Wednesday, February 6th, 2008

How many users are needed for an efficient collaborative filtering system?

Filed under: Science and Technology — Daniel Lemire @ 13:54
  • You can build an effective recommender system with as little as two people.
  • As you have more users, you tend to have more training data. Hence, you may have more accurate recommendations.
  • More accurate recommendations may not be important to your users.
  • The exact count of your users may not matter as much as the diversity of your users.
  • A good rule of thumb is that you should have many more users than you have items to recommend.
  • Given the right algorithms, your accuracy will improve monotonically with the number of users and the amount of training data.
  • The users may enter feedback data to correct the assumptions of your recommender system and thus, improve it over time.

Explanation: The title of my blog post is the subject of an email I got recently. A very popular question.

Acknowledgment: Andre inspired me to write this post.

CASCON 2008 (May 5, 2008 / October 27-30, 2008)

Filed under: Data Warehousing and OLAP, Passed CFP, Science and Technology — Daniel Lemire @ 10:10

The CASCON 2008 call for papers is out. CASCON is a generic Computer Science conference hosted by IBM in Toronto. The list of topics covered is pretty broad: software engineering, databases, HCI, Web, Grid Computing, and so on.

Tuesday, February 5th, 2008

BDA 2008 (May 16, 2008 / October 21-24, 2008)

Filed under: Data Warehousing and OLAP, Passed CFP — Daniel Lemire @ 10:33

The French conference Base de données avancées 2008 (BDA) published its web site. BDA stands for advanced databases. The conference will be held in Ardèche.

Saturday, February 2nd, 2008

Random Write Performance in Solid-State Drives

Filed under: Science and Technology — Daniel Lemire @ 12:49

I have written that solid-state memory drives (SSD) — as found in recent laptops such as the MacBook Air — nearly bridge the gap between internal and external memory. Indeed, we went from 3 orders of magnitude to 1 order of magnitude of difference between disk and RAM!

There is a catch however. SSDs can have terrible random write performance: at least two orders of magnitude slower than sequential writes!

Kevin Burton points out that — as a work-around — you can use log-structured file system. In effect, random writes are replaced by appends at the end of a log of changes. There are certainly cases where log-structured file systems are appropriate — I don’t know much about them — but are they appropriate for external-memory B-trees or hash tables?

However, some systems are designed to avoid random writes. For example, Google’s BigTable sorts data in memory before writing it to disk. Random writes are also minimized with most column-based databases and indexes such as C-store and bitmap indexes.

It is an interesting time to be a database researcher!

« Previous Page

34 queries. 0.652 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.