As a scientist, it is important to question your assumptions. So far, most of the hard Computer Science research on collaborative filtering has used static data sets such as Netflix. Specifically, it is assumed that the recommender systems do not impact the ratings and what items get rated. A related assumption is that polls do not change how people vote (thanks to Peter for this observation).

Yet, people’s preferences are often constructed in the process of elicitation. That is, collaborative filtering is a nonlinear problem: ratings feed into the recommender system which helps to determine what people rate, which, in turn, feeds back into the recommender system…

How could a researcher take this into account? It would be too expensive to try to simulate e-commerce sites with volunteers. We need to submit simulated users to a recommender system. The usefulness of the recommendations is a tricky thing to measure and cross-validation errors are probably not what you want to study exclusively, diversity might be an important factor too.

Note 1: If someone out there know how to simulate users (something I do not know how to do), please get in touch! I have no idea how to do sane user modelling and I need help!

Note 2: Peter also once pointed me to the Iterated Prisoner’s Dilemma problem as something related.

Yahuda Koren, one of the winners of the Netflix game so far, was nice enough to send me a pointer to a recent paper he wrote, Chasing $1,000,000: How we won the Netflix progress prize (link is to PDF document, see 4th page and following).

Their approach is based on the linear combination of large number of predictors. Their work is difficult to summarize because it is so sophisticated and complex. Nevertheless, it might be useful to try to see what lessons can be learned.

First, some generic observations that are not very surprising, but nice nevertheless:

  • All data distributions are very skewed. A single movie can receive 200,000 ratings whereas a large fraction of the movies is rated fewer than 200 times. Some users have rated 10,000 movies or more whereas most users have rated around 100 movies.
  • Ratings on movies with higher variance (you either like it or hate it) are more informative.

Here are some principles I take away from their work:

  • Singular Value Decomposition is useful to get overall trends.
  • Nearest-neighbor methods are better at picking up strong interactions inside small sets of related movies.
  • Nearest-neighbor methods should discard uninformative neighbors.
  • If you discard ratings and focus on who rated which movie, you seem to get useful predictors complementing the rating-based predictors.
  • Regularization is important (they use ridge regression) as expected.

There is, in their work, a very clear trade-off from our ability to explain the recommendations, in favor of the accuracy. This is somehow dictated by the rules of the game, I suppose. They acknowledge this fact: “when recommending a movie to a user, we don’t really care why the user will like it, only that she will.” Presumably, neither the engineer or manager running the system, nor the user should care why the recommendation was made. I have argued the exact opposite, and so have others. I hope we can agree to disagree on this one. (I have said it before, my goal in life is to make people smarter, not to make smarter machines.)

Note. They claim that there is no justification found in the literature for the similarity measures, it is all arbitrary. I think Yahuda did not read my paper Scale And Translation Invariant Collaborative Filtering Systems. I suggested that the predictions of an algorithm should not change if we transform the data in some sensible way. Of course, this may not be enough to determine what the similarity measures must be, but it allows you to quickly discard some choices.

I am not a teacher per se. As a professor, I define myself as a researcher first and I do not do research on teaching methodologies. So this makes me poorly qualified to tell the world how a professor ought to be teaching. Nevertheless, I do teach. And I think that some of the time, I teach better than some. In fact, in the last few years, 95% of all students who took my courses would recommend the course to others.

Here are the rules I follow:

  • Don’t focus on content. In most fields, the content, the information, is already out there. It has been organized several times over by very smart people. Books have been written on most topics worth the attention of your students. There is a growing set of great talks available on YouTube, Google Video and elsewhere. Your students do not need you to rehash the same content they can find elsewhere, sometimes in better form. You may produce really sharp Java PowerPoint slides, but the value of these slides is very tiny when your students have access to Google. Stop lecturing already! Only produce content when you really cannot find the equivalent elsewhere. (I would attribute this idea to Stephen Downes, though I can’t find a reference.)
  • Focus on assignments and exams. Many professors are frustrated that students come in only for the grades. Probably because they focus on nice lectures and then prepare hastily some assignments. Turn this problem on its head! Focus on the assignments. If your students are not very autonomous, and they rarely are, give several long and challenging assignments (at least 4 or 5 a term). Do make sure however that they know where to get the information they need. You don’t need to provide all the information, but you need to link to all of it, because most students lack the research skills to figure out where they should look. Provide solved problems to help the weaker students.
  • Be an authentic role model. Knowing that someone ordinary, like your professor, has become a master of the course material (or he can fake it well) means that you, the very-smart-student, can do the same. That’s the power of emulation. In practice, this means that I do stress to my students that I do research in this field or that I have accomplished some difficult tasks using this very same course material. I also stress the difficulties that I have encountered and I give my personal view on the issues.

Note. Feel free to disagree.

picture by -Kj

In current work with do on database indexes, we decided to use DBLP as a data source. Among other things, we use the authors’ name as a dimension. From one plot, I noticed that there must have half a million distinct authors. I doubted this number, and Kamel was nice enough to investigate further. It turns out that there are 531,480 different authors in DBLP! (As a basis for comparison, there about 945,000 articles.)

I don’t know about you, but this feels like a large number. We started to look for explanations. I already reported that the USA is producing 1,500 new Computer Science Ph.D.s a year. Still, there cannot be many more than 100,000 recently active Computer Science authors holding a Ph.D.

Owen pointed us to the recent CACM article Are your citations clean? by Lee et al. Alas, while DBLP is certainly dirty, in that some researchers will appear under two or more different names, it cannot explain why we end up with half a million authors!

The best explanation so far is that many undergraduate or M.Sc. students have papers on DBLP. So much so that they make up the majority of the authors in DBLP.

Do you buy this theory? If not, do you have a better explanation?

(As a side-effect, it should not be very hard to be in the top 10% among the most prolific DBLP authors!)

Peter has come up with a list of 21 (important) open problems in the field of Artificial Intelligence. I am not aware of any such list anywhere, so this might be an important contribution. For comparison, Wikipedia as a list of open problems in Computer Science. In the field of database, the closest thing to a list of open problems would be the Lowell report: it falls short of providing true open problems however.

I am a bit surprised to see Learning Chess, but not Learning Go, on his list since I have the impression that Deep Blue has pretty much learned to play Chess at a very high level, whereas the same is not true of Go.

Out of Peter’s list, two of the open problems stroke a cord with me:

  • Self-References in software. I am no expert in AI, but it seems to me that the main mystery we are facing today, the deepest mystery of all, is what is consciousness? Some say that as computers grow larger, more connected and more powerful, they will acquire consciousness. Maybe. I believe however that consciousness has to do with software that can reference and modify itself. Note that we can figure out what consciousness is without building a conscious robot, and we may build a conscious robot without knowing what consciousness is.
  • Approximate queries in databases. As we now have infinite storage, and as data is created and discarded faster than ever, we need smarter database systems in the sense that they can provide the human beings exactly what they need, just like a human assistant would, only faster. The key here is probably to use lossy database engines and approximate representations. I like this topic because while I am not an AI research, it is close to my interests. For related work, see our recent paper on OLAP Tag Clouds (to be presented at WEBIST 2008), our work on quasi-monotonic segmentations (to appear in IJCM), and my work on better piecewise linear segmentations (SDM 2007). This last paper is interesting because it was motivated by my frustration at defining what a flat segment is in time series, a concept human beings can agree upon easily, it seems.
« Previous PageNext Page »

Powered by WordPress