Friday, December 21st, 2007

How to win the Netflix $1,000,000 prize?

Filed under: Science and Technology — Daniel Lemire @ 14:05

Yahuda Koren, one of the winners of the Netflix game so far, was nice enough to send me a pointer to a recent paper he wrote, Chasing $1,000,000: How we won the Netflix progress prize (link is to PDF document, see 4th page and following).

Their approach is based on the linear combination of large number of predictors. Their work is difficult to summarize because it is so sophisticated and complex. Nevertheless, it might be useful to try to see what lessons can be learned.

First, some generic observations that are not very surprising, but nice nevertheless:

  • All data distributions are very skewed. A single movie can receive 200,000 ratings whereas a large fraction of the movies is rated fewer than 200 times. Some users have rated 10,000 movies or more whereas most users have rated around 100 movies.
  • Ratings on movies with higher variance (you either like it or hate it) are more informative.

Here are some principles I take away from their work:

  • Singular Value Decomposition is useful to get overall trends.
  • Nearest-neighbor methods are better at picking up strong interactions inside small sets of related movies.
  • Nearest-neighbor methods should discard uninformative neighbors.
  • If you discard ratings and focus on who rated which movie, you seem to get useful predictors complementing the rating-based predictors.
  • Regularization is important (they use ridge regression) as expected.

There is, in their work, a very clear trade-off from our ability to explain the recommendations, in favor of the accuracy. This is somehow dictated by the rules of the game, I suppose. They acknowledge this fact: “when recommending a movie to a user, we don’t really care why the user will like it, only that she will.” Presumably, neither the engineer or manager running the system, nor the user should care why the recommendation was made. I have argued the exact opposite, and so have others. I hope we can agree to disagree on this one. (I have said it before, my goal in life is to make people smarter, not to make smarter machines.)

Note. They claim that there is no justification found in the literature for the similarity measures, it is all arbitrary. I think Yahuda did not read my paper Scale And Translation Invariant Collaborative Filtering Systems. I suggested that the predictions of an algorithm should not change if we transform the data in some sensible way. Of course, this may not be enough to determine what the similarity measures must be, but it allows you to quickly discard some choices.

How University professors ought to be teaching…

Filed under: Academia/Research, Favorite — Daniel Lemire @ 10:02

I am not a teacher per se. As a professor, I define myself as a researcher first and I do not do research on teaching methodologies. So this makes me poorly qualified to tell the world how a professor ought to be teaching. Nevertheless, I do teach. And I think that some of the time, I teach better than some. In fact, in the last few years, 95% of all students who took my courses would recommend the course to others.

Here are the rules I follow:

  • Don’t focus on content. In most fields, the content, the information, is already out there. It has been organized several times over by very smart people. Books have been written on most topics worth the attention of your students. There is a growing set of great talks available on YouTube, Google Video and elsewhere. Your students do not need you to rehash the same content they can find elsewhere, sometimes in better form. You may produce really sharp Java PowerPoint slides, but the value of these slides is very tiny when your students have access to Google. Stop lecturing already! Only produce content when you really cannot find the equivalent elsewhere. (I would attribute this idea to Stephen Downes, though I can’t find a reference.)
  • Focus on assignments and exams. Many professors are frustrated that students come in only for the grades. Probably because they focus on nice lectures and then prepare hastily some assignments. Turn this problem on its head! Focus on the assignments. If your students are not very autonomous, and they rarely are, give several long and challenging assignments (at least 4 or 5 a term). Do make sure however that they know where to get the information they need. You don’t need to provide all the information, but you need to link to all of it, because most students lack the research skills to figure out where they should look. Provide solved problems to help the weaker students.
  • Be an authentic role model. Knowing that someone ordinary, like your professor, has become a master of the course material (or he can fake it well) means that you, the very-smart-student, can do the same. That’s the power of emulation. In practice, this means that I do stress to my students that I do research in this field or that I have accomplished some difficult tasks using this very same course material. I also stress the difficulties that I have encountered and I give my personal view on the issues.

Note. Feel free to disagree.

Wednesday, December 19th, 2007

How many Computer Science researchers are there?

Filed under: Academia/Research — Daniel Lemire @ 20:00
picture by -Kj

In current work with do on database indexes, we decided to use DBLP as a data source. Among other things, we use the authors’ name as a dimension. From one plot, I noticed that there must have half a million distinct authors. I doubted this number, and Kamel was nice enough to investigate further. It turns out that there are 531,480 different authors in DBLP! (As a basis for comparison, there about 945,000 articles.)

I don’t know about you, but this feels like a large number. We started to look for explanations. I already reported that the USA is producing 1,500 new Computer Science Ph.D.s a year. Still, there cannot be many more than 100,000 recently active Computer Science authors holding a Ph.D.

Owen pointed us to the recent CACM article Are your citations clean? by Lee et al. Alas, while DBLP is certainly dirty, in that some researchers will appear under two or more different names, it cannot explain why we end up with half a million authors!

The best explanation so far is that many undergraduate or M.Sc. students have papers on DBLP. So much so that they make up the majority of the authors in DBLP.

Do you buy this theory? If not, do you have a better explanation?

(As a side-effect, it should not be very hard to be in the top 10% among the most prolific DBLP authors!)

21 open problems in Artificial Intelligence

Filed under: Science and Technology — Daniel Lemire @ 9:57

Peter has come up with a list of 21 (important) open problems in the field of Artificial Intelligence. I am not aware of any such list anywhere, so this might be an important contribution. For comparison, Wikipedia as a list of open problems in Computer Science. In the field of database, the closest thing to a list of open problems would be the Lowell report: it falls short of providing true open problems however.

I am a bit surprised to see Learning Chess, but not Learning Go, on his list since I have the impression that Deep Blue has pretty much learned to play Chess at a very high level, whereas the same is not true of Go.

Out of Peter’s list, two of the open problems stroke a cord with me:

  • Self-References in software. I am no expert in AI, but it seems to me that the main mystery we are facing today, the deepest mystery of all, is what is consciousness? Some say that as computers grow larger, more connected and more powerful, they will acquire consciousness. Maybe. I believe however that consciousness has to do with software that can reference and modify itself. Note that we can figure out what consciousness is without building a conscious robot, and we may build a conscious robot without knowing what consciousness is.
  • Approximate queries in databases. As we now have infinite storage, and as data is created and discarded faster than ever, we need smarter database systems in the sense that they can provide the human beings exactly what they need, just like a human assistant would, only faster. The key here is probably to use lossy database engines and approximate representations. I like this topic because while I am not an AI research, it is close to my interests. For related work, see our recent paper on OLAP Tag Clouds (to be presented at WEBIST 2008), our work on quasi-monotonic segmentations (to appear in IJCM), and my work on better piecewise linear segmentations (SDM 2007). This last paper is interesting because it was motivated by my frustration at defining what a flat segment is in time series, a concept human beings can agree upon easily, it seems.

Tuesday, December 18th, 2007

How much are the ideas of your competition worth to you?

Filed under: Academia/Research — Daniel Lemire @ 10:29

Scientists are typically rather secretive about whatever they are working on right now. While in most universities, you can at least see where the researchers work, in some government laboratories, such as NRC, you would think that Russian spies are on every corner: how else can you explain the armed guards you find at the entrance of some buildings? I bet that some private laboratories are even better protected.

Initially, when I started this blog, I wanted to tell the world about what I was working on. Somehow, on paper, it sounded like a nice approach. By sharing my ideas, I could get some early feedbacks, some extra references, I could maybe even get some collaboration going.

While it may work for some, opening up my research ideas simply does not work for me. Explaining, clearly, what I work on is hard. I could sketch my ideas, but only a handful of people would grasp even half of what I would write. Moreover, many ideas never make it outside my office. I abandon most of my ideas, eventually. Thus, taking the time to explain my current set of ideas would be very wasteful.

So, you simply cannot tell what I am working on. I just won’t tell you. I will tell you to go read my papers.

Most researchers behave the same way. Interestingly, however, many researchers have another reason for behaving this way: they do not want to give their competition an edge. They do not divulge their ideas for the same reason they keep their data and their software secret: they want to make sure nobody can catch up to them.

Whenever several researchers are working toward the very same goal, this a sensible concern. After all, being the first to solve a given scientific problem, is important. Science is a winner-takes-all game, at least some of time. Other times, people are simply misguided: keeping yourself out of the information-sharing loop only makes you less useful to the community and, ultimately, less important.

Me? If I were able to read minds… if I were able to see what other researchers are thinking about… I would most certainly not bother. I am already overwhelmed with carefully crafted papers on Google Scholar, I would certainly not care for the early drafts of my competitors. It helps that I do not feel like I have competitors. This is no accident since I apply Dijkstra’s rule:

Never tackle a problem of which you can be pretty sure that (now or in the near future) it will be tackled by others who are, in relation to that problem, at least as competent and well-equipped as you.

Monday, December 17th, 2007

Why tenure matters?

Filed under: Academia/Research — Daniel Lemire @ 12:24

Since the end of World War II, at least half of all university professors in North America have tenure: they cannot be dismissed without adequate cause. This job security is earned: you need to be a professor for several years, and to perform well, before you can be granted tenure.

At several schools, a large fraction of the teaching positions do not lead to tenure. There are claims that there is a growing trend to hire more and more people people on temporary positions. One justification for this trend might be that universities need the flexibility to adapt quickly to the market.

The same might be true of research institutions. While I worked at NRC, several projects were staffed by people whose contracts was for the duration of a project.

One argument that I hear to justify tenure is that it saves money. Indeed, people will accept a lower salary if they have job security. Even if the job market is favorable, even if if you could get more money elsewhere, few people like to change job frequently. Stability is nice.

But this argument is limited. What if the job market is difficult? If there is an oversupply of Ph.D.s, shouldn’t managers do away with tenure then?

Maybe not. Bland et al. have shown that tenure matters. According to their study, tenure-track professors perform significantly better than others:

faculty on tenure appointments are significantly more productive in research, more productive in education, more committed to their positions, (…)

Friday, December 14th, 2007

Why having a readership matters

Filed under: Academia/Research — Daniel Lemire @ 22:31

I recently proposed that scientists should adopt the find a readership or perish motto. (A related goal for engineers might be “find users or perish.”) The goal is certainly not to have as many readers as possible, but having some serious readers matter.

I was chatting with Seb Paquet today and he came up with a good argument to support this view: it is very hard to have a significant readership without serving a useful function in a community. In other words, it is easier to write many empty/wrong papers than it is to attract a large readership with empty/wrong papers.

« Previous PageNext Page »

35 queries. 0.615 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.