Get smart with email

Harold asks: do we suffer from information overload or do we have the wrong tools? Clearly, email is inefficient. It is like cars: everybody gets stuck in traffic.

To cope, I answer and compose emails only once a day, on a schedule (after 4pm). I check and prune my email regularly however.

How do you cope with email?

See also Improving your intellectual productivity by accepting chaos and How to manage email (Inbox Zero).

Keeping track of your time… lazily

Active Time is a free MacOS application keeping track of how much time I spend in various software applications — automatically! My bet is that most of my time is spent in a browser, but I want to get hard numbers.

Coping with overabundance as a scientist

We are in an era of overabundance. Many of the problems we face — spam, information overload, obesity, pollution — are actually the result of overabundance.

Scientists need new strategies:

  • Create fast, discard faster.
  • Aim for quality. When people have too much content, they want quality.
  • Focus and live in niches.
  • Produce shorter papers. People want to learn a specific facts. Make it easy for them to find them.
  • Use formats that are easy to index. Paper is terrible. Slides, voice and video are not very good. Digital text in simple formats is better.
  • Make your work easy to find: nobody has time to mail order.
  • Be agile: always be ready to change focus. There are just too many new opportunities!

Reference: Ann Blair, Reading Strategies for Coping With Information Overload ca. 1550-1700, Journal of the History of Ideas 64.1 (2003) 11-28.

What to get with your Nintendo Wii?

I guess many people are getting Wiis right about now. I have had mine for a few months. Here are my recommendations:

  • The Sims 2: Castaway. You do not need to know what the sims are. If you can cope with games requiring planning, some puzzles, but relatively little action, this game is for you. Reminds me of the TV show “Lost.”
  • Super Mario Galaxy. A relaxing and fun platform game. Very innovative, in a strange way. Reminds me of the Little Prince by Saint-Exupéry.
  • Resident Evil 4. This is an older strategic/action game, retrofitted to use the Wii controls. Well done and scary!
  • Medal of Honor: Heroes 2. I think that first-person shooters are getting old, but this one uses the Wii controls to good effect. Oddly immersive.
  • Opera Web browser for the Wii. I am typing this using my Wii right now. You can purchase the browser in the online boutique directly on the Wii. Well worth the money. Opera is a good browser.

I will be a better writer in 2008… I promise!

  • I will not use negations…
  • I will avoid UA (useless acronyms).
  • When appropriate, my writing will be in an active voice…
  • I will very much try to avoid carefully needless words in my writing.
  • I will employ uncomplicated terms.

Here is a call to my readers: what annoys you about my writing? I vow to improve!

ICDM 08 (July 7, 2008 / December 15-19, 2008)

ICDM’08 — the 8th IEEE International Conference on Data Mining — will be held in Pisa, Italy. ICDM is a big and prestigious conference on Data Mining. It has some pretty good workshops, but the list of workshops is not yet available. See the workshops they had last year.

(If you do not mind useless advice, do check the workshops first if you are going to prepare a paper. I see very little evidence in recent years that the papers accepted at the main conference have more impact.)

Collaborative Filtering: Why working on static data sets is not enough

As a scientist, it is important to question your assumptions. So far, most of the hard Computer Science research on collaborative filtering has used static data sets such as Netflix. Specifically, it is assumed that the recommender systems do not impact the ratings and what items get rated. A related assumption is that polls do not change how people vote (thanks to Peter for this observation).

Yet, people’s preferences are often constructed in the process of elicitation. That is, collaborative filtering is a nonlinear problem: ratings feed into the recommender system which helps to determine what people rate, which, in turn, feeds back into the recommender system…

How could a researcher take this into account? It would be too expensive to try to simulate e-commerce sites with volunteers. We need to submit simulated users to a recommender system. The usefulness of the recommendations is a tricky thing to measure and cross-validation errors are probably not what you want to study exclusively, diversity might be an important factor too.

Note 1: If someone out there know how to simulate users (something I do not know how to do), please get in touch! I have no idea how to do sane user modelling and I need help!

Note 2: Peter also once pointed me to the Iterated Prisoner’s Dilemma problem as something related.

How to win the Netflix $1,000,000 prize?

Yahuda Koren, one of the winners of the Netflix game so far, was nice enough to send me a pointer to a recent paper he wrote, Chasing $1,000,000: How we won the Netflix progress prize (link is to PDF document, see 4th page and following).

Their approach is based on the linear combination of large number of predictors. Their work is difficult to summarize because it is so sophisticated and complex. Nevertheless, it might be useful to try to see what lessons can be learned.

First, some generic observations that are not very surprising, but nice nevertheless:

  • All data distributions are very skewed. A single movie can receive 200,000 ratings whereas a large fraction of the movies is rated fewer than 200 times. Some users have rated 10,000 movies or more whereas most users have rated around 100 movies.
  • Ratings on movies with higher variance (you either like it or hate it) are more informative.

Here are some principles I take away from their work:

  • Singular Value Decomposition is useful to get overall trends.
  • Nearest-neighbor methods are better at picking up strong interactions inside small sets of related movies.
  • Nearest-neighbor methods should discard uninformative neighbors.
  • If you discard ratings and focus on who rated which movie, you seem to get useful predictors complementing the rating-based predictors.
  • Regularization is important (they use ridge regression) as expected.

There is, in their work, a very clear trade-off from our ability to explain the recommendations, in favor of the accuracy. This is somehow dictated by the rules of the game, I suppose. They acknowledge this fact: “when recommending a movie to a user, we don’t really care why the user will like it, only that she will.” Presumably, neither the engineer or manager running the system, nor the user should care why the recommendation was made. I have argued the exact opposite, and so have others. I hope we can agree to disagree on this one. (I have said it before, my goal in life is to make people smarter, not to make smarter machines.)

Note. They claim that there is no justification found in the literature for the similarity measures, it is all arbitrary. I think Yahuda did not read my paper Scale And Translation Invariant Collaborative Filtering Systems. I suggested that the predictions of an algorithm should not change if we transform the data in some sensible way. Of course, this may not be enough to determine what the similarity measures must be, but it allows you to quickly discard some choices.

How University professors ought to be teaching…

I am not a teacher per se. As a professor, I define myself as a researcher first and I do not do research on teaching methodologies. So this makes me poorly qualified to tell the world how a professor ought to be teaching. Nevertheless, I do teach. And I think that some of the time, I teach better than some. In fact, in the last few years, 95% of all students who took my courses would recommend the course to others.

Here are the rules I follow:

  • Don’t focus on content. In most fields, the content, the information, is already out there. It has been organized several times over by very smart people. Books have been written on most topics worth the attention of your students. There is a growing set of great talks available on YouTube, Google Video and elsewhere. Your students do not need you to rehash the same content they can find elsewhere, sometimes in better form. You may produce really sharp Java PowerPoint slides, but the value of these slides is very tiny when your students have access to Google. Stop lecturing already! Only produce content when you really cannot find the equivalent elsewhere. (I would attribute this idea to Stephen Downes, though I can’t find a reference.)
  • Focus on assignments and exams. Many professors are frustrated that students come in only for the grades. Probably because they focus on nice lectures and then prepare hastily some assignments. Turn this problem on its head! Focus on the assignments. If your students are not very autonomous, and they rarely are, give several long and challenging assignments (at least 4 or 5 a term). Do make sure however that they know where to get the information they need. You don’t need to provide all the information, but you need to link to all of it, because most students lack the research skills to figure out where they should look. Provide solved problems to help the weaker students.
  • Be an authentic role model. Knowing that someone ordinary, like your professor, has become a master of the course material (or he can fake it well) means that you, the very-smart-student, can do the same. That’s the power of emulation. In practice, this means that I do stress to my students that I do research in this field or that I have accomplished some difficult tasks using this very same course material. I also stress the difficulties that I have encountered and I give my personal view on the issues.

Note. Feel free to disagree.

How many Computer Science researchers are there?

picture by -Kj

In current work with do on database indexes, we decided to use DBLP as a data source. Among other things, we use the authors’ name as a dimension. From one plot, I noticed that there must have half a million distinct authors. I doubted this number, and Kamel was nice enough to investigate further. It turns out that there are 531,480 different authors in DBLP! (As a basis for comparison, there about 945,000 articles.)

I don’t know about you, but this feels like a large number. We started to look for explanations. I already reported that the USA is producing 1,500 new Computer Science Ph.D.s a year. Still, there cannot be many more than 100,000 recently active Computer Science authors holding a Ph.D.

Owen pointed us to the recent CACM article Are your citations clean? by Lee et al. Alas, while DBLP is certainly dirty, in that some researchers will appear under two or more different names, it cannot explain why we end up with half a million authors!

The best explanation so far is that many undergraduate or M.Sc. students have papers on DBLP. So much so that they make up the majority of the authors in DBLP.

Do you buy this theory? If not, do you have a better explanation?

(As a side-effect, it should not be very hard to be in the top 10% among the most prolific DBLP authors!)

Next Page »

18 queries. 0.423 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.