Monday, December 1st, 2008

Put your lectures only easily and for free with Panopto

Filed under: Academia/Research, Science and Technology — Daniel Lemire @ 21:49

Panopto screen shot I saw an impressive online course this morning using Panopto. The asynchronous videocasting was really convincing. Basically, the PowerPoint slides are synced with the video, and you can move up or down in the slide deck, with the video syncing automatically. Students can annotate your slides. You can add secondary video feeds or screen capture.

What is more is that a trusty colleague said it was really easy. You can do it on his own given a good camera. The catch is that Windows is required. The price is free or relatively cheap.


Update: See the live demo.

Reference: The November press release where Panopto announces the free version of their product.

Monday, November 24th, 2008

Diversity in recommender systems: sketch of a bibliography

Filed under: Science and Technology — Daniel Lemire @ 9:18

I have been arguing on this blog that while everyone knows diversity is a desirable property of recommender systems, there has been little work on the topic. To make my claim precise, I decided to list the papers addressing both recommender systems and diversity. I mean this list to be complete.

You can find a few more references and some analysis in our technical report:

Daniel Lemire, Stephen Downes, Sébastien Paquet, Diversity in open social networks, published online, October 2008.

If I am missing any paper, tell me!

Maybe this warrants a Wikipedia page?

Friday, November 21st, 2008

Recommender systems: where are we headed?

Filed under: Favorite, Science and Technology — Daniel Lemire @ 17:57

Daniel Tunkelang comments on the recent progress in collaborative filtering:

(…) the machine learning community, much like the information retrieval community, generally prefers black box approaches, (…) If the goal is to optimize one-shot recommendations, they are probably right. But I maintain that the process of picking a movie, like most information seeking tasks, is inherently interactive, (…)

I disagree with him. Even for non-interactive recommendations, the Machine Learning community is off-track for two reasons:

  • They fail to take into account diversity. In Information Retrieval, we know that if precision is high (all documents are relevant) but recall is low (few of the relevant documents are presented), then the system is poor. There is no such balance in collaborative filtering. Precision above all else is the goal. This is wrong. Diversity metrics must be used.
  • They work over static data sets. A system like Netflix is not static and so, accuracy on a static data set might be a good predictor for real-world performance. The problem is intrinsically nonlinear. People will rate different items, and they will rate differently, if you change the recommender system. The feedback loop may work against you or in your favour. The effect might be large or small. As far as I can tell, I am the only one who keep pointing out this fundamental, but never addressed limitation of working over static data sets. Update: This has absolutely nothing to do with online versus batch algorithms.

See also my post Netflix: an interesting Machine Learning game, but is it good science?

Disclaimer: I organized the ACM KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition along with people like Yehuda Koren. Yahuda is among the candidates to win the Netflix prize. I do not encourage the Netflix competition. I just do not think that it will solve our big problems.

Thursday, November 20th, 2008

How to speed up retrieval without any index?

Filed under: Data Warehousing and OLAP, Science and Technology — Daniel Lemire @ 22:06

John Cook gives us a nice recipe to quickly find all squares in a set of integers. For example, given 3, 4, 9, 15, you want your algorithm to identify 4 and 9 as squares.

The naïve way to solve this problem goes as follows:

  1. For each element…
  2. check whether sqrt(x) is an integer.

This may prove too expensive since the square-root operation must be computed using a floating-point algorithm.

A better way is to look at the first 4 bits of each integer. If the integer is a square, then the first 4 bits must have value 0, 1, 4, or 9. If you have a random distribution of numbers, this means that you can quickly discard 3 out of 4 numbers.

It is not immediately obvious that you will speed up the retrieval because inserting this check will add some overhead. However, it doubles the speed according to John. It is even less obvious that checking the first 8 or 16 bits instead of just the first 4 bits, can help. John says it does not help in one C++ implementation, but it does in a C# implementation.

This sort of strategy is entirely general. The research question is how much work should you do on fast dismissal? Too much effort toward dismissing lots of candidates might be counterproductive. Too little and your performance might not improve optimally.

Recently I started to wonder whether we could make it multipass: you first dismiss a few candidates with a cheap test, then on the survivors you use a more expensive test and so on. For example, you first check the first 4 bits, and if you cannot dismiss the candidate, you check the next 4 bits and so on. It is not a surprising idea, but figuring out whether it is worth the effort is a research question.

To make my point, I have worked on fast retrieval under the Dynamic Time Warping (DTW) distance, a nonlinear distance measure between time series. The DTW does not satisfy a triangle inequality. It is commonly used as a pattern recognition technique when comparing time series. It was initially designed to compare voice samples, allowing for changes in voice rhythm.

Eamonn Keogh from UCIUCR has come up with a simple but nearly optimal way to compute a lower bound to the DTW between any two times series, called LB_Keogh (named after himself). Just like in the John Cook algorithm, this lower bound quickly discards the false negatives. If you are interested, Eamonn has applied LB_Keogh to just about every time series problem you can think of. (Update: one hundred people or more also used LB_Keogh in their work, see comments below.)

I improved over LB_Keogh as follows. If LB_Keogh is not good enough (and only if it is not good enough), I compute a tighter lower bound (called LB_Improved). Surprisingly, in many cases, I can improve the retrieval time by a factor of two or more.

I have published my work as a software library, but also as the following paper:

Daniel Lemire, Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound, to appear in Pattern Recognition.

This sort of work is much more difficult than it appears. I could have easily made my method look good by optimizing it, while leaving the competing methods unoptimized. By publishing my implementation, I go a long way toward keeping me honest. If I fooled myself and the reviewers, someone might find out by surveying my source code.

Tuesday, November 18th, 2008

Is what I do technical?

Filed under: Science and Technology — Daniel Lemire @ 20:31

We are trying to design a master degree in Information Technology. To me, this sort of program should be a professional master degree, that is, it does not lead naturally to a research career or a Ph.D.

My business colleagues argue in favour of research methodology courses. Apparently, students need to learn how to conduct interviews and such. In any case, I then pointed out that my master degree did not contain any such course. One of my business colleague then said a deadly thing:

Of course, you got a technical master degree!

This got me really angry. Really, really, really angry. I do not think I ever got so angry in my life.

For the record, my master degree was in Mathematics at the University of Toronto. Is Mathematics technical? If technical is to have a “practical” connotation, I can tell you that none of my graduate courses were technical. Are fewnomials practical? I think not.

But the deeper implication was that anything having to do with Science was technical. That is, it deals with nuts and bolts. And I think that it is squarely wrong. From my view point, business is far more technical. And I ran my own business for several years. The business side of things was always the boring-but-easy component.

There is a distinct feeling in North America that business is king, and science & technology are things monkeys or foreigners can do. Yet, in my experience, it is a lot harder to design a usable web application than negotiate a business deal. I believe that India and China are getting a sweet deal by doing our science & technology while we focus on business. A very sweet deal indeed.

I think that Amazon, Google, Cisco, Microsoft and so on, thrive because many of their engineers have a deep knowledge of Computer Science. Kill the science and you kill the business.

But even if you discard science. Writing good source code is hard. Very hard. And it is not hard for technical reasons, not any more than painting, movie-making and sculpture are technical challenges.

In any case, I believe that North America is headed for a wall if it fails to recognize that its prosperity is due to culture, science and technology. And given that 40% of all students at my school go for a business degree, I am nervous.

See also my post Career Swings where I wrote:

I cannot believe that in 2015, we’ll all be lawyers, business managers, salesman, and medical doctors. I cannot believe that technology will stand still and mathematics beyond basic algebra will be a lost art. I cannot believe my two sons will have business degrees and make three times my salary by managing a bunch of underpaid Indian programmers.

Monday, November 17th, 2008

SciFi book review: Spin by Robert Charles Wilson

Filed under: Science and Technology — Daniel Lemire @ 19:21

The novel Spin won the Hugo Award for Best Novel in 2006.

It is what I would call a “temporal disparity” novel. Earth becomes suddenly surrounded in a temporal shield that slows time down for human beings. Alas, the Sun is aging very fast for the poor human beings. Are we going to die? Who is creating this field?

This is almost exactly the reverse story from Georges-Jean Arnaud’s La grande séparation (1971-1973). In Arnaud’s story, a planet has a similar temporal field, but it accelerates time on the planet. Even though the planet has primitive technology, it is constantly surveyed for any sign of technological development. Spin offers the counterpart story.

A temporal disparity leads to a technological disparity: a small band of savages can evolve into a technologically superior race while you are having coffee.

Pros

The novel is very good. The author writes with good scientific rigour. The writing is supported by repeatedly introducing new mysteries in every chapter… to keep you coming for more. The characters are believable and well drawn.

Cons

The author tried to limit the scope of the story to few characters, but not all of them are good characters. The writing style reminds me a bit of Card’s Ender’s game series. There is the extra smart kid who grows up to be the is the only one able to see through what is happening. I found this particular element of the novel irritating. A major catastrophe hits the Earth and only one man seems to be able to put it all together? I am a bit disappointed by how the author dealt with anything outside the Earth, including the Martians. He could have done so much more!

Sequels are upcoming.

The most active blogs I follow…

Filed under: Science and Technology — Daniel Lemire @ 18:11

A very active feed that has remained in my list for a long time is a good feed (for me). My top 3 (in decreasing order of activity):

  • The Noisy Channel: Daniel Tunkelang, chief Scientist at Endeca. He works in information retrieval.
  • Population of One: Sylvie Noël, research scientist at the government of Canada. She works in HCI.
  • Ongoing: Tim Bray, director of Web Technologies at Sun Microsystems. He helped create XML and Atom.
Next Page »

37 queries. 1.254 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.