<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2enclosuresfull.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>Daniel Lemire's blog</title><link>http://www.daniel-lemire.com/blog</link><description>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</description><language>en</language><lastBuildDate>Fri, 21 Nov 2008 16:59:01 -0600</lastBuildDate><generator>WordPress http://wordpress.org/</generator><itunes:explicit>no</itunes:explicit><itunes:subtitle>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:subtitle><creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/2.0/</creativeCommons:license><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/daniel-lemire/atom" type="application/rss+xml" /><feedburner:emailServiceId>1396075</feedburner:emailServiceId><feedburner:feedburnerHostname>http://www.feedburner.com</feedburner:feedburnerHostname><item><title>Recommender systems: where are we headed?</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/461258987/</link><category>Favorite</category><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 21 Nov 2008 16:59:01 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1565</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Daniel  Tunkelang <a href="http://thenoisychannel.com/2008/11/21/the-napoleon-dynamite-problem/">comments</a> on the recent progress in collaborative filtering:</p>
<blockquote><p>(&#8230;) the machine learning community, much like the information retrieval community, generally prefers black box approaches, (&#8230;) If the goal is to optimize one-shot recommendations, they are probably right. But I maintain that the process of picking a movie, like most information seeking tasks, is inherently interactive, (&#8230;)</p></blockquote>
<p>I disagree with him. Even for non-interactive recommendations, the Machine Learning community is off-track for two reasons:</p>
<ul>
<li>They fail to take into account diversity. In Information Retrieval, we know that if precision is high (all documents are relevant) but recall is low (few of the relevant documents are presented), then the system is poor. There is no such balance in collaborative filtering. Precision above all else is the goal. This is wrong. <a href="http://www.daniel-lemire.com/blog/archives/2008/11/14/measuring-the-diversity-of-recommended-lists-at-last/">Diversity metrics must be used</a>.</li>
<li>They work over static data sets. A system like Netflix is not static and so, accuracy on a static data set might be a good predictor for real-world performance. The problem is intrinsically nonlinear. People will rate different items, and they will rate differently, if you change the recommender system. The feedback loop may work against you or in your favour. The effect might be large or small. As far as I can tell, I am <a href="http://www.daniel-lemire.com/blog/archives/2007/12/22/collaborative-filtering-why-working-on-static-data-sets-is-not-enough/">the only one</a> who keep pointing out this fundamental, but never addressed limitation of working over static data sets.</li>
</ul>
<p>See also my post <a href="http://www.daniel-lemire.com/blog/archives/2007/12/13/netflix-an-interesting-machine-learning-game-but-is-it-good-science/">Netflix: an interesting Machine Learning game, but is it good science?</a></p>
<p><strong>Disclaimer</strong>: I organized the ACM KDD <a href="http://netflixkddworkshop2008.info/pc.html">Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition</a> along with people like Yehuda Koren. Yahuda is among the candidates to win the Netflix prize. I do not encourage the Netflix competition. I just do not think that it will solve our big problems.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/461258987" height="1" width="1"/>]]></content:encoded><description>Daniel  Tunkelang comments on the recent progress in collaborative filtering:
(&amp;#8230;) the machine learning community, much like the information retrieval community, generally prefers black box approaches, (&amp;#8230;) If the goal is to optimize one-shot recommendations, they are probably right. But I maintain that the process of picking a movie, like most information seeking tasks, is [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F21%2Frecommender-systems-where-are-we-headed%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/21/recommender-systems-where-are-we-headed/</feedburner:origLink></item><item><title>Tim Bray on solving the economic crisis</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/460927299/</link><category>Business / Economics / Politics</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 21 Nov 2008 10:39:11 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1562</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>For reasons I will not go into, this quote feels very satisfying today:</p>
<blockquote><p>Solution to economic crisis: sack everyone who has an MBA. (<a href="http://twitter.com/timbray/statuses/1016736621">Tim Bray</a>)</p></blockquote>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/460927299" height="1" width="1"/>]]></content:encoded><description>For reasons I will not go into, this quote feels very satisfying today:
Solution to economic crisis: sack everyone who has an MBA. (Tim Bray)</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F21%2Ftim-bray-on-solving-the-economic-crisis%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/21/tim-bray-on-solving-the-economic-crisis/</feedburner:origLink></item><item><title>How to speed up retrieval without any index?</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/460294526/</link><category>Data Warehousing and OLAP</category><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Thu, 20 Nov 2008 21:16:43 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1555</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>John Cook gives us a nice recipe to <a href="http://www.johndcook.com/blog/2008/11/17/fast-way-to-test-whether-a-number-is-a-square/">quickly find all squares in a set of integers</a>. For example, given 3, 4, 9, 15, you want your algorithm to identify 4 and 9 as squares.</p>
<p>The naïve way to solve this problem goes as follows:</p>
<ol>
<li>For each element&#8230;</li>
<li>check whether sqrt(x) is an integer.</li>
</ol>
<p>This may prove too expensive since the square-root operation must be computed using a floating-point algorithm. </p>
<p>A better way is to look at the first 4 bits of each integer. If the integer is a square, then the first 4 bits must have value 0, 1, 4, or 9. If you have a random distribution of numbers, this means that you can quickly discard 3 out of 4 numbers.</p>
<p>It is not immediately obvious that you will speed up the retrieval because inserting this check will add some overhead. However, it doubles the speed according to John. It is even less obvious that checking the first 8 or 16 bits instead of just the first 4 bits, can help. John says it does not help in one C++ implementation, but it does in a C# implementation.</p>
<p>This sort of strategy is entirely general. The research question is how much work should you do on fast dismissal? Too much effort toward dismissing lots of candidates might be counterproductive. Too little and your performance might not improve optimally.</p>
<p>Recently I started to wonder whether we could make it multipass: you first dismiss a few candidates with a cheap test, then on the survivors you use a more expensive test and so on. For example, you first check the first 4 bits, and if you cannot dismiss the candidate, you check the next 4 bits and so on. It is not a surprising idea, but figuring out whether it is worth the effort is a research question.</p>
<p>To make my point, I have worked on fast retrieval under the <a href="http://en.wikipedia.org/wiki/Dynamic_time_warping">Dynamic Time Warping</a> (DTW) distance, a nonlinear distance measure between time series. The DTW does not satisfy a triangle inequality. It is commonly used as a pattern recognition technique when comparing time series. It was initially designed to compare voice samples, allowing for changes in voice rhythm.</p>
<p><a href="http://www.cs.ucr.edu/~eamonn/">Eamonn Keogh</a> from UCI has come up with a simple but nearly optimal way to compute a lower bound to the DTW between any two times series, called LB_Keogh (named after himself). Just like in the John Cook algorithm, this lower bound  <strong>quickly discards the false negatives</strong>. If you are interested, Eamonn has applied LB_Keogh to just about every time series problem you can think of.</p>
<p>I improved over LB_Keogh as follows. If LB_Keogh is not good enough (and only if it is not good enough), I compute a tighter lower bound (called LB_Improved). Surprisingly, in many cases, I can improve the retrieval time by a factor of two or more. </p>
<p>I have published my work as a <a href="http://code.google.com/p/lbimproved/">software library</a>, but also as the following paper:</p>
<blockquote><p>Daniel Lemire, <a href="http://arxiv.org/abs/0811.3301">Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound</a>, to appear in Pattern Recognition.</p></blockquote>
<p>This sort of work is much more difficult than it appears. I could have easily made my method look good by optimizing it, while leaving the competing methods unoptimized. By publishing my implementation, I go a long way toward keeping me honest. If I fooled myself and the reviewers, someone might find out by surveying my source code. </p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/460294526" height="1" width="1"/>]]></content:encoded><description>John Cook gives us a nice recipe to quickly find all squares in a set of integers. For example, given 3, 4, 9, 15, you want your algorithm to identify 4 and 9 as squares.
The naïve way to solve this problem goes as follows:

For each element&amp;#8230;
check whether sqrt(x) is an integer.

This may prove too expensive [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F20%2Fhow-to-speed-up-retrieval-without-any-index%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/20/how-to-speed-up-retrieval-without-any-index/</feedburner:origLink></item><item><title>Why am I not working on world hunger?</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/459158110/</link><category>Academia/Research</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Thu, 20 Nov 2008 20:00:10 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1467</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>My wife sometimes asks me why I am not working on important problems like world hunger. Instead, I am one of the top world expert in <a href="http://arxiv.org/abs/cs.DS/0703109">tag-cloud drawing</a>. I am sure she thinks that I just fool around, faking serious research.</p>
<p>I actually take my research very seriously. </p>
<p>I like to distinguish  abstract from concrete research. Concrete research is when you seek to obtain results in special cases. For example, an AI researcher may try to first understand how we can detect spam. Eventually he might move on to even more sophisticated tasks. In such a form of research, there are no overarching formal plans. You could say it is inductive, maybe. Researchers are often driven to this form of research because the deeper problems are simply too difficult to address directly. (I define a problem to be too difficult when you cannot make noticeable progress in a matter of months.) They hope for a breakthrough to an important problem to come as they work on a narrow issue.</p>
<p>Abstract research derives from a formal plan. Semantic Web is one such a plan. Tim Berners-Lee even drew diagrams early on of what the beast should look like. The research issues are clearly laid out. As a researcher you are tackling an extremely difficult problem, unsure whether you will ever make any noticeable progress. Researchers follow this path because they believe that only a focused effort in a definite direction can solve the difficult problems. Funding agencies love abstract research.</p>
<p>It might be a matter of biology, but my brain has always been much more productive in concrete research. I resist the inductive/deductive classification because it feels wrong. However, times and times again, working on a tractable, but possibly insignificant problem, has lead me to understand a deeper issue. When the problems are too big, my brain gets into circular and incorrect arguments. I need to chop down the problems to a manageable size. The problems need to be hard enough to push me to the limits, but easy enough that I can make weekly progress. Moreover, I cannot never know exactly what I will be doing a month later, as a researcher. </p>
<p>I will make a stronger claim: abstract research is never done. Researchers will give the illusion that they are working directly on some grand problem (like world hunger), but, in reality, they will work at a much smaller scale. And when a researcher solves a grand problem in what seems like a short time, and with few concrete possibly irrelevant steps, I attribute it to luck or lies.</p>
<p>See also my post <a href="http://www.daniel-lemire.com/blog/archives/2007/11/19/my-research-process/">my research process</a>.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/459158110" height="1" width="1"/>]]></content:encoded><description>My wife sometimes asks me why I am not working on important problems like world hunger. Instead, I am one of the top world expert in tag-cloud drawing. I am sure she thinks that I just fool around, faking serious research.
I actually take my research very seriously. 
I like to distinguish  abstract from concrete [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">12</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F19%2Fwhy-am-i-not-work-on-world-hunger%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/19/why-am-i-not-work-on-world-hunger/</feedburner:origLink></item><item><title>Is what I do technical?</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/457836589/</link><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Tue, 18 Nov 2008 20:44:41 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1536</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We are trying to design a master degree in Information Technology. To me, this sort of program should be a professional master degree, that is, it does not lead naturally to a research career or a Ph.D.</p>
<p>My business colleagues argue in favour of research methodology courses. Apparently, students need to learn how to conduct interviews and such. In any case, I then pointed out that my master degree did not contain any such course. One of my business colleague then said a deadly thing:</p>
<blockquote><p>Of course, you got a technical master degree!</p></blockquote>
<p>This got me really angry. Really, really, really angry. I do not think I ever got so angry in my life.</p>
<p>For the record, my master degree was in <strong>Mathematics</strong> at the University of Toronto. Is Mathematics technical? If technical is to have a &#8220;practical&#8221; connotation, I can tell you that none of my graduate courses were technical. Are <a href="http://search.barnesandnoble.com/Fewnomials/A-Khovanskii/e/9780821845479">fewnomials</a> practical? I think not.</p>
<p>But the deeper implication was that anything having to do with Science was technical. That is, it deals with nuts and bolts. And I think that it is squarely wrong. From my view point, business is far more technical. And I ran my own business for several years. The business side of things was always the boring-but-easy component.</p>
<p>There is a distinct feeling in North America that <strong>business is king, and science &amp; technology are things monkeys or foreigners can do</strong>. Yet, in my experience, it is a lot harder to design a usable web application than negotiate a business deal. I believe that India and China are getting a sweet deal by doing our science &amp; technology while we focus on business. A very sweet deal indeed.</p>
<p>I think that Amazon, Google, Cisco, Microsoft and so on, thrive because many of their engineers have a deep knowledge of Computer Science. Kill the science and you kill the business.</p>
<p>But even if you discard science. Writing good source code is hard. Very hard. And it is not hard for technical reasons, not any more than painting, movie-making and sculpture are technical challenges.</p>
<p>In any case, I believe that North America is headed for a wall if it fails to recognize that its prosperity is due to culture, science and technology. And given that 40% of all students at my school go for a business degree, I am nervous.</p>
<p>See also my post <a href="http://www.daniel-lemire.com/blog/archives/2005/09/16/career-swings/">Career Swings</a> where I wrote:</p>
<blockquote><p>I cannot believe that in 2015, we’ll all be lawyers, business managers, salesman, and medical doctors. I cannot believe that technology will stand still and mathematics beyond basic algebra will be a lost art. I cannot believe my two sons will have business degrees and make three times my salary by managing a bunch of underpaid Indian programmers.</p></blockquote>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/457836589" height="1" width="1"/>]]></content:encoded><description>We are trying to design a master degree in Information Technology. To me, this sort of program should be a professional master degree, that is, it does not lead naturally to a research career or a Ph.D.
My business colleagues argue in favour of research methodology courses. Apparently, students need to learn how to conduct interviews [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">4</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F18%2Fis-what-i-do-technical%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/18/is-what-i-do-technical/</feedburner:origLink></item><item><title>SciFi book review: Spin by Robert Charles Wilson</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456581749/</link><category>Science and Technology</category><category>scifi review</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Mon, 17 Nov 2008 18:21:07 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1534</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The novel <a href="http://en.wikipedia.org/wiki/Spin_(novel)">Spin</a> won the <a title="Hugo Award" href="http://en.wikipedia.org/wiki/Hugo_Award">Hugo Award</a> for <a title="Hugo Award for Best Novel" href="http://en.wikipedia.org/wiki/Hugo_Award_for_Best_Novel">Best Novel</a> in 2006.</p>
<p>It is what I would call a &#8220;temporal disparity&#8221; novel. Earth becomes suddenly surrounded in a temporal shield that slows time down for human beings. Alas, the Sun is aging very fast for the poor human beings. Are we going to die? Who is creating this field?</p>
<p>This is almost exactly the reverse story from <a href="http://fr.wikipedia.org/wiki/Georges-Jean_Arnaud">Georges-Jean Arnaud</a>&#8217;s <em>La grande  séparation</em> (1971-1973). In Arnaud&#8217;s story, a planet has a similar temporal field, but it accelerates time on the planet. Even though the planet has primitive technology, it is constantly surveyed for any sign of technological development. Spin offers the counterpart story.</p>
<p>A temporal disparity leads to a technological disparity: a small band of savages can evolve into a technologically superior race while you are having coffee.</p>
<p><strong>Pros</strong></p>
<p>The novel is very good. The author writes with good scientific rigour.  The writing is supported by repeatedly introducing new mysteries in every chapter&#8230; to keep you coming for more. The characters are believable and well drawn.</p>
<p><strong>Cons</strong></p>
<p>The author tried to limit the scope of the story to few characters, but not all of them are good characters. The writing style reminds me a bit of Card&#8217;s Ender&#8217;s game series. There is the extra smart kid who grows up to be the is the only one able to see through what is happening. I found this particular element of the novel irritating. A major catastrophe hits the Earth and only one man seems to be able to put it all together? I am a bit disappointed by how the author dealt with anything outside the Earth, including the Martians. He could have done so much more! </p>
<p>Sequels are upcoming.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456581749" height="1" width="1"/>]]></content:encoded><description>The novel Spin won the Hugo Award for Best Novel in 2006.
It is what I would call a &amp;#8220;temporal disparity&amp;#8221; novel. Earth becomes suddenly surrounded in a temporal shield that slows time down for human beings. Alas, the Sun is aging very fast for the poor human beings. Are we going to die? Who is [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F17%2Fscifi-book-review-spin-by-robert-charles-wilson%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/17/scifi-book-review-spin-by-robert-charles-wilson/</feedburner:origLink></item><item><title>The most active blogs I follow…</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456538279/</link><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Mon, 17 Nov 2008 17:15:07 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1530</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>A very active feed that has remained in my list for a long time is a good feed (for me). My top 3 (in decreasing order of activity):</p>
<ul>
<li><a href="http://thenoisychannel.com/">The Noisy Channel</a>: Daniel Tunkelang, chief Scientist at <a href="http://www.endeca.com/">Endeca</a>. He works in information retrieval.</li>
<li><a href="http://www.sylvienoel.ca/blog/">Population of One</a>: Sylvie Noël, research scientist at the government of Canada. She works in <a href="http://en.wikipedia.org/wiki/Human-computer_interaction">HCI</a>.</li>
<li><a href="http://www.tbray.org/ongoing/">Ongoing</a>: Tim Bray, director of Web Technologies at Sun Microsystems. He helped create XML and Atom.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456538279" height="1" width="1"/>]]></content:encoded><description>A very active feed that has remained in my list for a long time is a good feed (for me). My top 3 (in decreasing order of activity):

The Noisy Channel: Daniel Tunkelang, chief Scientist at Endeca. He works in information retrieval.
Population of One: Sylvie Noël, research scientist at the government of Canada. She works in [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F17%2Fthe-most-active-blogs-i-follow%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/17/the-most-active-blogs-i-follow/</feedburner:origLink></item><item><title>Full text search in SQL with LuSql</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456096072/</link><category>Data Warehousing and OLAP</category><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Mon, 17 Nov 2008 08:30:53 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1526</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>MySQL supports natively <a href="http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html">full text search</a>; many database engines do. However, few databases can match a dedicated search engine library like <a href="http://lucene.apache.org/java/docs/">Lucene</a>. Moreover, even if you do not need the power of Lucene, sometimes you are forced to use a database engine that does not support full text search (like raw <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> files). </p>
<p>It would be nice to be able to combine a true search engine with any database engine. </p>
<p>If you are willing to use Java, then Glen Newton from NRC has the solution: <a href="http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql">LuSql</a>. It allows you to index with Lucene any database accessible by Java (through JDBC). He says it has been extensively tested. It is open source and free.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456096072" height="1" width="1"/>]]></content:encoded><description>MySQL supports natively full text search; many database engines do. However, few databases can match a dedicated search engine library like Lucene. Moreover, even if you do not need the power of Lucene, sometimes you are forced to use a database engine that does not support full text search (like raw CSV files). 
It would [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F17%2Ffull-text-search-in-sql-with-lusql%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/17/full-text-search-in-sql-with-lusql/</feedburner:origLink></item><item><title>Toward the Commoditization of Natural Language Processing</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456096073/</link><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 14 Nov 2008 17:52:51 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1503</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In a remarkable <a href="http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-50398_e.html">paper</a>, <a href="http://apperceptual.wordpress.com/">Peter Turney</a> shows that using a simple family of algorithms and  freely available software, one can determine analogies, synonyms, antonyms, and relations between words automatically. Here is the beginning of the abstract:</p>
<blockquote><p>
Recognizing analogies, synonyms, antonyms, and associations appear to be four distinct tasks, requiring distinct NLP algorithms. In the past, the four tasks have been treated independently, using a wide variety of algorithms. These four semantic classes, however, are a tiny sample of the full range of semantic phenomena, and we cannot afford to create ad hoc algorithms for each semantic phenomenon; we need to seek a unified approach.</p></blockquote>
<p>I do not work in Natural Language Processing (NLP) per se, but this sounds like commoditization to me in the sense that you no longer need to design, learn and tweak custom algorithms. <strong>If you have enough data, you can do NLP after learning one (remarkably simple) family of algorithms</strong>. <a href="http://norvig.com/">Peter Norvig</a> might approve.</p>
<p>In the database research world, commoditization is already an accomplished fact. Database researchers have been wondering about their relevance for about ten years. </p>
<p><a href="http://apperceptual.wordpress.com/">Peter</a> might argue that in such a context, researchers should become bold and daring. Computer Science researchers should choose crazy problems. </p>
<p><strong>Reference</strong>: Peter Turney, <a href="http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-50398_e.html">A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations</a>,  Coling 2008 August 2008.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456096073" height="1" width="1"/>]]></content:encoded><description>In a remarkable paper, Peter Turney shows that using a simple family of algorithms and  freely available software, one can determine analogies, synonyms, antonyms, and relations between words automatically. Here is the beginning of the abstract:

Recognizing analogies, synonyms, antonyms, and associations appear to be four distinct tasks, requiring distinct NLP algorithms. In the past, [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">3</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F14%2Ftoward-the-commoditization-of-natural-language-processing%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/14/toward-the-commoditization-of-natural-language-processing/</feedburner:origLink></item><item><title>Do not trust financial experts</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456096074/</link><category>Business / Economics / Politics</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 14 Nov 2008 15:47:09 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1500</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>One expert predicted the recession. He was ridiculed. Watch and draw your conclusions.</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/2I0QN-FYkpw&#038;color1=0xb1b1b1&#038;color2=0xcfcfcf&#038;hl=en&#038;fs=1"></param><param name="allowFullScreen" value="true"></param><embed src="http://www.youtube.com/v/2I0QN-FYkpw&#038;color1=0xb1b1b1&#038;color2=0xcfcfcf&#038;hl=en&#038;fs=1" type="application/x-shockwave-flash" allowfullscreen="true" width="425" height="344"></embed></object></p>
<p><strong>Source</strong>: <a href="http://parand.com/say/index.php/2008/11/13/this-guy-called-the-recession/">Standard Deviations</a>.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456096074" height="1" width="1"/>]]></content:encoded><description>One expert predicted the recession. He was ridiculed. Watch and draw your conclusions.

Source: Standard Deviations.</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">3</thr:total><media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/456096075/2I0QN-FYkpw&amp;" fileSize="882" type="application/x-shockwave-flash" /><itunes:explicit>no</itunes:explicit><itunes:subtitle>One expert predicted the recession. He was ridiculed. Watch and draw your conclusions. Source: Standard Deviations. </itunes:subtitle><itunes:summary>One expert predicted the recession. He was ridiculed. Watch and draw your conclusions. Source: Standard Deviations. </itunes:summary><itunes:keywords>Business / Economics / Politics</itunes:keywords><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F14%2Fdo-not-trust-financial-experts%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/14/do-not-trust-financial-experts/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/456096075/2I0QN-FYkpw&amp;" length="882" type="application/x-shockwave-flash" /><feedburner:origEnclosureLink>http://www.youtube.com/v/2I0QN-FYkpw&amp;#038;color1=0xb1b1b1&amp;#038;color2=0xcfcfcf&amp;#038;hl=en&amp;#038;fs=1</feedburner:origEnclosureLink></item><item><title>So, you think academic peer review works?</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456096076/</link><category>Academia/Research</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 14 Nov 2008 10:13:07 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1485</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>If you think peer review is sane, consider <a href=" http://golem.ph.utexas.edu/category/2008/11/the_case_of_m_s_el_naschie.html">this example</a>:</p>
<blockquote><p>El Naschie is editor in chief of the journal Chaos, Solitons and Fractals.  This journal is published by Elsevier, one of the biggest players in the science publishing business. But here’s where things get interesting: this journal also lists 322 papers with El Naschie as an author!
</p></blockquote>
<p>The journal has a high <a href="http://golem.ph.utexas.edu/category/2008/11/the_kind_of_email_i_dont_need.html#c019806">impact factor</a> and a high rating by the Australian Academy of Sciences.</p>
<p><strong>Source</strong>: <a href="http://michaelnielsen.org/blog/?p=494">Nielsen</a>.</p>
<p>See also my posts <a href="http://www.daniel-lemire.com/blog/archives/2008/08/25/the-insane-world-of-academic-publishing/">The insane world of academic publishing</a>, <a href="http://www.daniel-lemire.com/blog/archives/2008/04/15/why-arent-there-more-scientific-breakthroughs/">Why aren’t there more scientific breakthroughs?</a> and <a href="http://www.daniel-lemire.com/blog/archives/2008/08/21/peer-review-is-an-honor-based-system/">Peer review is an honor-based system</a>.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456096076" height="1" width="1"/>]]></content:encoded><description>If you think peer review is sane, consider this example:
El Naschie is editor in chief of the journal Chaos, Solitons and Fractals.  This journal is published by Elsevier, one of the biggest players in the science publishing business. But here’s where things get interesting: this journal also lists 322 papers with El Naschie as [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F14%2Fso-you-think-academic-peer-review-works%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/14/so-you-think-academic-peer-review-works/</feedburner:origLink></item><item><title>Measuring the diversity of recommended lists, at last</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/456096077/</link><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 14 Nov 2008 11:56:02 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1480</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>For a number of years, algorithm researchers in collaborative filtering and recommender systems <strong>have focused on accuracy as the sole performance metric</strong>.</p>
<p>Imagine that you bought a couple of albums from Celine Dion and you liked them a lot. Then the best answer might be to suggest you buy all the other Dion albums. Or is it?</p>
<p>No. You do not want to optimize accuracy above all else. <strong>You need to balance accuracy and diversity</strong>. To any user, it is obvious. Researchers often prefer to ignore diversity because it is harder to measure.</p>
<p>Several people, <a href="http://www.daniel-lemire.com/blog/archives/2007/12/22/collaborative-filtering-why-working-on-static-data-sets-is-not-enough/">me included</a>, have argued in favour of diversity, but metric proposals were still missing. I have had on my to-do list to write a paper on <strong>measuring</strong> the diversity of recommender systems. Unfortunately, I cannot cope with more than a few projects at any one time. Fortunately, it looks like I will not have to write this paper. Zhang and Hurley have done a good job at it:</p>
<blockquote><p>Zhang, M. and Hurley, N. 2008. <a href="http://doi.acm.org/10.1145/1454008.1454030">Avoiding monotony: improving the diversity of recommendation lists</a>. In Proceedings of the 2008 ACM Conference on Recommender Systems (Lausanne, Switzerland, October 23 - 25, 2008). RecSys &#8216;08. ACM, New York, NY, 123-130.</p></blockquote>
<p>Do not be put off by the mathematics: they are a tad formal, but the right ideas are there, just read slowly sections 3 and 4. Basically, diversity is measured as the <strong>average dissimilarity between items</strong>. That is a standard form of diversity measure. This strategy to measure diversity is not novel, but to my knowledge, they are the first to apply it to collaborative filtering.</p>
<p>What is next? You are looking for a paper idea?</p>
<ul>
<li> Take Zhang and Hurley&#8217;s class of diversity measures, and apply them to existing recommender systems. Show that there is an accuracy-precision trade-off. All you need is a dissimilarity measure between items.</li>
<li> Do user studies to prove people prefer a balance between diversity and accuracy.</li>
</ul>
<p><strong>Requirement</strong>: if you steal anyone of these ideas, you have to email me a copy of your paper once it is written.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/456096077" height="1" width="1"/>]]></content:encoded><description>For a number of years, algorithm researchers in collaborative filtering and recommender systems have focused on accuracy as the sole performance metric.
Imagine that you bought a couple of albums from Celine Dion and you liked them a lot. Then the best answer might be to suggest you buy all the other Dion albums. Or is [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F14%2Fmeasuring-the-diversity-of-recommended-lists-at-last%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/14/measuring-the-diversity-of-recommended-lists-at-last/</feedburner:origLink></item><item><title>To improve your indexes: sort your tables!</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/449860745/</link><category>Data Warehousing and OLAP</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Thu, 20 Nov 2008 17:44:18 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1465</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Many database indexes, including bitmap indexes, are sensitive to the order of the rows in your table. Many data warehousing practitioners urge you to sort your tables to get better results, especially with Oracle systems. In fact, column-oriented database systems like Vertica are built on sorted tables.</p>
<p>What do I mean by sorting the rows? It turns out that finding the best row reordering is a NP-hard problem. You might as well use something cheap like lexicographical sort as a heuristic. Finding an equally scalable technique that work much better is probably impossible. Ah! But there are different ways to sort lexicographically!</p>
<p>We wrote a few papers on this issue including one for <a href="http://arxiv.org/abs/0808.2083">DOLAP 2008</a>. Here are the slides of our presentation:</p>
<div id="__ss_742475" style="width: 425px; text-align: left;"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" title="Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes" href="http://www.slideshare.net/lemire/histogramaware-sorting-for-enhanced-wordaligned-compression-in-bitmap-indexes-presentation?type=powerpoint">Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes</a><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="355" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=dolaphandout-1226428379721064-8&amp;stripped_title=histogramaware-sorting-for-enhanced-wordaligned-compression-in-bitmap-indexes-presentation" /><embed type="application/x-shockwave-flash" width="425" height="355" src="http://static.slideshare.net/swf/ssplayer2.swf?doc=dolaphandout-1226428379721064-8&amp;stripped_title=histogramaware-sorting-for-enhanced-wordaligned-compression-in-bitmap-indexes-presentation" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<div style="font-size: 11px; font-family: tahoma,arial; height: 26px; padding-top: 2px;">View SlideShare <a style="text-decoration:underline;" title="View Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes on SlideShare" href="http://www.slideshare.net/lemire/histogramaware-sorting-for-enhanced-wordaligned-compression-in-bitmap-indexes-presentation?type=powerpoint">presentation</a> or <a style="text-decoration:underline;" href="http://www.slideshare.net/upload?type=powerpoint">Upload</a> your own. (tags: <a style="text-decoration:underline;" href="http://slideshare.net/tag/dolarp2008">dolarp2008</a>)</div>
</div>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/449860745" height="1" width="1"/>]]></content:encoded><description>Many database indexes, including bitmap indexes, are sensitive to the order of the rows in your table. Many data warehousing practitioners urge you to sort your tables to get better results, especially with Oracle systems. In fact, column-oriented database systems like Vertica are built on sorted tables.
What do I mean by sorting the rows? It [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total><media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/449860749/ssplayer2.swf" fileSize="54208" type="application/x-shockwave-flash" /><itunes:explicit>no</itunes:explicit><itunes:subtitle>Many database indexes, including bitmap indexes, are sensitive to the order of the rows in your table. Many data warehousing practitioners urge you to sort your tables to get better results, especially with Oracle systems. In fact, column-oriented databas</itunes:subtitle><itunes:summary>Many database indexes, including bitmap indexes, are sensitive to the order of the rows in your table. Many data warehousing practitioners urge you to sort your tables to get better results, especially with Oracle systems. In fact, column-oriented database systems like Vertica are built on sorted tables. What do I mean by sorting the rows? It [...]</itunes:summary><itunes:keywords>Data Warehousing and OLAP</itunes:keywords><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F11%2Fto-improve-your-indexes-sort-your-tables%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/11/to-improve-your-indexes-sort-your-tables/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/449860749/ssplayer2.swf" length="54208" type="application/x-shockwave-flash" /><feedburner:origEnclosureLink>http://static.slideshare.net/swf/ssplayer2.swf?doc=dolaphandout-1226428379721064-8&amp;amp;stripped_title=histogramaware-sorting-for-enhanced-wordaligned-compression-in-bitmap-indexes-presentation</feedburner:origEnclosureLink></item><item><title>Leaves in the web of knowledge</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/449161578/</link><category>Academia/Research</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Thu, 20 Nov 2008 17:45:49 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1463</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Sometimes people conclude that I am very humble or just not very smart when I state that most of my work is not very important. In truth, a couple of things I did as a researcher are worth considering, and I hope to produce a few more, but these are small gems in a vast underground of dirt.</p>
<p>It appears that I am not alone. Here is Zeilberger on why errors in 99.9% of mathematical papers are no big deal:</p>
<blockquote><p>Most mathematical papers are leaves in the web of knowledge, that no one reads, or will ever use to prove something else. The results that are used again and again are mostly lemmas, that while a priori non-trivial, once known, their proof is transparent. (Zeilberger&#8217;s <a href="http://www.math.rutgers.edu/~zeilberg/Opinion91.html">Opinion 91</a>)</p></blockquote>
<p>I agree that getting a formula wrong is no big deal. As long as we have a diverse set of researchers working on different problems, important errors are not going to survive long. Fortunately.</p>
<p><a href="http://www.daniel-lemire.com/blog/archives/2008/08/21/peer-review-is-an-honor-based-system/">Cheating and biases</a> is what we need to worry about. </p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/449161578" height="1" width="1"/>]]></content:encoded><description>Sometimes people conclude that I am very humble or just not very smart when I state that most of my work is not very important. In truth, a couple of things I did as a researcher are worth considering, and I hope to produce a few more, but these are small gems in a vast [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F11%2Fleaves-in-the-web-of-knowledge%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/11/leaves-in-the-web-of-knowledge/</feedburner:origLink></item><item><title>Understanding what makes database indexes work</title><link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/445655633/</link><category>Data Warehousing and OLAP</category><category>Favorite</category><category>Science and Technology</category><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Daniel Lemire</dc:creator><pubDate>Fri, 21 Nov 2008 11:36:48 -0600</pubDate><guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1454</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p><strong>Why do database indexes work?</strong></p>
<p>In a <a href="http://www.daniel-lemire.com/blog/archives/2008/10/31/a-no-free-lunch-theorem-for-database-indexes/">previous post</a>, I explained that only two factors make indexing possible:</p>
<ul>
<li>your index expects specific queries</li>
<li>or you make specific assumptions about the data sets.</li>
</ul>
<p>In other cases, you are better off just scanning the entire data set.</p>
<p><strong>What makes database indexes work?</strong></p>
<p>As far as I know, there are only 6 strategies that make indexes work. By combining them in different ways, we get all of the various existing schemes. (I would love to hear your feedback on this claim!)</p>
<p><em>1. You expect specific queries: restructure your data! </em></p>
<p>Suppose you know ahead of time that you will only need to select some of the elements in your data set. Then you can taylor an index for such queries and thus avoid scanning much of the content. For example, an inverted index in full-text search will select which documents contain the various keywords. Instead of working with all documents, you will only worry about the ones matching at least one keyword. Indexing a column with a B-tree or a hash table is another scenario where you try to immediately go to the relevant rows in a table. </p>
<p>Of course, if I look for all documents containing the words &#8220;the&#8221; and &#8220;will&#8221;, and want to know how many there are and what is their average length, such a form of indexing will not help.</p>
<p><em>2. You expect specific queries: materialize them! </em></p>
<p>Another commonly used strategy is view materialization.  If 10% of all visitors on Google type in the word &#8220;sex&#8221;, they might as well precompute the result of the query. In Business Intelligence, if you can expect your users to mostly care about results aggregated over weeks, months or years, it makes sense to precompute these values instead of always working from the raw data. Alternatively, you can materialize intermediate elements that are needed to compute your results. For example, even if people do not need data aggregated per day, precomputing it might be useful for computing weekly numbers faster.</p>
<p>This form of indexing tends to work well to address the most popular queries, but it fails when people have more specific needs.</p>
<p><em>3. You expect specific queries: redundancy is (sometimes) your friend </em></p>
<p>When you do not know exactly which queries to expect, you can try to index the data in different ways, for different queries. For example, you could both use a B-tree and a hash table, and determine at query time which is the best evaluation strategy. You might even determine that the best way is to forgo the indexes and scan the raw data!</p>
<p><em>4. Use multiresolution! </em></p>
<p>Suppose that you look for specific images, but you may still need to scan 50% of them. An index that would point you to only the relevant images might not be effective. Instead, you should try to quickly discard the irrelevant candidates. What you could do is create thumbnails (low resolution images). Then you can dismiss quickly the images that are obviously not a good match. Naturally, you can have progressively finer resolutions. </p>
<p>Database indexes often bin values together. For example, if you could bin all workers earning between $10,000  and $30,000, then all workers earning between $30,000 and $50,000, and so on. If you are looking for workers earning between $40,000 and $45,000, you can first find all works that are in the $30,000-$50,000 bin, and then look up their actual salaries, one by one. You can adapt the bins either to the data distribution or to the types of queries you expect.</p>
<p>For more examples, see my post <a href="http://www.daniel-lemire.com/blog/archives/2008/11/20/how-to-speed-up-retrieval-without-any-index/">How to speed up retrieval without any index?</a>.</p>
<p><em>5. Your data is not random: compress it! </em></p>
<p>Most real-world data is highly compressible. By compressing the data, you can make it so that your CPU and IO subsystem process less data. However, you have to worry about bottlenecks. Too much compression may overload your CPU. Too little compression and most time will be spent in loading the data from disk. Two techniques are often combined to get good results out of compression: sorting and <a href="http://en.wikipedia.org/wiki/Run-length_encoding">run-length encoding</a>. </p>
<p><em>6. In any case: optimize your code </em></p>
<p>You should be using cache-aware and <a href="http://arxiv.org/abs/0808.2083">CPU-aware</a> indexes. Be aware that comparing two bits together may take nearly as long as comparing two integers. Be aware that jumping all over the place (as in a B-tree) takes longer than processing the data by tiny chunks.</p>
<img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/445655633" height="1" width="1"/>]]></content:encoded><description>Why do database indexes work?
In a previous post, I explained that only two factors make indexing possible:

your index expects specific queries
or you make specific assumptions about the data sets.

In other cases, you are better off just scanning the entire data set.
What makes database indexes work?
As far as I know, there are only 6 strategies that [...]</description><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F11%2F07%2Funderstanding-what-makes-database-indexes-really-work%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/</feedburner:origLink></item><media:rating>nonadult</media:rating><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetFeedData?uri=daniel-lemire/atom</feedburner:awareness></channel></rss>
