<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2enclosuresfull.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Daniel Lemire's blog</title>
	
	<link>http://www.daniel-lemire.com/blog</link>
	<description>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</description>
	<pubDate>Thu, 08 Jan 2009 14:40:41 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<!-- podcast_generator="podPress/8.8" -->
		<copyright>© </copyright>
		<managingEditor>lemireno@spamacm.org ()</managingEditor>
		<webMaster>lemireno@spamacm.org()</webMaster>
		<category />
		<ttl>1440</ttl>
		<itunes:keywords />
		<itunes:subtitle>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:subtitle>
		<itunes:summary>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:summary>
		<itunes:author />
		<itunes:category text="Society &amp; Culture" />
		<itunes:owner>
			<itunes:name />
			<itunes:email>lemireno@spamacm.org</itunes:email>
		</itunes:owner>
		<itunes:block>No</itunes:block>
		<itunes:explicit>no</itunes:explicit>
		<itunes:image href="http://www.daniel-lemire.com/blog/wp-content/photos/daniellohanjardin.jpg" />
		<image>
			<url>http://www.daniel-lemire.com/blog/wp-content/photos/daniellohanjardin.jpg</url>
			<title>Daniel Lemire's blog</title>
			<link>http://www.daniel-lemire.com/blog</link>
			<width>144</width>
			<height>144</height>
		</image>
		<media:copyright>©</media:copyright><media:thumbnail url="http://www.daniel-lemire.com/blog/wp-content/photos/daniellohanjardin.jpg" /><media:keywords></media:keywords><media:category scheme="http://www.itunes.com/dtds/podcast-1.0.dtd">Society &amp; Culture</media:category><creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/2.0/</creativeCommons:license><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/daniel-lemire/atom" type="application/rss+xml" /><feedburner:emailServiceId>1396075</feedburner:emailServiceId><feedburner:feedburnerHostname>http://www.feedburner.com</feedburner:feedburnerHostname><item>
		<title>Progress is continuous by nature</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/506234453/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2009/01/08/progress-is-continuous-by-nature/#comments</comments>
		<pubDate>Thu, 08 Jan 2009 14:25:54 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1716</guid>
		<description>In my post We never invent anything new, yet progress is made!, I argued that innovation is incremental and social. I derived two recommendations for innovators: be good at communicating your ideas and be networked. Indeed, while you cannot create radically new ideas, you may contribute significantly to the adoption of an important insight.
Frédéric commented:
Sometimes [...]</description>
			<content:encoded><![CDATA[<p>In my post <a href="http://www.daniel-lemire.com/blog/archives/2008/12/27/we-never-invent-anything-new-yet-progress-is-made/">We never invent anything new, yet progress is made!</a>, I argued that <strong>innovation is incremental and social</strong>. I derived two recommendations for innovators: be good at communicating your ideas and be networked. Indeed, while you cannot create radically new ideas, you may contribute significantly to the adoption of an important insight.</p>
<p>Frédéric <a href="http://www.daniel-lemire.com/blog/archives/2008/12/27/we-never-invent-anything-new-yet-progress-is-made/#comment-50517">commented</a>:</p>
<blockquote><p>Sometimes (rarely) a significant breakthrough is achieved by someone who really invent something. Some examples :</p>
<ul>
<li> Copernic and geocentrism give up,</li>
<li> Darwin and evolution theory,</li>
</ul>
<p>Sure that these breakthroughs have been possible by previous incremental progress. Citing Thomas Khun, science progress is discontinuous by nature. It is more rupture than accumulation.</p></blockquote>
<p>I disagree with the Thomas Khun quote. Let me take the two examples Frédéric submitted.</p>
<p>Copernic did not invent <a href="http://en.wikipedia.org/wiki/Heliocentrism">heliocentrism</a>. From <a href="http://en.wikipedia.org/wiki/Geocentrism#Maragha_system">wikipedia</a>, we learn that <em>The Greek <a title="Aristarchus of Samos" href="http://en.wikipedia.org/wiki/Aristarchus_of_Samos">Aristarchus of Samos</a>, in the 3rd century BC, was the first known person to speculate that the Earth revolves around a stationary sun.</em> Even his mathematical models are not novel. Still from <a href="http://en.wikipedia.org/wiki/Geocentrism#Maragha_system">wikipedia</a>, we learn that two centuries before Copernic wrote  <em><a title="De revolutionibus orbium coelestium" href="http://en.wikipedia.org/wiki/De_revolutionibus_orbium_coelestium">De revolutionibus orbium coelestium</a></em>, Ibn Al-Shatir, a Damascene astronomer, wrote a lunar theory which is mathematically identical with that of Copernicus.</p>
<p>Darwin did not invent evolution. Lamark&#8217;s <a href="http://en.wikipedia.org/wiki/Inheritance_of_acquired_characters">Inheritance of acquired characters</a> theory published in 1809 clearly describe evolution as we know it, he just misunderstood the mechanism by which evolution occurs. Darwin did not even invent single-handedly evolution by natural selection, since he co-discovered it with <a href="http://en.wikipedia.org/wiki/Alfred_Russel_Wallace">Wallace</a>. It is quite possible that others had the same ideas independently as well (see <a href=" http://www.amazon.com/Monad-Man-Concept-Progress-Evolutionary/dp/0674582209/">Monad to Man: The Concept of Progress in Evolutionary Biology</a>).</p>
<p>In his excellent blog post <a href="http://apperceptual.wordpress.com/2008/06/22/convergent-evolution-and-multiple-discovery/">Convergent Evolution and Multiple Discovery</a>, Peter Turney wrote:</p>
<blockquote><p>(&#8230;) a <a title="http://www.amazon.com/exec/obidos/ASIN/0521296811/daniellemires-20" href="http://www.amazon.com/exec/obidos/ASIN/0521296811/daniellemires-20">close study of the history of any particular technology</a> invariably shows a series of small, incremental developments. The apparent jumps are an illusion, caused by <a title="http://www.saffo.com/idea1.php" href="http://www.saffo.com/idea1.php">the way technology is adopted</a>.</p></blockquote>
<p>To express it differently, apparent jumps forward are social effects. If you ever become part of history as an innovator, it will not be because of your superior ideas, but because you were at the center of a social phenomenon.</p>
<p>Be humble.</p>
<p><strong>Further reading</strong>:</p>
<ul>
<li> <a href="http://www.amazon.com/Evolution-Technology-Cambridge-Studies-History/dp/0521296811/ ">The Evolution of Technology</a></li>
<li> <a href="http://www.amazon.com/Multiple-Discovery-Pattern-Scientific-Progress/dp/0861270258/">Multiple Discovery: The Pattern of Scientific Progress<br />
</a></li>
</ul>
<p><strong>Source</strong>: Almost all of the good content of this post must be credited to <a href="http://www.apperceptual.com/">Peter Turney</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=A049Qu.p"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=A049Qu.p" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/506234453" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2009/01/08/progress-is-continuous-by-nature/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2009%2F01%2F08%2Fprogress-is-continuous-by-nature%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2009/01/08/progress-is-continuous-by-nature/</feedburner:origLink></item>
		<item>
		<title>How many deleted sections do you write?</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/504789455/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2009/01/06/how-many-deleted-sections-do-you-produce/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 00:50:55 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1711</guid>
		<description>Research ideas come through writing. Thinking deep thoughts while you stare at the wall is not productive. So, researchers write a lot. Some of it is incorrect or uninteresting.
Just like movie studios are filled with deleted scenes, my drawers are full of deleted sections. I write about 5 research papers a year; I must throw [...]</description>
			<content:encoded><![CDATA[<p><a href="http://www.daniel-lemire.com/blog/archives/2008/07/11/do-you-think-because-you-write-or-write-because-you-think/">Research ideas come through writing</a>. Thinking deep thoughts while you stare at the wall is not productive. So, researchers write a lot. Some of it is incorrect or uninteresting.</p>
<p>Just like movie studios are filled with deleted scenes, my drawers are full of deleted sections. I write about <a href="http://www.daniel-lemire.com/en/publications.html">5 research papers a year</a>; I must throw away several hundred pages of content each year. Deleted scenes make it to the DVD version. My deleted sections sometimes appear in technical reports. Most often, I never publish them. They fall in the following categories:</p>
<ul>
<li>Lengthy theoretical analysis of a tangential idea. Or <em>a good idea in the wrong paper</em>.</li>
<li>Description of unconclusive analysis or experiment.</li>
</ul>
<p>I am often reluctant to throw away uninteresting content. <em>But I worked so hard on this section!</em></p>
<p>Throwing away content is easy if you write conference papers. The page limitations and tight deadlines force your hand. However, it becomes harder with journal articles where there are often no page limitation, and reviewers expect a thorough and exhaustive analysis.</p>
<p>The main problem, I believe, is in the medium itself. That is, our collective reluctance to definitively move away from paper. Just like DVDs make it possible to include deleted scenes, research papers appearing online should have links to deleted sections and extra material. Some of the sections I delete should still be available.</p>
<p>Yes, I can write an appendix, but they are peer reviewed. I want to include <em>extra stuff</em> for the curious readers.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=itIlYi.p"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=itIlYi.p" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/504789455" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2009/01/06/how-many-deleted-sections-do-you-produce/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2009%2F01%2F06%2Fhow-many-deleted-sections-do-you-produce%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2009/01/06/how-many-deleted-sections-do-you-produce/</feedburner:origLink></item>
		<item>
		<title>Favorite posts for 2008</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/501172378/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2009/01/02/favorite-posts-for-2008/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 18:50:18 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category />

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1708</guid>
		<description>I wrote about 300 blog posts this year. Here are some of my favorites:

Do you think because you write, or write because you think?
Good research: invent new problems or explain mysteries
How to solve hard problems
To improve your indexes: sort your tables!
The secret to intellectual productivity</description>
			<content:encoded><![CDATA[<p>I wrote about 300 blog posts this year. Here are some of my favorites:</p>
<ul>
<li><a href="http://www.daniel-lemire.com/blog/archives/2008/07/11/do-you-think-because-you-write-or-write-because-you-think/">Do you think because you write, or write because you think?</a></li>
<li><a href="http://www.daniel-lemire.com/blog/archives/2008/06/24/good-research-invent-new-problems-or-explain-mysteries/">Good research: invent new problems or explain mysteries</a></li>
<li><a href="http://www.daniel-lemire.com/blog/archives/2008/03/31/how-to-solve-hard-problems/">How to solve hard problems</a></li>
<li><a href="http://www.daniel-lemire.com/blog/archives/2008/11/11/to-improve-your-indexes-sort-your-tables/">To improve your indexes: sort your tables!</a></li>
<li><a href="http://www.daniel-lemire.com/blog/archives/2008/08/19/the-secret-to-intellectual-productivity/">The secret to intellectual productivity</a></li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=fgtx8V.p"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=fgtx8V.p" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/501172378" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2009/01/02/favorite-posts-for-2008/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2009%2F01%2F02%2Ffavorite-posts-for-2008%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2009/01/02/favorite-posts-for-2008/</feedburner:origLink></item>
		<item>
		<title>Where are the academic podcasts?</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/499578520/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/31/where-are-the-academic-podcasts/#comments</comments>
		<pubDate>Wed, 31 Dec 2008 16:42:29 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1704</guid>
		<description>This blog is also a podcast. Few people notice. I have not posted any audio in months. (If you have never listened to me: I am better at writing English than at speaking it.)
Where are the good academic/research podcasts? What I found so far fitted in these categories:

Promotional material for schools or research centers.
Capture of [...]</description>
			<content:encoded><![CDATA[<p>This blog is also a podcast. Few people notice. I have not posted any audio in months. (If you have never listened to me: I am better at writing English than at speaking it.)</p>
<p>Where are the good academic/research podcasts? What I found so far fitted in these categories:</p>
<ul>
<li>Promotional material for schools or research centers.</li>
<li>Capture of audio/video events (such as lectures).</li>
</ul>
<p>I have never liked lectures. I was the kind of student to never pay any attention in class. Some live talks are good, but most are not.</p>
<p>Podcasting is different however because it can be edited afterhand. You can cut the parts where you are rambling. You can redo part of the podcast if the first take was not good.</p>
<p>But where are the researchers producing good podcasts?</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=JAwrMn.o"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=JAwrMn.o" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/499578520" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/31/where-are-the-academic-podcasts/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F31%2Fwhere-are-the-academic-podcasts%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/31/where-are-the-academic-podcasts/</feedburner:origLink></item>
		<item>
		<title>What makes recommender systems work?</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/499478355/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/31/what-makes-recommender-systems-work/#comments</comments>
		<pubDate>Wed, 31 Dec 2008 14:08:12 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1700</guid>
		<description>Why can we predict tastes? There are several possible explanations:

Intrinsically, individuals have predictable tastes. To test this theory, we would need to isolate each individual. Collect their opinions. Then attempt to make predictions. (You also need to prevent the recommender system from giving feedback to the user!)
 People are influenced by the perceived popularity of [...]</description>
			<content:encoded><![CDATA[<p>Why can we predict tastes? There are several possible explanations:</p>
<ul>
<li>Intrinsically, individuals have predictable tastes. To test this theory, we would need to isolate each individual. Collect their opinions. Then attempt to make predictions. (You also need to prevent the recommender system from giving feedback to the user!)</li>
<li> People are influenced by the perceived popularity of items. People tend to like a movie more when they think it is a blockbuster. A  more general statement is:   <a href="http://www.amazon.com/exec/obidos/ASIN/0521542200/daniellemires-20?%5Fencoding=UTF8&amp;camp=1789&amp;link%5Fcode=xm2">Preferences are constructed in the process of elicitation</a>.</li>
</ul>
<p>Jon Dron <a href="http://community.brighton.ac.uk/jd29/weblog/39189.html">linked</a> to a 2007 New York Times article by Duncan J. Watts which <a href="http://www.nytimes.com/2007/04/15/magazine/15wwlnidealab.t.html">claims</a> that the second factor can easily dominate. The article is based on experiments. However, I have several arguments to support this claim.  Highly cited research papers are often no more interesting than regular papers. Except that you must also cite them, otherwise people will complain that you are missing a reference. The same is true with music or books. I often read novels based because many people read them: that is how, for example, I found <a href="http://www.amazon.com/compagnie-glaces-ceinture-feu/dp/2265070750/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1230730696&amp;sr=8-1">la compagnie des glaces</a> or Dune.</p>
<p>Hence,  collaborative filtering is a circular process:</p>
<ul>
<li>People like certain items (mostly) because they perceive that similar people like them.</li>
<li>Collaborative filtering matches people with such items.</li>
</ul>
<p>Of course, that is a bit pessimistic: people are also influenced by the intrinsic values. However, how can you differentiate the social perception from the intrinsic value of an item? Is this novel entertaining, or is it popular? Outside a laboratory, we cannot tell these factors apart!</p>
<p>How well would collaborative filtering work if individuals were isolated? It would work poorly. There are millions of books and research articles. Many are quite good. Without a social component to push us in some directions, we would have diverse tastes.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=Ov5XCn.o"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=Ov5XCn.o" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/499478355" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/31/what-makes-recommender-systems-work/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F31%2Fwhat-makes-recommender-systems-work%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/31/what-makes-recommender-systems-work/</feedburner:origLink></item>
		<item>
		<title>Grabbing attention or building a reputation?</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/497885166/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/29/grabbing-attention-or-building-a-reputation/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 16:44:36 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1693</guid>
		<description>Daniel Tunkelang has been writing on the attention economy (here and here for example): everyone is fighting to have your attention, and you only have so much to offer.
Attention is easy to measure:

You can record the number of people subscribing to your blog.
You can count the number of people citing your research papers.
You can point [...]</description>
			<content:encoded><![CDATA[<p>Daniel Tunkelang has been writing on the <a href="http://en.wikipedia.org/wiki/Attention_economy">attention economy</a> (<a href="http://thenoisychannel.com/2008/12/29/its-the-attention-stupid/">here</a> and <a href="http://thenoisychannel.com/2008/12/27/loic-le-meur-misses-the-point-of-twitter/">here</a> for example): everyone is fighting to have your attention, and you only have so much to offer.</p>
<p>Attention is easy to measure:</p>
<ul>
<li>You can record the number of people subscribing to your blog.</li>
<li>You can count the number of people citing your research papers.</li>
<li>You can point to your number of followers on Twitter or your number of friends on Facebook.</li>
</ul>
<p>However, I do not blog or write research papers merely to grab attention. Instead, <strong>I seek to increase my reputation</strong>. While attention fluctuates depending on your current actions, reputation builds up over time based on your reliability, your honesty, and your transparency. To build a good reputation, you do not need to do anything extraordinary: you just need to be consistent over a long time.</p>
<p>Of course, you need to get some attention if you are building a reputation. However, on the long run, the saying <em>build it and they will come</em>, is true. Being present and doing good work is enough. You do not need flashy presentations. Remain lean and mean. Avoid high maintenance operations. Do good quality work.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=qq7fSo.o"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=qq7fSo.o" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/497885166" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/29/grabbing-attention-or-building-a-reputation/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F29%2Fgrabbing-attention-or-building-a-reputation%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/29/grabbing-attention-or-building-a-reputation/</feedburner:origLink></item>
		<item>
		<title>We never invent anything new, yet progress is made!</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/496533469/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/27/we-never-invent-anything-new-yet-progress-is-made/#comments</comments>
		<pubDate>Sat, 27 Dec 2008 20:50:59 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1664</guid>
		<description>Practical innovation explains how per-capita wealth increased eightfold during the last century. Yet, we are constantly reminded that we never invent anything new:

Most movies are remake or variations on older movies.
Most research papers are variation on a theme.
Most products and services are variations on existing products and services.

Even if I invent something recognized as drastically [...]</description>
			<content:encoded><![CDATA[<p>Practical innovation <a href="http://www.daniel-lemire.com/blog/archives/2008/12/15/why-is-the-free-market-letting-us-down/">explains</a> how per-capita wealth increased eightfold during the last century. Yet, we are constantly reminded that we never invent anything new:</p>
<ul>
<li>Most movies are remake or variations on older movies.</li>
<li>Most research papers are variation on a theme.</li>
<li>Most products and services are variations on existing products and services.</li>
</ul>
<p>Even if I invent something recognized as <em>drastically novel</em>, I am sure someone will say &#8220;oh! but we did that 20 years ago.&#8221; A recent example are the <a href="http://en.wikipedia.org/wiki/Tag_cloud">tag clouds</a> which have no equivalent in my pre-Web textbooks. Yet, I am sure we can show that they are variation on much older visualization techniques.</p>
<p>If nothing really new is ever invented, why do we observe so much <a href="http://en.wikipedia.org/wiki/Innovation">innovation</a>? As I watch the 1993-1993 <a href="http://en.wikipedia.org/wiki/X_files">X-Files</a> (season 1), I am amazed at how backward these FBI agents appear:</p>
<ul>
<li>They do not carry a cell phone and must run back to a land line to get help. (Update: toward the end of the season, we learn that Scully has a cell phone.)</li>
<li>Even though agent Scully uses a DOS word processor (probably Word Perfect), all their archives are on paper or microfilms.</li>
<li>While Scully has a modem, it is not clear that she uses any networked application. Agent Mulder does not use a computer? I found no sign of the Web.</li>
<li>Files are not shared. For example, Mulder has his own files (the X-files) and apparently, others must come to him to have (paper) copies.</li>
</ul>
<p>The key? Progress is incremental. Human beings are gregarious for a reason: our innovation process is social. Innovation occurs when new ideas are put into action by society.</p>
<p>For researchers wanting to create innovation, there are several implications:</p>
<ul>
<li>You can only be an effective researcher if you are an effective communicator.</li>
<li>You should be connected as much as possible to the rest of society. It is not sufficient to impress the ivory tower: you must reach out.</li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=N8ajo"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=N8ajo" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/496533469" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/27/we-never-invent-anything-new-yet-progress-is-made/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F27%2Fwe-never-invent-anything-new-yet-progress-is-made%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/27/we-never-invent-anything-new-yet-progress-is-made/</feedburner:origLink></item>
		<item>
		<title>My (short) activity report for 2008</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/495172969/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-short-activity-report-for-2008/#comments</comments>
		<pubDate>Thu, 25 Dec 2008 23:59:01 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1681</guid>
		<description>I heard on radio today that the Christmas break should be used to review the past year, and decide where you want to go. Good idea!
What did I do?

I published the Lemur Bitmap Index C++ Library. 

I published lbimproved, a C++ library for Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping.
Owen presented our paper Histogram-Aware [...]</description>
			<content:encoded><![CDATA[<p>I heard on radio today that the Christmas break should be used to review the past year, and decide where you want to go. Good idea!</p>
<p>What did I do?</p>
<ul>
<li>I published the <em><a href="http://code.google.com/p/lemurbitmapindex/">Lemur Bitmap Index C++ Library. </a></em><a style="text-decoration: none; color: #000000;" href="http://code.google.com/p/lemurbitmapindex/"><br />
</a></li>
<li>I published<em> </em><a href="http://code.google.com/p/lbimproved/">lbimproved</a>, a C++ library for Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping.</li>
<li>Owen presented our paper <em>Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes</em> at DOLAP 2008 (<a href="http://arxiv.org/abs/0808.2083">arXiv:0808.2083</a>).</li>
<li>I presented our paper <em>Tri de la table de faits et compression des index bitmaps avec alignement sur les mots</em> (French for <em>Fact Table Sorting and  Word-Aligned Compression for Bitmap Indexes</em>) at BDA&#8217;08 (<a href="http://arxiv.org/abs/0805.3339">arXiv:0805.3339</a>). It was my first time in ten years publishing in French.</li>
<li>Hazel presetned <em>Pruning Attributes From Data Cubes with Diamond Dicing</em> at IDEAS&#8217;08  (<a href="http://arxiv.org/abs/0805.0747">arXiv:0805.0747</a>).</li>
<li>Our paper <em>Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays</em> finally appeared in ACM Transactions on Algorithms. It is available from arxiv: <a href="http://arxiv.org/abs/cs.DS/0610128">cs.DS/0610128</a>. I published some years ago the <a href="http://code.google.com/p/hierarchicalbinbuffering/">C++ source code</a>.</li>
<li>Kamel presented our paper <em>Collaborative OLAP with Tag Clouds: Web 2.0 OLAP Formalism and  Experimental Evaluation</em> at WEBIST 2008 (<a href="http://arxiv.org/abs/0710.2156">arXiv:0710.2156</a>).</li>
<li>I published my first online graduate course online (<a href="http://benhur.teluq.uqam.ca/SPIP/inf6104/">INF 6104</a>) on Information Retrieval.</li>
<li>I taught Data Management in XML (INF 6450) and an undergraduate course on Information Retrieval (INF 6460).</li>
<li>I published hundreds of blog posts, and some of my post have been relatively popular.</li>
</ul>
<p>Naturally, there was more to 2008, but you get the idea.</p>
<p>What are my plans for 2009?</p>
<ul>
<li>I will keep experimenting with different types of blogging. My goal is to get better at <strong>connecting my main research activities and my blogging</strong>.</li>
<li>I will continue to work on bitmap indexes. I will also branch out from bitmap indexes to columnar databases in general. Expect a few research papers.  Expect my  <em><a href="http://code.google.com/p/lemurbitmapindex/">Lemur Bitmap Index C++ Library</a></em>, to grow beyond bitmap indexes.</li>
<li>I will pursue my interest in collaborative data analysis.</li>
<li>I will publish an online course on Data Warehousing.</li>
</ul>
<p>Again, there are a few more projects, but these are important goals.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=pHJ6o"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=pHJ6o" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/495172969" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-short-activity-report-for-2008/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F25%2Fmy-short-activity-report-for-2008%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-short-activity-report-for-2008/</feedburner:origLink></item>
		<item>
		<title>My low-tech research tools</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/494925144/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-low-tech-research-tools/#comments</comments>
		<pubDate>Thu, 25 Dec 2008 16:14:53 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1676</guid>
		<description>I carry a pocketbook and a pen everywhere. At night, my pocketbook is by my bed. All creative workers should carry notebooks.
Organizing and collecting ideas are different tasks. My pocketbook is strictly for collection. Every few days,  I start a new page: a list of reminders on one side, and diagrams on the other side. [...]</description>
			<content:encoded><![CDATA[<p>I carry a pocketbook and a pen everywhere. At night, my pocketbook is by my bed. All creative workers should carry notebooks.</p>
<p>Organizing and collecting ideas are different tasks. My pocketbook is strictly for collection. Every few days,  I start a new page: a list of reminders on one side, and diagrams on the other side. Important ideas get processed and stored on my laptop. I throw away used pocketbooks.</p>
<p>It is difficult to find quality pocketbooks. Here are my recommendations:</p>
<ul>
<li>My pocketbook must last a few months. Paper must be thick and of good quality. I prefer unlined paper. I need a ribbon marker to quickly find the current page. <a href="http://www.paperblanks.com/whats_new/pocket_companions.htm">Paperblanks</a> make some good and inexpensive <em>Pocket Companions</em> fitting the bill. Alas, Paperblanks does not sell directly to customers. The retail info on the Paperblanks&#8217; <a href="http://www.paperblanks.com/retail_info.htm">web site</a> is helpful.</li>
<li>I prefer black gel pens.  Rechargeable pens create less garbage and they are often of better quality. For about a year, I have had good luck with my <a href="http://www.zebrapen.com/gel-g301.html">Zebra gel pens</a>.</li>
</ul>
<p><strong>Note</strong>: I do not profit in any way if you buy a Paperblanks pocketbook or Zebra pen.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=e0kno"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=e0kno" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/494925144" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-low-tech-research-tools/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F25%2Fmy-low-tech-research-tools%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/25/my-low-tech-research-tools/</feedburner:origLink></item>
		<item>
		<title>Where do presidents and prime ministers go to school?</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/493178470/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/23/where-do-presidents-and-prime-ministers-go-to-school/#comments</comments>
		<pubDate>Tue, 23 Dec 2008 14:21:26 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1671</guid>
		<description>In his most recent essay, After the credentials, Paul Graham tells us that in South Korea where &amp;#8220;college entrance exams determine 70 to 80 percent   of a person&amp;#8217;s future.&amp;#8221; Fortunately, the Americans know better: &amp;#8220;Where you go to college still matters, but not like it used to.&amp;#8221;
Paul writes good essays, but they are [...]</description>
			<content:encoded><![CDATA[<p>In his most recent essay, <a href="http://www.paulgraham.com/credentials.html">After the credentials</a>, Paul Graham tells us that in South Korea where &#8220;college entrance exams determine 70 to 80 percent   of a person&#8217;s future.&#8221; Fortunately, the Americans know better: &#8220;Where you go to college still matters, but not like it used to.&#8221;</p>
<p>Paul writes good essays, but they are thin on research. How much is your alma matter a predictor of your success? The research is available. For example, in <a href="http://www.wjh.harvard.edu/%7Ecwinship/cfa_papers/elite_jbrand_rev803.pdf">Regression and Matching Estimates of the Effects of Elite College Attendance on Career Outcomes</a>, Brand and  Halaby write:</p>
<blockquote><p>Our results suggest that in terms of college quality, there is not only no direct effect on mid- and late-career attainment, but no significant effect at all.  This study questions the consequential belief that an elite college education necessarily translates into privileged socioeconomic status throughout the life course.</p></blockquote>
<p>To sum it up: <strong>If you are a privilege kid, you will do well even if you go to a local college</strong>.</p>
<p>Because my research budget for this blog is $0, I will do my own survey about a special job: the presidency in the USA and the office of prime minister in Canada. Do state leaders attend a small set of colleges?</p>
<p>Let us review where the American presidents got their first degree:</p>
<ul>
<li>John F. Kennedy earned his degree from <a href="http://en.wikipedia.org/wiki/Harvard_University">Harvard University</a>.</li>
<li>Lyndon B. Johnson earned his degree from <a href="http://en.wikipedia.org/wiki/Texas_State_University-San_Marcos">Southwest Texas State Teachers&#8217; College</a>.</li>
<li>Richard Nixon earned his degree from <a title="Whittier College" href="http://en.wikipedia.org/wiki/Whittier_College">Whittier College</a>.</li>
<li>Gerald R. Ford earned his degree from the <a title="University of Michigan" href="http://en.wikipedia.org/wiki/University_of_Michigan">University of Michigan</a>.</li>
<li>Jimmy Carter earned his degree from the  <a title="United States Naval Academy" href="http://en.wikipedia.org/wiki/United_States_Naval_Academy">United States Naval Academy</a>.</li>
<li>Ronald Reagan earned his degree from <a title="Eureka College" href="http://en.wikipedia.org/wiki/Eureka_College">Eureka College</a>.</li>
<li>George H. W. Bush earned his degree from  <a title="Yale University" href="http://en.wikipedia.org/wiki/Yale_University">Yale University</a>.</li>
<li>Bill Clinton earned his degree from <a title="Georgetown University" href="http://en.wikipedia.org/wiki/Georgetown_University">Georgetown University</a>.</li>
<li>George W. Bush earned his degree from <a title="Yale University" href="http://en.wikipedia.org/wiki/Yale_University">Yale University</a>.</li>
<li>Barak Obama earned his degree from  <a title="Columbia University" href="http://en.wikipedia.org/wiki/Columbia_University">Columbia University</a>.</li>
</ul>
<p>What about Canadian prime ministers?</p>
<ul>
<li>Pierre-Elliot Trudeau earned his degree from <a title="Université de Montréal" href="http://en.wikipedia.org/wiki/Universit%C3%A9_de_Montr%C3%A9al">Université de Montréal</a>.</li>
<li>Joe Clark earned his degree from  the <a title="University of Alberta" href="http://en.wikipedia.org/wiki/University_of_Alberta">University of Alberta</a>.</li>
<li>John Turney earned his degree from the <a href="http://en.wikipedia.org/wiki/University_of_British_Columbia">University of British Columbia</a>.</li>
<li>Brian Mulroney earned his degree from <a href="http://en.wikipedia.org/wiki/St._Francis_Xavier_University">St. Francis Xavier University</a>.</li>
<li>Kim Campbell earned her degree from the <a title="University of British Columbia" href="http://en.wikipedia.org/wiki/University_of_British_Columbia">University of British Columbia</a>.</li>
<li>Jean Chrétien earned his degree from  <a title="Université Laval" href="http://en.wikipedia.org/wiki/Universit%C3%A9_Laval">Université Laval</a>.</li>
<li>Paul Martin earned his degree from the <a title="University of Toronto" href="http://en.wikipedia.org/wiki/University_of_Toronto">University of Toronto</a>.</li>
<li>Stephen Harper earned his degree from the the <a title="University of Calgary" href="http://en.wikipedia.org/wiki/University_of_Calgary">University of Calgary</a>.</li>
</ul>
<p>Based on this evidence alone, if I were to coach a kid for a political career, I would ignore where he gets his degree. This makes sense. You  become president or prime minister several years after earning your degree. By the time you have the experience required for the job, any college premium is gone.</p>
<p>See also my post <a href="http://www.daniel-lemire.com/blog/archives/2008/03/13/the-2-myths-that-gets-students-into-heavy-league-schools/">The 2 myths getting students into heavy-league schools</a>.</p>
<p><strong>Disclaimer</strong>: I am a graduate of the <a title="University of Toronto" href="http://en.wikipedia.org/wiki/University_of_Toronto">University of Toronto</a>, maybe the most prestigious university in Canada.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=2A4no"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=2A4no" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/493178470" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/23/where-do-presidents-and-prime-ministers-go-to-school/feed/</wfw:commentRss>
		<media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/493178472/elite_jbrand_rev803.pdf" fileSize="942487" type="application/pdf" /><itunes:explicit>no</itunes:explicit><itunes:summary>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:summary><itunes:keywords>Academia/Research, Science and Technology</itunes:keywords><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F23%2Fwhere-do-presidents-and-prime-ministers-go-to-school%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/23/where-do-presidents-and-prime-ministers-go-to-school/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/493178472/elite_jbrand_rev803.pdf" length="942487" type="application/pdf" /><feedburner:origEnclosureLink>http://www.wjh.harvard.edu/%7Ecwinship/cfa_papers/elite_jbrand_rev803.pdf</feedburner:origEnclosureLink></item>
		<item>
		<title>Parsing CSV files is CPU bound: a C++ test case (Update 2)</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/490294108/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-2/#comments</comments>
		<pubDate>Sat, 20 Dec 2008 04:56:04 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1657</guid>
		<description>I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister [...]</description>
			<content:encoded><![CDATA[<p>I am continuing my fun saga to determine whether parsing <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> files is CPU bound or I/O bound. Recall that <a href="http://www.daniel-lemire.com/parsecsv/parsecsv.zip">I posted some C++</a> code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly pointed out that using the clock() function is wrong. So I updated my code using his ZTimer class instead. The new numbers are 103 seconds for the full parsing and 57 seconds to just parse the lines.</p>
<p>Some anonymous reader <a href="http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-1/#comment-50359">claimed</a> that my code was still grossly inefficient. I do not like arguing without evidence.</p>
<p>Ah! But Unix utilities can also parse CSV files. They are usually efficient.  Let us use the cut command:</p>
<p><code><br />
$ time cut -f 1,2,3,4 -d , ./netflix.csv &gt; /dev/null<br />
real    1m59.596s<br />
user    1m53.163s<br />
sys     0m3.775s<br />
</code></p>
<p>So, 120 seconds?</p>
<p>What about sorting the CSV file? Of course, it is a lot more expensive: 504 seconds.<br />
<code><br />
$ time sort -t, ./netflix.csv &gt; /dev/null<br />
real    8m23.985s<br />
user    2m28.855s<br />
sys     1m1.467s<br />
</code></p>
<p>Finally, for a basis of comparison, let us just dump the file to /dev/null:</p>
<p><code> $ time cat ./netflix.csv &gt; /dev/null<br />
real    0m29.337s<br />
user    0m0.029s<br />
sys     0m2.541s<br />
</code></p>
<p>The final story:</p>
<table border="0">
<tbody>
<tr style="background:#ccc">
<td>parsing method</td>
<td>time elapsed</td>
</tr>
<tr style="background:#ddd">
<td>cat Unix command</td>
<td>29 s</td>
</tr>
<tr style="background:#ddd">
<td>Daniel&#8217;s line parser</td>
<td>57 s</td>
</tr>
<tr style="background:#ddd">
<td>Daniel&#8217;s CSV parser</td>
<td>103 s</td>
</tr>
<tr style="background:#ddd">
<td>cut Unix command</td>
<td>120 s</td>
</tr>
<tr style="background:#ddd">
<td>sort Unix command</td>
<td>504 s</td>
</tr>
</tbody>
</table>
<p><strong>Analysis</strong>: My <a href="http://www.daniel-lemire.com/parsecsv/parsecsv.zip">C++ code</a> is not grossly inefficient. If the I/O cost of reading the file is about 30 seconds, parsing it takes about 100 seconds. My preliminary conclusion is that parsing CSV files is more CPU than I/O bound.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=h0Ceo"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=h0Ceo" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/490294108" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-2/feed/</wfw:commentRss>
		<media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" fileSize="1614854" type="application/zip" /><itunes:explicit>no</itunes:explicit><itunes:summary>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:summary><itunes:keywords>Science and Technology</itunes:keywords><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F19%2Fparsing-csv-files-is-cpu-bound-a-c-test-case-update-2%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-2/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" length="1614854" type="application/zip" /><feedburner:origEnclosureLink>http://www.daniel-lemire.com/parsecsv/parsecsv.zip</feedburner:origEnclosureLink></item>
		<item>
		<title>Parsing CSV files is CPU bound: a C++ test case (Update 1)</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/489353356/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-1/#comments</comments>
		<pubDate>Fri, 19 Dec 2008 05:23:45 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1651</guid>
		<description>(See update 2.)
In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.
You can find my C++ code on my server.
A reader criticized my implementation [...]</description>
			<content:encoded><![CDATA[<p>(See <a href="http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-2/">update 2</a>.)</p>
<p>In a recent <a href="http://www.daniel-lemire.com/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/">blog post</a>, I said that parsing simple <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.</p>
<p>You can find <a href="http://www.daniel-lemire.com/parsecsv/parsecsv.zip">my C++ code</a> on my server.</p>
<p>A reader criticized my implementation as follows:</p>
<ul>
<li>I use the <a href="http://www.cplusplus.com/reference/string/getline.html">C++ getline function</a> to read the lines. The reader commented that &#8220;getline does one heap allocation and copy for every line.&#8221;  I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.</li>
<li>For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.</li>
<li>The reader commented that I should use a <em>custom allocator</em> to avoid heap allocations. Currently, if the CSV file has <em>x</em> fields, I use <em>x</em>+1 string objects (a tiny number) and small constant number of heap allocations. </li>
</ul>
<p>Despite these changes, I still get that parsing CSV files is strongly CPU bound:</p>
<p><code><br />
$ ./parsecsv ./netflix.csv<br />
without parsing: 26.55<br />
with parsing: 95.99<br />
</code></p>
<p>However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that <em>parsing CSV files is strongly I/O bound</em> as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.</p>
<p>My experiments are motivated by a <a href="http://www.ibridge.be/?p=150">post by Matt Casters</a>. <a href="http://www.daniel-lemire.com/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/">Some said</a> that Java was guilty. I use C++ and I get a similar result. So far at least. Can you tell me where I went wrong?</p>
<p><strong>Disclaimer</strong>: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=w8Pso"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=w8Pso" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/489353356" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-1/feed/</wfw:commentRss>
		<media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" fileSize="1614854" type="application/zip" /><itunes:explicit>no</itunes:explicit><itunes:summary>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:summary><itunes:keywords>Science and Technology</itunes:keywords><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F19%2Fparsing-csv-files-is-cpu-bound-a-c-test-case-update-1%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-1/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" length="1614854" type="application/zip" /><feedburner:origEnclosureLink>http://www.daniel-lemire.com/parsecsv/parsecsv.zip</feedburner:origEnclosureLink></item>
		<item>
		<title>Fast argmax in Python</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/489124685/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/17/fast-argmax-in-python/#comments</comments>
		<pubDate>Wed, 17 Dec 2008 13:48:13 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Software design]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1642</guid>
		<description>In my post Computing argmax fast in Python, I reported that Python has no builtin function to compute argmax, the position of a maximal value. I provided one such function and asked people to improve my solution. Here are the results:



argmax function
running time


array.index(max(array))
0.1 s


max(izip(array, xrange(len(array))))[1]
0.2 s


max(izip(array, xrange(len(array))))[1]
0.5 s



Conclusion: array.index(max(array)) is simpler and faster.</description>
			<content:encoded><![CDATA[<p>In my post <a href="http://www.daniel-lemire.com/blog/archives/2004/11/25/computing-argmax-fast-in-python/">Computing argmax fast in Python</a>, I reported that Python has no builtin function to compute argmax, the position of a maximal value. I provided one such function and asked people to improve my solution. Here are the results:</p>
<table border="0">
<tbody>
<tr style="background:#ccc">
<td>argmax function</td>
<td>running time</td>
</tr>
<tr style="background:#eee">
<td>array.index(max(array))</td>
<td>0.1 s</td>
</tr>
<tr style="background:#eee">
<td>max(izip(array, xrange(len(array))))[1]</td>
<td>0.2 s</td>
</tr>
<tr style="background:#eee">
<td>max(izip(array, xrange(len(array))))[1]</td>
<td>0.5 s</td>
</tr>
</tbody>
</table>
<p><strong>Conclusion</strong>: array.index(max(array)) is simpler and faster.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=b0Ubo"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=b0Ubo" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/489124685" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/17/fast-argmax-in-python/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F17%2Ffast-argmax-in-python%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/17/fast-argmax-in-python/</feedburner:origLink></item>
		<item>
		<title>The Synthese Recommender System</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/489124686/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/16/the-synthese-recommender-system/#comments</comments>
		<pubDate>Tue, 16 Dec 2008 15:27:48 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category><![CDATA[Academia/Research]]></category>

		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1636</guid>
		<description>Andre Vellino has just opened his  Synthese Recommender System: a recommender for journal articles. Andre works for one of the largest scientific libraries in the world (CISTI). You can read all about his project on his blog.</description>
			<content:encoded><![CDATA[<p><a href="http://web.ncf.ca/andre/">Andre Vellino</a> has just opened his  <a href="http://lab.cisti-icist.nrc-cnrc.gc.ca/synthese/welcome.jsp">Synthese Recommender System</a>: a recommender for journal articles. Andre works for one of the largest scientific libraries in the world (<a href="http://en.wikipedia.org/wiki/CISTI">CISTI</a>). You can read all about <a href="http://synthese.wordpress.com/2008/12/16/synthese-recommender/">his project on his blog</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=EfFNo"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=EfFNo" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/489124686" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/16/the-synthese-recommender-system/feed/</wfw:commentRss>
		<feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F16%2Fthe-synthese-recommender-system%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/16/the-synthese-recommender-system/</feedburner:origLink></item>
		<item>
		<title>Parsing CSV files is CPU bound: a C++ test case</title>
		<link>http://feeds.feedburner.com/~r/daniel-lemire/atom/~3/489124687/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/#comments</comments>
		<pubDate>Tue, 16 Dec 2008 15:12:43 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
		
		<category />

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1634</guid>
		<description>(These results were updated.)
In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an [...]</description>
			<content:encoded><![CDATA[<p>(<a href="http://www.daniel-lemire.com/blog/archives/2008/12/19/parsing-csv-files-is-cpu-bound-a-c-test-case-update-2/">These results were updated</a>.)</p>
<p>In <a href="http://www.daniel-lemire.com/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/">Parsing text files is CPU bound</a>, I claimed that I had a C++ test case proving that parsing <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> files could be <a href="http://en.wikipedia.org/wiki/CPU_bound">CPU bound</a>. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array, dominates the running time.</p>
<p>How do I test this theory? I read the file twice. Once, I just read each line and report the time elapsed. Then, I read each line and process them and report the time elapsed. If the two times are similar, the process is I/O bound, if the second time is much larger, the process is CPU bound.</p>
<p>I get this result on a 2 GB file (numbers updated on Dec. 19, 2008):</p>
<p><code><br />
$ ./parsecsv ./netflix.csv<br />
without parsing: 26.55<br />
with parsing: 95.99<br />
</code></p>
<p>Hence, parsing dominates the running time. At least in this case. At least with my C++ code.</p>
<p>Before you start arguing with me, please go download <a href="http://www.daniel-lemire.com/parsecsv/parsecsv.zip">my reproducible test case</a>. All you need is the GNU GCC compiler. I tested out two machines, with two different versions of GCC.</p>
<p><strong>Disclaimer</strong>: I do not claim that this is professional benchmarking. I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.</p>
<p><strong>Reference</strong>: This quest started out from a <a href="http://www.ibridge.be/?p=150">post by Matt Casters</a> where he reported that you could parse a CSV file faster using two CPU cores instead of just one.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~f/daniel-lemire/atom?a=mftTo"><img src="http://feeds.feedburner.com/~f/daniel-lemire/atom?i=mftTo" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/489124687" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/feed/</wfw:commentRss>
		<media:content url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" fileSize="1614854" type="application/zip" /><itunes:explicit>no</itunes:explicit><itunes:summary>Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</itunes:summary><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetItemData?uri=daniel-lemire/atom&amp;itemurl=http%3A%2F%2Fwww.daniel-lemire.com%2Fblog%2Farchives%2F2008%2F12%2F16%2Fparsing-csv-files-is-cpu-bound-a-c-test-case%2F</feedburner:awareness><feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/</feedburner:origLink><enclosure url="http://feeds.feedburner.com/~r/daniel-lemire/atom/~5/489124688/parsecsv.zip" length="1614854" type="application/zip" /><feedburner:origEnclosureLink>http://www.daniel-lemire.com/parsecsv/parsecsv.zip</feedburner:origEnclosureLink></item>
	<media:credit role="author"></media:credit><media:rating>nonadult</media:rating><media:description type="plain">Daniel Lemire's blog is about life in academia, research in Computer Science, wondering how we can reconcile fast databases and algorithms with the informal and asemantic nature of the world around us. It is broadcasted from Montreal (Canada).</media:description><feedburner:awareness>http://api.feedburner.com/awareness/1.0/GetFeedData?uri=daniel-lemire/atom</feedburner:awareness></channel>
</rss>
