Where are the academic podcasts?

This blog is also a podcast. Few people notice. I have not posted any audio in months. (If you have never listened to me: I am better at writing English than at speaking it.)

Where are the good academic/research podcasts? What I found so far fitted in these categories:

  • Promotional material for schools or research centers.
  • Capture of audio/video events (such as lectures).

I have never liked lectures. I was the kind of student to never pay any attention in class. Some live talks are good, but most are not.

Podcasting is different however because it can be edited afterhand. You can cut the parts where you are rambling. You can redo part of the podcast if the first take was not good.

But where are the researchers producing good podcasts?

What makes recommender systems work?

Why can we predict tastes? There are several possible explanations:

  • Intrinsically, individuals have predictable tastes. To test this theory, we would need to isolate each individual. Collect their opinions. Then attempt to make predictions. (You also need to prevent the recommender system from giving feedback to the user!)
  • People are influenced by the perceived popularity of items. People tend to like a movie more when they think it is a blockbuster. A  more general statement is: Preferences are constructed in the process of elicitation.

Jon Dron linked to a 2007 New York Times article by Duncan J. Watts which claims that the second factor can easily dominate. The article is based on experiments. However, I have several arguments to support this claim.  Highly cited research papers are often no more interesting than regular papers. Except that you must also cite them, otherwise people will complain that you are missing a reference. The same is true with music or books. I often read novels based because many people read them: that is how, for example, I found la compagnie des glaces or Dune.

Hence,  collaborative filtering is a circular process:

  • People like certain items (mostly) because they perceive that similar people like them.
  • Collaborative filtering matches people with such items.

Of course, that is a bit pessimistic: people are also influenced by the intrinsic values. However, how can you differentiate the social perception from the intrinsic value of an item? Is this novel entertaining, or is it popular? Outside a laboratory, we cannot tell these factors apart!

How well would collaborative filtering work if individuals were isolated? It would work poorly. There are millions of books and research articles. Many are quite good. Without a social component to push us in some directions, we would have diverse tastes.

Grabbing attention or building a reputation?

Daniel Tunkelang has been writing on the attention economy (here and here for example): everyone is fighting to have your attention, and you only have so much to offer.

Attention is easy to measure:

  • You can record the number of people subscribing to your blog.
  • You can count the number of people citing your research papers.
  • You can point to your number of followers on Twitter or your number of friends on Facebook.

However, I do not blog or write research papers merely to grab attention. Instead, I seek to increase my reputation. While attention fluctuates depending on your current actions, reputation builds up over time based on your reliability, your honesty, and your transparency. To build a good reputation, you do not need to do anything extraordinary: you just need to be consistent over a long time.

Of course, you need to get some attention if you are building a reputation. However, on the long run, the saying build it and they will come, is true. Being present and doing good work is enough. You do not need flashy presentations. Remain lean and mean. Avoid high maintenance operations. Do good quality work.

We never invent anything new, yet progress is made!

Practical innovation explains how per-capita wealth increased eightfold during the last century. Yet, we are constantly reminded that we never invent anything new:

  • Most movies are remake or variations on older movies.
  • Most research papers are variation on a theme.
  • Most products and services are variations on existing products and services.

Even if I invent something recognized as drastically novel, I am sure someone will say “oh! but we did that 20 years ago.” A recent example are the tag clouds which have no equivalent in my pre-Web textbooks. Yet, I am sure we can show that they are variation on much older visualization techniques.

If nothing really new is ever invented, why do we observe so much innovation? As I watch the 1993-1993 X-Files (season 1), I am amazed at how backward these FBI agents appear:

  • They do not carry a cell phone and must run back to a land line to get help. (Update: toward the end of the season, we learn that Scully has a cell phone.)
  • Even though agent Scully uses a DOS word processor (probably Word Perfect), all their archives are on paper or microfilms.
  • While Scully has a modem, it is not clear that she uses any networked application. Agent Mulder does not use a computer? I found no sign of the Web.
  • Files are not shared. For example, Mulder has his own files (the X-files) and apparently, others must come to him to have (paper) copies.

The key? Progress is incremental. Human beings are gregarious for a reason: our innovation process is social. Innovation occurs when new ideas are put into action by society.

For researchers wanting to create innovation, there are several implications:

  • You can only be an effective researcher if you are an effective communicator.
  • You should be connected as much as possible to the rest of society. It is not sufficient to impress the ivory tower: you must reach out.

My (short) activity report for 2008

I heard on radio today that the Christmas break should be used to review the past year, and decide where you want to go. Good idea!

What did I do?

  • I published the Lemur Bitmap Index C++ Library.
  • I published lbimproved, a C++ library for Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping.
  • Owen presented our paper Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes at DOLAP 2008 (arXiv:0808.2083).
  • I presented our paper Tri de la table de faits et compression des index bitmaps avec alignement sur les mots (French for Fact Table Sorting and Word-Aligned Compression for Bitmap Indexes) at BDA’08 (arXiv:0805.3339). It was my first time in ten years publishing in French.
  • Hazel presetned Pruning Attributes From Data Cubes with Diamond Dicing at IDEAS’08 (arXiv:0805.0747).
  • Our paper Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays finally appeared in ACM Transactions on Algorithms. It is available from arxiv: cs.DS/0610128. I published some years ago the C++ source code.
  • Kamel presented our paper Collaborative OLAP with Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation at WEBIST 2008 (arXiv:0710.2156).
  • I published my first online graduate course online (INF 6104) on Information Retrieval.
  • I taught Data Management in XML (INF 6450) and an undergraduate course on Information Retrieval (INF 6460).
  • I published hundreds of blog posts, and some of my post have been relatively popular.

Naturally, there was more to 2008, but you get the idea.

What are my plans for 2009?

  • I will keep experimenting with different types of blogging. My goal is to get better at connecting my main research activities and my blogging.
  • I will continue to work on bitmap indexes. I will also branch out from bitmap indexes to columnar databases in general. Expect a few research papers.  Expect my Lemur Bitmap Index C++ Library, to grow beyond bitmap indexes.
  • I will pursue my interest in collaborative data analysis.
  • I will publish an online course on Data Warehousing.

Again, there are a few more projects, but these are important goals.

My low-tech research tools

I carry a pocketbook and a pen everywhere. At night, my pocketbook is by my bed. All creative workers should carry notebooks.

Organizing and collecting ideas are different tasks. My pocketbook is strictly for collection. Every few days,  I start a new page: a list of reminders on one side, and diagrams on the other side. Important ideas get processed and stored on my laptop. I throw away used pocketbooks.

It is difficult to find quality pocketbooks. Here are my recommendations:

  • My pocketbook must last a few months. Paper must be thick and of good quality. I prefer unlined paper. I need a ribbon marker to quickly find the current page. Paperblanks make some good and inexpensive Pocket Companions fitting the bill. Alas, Paperblanks does not sell directly to customers. The retail info on the Paperblanks’ web site is helpful.
  • I prefer black gel pens.  Rechargeable pens create less garbage and they are often of better quality. For about a year, I have had good luck with my Zebra gel pens.

Note: I do not profit in any way if you buy a Paperblanks pocketbook or Zebra pen.

Where do presidents and prime ministers go to school?

In his most recent essay, After the credentials, Paul Graham tells us that in South Korea where “college entrance exams determine 70 to 80 percent of a person’s future.” Fortunately, the Americans know better: “Where you go to college still matters, but not like it used to.”

Paul writes good essays, but they are thin on research. How much is your alma matter a predictor of your success? The research is available. For example, in Regression and Matching Estimates of the Effects of Elite College Attendance on Career Outcomes, Brand and Halaby write:

Our results suggest that in terms of college quality, there is not only no direct effect on mid- and late-career attainment, but no significant effect at all.  This study questions the consequential belief that an elite college education necessarily translates into privileged socioeconomic status throughout the life course.

To sum it up: If you are a privileged kid, you will do well even if you go to a local college.

Because my research budget for this blog is $0, I will do my own survey about a special job: the presidency in the USA and the office of prime minister in Canada. Do state leaders attend a small set of colleges?

Let us review where the American presidents got their first degree:

What about Canadian prime ministers?

Based on this evidence alone, if I were to coach a kid for a political career, I would ignore where he gets his degree. This makes sense. You  become president or prime minister several years after earning your degree. By the time you have the experience required for the job, any college premium is gone.

See also my post The 2 myths getting students into heavy-league schools.

Disclaimer: I am a graduate of the University of Toronto, maybe the most prestigious university in Canada.

Parsing CSV files is CPU bound: a C++ test case (Update 2)

I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly pointed out that using the clock() function is wrong. So I updated my code using his ZTimer class instead. The new numbers are 103 seconds for the full parsing and 57 seconds to just parse the lines.

Some anonymous reader claimed that my code was still grossly inefficient. I do not like arguing without evidence.

Ah! But Unix utilities can also parse CSV files. They are usually efficient. Let us use the cut command:


$ time cut -f 1,2,3,4 -d , ./netflix.csv > /dev/null
real 1m59.596s
user 1m53.163s
sys 0m3.775s

So, 120 seconds?

What about sorting the CSV file? Of course, it is a lot more expensive: 504 seconds.

$ time sort -t, ./netflix.csv > /dev/null
real 8m23.985s
user 2m28.855s
sys 1m1.467s

Finally, for a basis of comparison, let us just dump the file to /dev/null:

$ time cat ./netflix.csv > /dev/null
real 0m29.337s
user 0m0.029s
sys 0m2.541s

The final story:

parsing method time elapsed
cat Unix command 29 s
Daniel’s line parser 57 s
Daniel’s CSV parser 103 s
cut Unix command 120 s
sort Unix command 504 s

Analysis: My C++ code is not grossly inefficient. If the I/O cost of reading the file is about 30 seconds, parsing it takes about 100 seconds. My preliminary conclusion is that parsing CSV files is more CPU than I/O bound.

Parsing CSV files is CPU bound: a C++ test case (Update 1)

(See update 2.)

In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.

You can find my C++ code on my server.

A reader criticized my implementation as follows:

  • I use the C++ getline function to read the lines. The reader commented that “getline does one heap allocation and copy for every line.” I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.
  • For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.
  • The reader commented that I should use a custom allocator to avoid heap allocations. Currently, if the CSV file has x fields, I use x+1 string objects (a tiny number) and small constant number of heap allocations.

Despite these changes, I still get that parsing CSV files is strongly CPU bound:


$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99

However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that parsing CSV files is strongly I/O bound as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.

My experiments are motivated by a post by Matt Casters. Some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you tell me where I went wrong?

Disclaimer: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.

Fast argmax in Python

In my post Computing argmax fast in Python, I reported that Python has no builtin function to compute argmax, the position of a maximal value. I provided one such function and asked people to improve my solution. Here are the results:

argmax function running time
array.index(max(array)) 0.1 s
max(izip(array, xrange(len(array))))[1] 0.2 s
max(izip(array, xrange(len(array))))[1] 0.5 s

Conclusion: array.index(max(array)) is simpler and faster.

Next Page »

18 queries. 0.448 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.