Friday, October 31st, 2008

A no free lunch theorem for database indexes?

Filed under: Data Warehousing and OLAP, Science and Technology — Daniel Lemire @ 10:01

As a trained mathematician, I like to pull back and ask what are the fundamental limitations I face.

A common misconception in our Google-era is that database performance is a technical matter easily fixed by throwing enough hardware at the problem. We apparently face no fundamental limitation. To a large extend, this statement is correct. Thanks to the B-tree and related data structures, we can search for most things very quickly. Roughly, the database problems are solved, as long as you consider only a specific type of queries: queries targeting only a small subset of your data.

What if you want to consider more general queries? Can you hope to find a magical database index that solves all your problems?

I spent part of my morning looking for a No Free Lunch (NFL) theorem for database indexes. I would like to propose one:

Any two database indexes are equivalent when their performance is average across all possible content and queries. (Naturally, it is ill-posed.)

I draw your attention to how limited your interaction with tools that produce aggregates, such as Google Analytics, are. Basically, you navigate through precomputed data. You may not realize it, but the tool does not let you craft your own queries. Jim Gray taught us to push these techniques further with structures such as the data cube (which wikipedia insists on calling an OLAP cube). However, precomputation is just a particular form of indexing. It helps tremendously when the queries take a particular form, but when you average over all possible queries, it is unhelpful.

What if you want to consider general queries, and still get fast results? Then you have to assume something about your data. I suggest you assume that your data is highly compressible. Run-length encoding has been shown to help database queries tremendously.

Short of these two types of assumptions (specific queries or specific data sets), the only way you can improve the current indexes is by constant factors—you can double the performance of existing B-tree indexes, maybe. Or else, you can throw in more CPUs, more disks, and more memory.

For now, I will stick with my puny computers, and I will assume that the data is highly compressible. It seems to work well in real life.

Thursday, October 30th, 2008

How I built my Web presence as a researcher…

Filed under: Academia/Research — Daniel Lemire @ 12:00

Suzanne Bowness asked me to answer some questions for a paper she is preparing. I reproduce here the content of the interview. It is mildly incoherent.


When did you first start your web site? Has your purpose for it evolved over the time that it has been online? How did you decide what sections to include?

I started my web site as a graduate student, circa 1995. Initially, my goal was to keep track of my favourite web sites, and share them with the world. Later, I began to post my academic papers online systematically. Informally, I also added a news section to my web site.

Around 2004, academic blogging emerge as a new trend. Researchers like Stephen Downes showed that blogging could be an integral part of one’s research activities. Therefore, I replaced my hand-crafted news section by a bona fide blog. Later on, I added a French blog for my students.

Do you have a sense of what parts of your web site are most commonly consulted? What would you recommend that other professors include on a web site? Anything you would avoid?

My blog is read by over 1000 people. Judging by the comments alone, well known professors and researchers read my blog. Of course, many of them have blogs as well including Peter Turney (NRC)  or David Eppstein (UCI).

In fact, my blog is more than a  publication venue: it is an integral part of my networking activities as a researcher.

Most professors should not become bloggers. However, they should all be making sure that their papers, their data, their software, their courses and their talks are available online. There is mounting evidence that making your work easier to browse and download is beneficial to one’s academic career.

How does your web site help you in reaching out to students? How does it help to raise your public profile?

I believe that most of my students do not read my blog. Other students do, certainly. I have not yet found a good way to integrate blogging with teaching. Indeed, whereas most teaching in universities happens in closed groups, blogging appears to require large open social networks to be effective.

I find however that many graduate students enjoy the fact that I make my papers and my software available online freely.

Did you design your site yourself? What software do you use to maintain it? Any advice for other profs in terms of technical upkeep or updating frequency?

Like many science professors, I have initially designed my Web site using a text editor and raw HTML. I maintain my own blog engine (wordpress).

Universities often do not have the resources to help professors maintain effective Web sites. Fortunately, many inexpensive solutions are available (wordpress.com, blogger.com, and so on). In fact, most academic bloggers do not blog from within the university networks. People have even coined a term for this do-it-yourself strategy: edupunk.

I spend  many hours every week publishing content online. Some may consider it wasteful. However, I believe university professors are in the communication business.

Any other comments/advice on creating a web presence?

In 2008, the Web is a social phenomenon. Merely posting content is no longer enough. You have to find a way to interact dynamically with people interested by your work and ideas.

Tuesday, October 28th, 2008

When in doubts, prefer unimpressive negative results

Filed under: Academia/Research — Daniel Lemire @ 18:58

When you first hear about peer review, you are lead to believe that the purpose of peer review is detect mistakes and improve papers. Then, one day, you submit a beautiful paper presenting a correct result. Surprise! It gets rejected! What happened? Someone decided your work was not impressive enough.

In Why Current Publication Practices May Distort Science, Young et al. explains why the artificial scarcity of publication venues in science may systematically favor wrong results. The short story is this:

  • journals and conferences pick papers that are most likely to get cited;
  • papers with impressive positive results are more likely to get cited;
  • hence, papers with impressive positive results are favoured against unimpressive negative results;
  • however, unimpressive negative results are actually more likely to be correct!

Hence, the modern version of peer review may actually favour misguided, but impressive results over carefully crafted, but boring results.

One of their conclusions is particularly interesting:

To exorcise the winner’s curse, the quality of experiments rather than the seemingly dramatic results in a minority of them would be the focus of review, but is this feasible in the current reality?

Maybe my advice to submit your papers where they are likely to be accepted might actually lead to better science?

Related posts: Productivity measures are counterproductive?, Are we destroying research by evaluating it?, and On the upcoming collapse of peer review.

Source: Guillaume Guité

Monday, October 27th, 2008

The future of innovation is in software

Filed under: Science and Technology — Daniel Lemire @ 8:01

I keep reading about how the future will be shaped by new cheaper fuel or amazing new medications. I believe that we are misreading the trends. Yes, we will have better medications and cheaper fuel in the future. However, I believe we are clearly in the mist of an information revolution. The future will be shaped by software, defined broadly.

Specifically, I believe that:

  • Tele-work, tele-play, tele-learning will soon represent 80% of our lives.
  • There is much more room for innovation in software than in hardware. There are few ways to build a house, but many more ways to build a virtual house.

Friday, October 24th, 2008

The problem with unidimensional research

Filed under: Academia/Research, Data Warehousing and OLAP — Daniel Lemire @ 3:02

Yesterday, I listened to some of the BDA’08 talks. One common issue I noticed is that most work is unidimensional. That is, researchers tell us “my technique is 4 times faster”. The trade-off, if any, is not presented. Here is what I would like to hear: “I use twice the RAM, but I am 4 times faster”.

This is a side-effect of our quest for positive results. People hide the trade-offs because the negative points make their papers less likely to be accepted. I believe this is the main reason why practitioners ignore researchers. Researchers paint a severely distorted view of the world.

Reviewers need to wise up and give extra points to researchers who describe the negative and positive points of a method. We need to stop believing that the one ideal technique is just around the corner.

In my experience as a researcher, there is no silver bullet. Impressive techniques often fall flat when viewed from a different angle. What appears deep and thoughtful may sound trivial a few years later once it is well understood. What appears childish may be the best option in most cases.

Paint a rosy picture of your work and I will distrust you. Paint the shadows and I start believing in you.

Saturday, October 18th, 2008

Blogging is networking

Filed under: Academia/Research, Science and Technology — Daniel Lemire @ 17:00

Two years ago, I asked whether academic blogging was still relevant. At the time, two famous bloggers had stopped (Sébastien Paquet and Stephen Downes). Evidently, I kept on blogging. I even took up microblogging.

Let me revisit some of the benefits.

  • Bloggers are more visible. This blog has over 900 readers. Some are students, others are engineers, teachers, entrepreneurs, researchers or professors.
  • Blogging is good for knowledge management: leaving a trace of your thoughts is always a good idea. You are also forced to flush out your ideas and you get immediate feedback.

However, reading Daniel Tunkelang today, I realized that the most important benefit is the networking effect. A blog without a social network is nothing. Nobody wants to read a list of random thoughts. What makes blogging rewarding for me are the comments and the link I receive. But I enjoy even more reading others and commenting elsewhere.

In effect, good blogging makes you part of a rich and open community. That is very valuable.

Friday, October 17th, 2008

How do you know that you are right?

Filed under: Academia/Research — Daniel Lemire @ 13:13

How an individual evaluates his work is a fundamental intrinsic characteristic. If I had to classify researchers, for example, I would look at how they argue for the quality of their research.

Let me ask you, how do you know how well you are doing? Doing well can mean several things:

  • What you state is factually correct.
  • Many people know or appreciate your work.
  • The performance of your tool  or system is competitive with respect to some measure.
  • You are getting a lot done.
  • You are making a lot of money.

For most tasks we accomplish, no quantitative measure is satisfying. Can you quantify how good is one of my research papers, or this blog post? Even when you can use a quantitative measure, it is sometimes prohibitively expensive to compute it. Correctness is also somewhat relative. Most non-trivial work is not without errors. The best way to avoid errors is to do trivial work. Popularity is also relative: is McDonald’s better than the best restaurant in my neighborhood?

I think you can change the direction your career takes by changing your metrics.

A secondary characteristic is what you believe makes them good. Are you doing poorly because of the working conditions, or because you are an idiot? Most research papers spend little time on failures, but it would be interesting to see how people describe their failures. I find it interesting to hear famous people tell us why they are doing well. Some stess how brilliant they are. Others stress how much they worked. Others thank others for helping them.

Next Page »

36 queries. 0.931 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.