Column stores and row stores: should you care?

Most database users row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows. Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.

Daniel Tunkelang is back from SIGMOD: he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded the 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.

But are column-oriented databases the next big thing? No.

  • Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
  • While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
  • Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?

I believe Nicolas Bruno said it best in Teaching an Old Elephant New Tricks:

(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their benefits cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the benefits of C-stores can indeed be simulated in traditional engines with no changes whatsoever.  Finally, we predict that traditional relational engines will eventually leverage most of the benefits of C-stores natively, as is currently happening in other domains such as XML data.

That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.

(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)

Is collaboration correlated with productivity?

Apparently, it is prestigious to write research papers with people from other countries. Funding agencies routinely favor collaboration between different  universities.

Presumably collaboration improves productivity? Maybe not:

(…), there is no clear evidence that correlation exists between the resort to extramural collaboration and the overall performance of a research institution

Reference: Giovanni Abramo, Ciriaco Andrea D’Angelo, Flavia Di Costa, Research collaboration and productivity: is there correlation? High Educ (2009) 57:155–171.

Netflix competition is over?

The Netflix competition is a $1 million research competition to improve the Netflix movie recommender system by 10%. A large team called BellKor’s Pragmatic Chaos just announced that they won (update: unless someone can beat them in the next month). Among them is Yehuda Koren with whom I organized the 2nd Netflix-KDD Workshop and also some engineers from my home town (Piotte and Chabbert). I do not know how they will split the money, but I suspect each one of them will get at least 100k$. 

I want a study on the benefits of this new technology on the Netflix users.

Reference: See my older posts Proceedings of the Large-Scale Recommender Systems workshop and Netflix: an interesting Machine Learning game, but is it good science?

Physical tools to improve research productivity

Using the right tools can improve your productivity:

  • I use black gel pens with a large to medium point. Right now, I favor uni-ball 207 pens.
  • I always carry a pocketbook. I use it to collect current actionable items. There is exactly one active page at any one time. Once it gets filled up, I move to the next page. For example, I might have a three-item list: “prove conjecture A, find an algorithm to solve problem B, and run script C”. If I am shopping and I have an idea, I jot it down in my pocketbook on the active page. I never delete anything, though I strike through completed or irrelevant tasks. Right now, I am using a Paperblanks back pocket.
  • I use larger notebooks to brainstorm ideas. Again, I never delete anything: I move from page to page through a random collection of ideas. Most of the ideas I jot down are wrong. I do not worry about it since my notebooks are just collections of ideas. I only work out details on my laptop. If you see me carrying notebooks, I am probably working on some crazy new idea. Right now, I use Winnable Executive Journals.
  • I have a white board in my office, but I only use it with visitors, students or passing colleagues. I never work directly on a large board. I never use a white board for serious mathematics.
You may have noticed that these tools keep me mobile: I can do research anywhere, as long as I have a laptop and an Internet connection. I have no one true workplace. I can work in my living room, in my kitchen, in my university office, in our home office, in my bedroom, in my garden, and so on.
Other people use different tools:

Which tools do you use, and how do you use them?

Is Open Access publishing the solution? Really?

Back when I was a consultant, I had client who was convinced that Microsoft Windows was free software. So, he insisted that all applications ran on Microsoft’s web server. To him, the Apache server was an expensive proposition. Yet, Microsoft is not at all in the business of free software, but their cost is hidden from the consumer.

Similarly, for professors and many graduate students, the costs of academic publishing are hidden. UQAM pays for my unrestricted access to research papers. Open Access research papers might have marginally more impact. However, the costs of Open Access are significant for me, just like the costs of Apache were important for my client:

  • There are far fewer Open Access journals to choose from.
  • On average, Open Access journals have lower standing.

Open access to research papers is the responsible thing to do.  How do we change the system? Do we boycott restricted journals? No. There is nothing wrong with restricted journals. We should not force them to close, we should evolve so that they become irrelevant. For now, they serve their purpose. There is no adequate drop-in replacement.

Disruption is the solution. Younger folks may not remember this, but in the nineties, Microsoft had a tight grasp of the software market. Right now, Microsoft’s monopoly is irrelevant as far as I am concerned. Anyone can buy a PC, install Linux on it and access everything that matters. Of course, the real story is not that Linux has beaten Microsoft Windows. Instead, it is the operating system that has lost relevance.

How do we generate disruption? By providing alternatives. It is important to realize that these alternatives do not have to be better. Instead, they have to be more convenient and simpler. Unfortunately, I do not believe that Open Access journals are disruptive. They are challengers, certainly, but due to economics, they may fail to subvert the current system. 

Several years ago, I decided to publish all my preprints to arxiv. You can even grab an atom feed of my publications. Arxiv is indexed by Google Scholar and DBLP. Arxiv is well managed. Their web site is usable. Before I used arxiv, I would merely post my papers on my web site. This is an individual choice. While it is not apolitical, it does not require me to change anybody’s mind.

To me, the single most important recent event in academic publishing has been the publication by Perelman of his solution to the Poincarré conjecture on arxiv. This is truly a historical event.

Self-publishing is both simpler and more convenient than traditional publishing. It is disruptive. As is often the case with disruptive solutions, it lacks some important features. For example, reputation, peer-review, quality control, review, validation, authentication are difficult with self-publishing. But that is to be expected. The solution is not to try to emulate these features one by one. Indeed, we may find that many of these important missing features are not relevant.

 

Some shameful facts about myself

  • In 2003, I predicted that it would take decades before videoconferencing became cheap enough for home users.
  • I do not know my own telephone number or postal code, though I have lived for many years in the same house (and we own it). I do not know my office number. I do not know my social insurance number. 
  • For the longest time, I thought that getting a Ph.D. was sufficient to get decent jobs, if not within academia, at least in industry. (That’s wrong.)
  • Once I file anything in a folder or inside a desk, I am certain never to find it again. Anything not directly on my desk is lost forever. I am not kidding. That is why I run a paperless office.
  • I once thought that computing the Hamming distance took quadratic time.
  • I can no longer understand my older research papers such as “Fourier analysis of 2-point Hermite interpolatory subdivision schemes” and “A family of 4-point dyadic multistep subdivison schemes”. I cannot even understand the abstract of these papers. I could not prove I wrote them.
  • I lost all the electronic copies of my Ph.D. thesis the same day I sent the second revised version to the printer. Though I had backups, I overwrote all the backups with an empty file, by accident. Had they requested a second round of revisions, I would have had to retype my thesis.
  • My wife is much smarter than I am. If she did not manage our money, I would probably put all my savings in a checking account or I might forget where the money is.
  • I am somewhat of a diva: I guard my schedule against intrusions as if time spent on my research was very important. I am convinced that my research matters.
Next Page »

22 queries. 0.418 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.