Column stores and row stores: should you care?

Most database users row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows. Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.

Daniel Tunkelang is back from SIGMOD: he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded the 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.

But are column-oriented databases the next big thing? No.

  • Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
  • While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
  • Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?

I believe Nicolas Bruno said it best in Teaching an Old Elephant New Tricks:

(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their benefits cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the benefits of C-stores can indeed be simulated in traditional engines with no changes whatsoever.  Finally, we predict that traditional relational engines will eventually leverage most of the benefits of C-stores natively, as is currently happening in other domains such as XML data.

That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.

(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)

Netflix competition is over?

The Netflix competition is a $1 million research competition to improve the Netflix movie recommender system by 10%. A large team called BellKor’s Pragmatic Chaos just announced that they won (update: unless someone can beat them in the next month). Among them is Yehuda Koren with whom I organized the 2nd Netflix-KDD Workshop and also some engineers from my home town (Piotte and Chabbert). I do not know how they will split the money, but I suspect each one of them will get at least 100k$. 

I want a study on the benefits of this new technology on the Netflix users.

Reference: See my older posts Proceedings of the Large-Scale Recommender Systems workshop and Netflix: an interesting Machine Learning game, but is it good science?

Why senior researchers and managers should analyze data themselves…

Scientists, businessman and even spies are supposed to analyze data collaboratively. Are they?

If you are a scientist, you are familiar with the following type of research collaboration: a lowly student collects the data, crunches the numbers and plots the data. Other collaborators—such as the professor—merely comment on the tables and plots. Similarly, the CEO sees the pie chart, while the assistant crunches the numbers. That is vertical collaboration: you clean the basement and I will clean the main floor.

Yet, reliable data analysis requires horizontal collaboration.  Indeed, there are downsides to task specialization:

  • By never looking at the data, senior scientists and managers rely on experience and hearsay. Their incoming bandwidth is dramatically reduced. Nature is the best coauthor. Consider how the best American spies were fooled prior to 9/11 while all the data to catch the terrorists was available. Bandwidth is a requirement to be smart.
  • When a single person crunches the numbers, hard-to-detect errors creep in. The problem is serious: Ioannidis showed that most research findings are wrong.
  • With nobody to review the source data, the sole data analyst is more likely to cheat. Why rerun these tests properly, when you can just randomly dismiss part of the data? People are lazy: when given no incentive, we take the easy way out.

The common justification for task specialization is that senior researchers and managers do not have the time. Yet, 30 years ago, researchers and managers did not type their own letters. Improve the tools, and reduce task specialization.

With Sylvie Noël, I decided to have a closer look. My preliminary conclusions are as follows:

  • There are adequate tools to support rich collaboration over data analysis. Collaboratories have been around for a long time. We have the technology! Yet, we may need a disruption: inexpensive, accessible and convenient tools. The current migration tower Web-based applications might help.
  • Given a chance, everyone will pitch in. To make our demonstration, we collected user data from sites such as IBM Many Eyes and StatCrunch. We then ran an Ochoa-Duval analysis. We find that the network of users within web-based data analysis tools is comparable to other Web 2.0 sites.

As a database researcher, I think that further progress lies with loosely coupled data (no big tables! no centralized tool!) and flexible visualization tools (stop the pie charts! go with tag clouds!). I am currently looking for new research directions on this problem, any idea?

Further reading

Stop generating metadata and access the full content!

 Many researchers advocate the use of metadata to help find or recommend content automatically. Metadata is certainly useful when aggregating content for human beings: I first read the titles of research papers before reading them. However, machines do better when they access at least some of content  (Lin, 2009). Moreover, metadata is of little value in ranking answers (Hawking and Zobel, 2007). 

I think that researchers cling to metadata because that is how we have indexed books for so long. When I was a kid, full text searches in a library was unthinkable. Yet, there is no escape: everything is miscellaneous. Folksonomies and ontologies will not save the day. When working with machines, let go of metadata and embrace the full content.

I am particularly puzzled by a common research approach. Take an object. Extract metadata. Then compare objects between themselves using the metadata, or use the metadata for retrieval. I understand that this may constitute a useful form of dimensionality reduction. Yet, researchers frequently omit to check whether it is necessary to extract metadata at all.

Reference

  • David Hawking and Justin Zobel, Does topic metadata help with Web search? Journal of the American Society for Information Science 58 (5), 2007.
  • Jimmy Lin, Is searching full text more effective than searching abstracts? BMC Bioinformatics 2009, 10:46, 2009.

Credit: Thanks to Andre Vellino for motivating this post.

Quantum databases: what are they good for?

Hu et al. just posted An efficient quantum search engine on unsorted database. They refer to an older paper by Patel (2001), Quantum database search can do without sorting. Apparently without any data structure or preprocessing, logarithmic-time search queries are possible in quantum databases.

Even if we did have affordable quantum computers, this would not be a big selling point. Building B-trees or sorting tables is hardly prohibitive.

I would be more interested in how quantum databases can handle low selectivity queries. For example, can a quantum computer sum up a large array of numbers in near-constant time? Our current technology solves these problems with a mix of parallelism, compression and brute-force.

All I know about quantum computers is that we do not think they can solve NP-hard problems in polynomial time. Is there any reason to see a bright future in quantum databases?

Why are dynamic languages easier than static languages?

Dynamic langages like Python or Ruby are considerably easier than static languages like Java and C++. John is asking us why:

How do you account for the huge increases in productivity that dynamic language advocates say they experience.

My answer is duck typing. It is powerful because formal and tight definitions are hard and less useful than we might expect.

The difference is fundamental:

Static view Dynamic view
Tight definitions Duck typing
Ontologies Folksonomies
Specific solutions Generic solutions

Further reading:

Next Page »

23 queries. 0.445 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.