The mythical bitmap index
A bitmap index is a popular data structure to speed up the retrieval of matching rows in a table. (It is also used in Information Retrieval, to retrieve matching words.)
Building a bitmap index is not hard. You match each possible value with a vector of bits. When the value occur, you insert the value “true” in the vector, otherwise you insert the value “false”. Hence, you get a set of vectors containing binary values. To find out when a particular value occur, you just load up the corresponding vector of bits. (You still need a B-tree or a hash table to map values of the corresponding vector.) In practice, programming with (compressed) vector of bits leads to much faster software than working of sets of row IDs.
The size of such a matrix is the number of distinct values times the number of rows. Hence, some people conclude that bitmap indexes are mostly good to index data were we have few distinct values. In fact, the first sentence of the bitmap index article on wikipedia tells us that bitmap indexes are meant to index data such as “gender” were only two values are possible.
Of course, this reasoning is wrong. Why?
Because bitmap indexes (the big matrix of zeroes and ones) are never stored uncompressed. You always manipulate them in compressed form.
How big is the compressed matrix? Let us see. By run-length encoding (RLE), the simplest compression algorithm I know, each vector of bits has a compressed size at most proportional to the number of occurrences of the corresponding value. If you sum it up, you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!
For a wide range of decision-support applications, bitmap indexes can match or surpass the performance of most indexes, irrespective of the number of distinct values.
You don’t believe me? Read these benchmarks using an Oracle database: “In summary, bitmap indexes are best suited for DSS regardless of cardinality (…)”.
With my very own bitmap index library, I have scaled up to hundreds of millions of attribute values without a problem and tables having several gigabytes of data. And my budget is not comparable to Oracle (I have one developer, me).
I am not saying that bitmap indexes are always the best solution. Of course not! But there is no published evidence that the number of distinct values is a practical criterion.
Then why does the bitmap index article on wikipedia suggests otherwise? Because it is all over the blogosphere and posting boards… because when I tried to fix the wikipedia article, my changes got reverted. So, I post my rebuttal here. If you have practical evidence that bitmap indexes mostly work well when you have 2-3 attribute values, let us see it. Otherwise, help me kill this myth!
Daniel, I’m glad your pushing this. If you check out the talk page for the entry (http://en.wikipedia.org/wiki/Talk:Bitmap_index), you’ll see that a couple of us are supporting your efforts.
Fixing Wikipedia is a thankless but worthwhile task, and I’m glad to see researcher-bloggers taking it on. I’ve had a few success stories myself:
http://thenoisychannel.blogspot.com/2008/06/your-input-really-is-relevant.html
http://thenoisychannel.blogspot.com/2008/06/exploratory-search-is-relevant-too.html
Comment by Daniel Tunkelang — 20/8/2008 @ 10:25
Ouch, make that “you’re”. Next time I’ll proofread before getting caught up in the Roman numeral arithmetic.
And the talk page for the Wikipedia entry didn’t get turned into a hot link, so I’ll try again:
http://en.wikipedia.org/wiki/Talk:Bitmap_index
Comment by Daniel Tunkelang — 20/8/2008 @ 10:28
Thank you Daniel. This is more than I wished for.
Comment by Daniel Lemire — 20/8/2008 @ 11:33
Hi Daniel, I am not sure why your edit got reverted (can’t find the diff at a glance), but Wikipedia is pretty much open to referenced information. So, if almost any info is properly referenced from several reliable sources, it gets to stay.
I have added the info and reference to the Sharma benchmark article, and feel free to provide further links/refs about this in the article talk page.
Comment by Ragib Hasan — 20/8/2008 @ 12:04
Thank you. I saw that earlier today.
Comment by Daniel Lemire — 20/8/2008 @ 17:05
If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?
Comment by Chris Dew — 21/8/2008 @ 4:48
Daniel,
I’m installing the software today, and don’t have the GNU seq command installed; so the /testequalityqueries.sh failed.
Let me suggest you replace `seq 0 10` with
0 1 2 3 4 5 6 7 8 9 10 in the two places seq is called in the script. This was then successful.
Comment by Will — 21/8/2008 @ 7:28
Thank you Will.
Comment by Daniel Lemire — 21/8/2008 @ 7:46
If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?
As I point out in my post, you typically use a b-tree to get the location of the bitmap. You need some sort of b-tree-like data structure because compressed bitmaps have different sizes so you can’t just have them in a fix-length-record flat file.
It is not very exciting to index a field where all values are distinct because, yes, the bitmaps would all be very short. Ah! But you rarely have a single dimension. Bitmap indexes are typically used in a DSS context where you “always” have numerous dimensions.
Comment by Daniel Lemire — 21/8/2008 @ 7:48
Jesus, you’re surprised that Wikipedia is handing out bad technical advice? Have you noticed who writes it yet?
http://sc.tri-bit.com/outgoing/NeverTrustWikipediaTechnicalArticles.png
Also, your anti-spam mechanism is inappropriately case sensitive.
Comment by John Haugeland — 21/8/2008 @ 9:31
Thanks Daniel, I learned something today.
Comment by Parand — 25/8/2008 @ 11:56