Thursday, May 1st, 2008

Seeking an efficient algorithm to group identical values

Filed under: Science and Technology — Daniel Lemire @ 23:26

In the past, I have had luck with my requests for help, so here is another one.

Suppose you have a large array made of a large number of distinct values ({A,B,A,B,A,C,C}) and you want to group the identical values like so {A,A,A,B,B,C,C} or like so {C,C, A,A,A,B,B}. That is, you do not care about the order, you just want identical values to be clustered. How do you do it?

  • You could sort the array assuming that there is an ordering between the objects. There are highly efficient external-memory sorting routines. They mostly rely on sequential IO and are amazingly fast.
  • You can build a hash table. Because it is a linear time operation, for very large arrays, it should be faster than sorting, in theory. However, the catch is that external-memory hash tables are not very efficient because they rely on random IO and are prone to cache misses. Remember kids that 100 n > n log n despite what your math. teacher taught you.
  • We can mix hashing and sorting. Scan the array, and randomly hash each value into one of L bins. You know that if the value x appears in bin i, then all values x are in the same bin. So, you can simply sort each bin and concatenate the bins in O(n log n/L) time, assuming your hashing is good enough.
  • One last possible trick might be to adapt fast duplicate detection algorithms such as the Teuhola-Wegner algorithm: J. Teuhola, L. Wegner, Minimal space, average linear time duplicate deletion, Communications of the ACM, 1991.

So, what do you think?

Week-old cappucinos taste bad

Filed under: Academia/Research — Daniel Lemire @ 8:32

I am not a neat guy. I tend to drop coffee cups on my desk and I never look back. What I learned this year is that if you leave a cup of cappucino for a week on your desk, then drink a shot of it by mistake while staring at your computer screen, you are left with a terrible taste in your mouth for the rest of the day. However, day-old cappucinos are ok.

There is a lesson here kids: do not leave projects without attention for weeks at a time. You will get a bad taste once you go back to it later.

« Previous Page

27 queries. 0.338 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.