If you claim high scalability…

I just reviewed a paper where the authors come up with a nice highly scalable algorithm. And it is really scalable too! But to prove just how fast it is, they process 2,000 data points.

This is correct, strictly speaking. Their algorithm runs in O(n) time, so to know how long it would take to process 1000 times more data, just multiply by 1000.

But where is the fun in that?

The insane world of academic publishing

Stephen Few few wrote a post on how insane academic publishing is. If you publish academic papers, his post is worth your time. Don’t miss the comments!

Stephen is not in academia. From his point of view, what is required of him makes no sense:

  • While he does not expect to get paid for publishing a paper, he expects some kind of symbolic reward like a few subscription to the journal: Is there really any question that someone who takes the time to write an article and go through the lengthy process of working with a publisher, deserves a gesture of thanks equaling the cost of postage?
  • Stephen is surprised that reviewers remain anonymous through the entire process: Cloaking the process in anonymity seemed to indicate a level of discomfort with critique that I didn’t expect to find to this degree in academia.
  • He is upset with how IEEE handles copyright:”I have worked with several publishers and I have never had to give up my rights as author. Most modern publishers know that they don’t need to strip authors of their rights in order to do their job.

My own answers:

  • Anonymous review is just a system we refuse to question. Speaking your mind is certainly a dangerous thing—more so in some countries than others. However, I believe a scholar should have the backbone to speak out in the open. Do something else with your life if you are afraid to sign your opinion pieces.
  • The copyright issue is a shame. However, Stephen should also ask why so many employers ask for non-compete clauses. He should also ask why musicians sign away their soul routinely. I have always been puzzled at how easily TV series are killed: clearly the authors lose their copyright along the way. Fortunately, scholars are pretty bad at reading the contracts they sign…

Proceedings of the Large-Scale Recommender Systems workshop

We have made available a PDF copy of the proceedings for the second Netflix/Large-Scale KDD Recommender workshop. It includes the following papers:

  • Jinlong Wu and Tiejun Li. A Modified Fuzzy C-Means Algorithm For Collaborative Filtering
  • Gavin Potter. Putting the collaborator back into collaborative filtering
  • Andreas Toescher, Michael Jahrer and Robert Legenstein. Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems
  • Oscar Celma and Pedro Cano. From hits to niches? or how popular artists can bias music recommendations
  • Domonkos Tikk, Gabor Takacs, Istvan Pilaszy and Bottyan Nemeth. Investigation of Various Matrix Factorization Methods for Large Recommender Systems

Cool software design insight #4

Mathematicians and philosophers often make terrible programmers. They also tend to write gibberish even in English. (Ok, I do not know if it is a fact, but stay with me.)

A terrible way of programming is to try to hold the entire problem in your head and to put it into code in one shot. Why? Because you are almost certainly overestimating your brain. Your mind can only cope with few parameters at a time.

Here is how you have to program:

  • If you have the luxury to start fresh, start small. Otherwise, make sure you understand and respect the code you start with.
  • Identify specific changes you want to implement. Small changes! Then implement them. Then test them!

You should never redesign working software from the ground up without incremental testing. Never. Even if you work alone.

Interestingly, I also write my papers incrementally, fixing small things one after the other. No other way works for me.

How to select even or odd rows in a table using CSS 3

CSS 3 is around the corner. Already we are seeing some benefits. The latest versions of Safari and Opera, as well as the beta version of Firefox allow you to select even or odd rows in a table using only CSS:


tr:nth-child(2n+1) {
background-color: blue;
}
tr:nth-child(2n) {
background-color: red;
}

See? No ECMAScript, no server-side programming. Alas, no sign of support for this in Internet Explorer.

Quick CSS quiz

Given these CSS instructions,


z[x] > a[i] {color: blue;}
z z[x] a {text-decoration: underline;}
z > z a , z z z + a { color: red ;}

what will be the color of the text in the following XML file?

<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml-stylesheet type="text/css"
href="test.css"?>
<z><z x="x">
<z />
<a i="x">my text</a>
</z>
</z>

Peer review is an honor-based system

It would take too long to expose all of the flaws of peer review, here are some:

  • some work is just flat wrong because the reviewers cannot analyze all of the mathematical results, and because they cannot redo the experiments;
  • numerous researchers cheat, sometimes in small ways (“2 out of 3 experiments agree with by theory, let us drop the third one”), sometimes in big ways (“I don’t have time to run these experiments, so let me make up some data”);
  • peer review may perpetuate some biases and prevent researchers from putting into question some fundamental questions (“we decided that this is the right way, if you question it, you are a loony”).

However, for all its faults, peer review remains essential in science. I want other researchers to read and criticize my work. I enjoy it very much when people try to find flaws in my work. I think that my work is serious enough that when people point out flaws, I am usually aware of them at some level and I can respond easily (and enjoy the process).

The type of peer review I do not enjoy is the country-club approach: 1) does the paper agrees with the goals and views of the reviewers 2) is the paper written by someone we can respect? Fortunately, you can navigate the system and stay away (mostly) from country-club peer review.

But why do I still like peer review despite its obvious flaws? Because I see it as an honor-based system. In such a system, you have to accept that there will be cheaters. A lot of them. And there will lots of mistakes. All we have to do is be open about it. That is, you cannot say “but my work was peer reviewed so you cannot question it!” or “I am very good, look at these prestigious publications!”. The peer review is there to help the authors. It is not, however, an insurance against fraud or mistakes. I like peer review because it helps me become better, but I do not use the system to determine how good someone else is.

So, what do we do if we want to know how good someone is? You read his work. You reproduce his experiments. You redo his math. Of course, this scales poorly. If you have to hire someone, you cannot read the work of 50 or 500 candidates. So? I think we have to be realistic. It is hard to know how good someone is. You can get to know 10 or 20 researchers in your life. That is about all.

Hiring processes are flawed. You will hire cheaters. Get over it.

The mythical bitmap index

A bitmap index is a popular data structure to speed up the retrieval of matching rows in a table. (It is also used in Information Retrieval, to retrieve matching words.)

Building a bitmap index is not hard. You match each possible value with a vector of bits. When the value occur, you insert the value “true” in the vector, otherwise you insert the value “false”. Hence, you get a set of vectors containing binary values. To find out when a particular value occur, you just load up the corresponding vector of bits. (You still need a B-tree or a hash table to map values of the corresponding vector.) In practice, programming with (compressed) vector of bits leads to much faster software than working of sets of row IDs.

The size of such a matrix is the number of distinct values times the number of rows. Hence, some people conclude that bitmap indexes are mostly good to index data were we have few distinct values. In fact, the first sentence of the bitmap index article on wikipedia tells us that bitmap indexes are meant to index data such as “gender” were only two values are possible.

Of course, this reasoning is wrong. Why?

Because bitmap indexes (the big matrix of zeroes and ones) are never stored uncompressed. You always manipulate them in compressed form.

How big is the compressed matrix? Let us see. By run-length encoding (RLE), the simplest compression algorithm I know, each vector of bits has a compressed size at most proportional to the number of occurrences of the corresponding value. If you sum it up, you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!

For a wide range of decision-support applications, bitmap indexes can match or surpass the performance of most indexes, irrespective of the number of distinct values.

You don’t believe me? Read these benchmarks using an Oracle database: “In summary, bitmap indexes are best suited for DSS regardless of cardinality (…)”.

With my very own bitmap index library, I have scaled up to hundreds of millions of attribute values without a problem and tables having several gigabytes of data. And my budget is not comparable to Oracle (I have one developer, me).

I am not saying that bitmap indexes are always the best solution. Of course not! But there is no published evidence that the number of distinct values is a practical criterion.

Then why does the bitmap index article on wikipedia suggests otherwise? Because it is all over the blogosphere and posting boards… because when I tried to fix the wikipedia article, my changes got reverted. So, I post my rebuttal here. If you have practical evidence that bitmap indexes mostly work well when you have 2-3 attribute values, let us see it. Otherwise, help me kill this myth!

The secret to intellectual productivity

I have written a lot about productivity in research and academia on this blog. As recent examples, see my posts Scientific productivity tips from Hartley and Branthwaite Rigor or relevance: choose one, Improving your intellectual productivity by accepting chaos, and Research productivity: what matters?.

However, there is a simple secret that anyone can apply right now. It can help you get better grades if you are students, it can help you finish that book or report you must write if you are a professional.

I should really charge you for this secret, but I am not good at capitalism. The secret is this: break down your task into small and easy chunks of work. And just do them!

Of course, there is more to it:

  • Sometimes you do not know how to break down the task. Do not worry so much about it. Break it down in some way and get working. Let me take an example… suppose that you want to prove some very difficult mathematical conjecture. Suppose that thousands of scientists have failed in making any progress at proving this conjecture. If you must work on it, then stop worrying and just pick a strategy, any strategy, and work.
  • Tasks you assign yourself can range from drawing pictures to going to sending an email to someone who has worked on the problem. As long as you work seriously and systematically, you cannot go wrong.
  • Periodically reevaluate your strategy. Always assume that your plans are quite possibly wrong. Just constantly refresh them.

Why does it work?

  • Feeling overwhelmed makes you stupid. You are better off working using a silly approach than to stand there and wait for divine inspiration. The gods help those who help themselves! The gods even help the idiots.
  • Enumerating all possible solutions to a problem, might actually be the best possible strategy! Sometimes, the best approach sounds silly a priori. All intellectuals work by trials and errors.
  • Getting your hands dirty on a problem trains your brain. After a week working hard (but stupidly) on a problem, you will have a different view of the problem because your brain will be different.
  • If you picked a very difficult problem, you may never succeed in solving it, but by working hard and regularly, you are very likely to find some results you can be proud of.

Back from vacations

I took some time off this year. No, we did not go anywhere specific. I just took two weeks off with my two sons. We had fun.

It has been years since I took time totally off. For irrational reasons, I would always keep my research going at a reduced rate during my time off. Not this year! I was 100% with my family… most of the time.

There still were a few exceptions. For example, DOLAP 2008 accepted our bitmap-index paper and we had to revise the paper within a few days. I really like this paper.

The problem we work on is simple: improve the compression and speed of bitmap indexes by row reordering. A bitmap index is a like a binary matrix with millions or billions of rows. Each column is compressed using a variant of run-length encoding. At query-time, the you load a few columns and apply logical operations (AND/OR) between them. Better compression translates into a faster index because the time required to perform logical operations over bitmaps is proportional to their compressed size. Finding the optimal row order is NP-hard.

A simple heuristic is to lexicographically sort the rows. It works surprisingly well. Fancier heuristics work slightly better. There are limits to how sophisticated your heuristic can get, however, since you need to scale to billions of rows. Fancy graph algorithms are out (as far as I can tell). (Processing the data in small blocks without merging does not work well.) The gist of our paper is that it is very difficult to predict which heuristic will work best, even with simple synthetic data. It was a difficult paper to write, and I am happy that the reviewers did not dismiss it since it breaks the traditional “A is better than B” research model. What our paper establishes is that it is a deeper problem than you might think at first. Fortunately, there are still a few elegant conclusions we came to.

Ok. Enough about research. Here is my youngest son on a poney:

Next Page »

18 queries. 0.433 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.