For those who don’t know, email is not a good collaborative editing tool. There are many superior alternatives such as wikis, version control tools, and so on. I tend to use all of those, but for serious work, when I need to actually sit down and write the paper, I use version control (such as CVS).

My friend Owen has decided to go spend the summer hidden in a shack. The downside of this is that bandwidth has become very expensive to him. So I suggested we move to Subversion from CVS. Subversion is reportedly smarter about how it uses the bandwidth. For example, if you change a single line in a file, it will not send the entired (changed) file over to the server, but just the description of the change. It has also other nice features like atomicity.

In practice, however, setting up Subversion has turned out to be a bit of a headache. My ISP has a nice Subversion howto, but I ran into many problems when I tried to setup svnserve (the Subversion server).

Some lessons learned:

  • Make sure you don’t use the Berkeley DB backend. It is very fragile: the minute something goes a bit wrong, Berkeley DB falls and can’t get up. Oh! You won’t lose data, but you’ll need to go on the server and do a “svnadmin recover”. Instead, use the newer “fsfs” which has not given me any trouble.
  • Don’t trust too much the Subversion error messages. I think they were written without much thought. Things like “this URL doesn’t exist” actually means “I can’t find your repository where you say it should be, try another URL”.
  • Use a web client for browsing the different versions. I use WebSVN and I’m very happy. It was extremely simple to setup and the requirements are very minimal.
  • Under Linux, don’t bother with the GUI clients. I tried most of them and they simply won’t cut it. Use the basic CLI tools coupled with a web client.

I recently updated my Mandrake Linux machine and got errors like this one when starting a GTK application:


(gnome_segv:24830): Pango-CRITICAL **: _pango_engine_shape_shape: assertion `PANGO_IS_FONT (font)' failed

The net result is that all GTK-based applications (Firefox, Gnumeric, Gimp, and so on) are broken.

To fix this, log in as root, and type


/usr/bin/pango-querymodules-32 > /etc/pango/i386/pango.modules

You might want to check that pango.modules is indeeed in “/etc/pango/i386″.

Suppose you could build a collision-free hash table, how fast would it be? It would be extremely fast, almost as fast as looking up data in an array.

As it turns out, collision-free hash tables have been possible for quite something and that’s called perfect hashing. See for example GNU gperf, for an implementation.

One way to building a collision-free hash table is to use two hashes: r mod q and r mod p. It is important that p and q be coprimes and that q be large enough to store all of your data. Then, these two hashes are used to create two layers of hash tables (h1 and h2): given the key a, you retrieve its value h(a) by computing h1(h2(a)). In other words, h1 stores the values and h2 uses your keys. Using two hash tables buys you the freedom of choosing (through heuristics) the values of h2, or equivalently, the keys of h1.

The downsides are that construction time might be slower and that the data structure cannot be easily updated. The upsides is that you can use very little storage and have tremendously fast look-ups.

Reference:
S. Lefebvre, H. Hoppe, Perfect spatial hashing, ACM SIGGRAPH 2006.

Here are some papers with sounding interesting titles from the list of ACM KDD papers:

Global Distance-Based Segmentation of Trajectories
Aris Anagnostopoulos, Michail Vlachos, Marios Hadjieleftheriou, Eamonn Keogh, Philip Yu

Rule Interestingness Analysis Using OLAP Operations
Bing Liu, Kaidi Zhao, Jeffrey Benkler, Weimin Xiao

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends
Xuerui Wang, Andrew McCallum

Algorithms for time series knowledge mining
Fabian Morchen

Semi-Supervised Time Series Classification
Li Wei, Eamonn Keogh

Two of the papers I selected are from the infamous Eamonn Keogh. I wonder what it means? I must like the guy’s work!

Yesterday, I attended Olivier‘s talk at the Curves and Surfaces conference. Olivier is a fellow blogger and researcher. Alas, I was too tired after an afternoon of talks and went to sleep, so I did not hunt Olivier.

In any case, he presented one of the most interesting talk so far in the conference. While quite possibly extremely expensive (sounds like his algorithms are bound to be O(n^2)?!?), the approach he proposed, Learning on Manifolds, seems well grounded for practical problems. Among the key idea is that you draw a graph where there are vertices between close data points, and then you compute some Laplacian. By computing the eigenvectors of the Laplacian, you can “cluster” the data in different manifolds, assuming you know how many clusters you want. There are some hidden assumptions, like the fact that you have rather uniform density, and that all components of all data points are known (no missing data!).

Naturally, Olivier was asked “where are your analytical results?” I felt that while he correctly answered, Olivier was a bit shy. Basically, I would have answered “where are your experimental results?” Because, they have none. They claim that their analytical results are measures of goodness, but that’s only a claim. A theorem is not, necessarily, a valuable thing. Just the right theorems at the right times, are valuable. Olivier had experimental evidence on his side, at least on toy problems. But Olivier was probably wise not to answer the way I would have, because, no doubt, I would have been booed out of the room.

In this kind of work, there are two measures of how good something is. It can work well in practice, which you can verify by trying to solve some real-world problem, or, at least, a toy problem… or else, you can produce analytical results, or theorems, and claim that these theoretical results are an adequate measure of “goodness”. Well, guess what, theory alone is not enough, just like experimental evidence alone is not enough. If you come up with a theorem that says “if the solution is in C1 then this and this will happen”, and, no solution is ever in C1, what then? Anyone who has tried to work on industrial problems know that while theory can (and not is) be very helpful, it can be equally deceiving. Also, few real problems meet smoothness conditions such as Ck classes. And I say this as someone who has a Ph.D. in Mathematics!

An earlier talk in the conference discussed the problem of computing PageRank when the damping (or regularization) factor goes to zero. No, I don’t know what it had to do with Curves and Surfaces, but it was enjoyable anyhow. When I asked the author why he was assuming that removing the damping factor would improve search on the Internet, he clearly did not grasp my question. His answer was basically equivalent to saying “I don’t know that it will be better, but it can’t hurt.” Maybe it is true, maybe it isn’t, though I feel that he might have missed a few papers on the topic, since I don’t think researchers think that the damping factor is a harmful hack to achieve fast numerical convergence. I was annoyed that despite all his fancy algorithms he didn’t bother to present us with some experimental evidence, so we could see if his algorithms did a good job or not. He did not even present results over toy cases. Surely, he thought, presenting the mathematics is sufficient?

I wish we could change the culture and get people to work more often on real industrial problems.

« Previous Page

Powered by WordPress