Monday, November 3rd, 2008

Google is fighting back against cars: Google transit

Filed under: Science and Technology — Daniel Lemire @ 10:05

In North America, getting from point A to point B by foot or my bus can be quite difficult. Hence, many people prefer cars because, in some sense, they are easier. With cheap GPS devices, getting around in cars is really easy. Need to be somewhere at 6am? Your car will help you… What to use public transportation? Good luck.

Fortunately, Google is fighting back against cars! I tried Google transit. It helps you plan your trips using public transportation. What a brilliant tool!

Is there anything Google cannot do? I am really worried that they do too good a job. If I wanted to launch a Web start-up, I would hate to have to fear Google…

Source: Usually, I push new technology onto my wife. Not the other way around. This time however, my wife made me discover Google transit. Thanks Nathalie!

Friday, October 31st, 2008

A no free lunch theorem for database indexes?

Filed under: Data Warehousing and OLAP, Science and Technology — Daniel Lemire @ 10:01

As a trained mathematician, I like to pull back and ask what are the fundamental limitations I face.

A common misconception in our Google-era is that database performance is a technical matter easily fixed by throwing enough hardware at the problem. We apparently face no fundamental limitation. To a large extend, this statement is correct. Thanks to the B-tree and related data structures, we can search for most things very quickly. Roughly, the database problems are solved, as long as you consider only a specific type of queries: queries targeting only a small subset of your data.

What if you want to consider more general queries? Can you hope to find a magical database index that solves all your problems?

I spent part of my morning looking for a No Free Lunch (NFL) theorem for database indexes. I would like to propose one:

Any two database indexes are equivalent when their performance is average across all possible content and queries. (Naturally, it is ill-posed.)

I draw your attention to how limited your interaction with tools that produce aggregates, such as Google Analytics, are. Basically, you navigate through precomputed data. You may not realize it, but the tool does not let you craft your own queries. Jim Gray taught us to push these techniques further with structures such as the data cube (which wikipedia insists on calling an OLAP cube). However, precomputation is just a particular form of indexing. It helps tremendously when the queries take a particular form, but when you average over all possible queries, it is unhelpful.

What if you want to consider general queries, and still get fast results? Then you have to assume something about your data. I suggest you assume that your data is highly compressible. Run-length encoding has been shown to help database queries tremendously.

Short of these two types of assumptions (specific queries or specific data sets), the only way you can improve the current indexes is by constant factors—you can double the performance of existing B-tree indexes, maybe. Or else, you can throw in more CPUs, more disks, and more memory.

For now, I will stick with my puny computers, and I will assume that the data is highly compressible. It seems to work well in real life.

Monday, October 27th, 2008

The future of innovation is in software

Filed under: Science and Technology — Daniel Lemire @ 8:01

I keep reading about how the future will be shaped by new cheaper fuel or amazing new medications. I believe that we are misreading the trends. Yes, we will have better medications and cheaper fuel in the future. However, I believe we are clearly in the mist of an information revolution. The future will be shaped by software, defined broadly.

Specifically, I believe that:

  • Tele-work, tele-play, tele-learning will soon represent 80% of our lives.
  • There is much more room for innovation in software than in hardware. There are few ways to build a house, but many more ways to build a virtual house.

Friday, October 24th, 2008

The problem with unidimensional research

Filed under: Academia/Research, Data Warehousing and OLAP — Daniel Lemire @ 3:02

Yesterday, I listened to some of the BDA’08 talks. One common issue I noticed is that most work is unidimensional. That is, researchers tell us “my technique is 4 times faster”. The trade-off, if any, is not presented. Here is what I would like to hear: “I use twice the RAM, but I am 4 times faster”.

This is a side-effect of our quest for positive results. People hide the trade-offs because the negative points make their papers less likely to be accepted. I believe this is the main reason why practitioners ignore researchers. Researchers paint a severely distorted view of the world.

Reviewers need to wise up and give extra points to researchers who describe the negative and positive points of a method. We need to stop believing that the one ideal technique is just around the corner.

In my experience as a researcher, there is no silver bullet. Impressive techniques often fall flat when viewed from a different angle. What appears deep and thoughtful may sound trivial a few years later once it is well understood. What appears childish may be the best option in most cases.

Paint a rosy picture of your work and I will distrust you. Paint the shadows and I start believing in you.

Saturday, October 18th, 2008

Blogging is networking

Filed under: Academia/Research, Science and Technology — Daniel Lemire @ 17:00

Two years ago, I asked whether academic blogging was still relevant. At the time, two famous bloggers had stopped (Sébastien Paquet and Stephen Downes). Evidently, I kept on blogging. I even took up microblogging.

Let me revisit some of the benefits.

  • Bloggers are more visible. This blog has over 900 readers. Some are students, others are engineers, teachers, entrepreneurs, researchers or professors.
  • Blogging is good for knowledge management: leaving a trace of your thoughts is always a good idea. You are also forced to flush out your ideas and you get immediate feedback.

However, reading Daniel Tunkelang today, I realized that the most important benefit is the networking effect. A blog without a social network is nothing. Nobody wants to read a list of random thoughts. What makes blogging rewarding for me are the comments and the link I receive. But I enjoy even more reading others and commenting elsewhere.

In effect, good blogging makes you part of a rich and open community. That is very valuable.

Wednesday, October 1st, 2008

Google won’t help your blog

Filed under: Science and Technology — Daniel Lemire @ 14:34

My blog has been penalized in Google’s index for a few weeks now. While I would prefer that my content be easier to find, it does not worry me and my readership is not declining. Daniel Tunkelang explains exactly why: while Google brings a lot of traffic to my blog, almost none of it is relevant.

For example, my post on Simpson’s paradox is the most popular page from Google’s search engine. By far. Yet, it is not very exciting nor very representative. It gets a lot of traffic because many people want to compute the “average of averages”.

There is a practical conclusion: forget about most search-engine optimization tricks if you are a blogger. Making some of your pages show up first in Google will generally not increase your readership.

Wednesday, September 10th, 2008

Yes, Johny, your Mac has a 64-bit CPU

Filed under: Science and Technology — Daniel Lemire @ 12:54

I have written a lot of C++ code in the last year or so. The C family of languages have types that will vary in size depending on the OS and the CPU.

I thought my Macs had 32-bit CPUs because the “long” type uses 32 bits. Why would Intel sell 32-bit CPUs to Apple? I did not care for the answer.

However, my current work is very sensitive to the CPU word length. So I decided to look into the matter. Turns out that Apple is buying standard 64-bit processors from Intel. But somehow, the C compiler defaults to 32-bit binaries.

Thankfully, the fix is easy. Go from…

g++ -O2 -o test test.cpp

to…

g++ -O2 -m64 -o test test.cpp

For some code I wrote, this meant running twice as fast!

Disclaimers: 1) I have not checked whether the same thing is still true with MacOS 10.5. 2) You might encounter some difficulties if you try to link a 64-bit executable with 32-bit libraries. 3) I am not claiming that your software will run twice as fast with 64-bit words.

« Previous PageNext Page »

37 queries. 1.007 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.