Thursday, November 6th, 2008

DUAT: Do not use acronyms in titles

Filed under: Academia/Research — Daniel Lemire @ 15:16

If you submit a paper called “TIA in SAAS” to an international conference, your acceptance probability is low.

Disclaimer: Out of respect for the actual authors, I have changed slightly the title.

Wednesday, November 5th, 2008

Selecting emails per language

Filed under: Science and Technology — Daniel Lemire @ 13:14

Stephen Downes asked why he can’t tell his computer that all emails in Russian are spam. Here is my answer.

Part 1. Selecting emails per language

As far as I know, this is not a documented feature, but GMail allows you to select emails per language. Just type in the search box:

language:french

or

language:chinese in:spam

(I am aware the Chinese is not a language.)

You can also type “lang:chinese” if you prefer instead of “language:chinese”.

It has both false positives and false negatives. For example, I got an email with the subject line “L’internationalisation V.S. le SEO” which is clearly in French… but GMail fails to recognize it as French.

Part 2. Getting the spam filters to listen to your preferences

See my post My spam filter is asocial.

Tuesday, November 4th, 2008

Going back to the basics

Filed under: Academia/Research — Daniel Lemire @ 18:19

Quick! Which type of program had its enrolment grow by 54 percent between 2000 and 2004?


Philosophy is the answer
according to an article pointed to us by Stephen Downes. This is quite simply a consequence of a phenomenon described by Paul Graham in some of his famous essays: while the best businessman or engineers work their craft in industry, the best philosophers and mathematicians reside at your nearby university.

My experience dealing with professors from business schools has been that they are not very impressive, on average. I would not trust most of them with my business. Would they make good entrepreneurs or CEOs?

However, your average Mathematics or Philosophy professor can dazzle you with his insights. While they may be out of touch and irrelevant to our universe, at least, they can teach something.

To quote Serge Dubuc, my thesis cosupervisor:

You can do great things with Mathematics, as long as you do something else.

In short, starting with Mathematics and then moving on to applications is a wise path.

I do not want to sound too negative or too critical, but at my school, 40 percent of all students attend the Business School. While some of my business colleagues there are extremely bright (especially the operational research types), I think it is not a good omen.

I am glad more fundamental programs, like philosophy are doing well.

Monday, November 3rd, 2008

Staying organized without planning

Filed under: — Daniel Lemire @ 16:48

In a recent post, I told you to stop planning and start prototyping.

Cyril—whose web site you should visit just to admire his simple design—objected that efficiency maximization was a very personal matter.

I conclude that I must have misrepresented my idea. I seriously doubt that anyone can manage his time better than with a greedy algorithm. In truth, I spend an hour every few days deciding on what I need to do next. What I never do is decide in October what I will do in February. I tried a few times and it never worked. Unless you are a cyborg, I doubt it can work for you.

Deciding on what to do next is not the same as planning. Or rather, my plans are crude:

  1. do step A;
  2. do step B;
  3. (missing steps);
  4. prove that N=NP (or achieve true AI).

I realized that trying to fill the blanks was useless. As long as I know what the next few steps are, I am in good shape. The trick is to constantly revise.

Google is fighting back against cars: Google transit

Filed under: Science and Technology — Daniel Lemire @ 10:05

In North America, getting from point A to point B by foot or my bus can be quite difficult. Hence, many people prefer cars because, in some sense, they are easier. With cheap GPS devices, getting around in cars is really easy. Need to be somewhere at 6am? Your car will help you… What to use public transportation? Good luck.

Fortunately, Google is fighting back against cars! I tried Google transit. It helps you plan your trips using public transportation. What a brilliant tool!

Is there anything Google cannot do? I am really worried that they do too good a job. If I wanted to launch a Web start-up, I would hate to have to fear Google…

Source: Usually, I push new technology onto my wife. Not the other way around. This time however, my wife made me discover Google transit. Thanks Nathalie!

Friday, October 31st, 2008

A no free lunch theorem for database indexes?

Filed under: Data Warehousing and OLAP, Science and Technology — Daniel Lemire @ 10:01

As a trained mathematician, I like to pull back and ask what are the fundamental limitations I face.

A common misconception in our Google-era is that database performance is a technical matter easily fixed by throwing enough hardware at the problem. We apparently face no fundamental limitation. To a large extend, this statement is correct. Thanks to the B-tree and related data structures, we can search for most things very quickly. Roughly, the database problems are solved, as long as you consider only a specific type of queries: queries targeting only a small subset of your data.

What if you want to consider more general queries? Can you hope to find a magical database index that solves all your problems?

I spent part of my morning looking for a No Free Lunch (NFL) theorem for database indexes. I would like to propose one:

Any two database indexes are equivalent when their performance is average across all possible content and queries. (Naturally, it is ill-posed.)

I draw your attention to how limited your interaction with tools that produce aggregates, such as Google Analytics, are. Basically, you navigate through precomputed data. You may not realize it, but the tool does not let you craft your own queries. Jim Gray taught us to push these techniques further with structures such as the data cube (which wikipedia insists on calling an OLAP cube). However, precomputation is just a particular form of indexing. It helps tremendously when the queries take a particular form, but when you average over all possible queries, it is unhelpful.

What if you want to consider general queries, and still get fast results? Then you have to assume something about your data. I suggest you assume that your data is highly compressible. Run-length encoding has been shown to help database queries tremendously.

Short of these two types of assumptions (specific queries or specific data sets), the only way you can improve the current indexes is by constant factors—you can double the performance of existing B-tree indexes, maybe. Or else, you can throw in more CPUs, more disks, and more memory.

For now, I will stick with my puny computers, and I will assume that the data is highly compressible. It seems to work well in real life.

Thursday, October 30th, 2008

How I built my Web presence as a researcher…

Filed under: Academia/Research — Daniel Lemire @ 12:00

Suzanne Bowness asked me to answer some questions for a paper she is preparing. I reproduce here the content of the interview. It is mildly incoherent.


When did you first start your web site? Has your purpose for it evolved over the time that it has been online? How did you decide what sections to include?

I started my web site as a graduate student, circa 1995. Initially, my goal was to keep track of my favourite web sites, and share them with the world. Later, I began to post my academic papers online systematically. Informally, I also added a news section to my web site.

Around 2004, academic blogging emerge as a new trend. Researchers like Stephen Downes showed that blogging could be an integral part of one’s research activities. Therefore, I replaced my hand-crafted news section by a bona fide blog. Later on, I added a French blog for my students.

Do you have a sense of what parts of your web site are most commonly consulted? What would you recommend that other professors include on a web site? Anything you would avoid?

My blog is read by over 1000 people. Judging by the comments alone, well known professors and researchers read my blog. Of course, many of them have blogs as well including Peter Turney (NRC)  or David Eppstein (UCI).

In fact, my blog is more than a  publication venue: it is an integral part of my networking activities as a researcher.

Most professors should not become bloggers. However, they should all be making sure that their papers, their data, their software, their courses and their talks are available online. There is mounting evidence that making your work easier to browse and download is beneficial to one’s academic career.

How does your web site help you in reaching out to students? How does it help to raise your public profile?

I believe that most of my students do not read my blog. Other students do, certainly. I have not yet found a good way to integrate blogging with teaching. Indeed, whereas most teaching in universities happens in closed groups, blogging appears to require large open social networks to be effective.

I find however that many graduate students enjoy the fact that I make my papers and my software available online freely.

Did you design your site yourself? What software do you use to maintain it? Any advice for other profs in terms of technical upkeep or updating frequency?

Like many science professors, I have initially designed my Web site using a text editor and raw HTML. I maintain my own blog engine (wordpress).

Universities often do not have the resources to help professors maintain effective Web sites. Fortunately, many inexpensive solutions are available (wordpress.com, blogger.com, and so on). In fact, most academic bloggers do not blog from within the university networks. People have even coined a term for this do-it-yourself strategy: edupunk.

I spend  many hours every week publishing content online. Some may consider it wasteful. However, I believe university professors are in the communication business.

Any other comments/advice on creating a web presence?

In 2008, the Web is a social phenomenon. Merely posting content is no longer enough. You have to find a way to interact dynamically with people interested by your work and ideas.

« Previous PageNext Page »

36 queries. 1.347 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.