MIT fires associate professor for making up data

Slashdot points out this CNN article where we learn that the Massachusetts Institute of Technology fired an associate professor for falsifying research data. The fellow is named Luk Van Parijs and a quick search on Google doesn’t bring up his home page and even archive.org has no trace of the fellow. However, we get the news bite on MIT’s web site. This case seems very similar to Jan Hendrik Schön’s who was publishing one paper every 8 days and making the data as he went.

Two years ago, I’ve myself caught a Korean Ph.D. student who was publishing 15 papers a year, most of them being copies of existing papers, sometimes not his own. I caught him by using Google: I was a referee for a paper he submitted and after doing a Google search on a key phrase he used, I found out that he had paraphrased from start to finish a paper published by American authors in a Canadian workshop two years before. I reported the problem rather widely, if not publicly. Last time I checked, the student was still working toward his Ph.D. so, clearly, his school didn’ think it was a big deal. His school never got back to me, not even to acknowledge the report I sent. The student should have been dismissed from the Ph.D. program at once, if you ask me.

I suspect that fraud is far more widespread than most people suspect. When I write a paper, I get to spend about 80% collecting experimental data. If I were to make up the data, I could publish 5 times faster. The incentive to cheat is significant.

What is worse is that you are unlikely to get caught. Most papers are never read thoroughly and results are almost never reproduced. But, even so, when you catch someone, where do you report them? What do you have to gain by reporting them?

There is also a grey area where the author mislead you, but you can’t quite call it fraud. In Computer Science, I found that trying reproduce results from papers I read is often a frustrating and expensive experience. Most often, you don’t have enough details to reproduce accurately the results, and when you do have enough details, you often can’t match the results reported. This is why I mostly look at the theoretical analysis: you can’t easily falsify theory, all you can do is copy it. And even when you can reproduce the experimental results, you often find out that the author cheated a bit. How do authors cheat? By conveniently forgetting to include cases where their results are not good.

I say this is cultural. Author Z reports that technique X is best. If you ever come in, implement technique X and report that, after all, it doesn’t work so well, you will never be able to publish your negative result in as a prestigious venue. In this sense, the author who put the results in the most positive light possible will always win. The same hold for grant applications: your research must be guaranteed to deliver outstanding results or you will not receive grant money.

Honesty is definitively not an important value in our community at large. Pure Mathematics and Theoretical Computer Science are probably lucky exceptions.

On the positive side, if I ever catch someone at MIT of serious wrongdoings, I now know they may come down hard on the person. This is probably a serious warning to anyone working at MIT. I don’t know whether most schools would come as hard as MIT did on someone who cheated in order to get grant money. I have serious doubts especially if this person is a rising star.

Comments are back! But you need to pass a reverse Turing test!

I’ve installed Boriel’s Capcha! Plugin in my copy of wordpress. “Captcha” is the acronym for completely automated public Turing test to tell computers and humans apart (see wikipedia entry). It worked well so far, but I had two issues during the installation:

  • The “TMP Folder” where images are stored must be inside a “www” directory otherwise, a broken link will appear instead of the image. The plugin assumes that your web site is served from the directory /something/www/… This was not my case, but I was able to fix the problem using a symbolic link (command ln).
  • The “TrueType Folder” option must end with a slash “/” otherwise you will be told that fonts cannot be found.

Why use Boriel’s plugin? I tried two others, Secureimage and Bot Check. At least Secureimage had the issue that the captcha images would not contain any text. After some investigation, it turns out that the problem is that it uses the ImageMagick library assuming FreeType support: my server has ImageMagick but without FreeType support so it cannot do text annotations.

If you want to see whether this is a problem for you try to annotate an image using either the convert or mogrify command line utilities. You can recognize the problem by trying the following test:

$ mogrify someimage.jpg -draw
'text 0,0 tata' someimage.jpg

mogrify: FreeTypeLibraryIsNotAvailable
(/usr/local/share/ghostscript/fonts/n0190
03l.pfb).

Anyhow, I sure hope that crazy spamming is over!

The average of averages is not the average

A fact that we teach in our OLAP class is that you can’t take the average of averages and hope it will match the average. This is a common enough mistake for people working with databases and doing number crunching. It is only true if all of the averages are computed over sets having the same cardinality, otherwise it is false. In fancy terms, the average is not distributive though it is algebraic. This phenomenon has a name: the fact that the average of averages is not the average is an instance of Simpson’s Paradox.

Here is an example, consider the following list of numbers:

  • 3
  • 4
  • 6
  • 5
  • 4.5

The average is 4.5. However, we can split the list in two:
The average of the first list is 3.5:

  • 3
  • 4

The average of the second list is approximately 5.2:

  • 6
  • 5
  • 4.5

However, the average of the two average is (5.2 +3.5)/2 which is less than 4.5!

This only works if the two sets have a different number of elements.

Subscribe to this blog
in a reader
or by Email.

MySQL 5.0 Now Available for Production Use

MySQL 5.0 is out. It now supports:

  • Stored Procedures and SQL Functions (about time!);
  • Triggers (about time!);
  • Views (about time!);
  • Cursors (about time!);
  • Information Schema — to provide easy access to metadata (I don’t know what this is);
  • XA Distributed Transactions — supports complex transactions across multiple databases in heterogeneous environments (sounds good);
  • SQL Mode — provides server-enforced data integrity for new and existing data (about time!).

However, I wouldn’t switch over any serious Enterprise project to MySQL with Oracle buying InnoDB and all, but if you are already using MySQL, this is good news indeed.

Slope One Collaborative Filtering now available in Vogoo library

I got an email from Stéphane Droux, letting me know that the Vogoo Collaborative Filtering library (GPL) now supports the Item-Based Slope One Predictors Collaborative Filtering. If you are a company looking for collaborative filtering support, I think you ought to look at Vogoo.

Slope One is based on the surprising and simple realization that recommendations based on the average difference in ratings can match much more expensive algorithms while being orders of magnitude easier to maintain and implement.

Those who didn’t follow Slope One should know it is used by the DVD recommender site hitflip.de as well as by the MP3 recommender site InDiscover.net.

Daniel W. Drezner: a blogger was denied tenure

Through Jean-Pierre Cloutier’s blog, I got to this post by Daniel Dreszner, an assistant professor at the University of Chicago who was recently denied tenure. It would seem he suspects his blogging activities have something to do with his dismissal, but I like his analysis:

That said, if one assumes that the opportunity cost of blogging (e.g., better or more scholarship) was the difference between tenure and no tenure – an unclear assertion at best – then it’s a tough call. From a strict cost-benefit analysis, one could argue that the doors that blogging opened could have been deferred for a few years in return for the annuity of a tenured position at Chicago. That said, if I did things only for the money, I never would have entered the academy in the first place. And I’ve enjoyed the psychic rewards of blogging way too much to regret my choice.

I think he has the right attitude.

Update: Annie Patenaude correctly points out she is the one who pointed me to Jean-Pierre Cloutier’s blog.

OpenOffice.org 2.0: It is all about politics!

Here’s an interesting interview with Louis Suarez-Potts, “community manager” of OpenOffice.org. I really like the point she is making: free software is all about communities, it is all about politics. Free software is, in part, politically motivated which is an important difference with the typical small software motivated by financial gains. Of course, OpenOffice.org is backed by Sun Microsystems which, I hope, is in it for the money, but many of the contributors around them are there for political reasons: they want OpenOffice to support such a langage, such an operating system or such a technology. A company like Microsoft is mostly cut out from such support.

We see an understanding of this dynamic in Brazil, where the government is behind OpenOffice.org and open source in general. We see this in India and elsewhere, where governments understand that they can support OpenOffice.org and they can support open source, and the people who are benefiting are the localities. It is a politically inexpensive but valuable logic.

Oracle Java Applications on Linux

A nameless university is using Oracle’s jinitiator applets on some management web sites. Jinitiator is just Oracle’s version of the Java JVM, but you can use any recent JVM and be happy. The trick under Linux is to fool the browser into interpreting the mime-type “application/x-jinit-applet” (specific to Oracle) as just an ordinary applet. As it turns out, you just have to edit a small text file called pluginreg.dat.

Reference: Oracle Apps on Linux – AVallark.

See also my posts Oracle buys Hyperion, JOLAP versus the Oracle Java API, IBM, Oracle and Microsoft freeing their databases and Oracle and MySQL — is MySQL in a weak position?

Subscribe to this blog
in a reader
or by Email.

Spoofing your user agent – When Firefox tells the world it is Internet Explorer

Some nameless university has some management web site requiring Internet Explorer. If you ask me, that’s a lot like requiring GM cars on some highways. Such a web site is no longer a web site, but an Internet Explorer site.

You can often get around these problems by using a Firefox extension called “User Agent Switcher”. It adds a menu and a toolbar button to switch the user agent of the browser. In effect, the web site will be fooled into thinking it is dealing with Internet Explorer.

My only regret is that unlike Konqueror, it seems Firefox cannot spoof only specific web sites. You switch your user agent for all sites at once.

Spam bots got to me: no more comments

Spam bots killed my server. I had fancy spam filtering code in place, but it was taking too much juice to filter all the crap being sent at me. This blog is now read-only. There are just too many people buying penis enhancers and falling for get-rich-quick scams. Stop wasting your money.

Next Page »

19 queries. 0.450 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.