Saturday, October 29th, 2005

MIT fires associate professor for making up data

Filed under: Academia/Research — Daniel Lemire @ 13:19

Slashdot points out this CNN article where we learn that the Massachusetts Institute of Technology fired an associate professor for falsifying research data. The fellow is named Luk Van Parijs and a quick search on Google doesn’t bring up his home page and even archive.org has no trace of the fellow. However, we get the news bite on MIT’s web site. This case seems very similar to Jan Hendrik Schön’s who was publishing one paper every 8 days and making the data as he went.

Two years ago, I’ve myself caught a Korean Ph.D. student who was publishing 15 papers a year, most of them being copies of existing papers, sometimes not his own. I caught him by using Google: I was a referee for a paper he submitted and after doing a Google search on a key phrase he used, I found out that he had paraphrased from start to finish a paper published by American authors in a Canadian workshop two years before. I reported the problem rather widely, if not publicly. Last time I checked, the student was still working toward his Ph.D. so, clearly, his school didn’ think it was a big deal. His school never got back to me, not even to acknowledge the report I sent. The student should have been dismissed from the Ph.D. program at once, if you ask me.

I suspect that fraud is far more widespread than most people suspect. When I write a paper, I get to spend about 80% collecting experimental data. If I were to make up the data, I could publish 5 times faster. The incentive to cheat is significant.

What is worse is that you are unlikely to get caught. Most papers are never read thoroughly and results are almost never reproduced. But, even so, when you catch someone, where do you report them? What do you have to gain by reporting them?

There is also a grey area where the author mislead you, but you can’t quite call it fraud. In Computer Science, I found that trying reproduce results from papers I read is often a frustrating and expensive experience. Most often, you don’t have enough details to reproduce accurately the results, and when you do have enough details, you often can’t match the results reported. This is why I mostly look at the theoretical analysis: you can’t easily falsify theory, all you can do is copy it. And even when you can reproduce the experimental results, you often find out that the author cheated a bit. How do authors cheat? By conveniently forgetting to include cases where their results are not good.

I say this is cultural. Author Z reports that technique X is best. If you ever come in, implement technique X and report that, after all, it doesn’t work so well, you will never be able to publish your negative result in as a prestigious venue. In this sense, the author who put the results in the most positive light possible will always win. The same hold for grant applications: your research must be guaranteed to deliver outstanding results or you will not receive grant money.

Honesty is definitively not an important value in our community at large. Pure Mathematics and Theoretical Computer Science are probably lucky exceptions.

On the positive side, if I ever catch someone at MIT of serious wrongdoings, I now know they may come down hard on the person. This is probably a serious warning to anyone working at MIT. I don’t know whether most schools would come as hard as MIT did on someone who cheated in order to get grant money. I have serious doubts especially if this person is a rising star.

Friday, October 28th, 2005

Comments are back! But you need to pass a reverse Turing test!

Filed under: — Daniel Lemire @ 19:51

I’ve installed Boriel’s Capcha! Plugin in my copy of wordpress. “Captcha” is the acronym for completely automated public Turing test to tell computers and humans apart (see wikipedia entry). It worked well so far, but I had two issues during the installation:

  • The “TMP Folder” where images are stored must be inside a “www” directory otherwise, a broken link will appear instead of the image. The plugin assumes that your web site is served from the directory /something/www/… This was not my case, but I was able to fix the problem using a symbolic link (command ln).
  • The “TrueType Folder” option must end with a slash “/” otherwise you will be told that fonts cannot be found.

Why use Boriel’s plugin? I tried two others, Secureimage and Bot Check. At least Secureimage had the issue that the captcha images would not contain any text. After some investigation, it turns out that the problem is that it uses the ImageMagick library assuming FreeType support: my server has ImageMagick but without FreeType support so it cannot do text annotations.

If you want to see whether this is a problem for you try to annotate an image using either the convert or mogrify command line utilities. You can recognize the problem by trying the following test:

$ mogrify someimage.jpg -draw
'text 0,0 tata' someimage.jpg

mogrify: FreeTypeLibraryIsNotAvailable
(/usr/local/share/ghostscript/fonts/n0190
03l.pfb).

Anyhow, I sure hope that crazy spamming is over!

The average of averages is not the average

Filed under: Data Warehousing and OLAP — Daniel Lemire @ 8:30

A fact that we teach in our OLAP class is that you can’t take the average of averages and hope it will match the average. This is a common enough mistake for people working with databases and doing number crunching. It is only true if all of the averages are computed over sets having the same cardinality, otherwise it is false. In fancy terms, the average is not distributive though it is algebraic. This phenomenon has a name: the fact that the average of averages is not the average is an instance of Simpson’s Paradox.

Here is an example, consider the following list of numbers:

  • 3
  • 4
  • 6
  • 5
  • 4.5

The average is 4.5. However, we can split the list in two:
The average of the first list is 3.5:

  • 3
  • 4

The average of the second list is approximately 5.2:

  • 6
  • 5
  • 4.5

However, the average of the two average is (5.2 +3.5)/2 which is less than 4.5!

This only works if the two sets have a different number of elements.

Subscribe to this blog
in a reader
or by Email.

Monday, October 24th, 2005

MySQL 5.0 Now Available for Production Use

Filed under: Data Warehousing and OLAP — Daniel Lemire @ 13:18

MySQL 5.0 is out. It now supports:

  • Stored Procedures and SQL Functions (about time!);
  • Triggers (about time!);
  • Views (about time!);
  • Cursors (about time!);
  • Information Schema — to provide easy access to metadata (I don’t know what this is);
  • XA Distributed Transactions — supports complex transactions across multiple databases in heterogeneous environments (sounds good);
  • SQL Mode — provides server-enforced data integrity for new and existing data (about time!).

However, I wouldn’t switch over any serious Enterprise project to MySQL with Oracle buying InnoDB and all, but if you are already using MySQL, this is good news indeed.

Sunday, October 23rd, 2005

Slope One Collaborative Filtering now available in Vogoo library

Filed under: — Daniel Lemire @ 14:47

I got an email from Stéphane Droux, letting me know that the Vogoo Collaborative Filtering library (GPL) now supports the Item-Based Slope One Predictors Collaborative Filtering. If you are a company looking for collaborative filtering support, I think you ought to look at Vogoo.

Slope One is based on the surprising and simple realization that recommendations based on the average difference in ratings can match much more expensive algorithms while being orders of magnitude easier to maintain and implement.

Those who didn’t follow Slope One should know it is used by the DVD recommender site hitflip.de as well as by the MP3 recommender site InDiscover.net.

Saturday, October 22nd, 2005

Daniel W. Drezner: a blogger was denied tenure

Filed under: Academia/Research — Daniel Lemire @ 20:15

Through Jean-Pierre Cloutier’s blog, I got to this post by Daniel Dreszner, an assistant professor at the University of Chicago who was recently denied tenure. It would seem he suspects his blogging activities have something to do with his dismissal, but I like his analysis:

That said, if one assumes that the opportunity cost of blogging (e.g., better or more scholarship) was the difference between tenure and no tenure - an unclear assertion at best - then it’s a tough call. From a strict cost-benefit analysis, one could argue that the doors that blogging opened could have been deferred for a few years in return for the annuity of a tenured position at Chicago. That said, if I did things only for the money, I never would have entered the academy in the first place. And I’ve enjoyed the psychic rewards of blogging way too much to regret my choice.

I think he has the right attitude.

Update: Annie Patenaude correctly points out she is the one who pointed me to Jean-Pierre Cloutier’s blog.

OpenOffice.org 2.0: It is all about politics!

Filed under: — Daniel Lemire @ 7:49

Here’s an interesting interview with Louis Suarez-Potts, “community manager” of OpenOffice.org. I really like the point she is making: free software is all about communities, it is all about politics. Free software is, in part, politically motivated which is an important difference with the typical small software motivated by financial gains. Of course, OpenOffice.org is backed by Sun Microsystems which, I hope, is in it for the money, but many of the contributors around them are there for political reasons: they want OpenOffice to support such a langage, such an operating system or such a technology. A company like Microsoft is mostly cut out from such support.

We see an understanding of this dynamic in Brazil, where the government is behind OpenOffice.org and open source in general. We see this in India and elsewhere, where governments understand that they can support OpenOffice.org and they can support open source, and the people who are benefiting are the localities. It is a politically inexpensive but valuable logic.

Next Page »

30 queries. 0.192 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.