David Donoho was among the first researchers to promote reproducible research through software publication (see Buckheit and Donoho, 1995). Fifteen years later, Donoho and his collaborators are even more insistent :

Scientific computation is emerging as absolutely central to the scientific method. Unfortunately, it’s error-prone and currently immature—traditional scientific publication is incapable of finding and rooting out errors in scientific computation—which must be recognized as a crisis. An important recent development and a necessary response to the crisis is reproducible computational research in which researchers publish the article along with the full computational environment that produces the results. (Donoho et al., 2009)

Their 2009 paper on reproducibility is insightful and well worth reading. I agree that sharing software is good for science, and  for scientists.

Unfortunately, I fear we might lose sight of why we must publish our software.

  1. In theory, scientists should be constantly checking each other’s results. But that is not how science is done. You are rewarded for finding something new, not for checking someone’s results. So hardly anyone will ever download your code to check whether you cheated.
  2. Reproducibility and repeatability are not the same thing. It is great that I can rerun your code. But it does not follow that your code and results are right or useful.

Share your source code to spread your ideas:

  • Keep your packages simple. People need a few key pieces of code that they can integrate in their own software.
  • Use popular languages. Remember that repeatability is not enough: people are likely to tear apart your software to reconstruct their own.
  • Go beyond academia. Why assume academic researchers are the people who matter? Spreading your ideas among engineers is important as well.

The reproducibility that matters is getting people to use your ideas. Merely proving you are honest falls short of your potential!

Further reading: Open Sourcing your software hurt your competitiveness as a researcher?

5 Comments »

  1. I always enjoy your thoughts on that subject. Regarding (1), do you think it’s a good thing? Can we really reach a point where researchers make their results reproducible if there is no incentives on others to check these results? As you correctly point out, reproducing is more involved than merely repeating. Who will invest the time needed to do this?

    Also, I’d love to hear your opinion on two different ways to conduct open research. One is to publish the final result as open source code. The other, which is even more appealing to me, is to perform your research in a glass box. For example, by using an online public repository to commit your code on a daily basis, or by continuously publishing your intermediate research notebooks, etc.

    Comment by Philippe Beaudoin — 20/4/2010 @ 14:41

  2. Actually, Jon Claerbout’s work on reproducibility predates Donoho’s: http://sepwww.stanford.edu/data/media/public/sep//jon/reproducible.html

    SEP is mentioned with dates going back at least to 1990.

    Comment by Carlos Scheidegger — 20/4/2010 @ 15:52

  3. @Philippe

    I think we can learn a lot by spending more time thoroughly reviewing and studying previous results. A lot of hidden assumptions are the source of great research.

    So maybe that’s my answer. People should and are working to reproduce results because it is fun and fruitful.

    As for working in a glass box, I have not dared doing this, yet. I hope to spend more time thinking about it.

    Comment by Daniel Lemire — 20/4/2010 @ 16:03

  4. @Carlos

    Thanks for the link. (I did not think Donoho was the first, but he is often attributed the idea of reproducible computational research, informally.)

    Comment by Daniel Lemire — 20/4/2010 @ 16:24

  5. I have reviewed papers with open data. It was great, I ran some stats and proved to myself that their stats were at least correct, I even found a mistake in one of their p-values where they made an erroneous decision on the border (it didn’t affect the rest of the paper).

    It probably helped the paper as I could demonstrate that at least there stat methods were reproducible.

    Comment by Anonymous — 26/4/2010 @ 11:13

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: duo plus septem is '9'. The numbers are expressed in latin numerals but you should give your answers using ordinary digits.

 

« Blog's main page

Powered by WordPress