Research productivity versus funding received

Who should you keep funding? The scientist who receives $10,000 and has an impact factor of 10, or the scientist who receives $1,000,000 and has an impact factor of 15?

Source: The idea is from Sébastien Paquet.

Publish or Perish: the Tool

Through Sebastien Paquet, I found a software application called Publish or Perish. It queries Google Scholar and computes statistics for you automagically. It works well. Linux and Windows version available. The Windows version runs under MacOS if you have wine.

Online teaching is the future?

Recently, Bill Gates gave us the main reason for the ongoing revolution in university teaching:

Fortunately for all of you, you’re in a generation where all of these courses are going to be online and basically free. I’m taking solid state physics from MIT, though MIT doesn’t know it. You are far more empowered in terms of your ongoing education than any other generation has ever been.

It is, indeed, about empowerment. By teaching to students online, we are telling them that they can learn at any time and at any place, not just when they are in a classroom. News flash: you don’t need to be in the physical presence of a Physics professor to learn Physics. You do not need a professor for reading a textbook.

Michael asked a related question: why isn’t university free? Why don’t we just put all of the teaching material online, for free. Well. Michael. I do so. Right now. Except for graduate students, I never meet students. I have tutors who answer students’ emails and correct papers.

There are a few things you should know about online teaching however:

  • Mostly, professors are not there to provide scientific content. There are books, online articles, wikipedia, and existing online videos from which all of the content can be derived. There is the odd exception, mostly at the graduate level, where no textbook or online material is available. But even original courses are mostly aggregates of existing material.
  • The professor and his assistants provide the structure and the motivation. Offering solved problems and examples is a great way to get the students going.
  • Watching 3 hours of video a week on your computer screen is really boring. A large fraction of what happens in a classroom has nothing to do with the material.
  • A course is mostly about assignments and tests. The whole concept of a structured course falls apart if you do not get a grade at the end.
  • Most of the expenses in a well-run university have to do with salaries. Thus, online teaching is every bit as expensive as classroom teaching.

Why is online learning taking off now? It is all about the bandwidth and latency. You do not need video to have a great online course, but you need to offer the student a lot of interesting data quickly and interactively. Web technology has reached the point where you can surpass the classroom with a well-designed online course.

IDEAS 2008 (April 27, 2008 / September 3-5, 2008)

The Twelfth International Database Engineering & Applications Symposium (IDEAS 2008) will be held in Münster, Germany.

Should we fear Google?

Google is getting in the health records business. What happens when a single company has full access to your emails, your videos, your family pictures and your health records?

Abuses are possible, but I predict that not much will happen. The American NSA is recording and mining a large fraction of all Internet communications. The same is happening in China.

Privacy is an illusion. Even if governments and businesses did not spy on us already, cryptography and technology does not protect you against social engineering. Even if all your data was safely guarded by the best possible technology, a secretary with a USB key can open up all of your data to mobsters, political opponents or your wife.

How can we fight back? Sadly, I believe we will need to change our expectations. Lots of people will know about your secrets, get used to it. If you have cancer, people will learn about it. If you cheated on your wife, people will know. We are moving toward Brin’s transparent society whether we like it or not.

As for Google… unlike the Chinese government and the NSA, Google has given me powerful tools to offset my loss of privacy.

When a terabyte is small

With Kamel and Owen, I am working on a paper involving database indexes. We had over a terabyte of space, and yet, in the middle of the production of the paper, we ran out of space. Only a year ago, I thought that one terabyte was large.

So, I ask our technician about getting a new drive. He comes back with a small 500 GB drive. I ask how much they cost, he says “$200.”

This is a new frontier for me. Producing a simple research paper required us to generate more than one terabyte of data. Moreover, we will generate much more data before the paper is finished.

Assuming I write, say, 4 research papers a year, this means that I will generate over 4 terabytes of data a year at my current rate which is going to cost me about $1600 in storage.

FlexDBIST 2008 (March 18, 2008 / September 1-5, 2008)

The Third International Workshop on Flexible Database and Information Systems Technology (FlexDBIST-08) will be held in Turin this year. The scope is defined by this sentence: “Enterprises and organizations need to deal with such heterogeneous and often very large volumes of data, which may also be uncertain, imprecise and incomplete.”

New Trends in Physical Data Warehouse Design (April 18, 2008)

Ladjel Bellatreche is organizing a special issue in New Trends in Physical Data Warehouse Design for the journal of Distributed and Parallel Databases. Submissions are through the journal’s online system. You can read the call for papers on EventSeer.

Recommending Journal Articles in a Scientific Digital Library

André Vellino will give a talk on recommender systems in our offices (100 Sherbrooke West, room 2720) at 12:30pm this Thursday (February 21st 2008).

Recommender systems for scientific digital libraries that have been the subject of experiments in recent years have used corpora that are primarily in the field of computer science. However, designing an effective recommender system for journal articles in a broader Scientific, Technical and Medical (STM) digital library poses special challenges and presents unique opportunities.

This talk describes a recommender system for scientific scholarly articles that is both hybrid (content and collaborative filtering based) and multi-dimensional (across different rating criteria.) Our hypothesis is that such a design for a recommendation engine can improve scientists’ ability to discover new knowledge from a digital library provided that an interface to these recommendations can simultaneously offer explanations for the recommendations and increase the user’s control over how the recommender behaves.

The talk will be webcasted: mms://mediasrv.lorit.ca/presentation using Microsoft-only technology.

It says that the talk will be in French.

External-Memory Shuffles?

We need to shuffle the lines in very large variable-length-record flat files.

We can load the files in MySQL and do “select * from mytable order by rand().”

However, loading the data in a DBMS and dumping it out is cumbersome. So, we do an in-memory shuffle block by block. It comes close to a full random shuffle, but I am worried it might not be good enough.

Anyone knows of a fast way to do a full external-memory shuffle using only Ruby, Perl, Python, and other Unix utilities?

A crazy idea: Let n be the number of lines. Shuffle the numbers between 1 and n, in-memory. Then preprend these numbers at the beginning of each line, do external-memory sorting with the Unix command sort, and remove the random numbers from the final file. This will scale up to about 100 million lines on a PC.

It might be possible to generate random numbers from a large set, thus making collisions very unlikely, thereby avoiding a shuffle of numbers altogether.

Definitive solution? There is a GNU command called shuf, but it works only for small files. However, the new sort command has a –random-sort flag. To install a recent sort on on MacOS, do the following:

wget http://ftp.gnu.org/gnu/coreutils/coreutils-6.9.tar.bz2
tar xvjf coreutils-6.9.tar.bz2
cd coreutils-6.9
mkdir osxbuild && cd osxbuild
../configure --prefix=/usr/local/stow/coreutils-6.9
jm_cv_func_svid_putenv=yes && make
sudo cp src/sort /usr/local/bin/
sudo cp ../man/sort .1 /usr/local/share/man/man1/

However, sort –random-sort will only shuffle properly if no line is repeated. You need to first pass your file through cat -n to ensure that all lines are unique, then conclude with cut to remove the counter added by cat. Here is a command that does the trick:


cat -n myfile.csv | sort --random-sort | cut -f 2-

It may not generate a perfect shuffle, but it should be close.

Update:  I have written a follow-up blog post.

Next Page »

18 queries. 0.441 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.