An upcoming revolution in science? The end of academic journals?

Adam Rogers makes a bold prediction:

Eventually, printed journal articles will be quaint artifacts. Scientific papers will be living documents with data published on Web pages – commented on, linked to, and mirrored by labs doing the same work 6,000 miles away. Every research effort will have thousands of reviewers working in real time. Today’s undergrads have never thought about the world any differently – they’ve never functioned without IM and Wikipedia and arXiv, and they’re going to demand different kinds of review for different kinds of papers. It’s in their nature.

(Source: Wired, September 2006)

I think that the science media of the future will be electronic and read/write. We will annotate, link, cross-reference more than ever before in the coming years. We are in the middle of a paradigm change.

I do not think that publishing houses need to go out of business. I do not think that traditional universities and research laboratories will disappear. However, everything is growing more distributed. Physical location and physical support is becoming irrelevant.

Actually, by 2015, I will not even bother to get up of bed. I will just sit there by my computer all day. Come to think of it, that is precisely what I do right now except that I don’t stay in bed.

Scam Spam, the death of email, and Machine Learning

Tim Bray has predicted the end of email as we know it:

I don’t know about you, but in recent weeks I’ve been hit with high volumes of spam promoting penny stocks. They are elaborately crafted and go through my spam defenses like a hot knife through butter. (…) This could be the straw that finally breaks the back of email as we know it, the kind that costs nothing to send and something to receive.

Yes, Tim, I’ve been bombed by spam mail too. To the point that the fraction of non-spam email has gone below 10% for the first time in years. Before you think I’m an extreme case, ask your local IT experts about the amount of spam they are receiving. Currently, no spam filter can cope with the amount of spam I’m receiving.

The only spam filter that does anything to help is Google Mail’s spam filter, but it still let more spam through than legit emails (if I exclude mailing lists).

What is really failing us here is not the Internet per se: it is rather trivial to think of a better way to design email protocols. What is failing us is the blunt application of Machine Learning to a real-world problem.

Many Machine Learning researchers would have you believe, mostly because they really believe it, that Bayes or Neural Networks (add your favorite algorithm here) are ideally suited to solve most classification problems. That they can be tweaked to a particular problem. That in some small way, we have strong AI at our door. But we don’t. The failure of spam filters is symbolic. There is really no free lunch as far as algorithms go.

This is not to say that Machine Learning does not work. Recommender systems like those based on collaborative filtering or PageRank work. But in the real world, the best they can do is assist us. And how fancy your algorithm is does not change the equation.

The lesson here is that until we have strong AI, and this could be a long way still, if ever, we should collectively work on finding algorithms that can assist us better instead of trying to replace us.

For example, spam filters should work with the user on defining what is spam. And I don’t mean having the user train the algorithm. I mean that the user should be allowed to change and add to the spam filter. Naturally, in practice, this is hard work, very hard work, and thus, it might be simpler and better to replace the email protocols.

We have to move away from black box algorithms and embrace the fact that we lack strong AI. The intelligence is in your users, not in your software.

Taste – Collaborative Filtering for Java

Here’s yet another Collaborative Filtering library: Taste. This one is written in Java and supports Enterprise Java Beans.

Taste is a flexible, fast collaborative filtering engine for Java. The engine takes users’ preferences for items (“tastes”) and returns estimated preferences for other items. For example, a site that sells books or CDs could easily use Taste to figure out, from past purchase data, which CDs a customer might be interested in listening to.

Highly Affordable Computing (HAC)

Slashdot reports that Amazon will let you use a powerful Xeon-based machine for $0.10 per hour. This means that for $10 per hour, you can have 100 machines cranking away on some task. You need to be a Linux user though. That’s what I call Highly Affordable Computing.

Prestige is overrated?

Grigori Iakovlevitch Perelman proved the longstanding Poincaré conjecture and posted the solution on arXiv. One of the most difficult problems in Mathematics today. However, instead of publishing his work in a prestigious journal, he simply dropped it on an Internet archive. Maybe the Perelman story is meant to teach us something:

If your ideas are important enough and you get them out, people will pay attention to them, whether you publish in a high prestige peer-reviewed journal or not.

For more insights, see what Downes had to say.

Google Scholar launches a “related articles” feature

If you are a Google Scholar user, you will notice that it now allows you to search for similar articles: do a query and then look for an hyperlink below one of the returned paper. I don’t usually like these fuzzy similarities queries, but sometimes, there is no other way to mine for interesting articles.

Efficient FIFO/Queue data structure in Python

For the types of algorithms I implement these days, I need a fast FIFO-like data structure. Actually, I need a double-ended queue. Python has a list type, but it is somewhat a misnomer because its performance characterics are those of a vector. Recently, I found mxQueue which is a separate (non-free) download. Unfortunately, mxQueue has a non-Pythonic interface and, to make matters worse, I found out that Python comes by default with a really nice Queue of its own, called deque: you can find it in the new collection module.

Thus, as a good scientist, I decided to test these 3 implementations. As it turns out, Queue.deque is a perfectly good FIFO data structure:

Python class time (s)
list (Python’s default) 2.26
Queue.deque 0.42
mx.Queue.mxQueue 0.42

Through other tests, I was able to verify that both Queue.deque and mx.Queue.mxQueue have constant time deletion from both the head and the tail, unlike Python’s list.

Embedding fonts for IEEE

IEEE requires that your PDF files embed all fonts. Earlier, I told you how to get pdflatex LaTeX to embed all its fonts, but what if you are including figures that have hanging fonts and you are getting desperate? Luke Fletcher gives us the solution.

  1. convert to ps: pdftops ICRA05.pdf
  2. convert back to pdf using prepress settings: ps2pdf14 -dPDFSETTINGS=/prepress ICRA05.ps
  3. check new ICRA05.pdf for horrendous formatting errors due to double conversion.

(Source: Owen Kaser.)

Subscribe to this blog
in a reader
or by Email.

A Tectonic Shift in Global Higher Education

(…) India, which accounts for a quarter of the developing world’s population and has the third largest higher education system in the world. Today, 23 percent of all higher education enrollments in India are in distance education–specifically in 13 national and state open universities and 106 institutions, mostly public, that teach both on campus and by correspondence. The government’s target is that by 2010, 40 percent of all higher education participation will take place using distance education.

(A Tectonic Shift in Global Higher Education, Sir John Daniel et al., August 2006.)

Google launches online, shareable, spreadsheet tool!

Google has done it again! Spreadsheets.google.com offers free (as in “no money”) shareable, online spreadsheets. The UI feels a lot like Excel and you can save and load Excel documents. Unfortunately, it does not appear to support the Open Document Format. Unlike Excel, you can easily share your Google spreadsheet.

spreadsheets.google.com image

Next Page »

18 queries. 0.420 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.