Write a Twitter application in 5 minutes

I spend much time alone, writing and thinking. Twitter helps me stay connected. I love the platform.

On Friday, I wanted to find the intersection between the users followed by any two individuals. Indeed, suppose that you like both Joe and Jill, and they have similar interests. Maybe whoever they both read is also interesting? I could not find a tool to do it, so I built it with python-twitter.

Anybody with a working knowledge of Python can do it in less than 5 minutes. I used only twenty lines of code (in total!!!). The code proved immediately useful.

If you do not know Python or Ruby. Learn one or the other. Tonight. It is powerful stuff.

Manifesto for Half-Arsed Academic Research

  • Research results are more important than the number of publications or citations.
    This is fine. Yet, we don’t have time to read your papers. So, just keep publishing a lot of papers each year. And get your influential friends to cite you. That’s how we’ll know whether you are good.
  • Science and truth are more important than spin and marketing.
    Yes, but keep pretending you will solve world hunger. And align your research results with the current fashionable trends.
  • You cannot tell where the next science breakthrough is going to come from.
    Maybe. Still, we want a plan of your research activities for the next five years.

Further reading: The hard truth about research grants and The secret behind radical innovation.

Source : Manifesto for Half-Arsed Agile Software Development via John D. Cook.

Counterintuitive factors determining research productivity

  • Permanent researchers publish more when they are in smaller labs.
  • Having many Ph.D. students fails to improve productivity.
  • Funding has little effect on research productivity.

Reference: Carayol, N. and Matt, M., Individual and collective determinants of academic scientists’ productivity, Information Economics and Policy 18 (1), 2006.

Further reading (on this blog): To be smarter, ignore external rewards, Is collaboration correlated with productivity?, Big schools are no longer giving researchers an edge?

Working long hours is stupid

We do too much. We carry too many projects. This overproduction creates problems which we try to fix by working even more.

We value most what we create (see Made by hand and The upside of irrationality). To be happy, you want to focus on making interesting stuff. This takes time and dedication. Yet, as Graham’s essay The top idea in your mind stresses, we often fall into the trap of thinking mostly about money and personal disputes. These thoughts pull us away from our interests and prevent us from doing great work. As an example, I hear that Tiger Woods isn’t playing great golf. I bet he is either stuck into money problems or personal disputes, or both.

It is hard to be overworked by writing a book, by writing research articles or by playing golf. People are overworked dealing with email, context switching, money, and touchy relationships. This abundance of work makes people sad and boring. And this type of work tends to reproduce. The more you have, the more you will have.

Unemployment and pollution are visible results of our overproduction. Yet, there are many more negative side effects. In academia, we train more and more Ph.D.s every year. Yet, we have had too many Ph.D.s in the job market since the seventies. We write more and more research papers every year, and spend more and more time applying for research grants… but professors spend less and less time on curiosity-driven research.

It is cool to produce great work, but it is not cool to work 60 hours a week unless it is out of passion. And nobody is passionate about grant applications, marking papers or handling difficult people. Moreover, working long hours does not scale: you can’t increase your output continuously.

Our productivity will keep improving. I can write software faster and better than ever. I can research prior work with ease. I can ask fancy mathematical questions on the Web and get answers in minutes. Instead of investing back this productivity into more silly work, we need to get smarter:

  • Focus on the essential: programming great software, writing a fun book, a set of inspiring lecture notes or an insightful article.
  • Automate, reduce or delegate. Reduce is best: doing fewer things is cool!
  • A focus on money or on personal disputes makes you stupid. Yet, that’s where success often takes you. Watch out!
  • Airplanes pollute. Travel takes you away from your family. Cars pollute and make you fat. Do you need all that junk?

How to get everyone talking about your research!

Deolalikar claims to have solved the famous P versus NP problem. Is the proof correct? Some influential researchers doubt it: Scott Aaronson is betting 200k$ of his own money against Deolalikar.

What I find most interesting is that Deolalikar did not submit the paper to a journal, as far as I know. He didn’t even post it on arxiv like Perelman. Yet, he is receiving much attention. His name is being tweeted several times a minute. Many of the most influential theoretical computer scientists are reacting to the paper. He is getting the best peer review possible. Most similar papers don’t get so much attention.

Why is this paper different?

  • Everyone seems to agree that the paper is well written, it has nice (color!) figures and the reference section appears up-to-date and complete.  If your result is important, communicate it well.
  • Deolalikar has published just a handful of papers in theoretical computer science, and none at the major conferences. But he has enough peer-reviewed research papers to be treated as a peer.
  • While I doubt he was hired to work on complexity theory, Deolalikar is an industry researcher at HP. Being paid to do research might make you more credible.

Further reading: Deolalikar’s publication list on DBLP, A Proof That P Is Not Equal To NP? by Lipton and P ≠ NP by Baker.

Update: Porreca has the best write-up on reactions to this paper.

Update 2: The consensus after two weeks is that the proof wrong and unfixable.

Is multiplication slower than addition?

Earlier, I asked whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me.

However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software.

So? Can you predict which piece of code runs faster?

scalar product (N multiplications):

for(int k =0; k < N ; ++k)
answer += vector1[k] * vector2[k];

scalar product two-by-two (N multiplications):
for(int k =0; k < N ; k+=2)
answer += vector1[k] * vector2[k]
+vector1[k+1] * vector2[k+1];

non-standard scalar product (N/2 multiplications):
for(int k =0; k < N ; k+=2)
answer += ( vector1[k] + vector2[k] )
* ( vector1[k+1] + vector2[k+1] );

just additions (no multiplication):
for(int k =0; k < N ; ++k)
answer += vector1[k] + vector2[k];

Answer: Merely reducing the number of multiplications has no benefit, in these tests. Hence, simple computational cost models (such as counting the number of multiplications) may not hold on modern superscalar processors.

My results using GNU GCC 4.2.1 on both a desktop and a laptop:

algorithm Intel Core i7 Intel Core 2 Duo
scalar product 0.30 0.39
scalar product (2×2) 0.25 0.39
fewer multiplications 0.25 0.39
just additions 0.16 0.23

Times are in seconds. The source code is available without pointer arithmetics. The same test with pointer arithmetics gives faster results, but the same conclusion. I tried a similar experiment in Java. It confirms my result.

General versus domain intelligence

Our brains come with hard-wired algorithms. Cats can catch birds or mice without thinking about it. I can grab and eat a strawberry without thinking. The Savanna-IQ Interaction Hypothesis says that general intelligence may originally have evolved as a domain-specific adaptation to deal with evolutionarily novel, nonrecurrent problems. We can derive from this hypothesis that people with better general intelligence won’t be better at routine tasks. In fact, they may fare worse at it! They may only have an edge for novel tasks. Thus, general and domain intelligence may be somewhat separate entities.

How do you recognize people with better general intelligence? They are better at adapting to new settings. They are the first to adopt new strategies. But they may not be very good at baseball or boxing, and they may be socially inept.

Modern Artificial Intelligence (and Machine Learning) is typically domain-specific. My spam filter can detect spam, but it won’t ever do anything else. Our software has evolved to cope with specific problems. Yet, we still lack software with general intelligence. Trying to build better spam filters may be orthogonal to achieving general intelligence in software. In fact, software with good general intelligence may not do so well at spam filtering.

Reference: Satoshi Kanazawa, Kaja Perina, Why night owls are more intelligent, Personality and Individual Differences 47 (2009) 685–690

Further reading: Language, Cognition, and Evolution: Modularity versus Unity by Peter Turney

Summer reading: my recommendations

containmentContainment by Christian Cantrell is an excellent sci-fi novel. And you can grab it nearly for free from the author’s page. The premise of the book is that humanity built a colony on Venus. Children  are told that Earth cannot be reached. Massive research into economical oxygen production is required for long term survival. Indeed,  plants cannot survive on the surface of Venus. Or can they? Couldn’t we design special plants that could survive? One of the young researchers sets out to answer the question. Unfortunately, he won’t like the answer. The plot may not be extraordinary, but there are many things to like for computer nerds. For example, the book is set in a future where we appear to have cheap quantum computing. Or, at least, some very fast computers. One of the consequence is that any sufficiently smart kid can break any encryption. Moreover, it is cheaper to simulate most physical experiments than to actual execute them.

atrocity archiveThe Atrocity Archives by Charles Stross is the first in an ongoing series of books. Stross was a software engineer, and it shows. His book reveals many secrets all Computer Scientists should know. For example, do you know why Knuth will never finish the Art of Computer programming, no matter what he tells us? Here’s a quote:

The [Turing] theorem is a hack on discrete number theory that simultaneously disproves the Church-Turing hypothesis (wave if you understood that) and worse, permits NP-complete problems to be converted into P-complete ones. This has several consequences, starting with screwing over most cryptography algorithms—translation: all your bank account are belong to us—and ending with the ability to computationally generate a Dho-Nha geometry curve in real time.

This latter item is just slightly less dangerous than allowing nerds with laptops to wave a magic wand and turn them into hydrogen bombs at will. Because, you see, everything you know about the way this universe works is correct—except for the little problem that this isn’t the only universe we have to worry about. Information can leak between one universe and another. And in a vanishingly small number of other universes there are things that listen, and talk back—see Al-Hazred, Nietzsche, Lovecraft, Poe, et cetera. The many-angled ones, as they say, live at the bottom of the Mandelbrot set, except when a suitable incantation in the platonic realm of mathematics—computerised or otherwise—draws them forth. (And you thought running that fractal screensaver was good for your computer?)

The five most important algorithms?

Bernhard Koutschan posted a compilation of the most important algorithms. The goal is to determine the 5 most important algorithms. Out of his list, I would select the following five algorithms:

  • Binary search is the first non-trivial algorithm I remember learning.
  • The Fast Fourier transform (FFT) is an amazing algorithm. Combined with the Convolution theorem, it lets you do magic.
  • While hashing is not an algorithm, it is one of the most powerful and useful idea in Computer Science. It takes minutes to explain it, but years to master.
  • Merge sort is the most elegant sorting algorithm. You can explain it in three sentences to anyone.
  • While not an algorithm per se, the Singular Value Decomposition (SVD) is the most important Linear Algebra concept I don’t remember learning as an undergraduate. (And yes, I went to a good school. And yes, I was an A student.) It can help you invert singular matrices and do other similar magic.

NoSQL or NoJoin?

Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.

Next Page »

18 queries. 0.425 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.