The most important Theoretical Computer Science problem is inconsequential

Some consider the P = NP problem to be the most important Theoretical Computer Science problem. It asks whether all problems whose solution can be verified quickly, can also be solved quickly. If you can answer this question, you win one million dollars.

The catch is that quickly is defined as in polynomial time. Thus, if you can solve a problem in time O(nx) where x is the number of electrons in the universe, then you are quick.

This is an annoyingly silly definition. Bubble sort is in O(n2). Yet, try replacing all sorting within Google’s infrastructure by this quick algorithm. Google would die.

Lance Fortnow asks us to take for granted that proving P = NP implies we have fast algorithms for all NP problems. William Gasarch just predicted that proving P ≠ NP would also help real world of algorithms.

Unknowingly to them, Zeilberger proved that P=NP on April 1, 2009. Yet, nothing happened. (He was kidding. Or was he?) Anyhow, enough with the dogma! While intermediate steps in the solution of the problem might be critically important to our understanding of computation, knowing that P = NP is inconsequential technologically.

This is as silly as claiming that sending men to Mars will cure cancer. It might very well that the research necessary to send men to Mars might lead to major breakthroughs, but whether we go to Mars or not has nothing to do with cancer.

Yes, please send men on Mars. Yes, please work on the P = NP problem. But stop claiming the answer would change the world.

Further reading: See my older post Is P vs. NP a practical problem?. Dick Lipton wrote a related blog post as well.

Relational databases: are they obsolete?

Michael Stonebraker is predicting that the dominance of the generic relational database is coming to an end. Having recently founded several database companies, he has a vested interested in this prediction .

Here is Stonebraker logic: we can outperform relational databases with specialized solutions. Therefore, users will migrate to specialized engines. In effect, specialized players such as Vertica will grab market shares from Oracle Database and Microsoft SQL Server.

Unfortunately, Stonebraker’s arguments are misleading. As far as performance is concerned, Stonebraker is obviously right: we are undergoing major changes. As pointed out by Daniel Tunkelang, you can store a lot of data in 32GB of RAM. Solid-state drives can be used to wipe out some IO bottlenecks. Yet, these technological changes will not change the game for two reasons:

  • We have always been able to outperform generic relational databases: (1) column stores have been around since the seventies when they were called transposed files (2) search engines have always used their own indexes (3) lightweight key-value engines like Tokyo Cabinet have always been around. Generic relational databases did not achieve dominance due to their superior performance.
  • Generic relational databases are frequently catching up to specialized engines. In particular, they are not limited to row stores. Curt Monash’s blog post on Oracle’s hybrid columnar approach makes this obvious. Nicolas Bruno, in Teaching an Old Elephant New Tricks, predicted that the lessons learned by start-ups such as Vertica will be integrated into traditional relational engines.

Further reading: I was motivated by the latest StorageMojo blog post. See also my blog posts Trading compression for speed with vectorization, Changing your perspective: horizontal, vertical and hybrid data models, Column stores and row stores: should you care? and Native XML databases: have they taken the world over yet?

The hard truth about research grants

You must do many silly things to get a large research grant:

  • You must know precisely what you will do for the next five years. Yet, in my experience, good researchers only have a vague idea of where they will be in 5 years. If you know the promising research directions you will encounter, you are an astrologer, not a scientist.
  • Your plan must be infallible. Yet, in my experience, research ideas that never fail are not research ideas at all! They are mere applications of known principles. There is no research without risk. The more ambitious the research, the riskier it is.
  • Your work must revolutionize your field. Yet, as stated above, it must also be  infallible.
  • Your work must have crucial applications. Make it up if you must! When Einstein came up with E=mc2, he should have pointed out the possibility of dropping nasty bombs on Japan. Again, we expect scientists to know the future.
  • You must work in large teams, packing up a broad range of researchers. Yet, there is no clear evidence that correlation exists between the resort to extramural collaboration and overall research performance. The goal—once more—is to minimize risks. It is unclear whether it actually reduces the risk at all. And the cost is added bureaucracy.
  • You must promise to train many new Ph.D.s. Somehow, flooding the job market so that there are hundreds of qualified applicants for every research job is highly rewarded.

Thankfully, Peter A. Lawrence has many good ideas on how to improve the system:

  • Grant applications should be short. Quick to prepare. Quick to process. Currently, it is not uncommon for scientists to spend months writing grant applications, and weeks reviewing them. Wasteful!
  • You should be able to apply for grants based on your record alone. The requirement to submit a project or research program is silly.
  • We should be critical of large groups. For some large projects, such as building a multi-billion-dollar machine, you need a group. To investigate the properties of some algebraic structure, you don’t.
  • You should only offer a handful of research papers for review. Pick your top 3 papers and don’t mention the rest. In Canada, NSERC applied this idea, but they did not go far enough. They ask us for our top contributions, and then they ask for the full list. By reviewing researchers on only their best work, we would lift the incentive to produce many weak papers.

I conclude from a quote by Michael Nielsen:

Some years back I constructed a list of papers I especially admired, and was surprised to discover that with only a few exceptions they were produced from unfunded research. This was sobering, since it suggest that receiving research grants was (…) anticorrelated with doing work of the highest quality. Grants seem to be good at sustaining an established area, but not very good at all at producing the conceptual innovations that start new subfields. (Michael Nielsen)

(And before anyone jumps at this… Michael did not write that having a large grant was incompatible with highly original research.)

How things change: Cheaters are Innovators

If you seek approval above all else, you are unlikely to innovate outside the rigid bounds of the current system:

  • You do not convince existing journals to give more respect to this new field you created. You go out and create your own journals and conferences. John von Neumann did not wait for his colleagues to approve of his work on Computers. In fact, he had to use threats to get what he wanted.
  • You do not convince libraries to embrace e-commerce. You create Amazon.com.
  • You do not convince university librarians to stop worrying about what the publishers need. You go out and create Google Scholar.
  • You do not wait for Amazon.com’s management to approve the use of a recommender system. You do what Greg Linden did and squeeze the feature in during a test.
  • If you can prove Poincaré conjecture, you don’t wait for a journal to approve your work, you just post it on arxiv.

Changing a system from the inside is inefficient and failure-prone. Stop wasting your time and make the old system irrelevant. Do not seek approval. Go out and test your ideas yourself. Listen respectfully to others, but make up your own mind. Work within a large company, scientific community or University if you must, but never fall under the illusion that you can change it by playing fair. Cheating is the fundamental mechanism by which change happens. Cheaters are innovators.

How are you planning to cheat today?

Further reading: Is scientific publishing about to be disrupted by Michael Nielsen

Where do the best mathematicians come from?

Americans think that the best scientists come from their best universities. To learn more, consider where the influential mathematicians from:

got their degree in the USA 33%
got their Ph.D. in the USA 58%
working in the USA 68%

The American scientific dominance relies partially on the ability of the USA to attract and retain the best and the brightest:

  • 33.6% of European PhDs were attracted to faculty or research positions in the US.
  • 55% of non-European foreign PhDs were attracted to faculty positions in the US.
  • Only 10% of American Ph.D.s go work abroad.

Reference: Influential Mathematicians: Birth, Education and Affiliation

Further reading: It may not matter all that much where you go to college and Big schools are no longer giving researchers an edge?

Changing your perspective: horizontal, vertical and hybrid data models

Data has natural layouts:

  • text is written from the first to the last word,
  • database tables are written one row at a time,
  • Google presents results one document at a time,
  • the early recommender systems compared users to other users,
  • discussions are organized in newsgroups and posting boards by topic,
  • research papers are organized in journals and conferences,
  • objects have attributes (a ball is red), and from these attributes we determine similarities between objects.

Using a database terminology, these are horizontal layouts.

We can rotate these models to create vertical layouts:

  • Instead of writing text sequentially, we can store the locations of each word in an inverted index.
  • Instead of writing database tables row-by-row (e.g., Oracle, MySQL), we can write databases column by column (e.g., C-Store/Vertica, LucidDB, Sybase IQ, and my Lemur Bitmap Index Library).
  • Instead of presenting results sets one document at a time, we present tag clouds and use faceted search to support exploration. Thus, instead of listing documents, we focus on attributes (date, topic, author).
  • Recommender systems are often more scalable when they compare items instead of users: the most famous example is Greg Linden‘s Amazon recommender (if you liked this book, you may like…). For example, the Slope One algorithms outperform many user-to-user algorithms.
  • The social web started out with topic-oriented newsgroups and posting boards, but it is not dominated by user-centric blogs and social sites (such as Facebook or Twitter). Since then, we have realized that user-oriented blogs can be preferable.
  • While research papers are published in conferences and journals, I argue that we should turn this around and organize research papers by author through author-specific feeds.
  • Some AI researchers are suggesting that relations might be primary whereas attributes would be secondary.
Many of the best solutions are hybrids. For example, text search sometimes require full-text indexes such as suffix arrays and Oracle recently announced a row/column hybrid.
Take away message: If you are stuck, try to rotate your data model. If neither the vertical nor the horizontal model is a good fit, create an hybrid.

Toward author-centric science

Too many research papers in Computer Science are nonsense: they convey no worthy message. Yet, they pass a Turing test of sort: at a glance, they are indistinguishable from interesting research papers. In fact, they are designed as nonsense from the beginning: the authors mimic the output of good research. The goal is to appear to be doing valued research. Some of these papers appear in top conferences, and even go on to be highly cited.

What is the difference between a Stephen King novel and the average horror manuscript? At first glance, probably not much. After all, it is all a matter of taste. Yet, if you have read enough novels, you can recognize King’s mastery. Avid readers know who are the best writers and they stick with them. It is not sufficient to get a manuscript accepted by King’s publisher to get the attention of his readers. People first care for the author. Stephen King is the brand.

In science, we publish by venue, by journal and by topic. Yet, if we followed specific researchers, the same way we follow specific writers, these researchers would have a strong incentive to produce interesting work. It would not be profitable to write many uninteresting research papers since nobody would follow your work.

The idea is not new nor original. Back in 2005, I urged researchers to setup RSS feeds for their publications. Arxiv recently made author feeds available. Creating RSS feeds is technically easy. We have no excuse.

Please: if you want me to read you, make it easy to receive notice whenever you publish a new research paper. If enough researchers do it, we might improve the quality science significantly. Why wouldn’t you want people to follow your work?

Further reading: Scholarly Communications must be Syndicated by Gideon Burton

Eating my own dog food: You can subscribe to my research papers by email or through a reader.

Why I am not publishing in PLoS One, yet

PLoS One is a new peer-reviewed journal (2006) with many interesting features:

Unfortunately, for a Computer Scientist, it is not yet attractive:

  • The Computer Science section is filled with biology and medicine papers making use of Information Technology. In other words, the PLoS One taxonomy  confuses Information Technology and Computer Science! Thankfully, I could find one article in Natural Language Processing which might be the first and only Computer Science paper published in PLoS One. So there is hope.
  • As a related point, PLoS One is not indexed at the usual places as a Computer Science journal (DBLP, ACM DL, and so on). Of course, no Computer Science indexing is possible until PLoS One correctly classifies the Computer Science articles.

If they could fix these problems, I would gladly submit some of my work to them. PLoS One could become a useful journal in Computer Science over time. What about prestige? PLoS One uses article-level metrics. Instead of trying to be a prestigious journal, PLoS One helps you measure the impact of your own papers.

    18 queries. 0.426 seconds. Valid XHTML

    Powered by WordPress

    Subscribe to this blog in a reader or by Email.