Chaining CAPTCHAs for fun and profit?

A CAPTCHA is a type of challenge-response test used in computing to determine whether a user is human. Yahoo! is having major difficulties with its CAPTCHAs. Russian hackers are able to pass their Turing tests with 35% accuracy. Some human beings say that their accuracy is 80% on these same tests.

This accuracy is not an historical breakthrough since academics got almost perfect scores on earlier Yahoo! CAPTCHAs. Nevertheless, 35% appears to be good enough to make the CAPTCHA nearly useless.

I have a neat way to decrease the accuracy for the robots while maintaining the human accuracy — at the cost of more work for the human being. Basically, you “chain” the CAPTCHAs by asking the human being to get at least k good answers out of N queries.

Say you have an p=80% probability of entering the right answer as a human being. What is the probability that you will get at least 3 out of 4 right? My answer is F(p) = 4 p3(1-p)+p4. That is, the human being still has an 80% probability of passing the test, while the robot chances went from 35% down to 13%.

I have been unable to find any reference to CAPTCHA chaining. The closest reference I found is this paper:

Ran Canetti, Shai Halevi, Michael Steiner, Hardness Amplification of Weakly Verifiable Puzzles, Lecture Notes in Computer Science, 2005.

If anyone knows how to spin this into an interesting research paper, please get in touch.

I predict that telling human beings apart from evil machines is going to become an emerging and very important research area. The war against artificial intelligences has already started. No. I am not kidding: evil people are building networks of evil software as we speak.

Closed-source software is the source of innovation?

Geoff cites an article by Jaron Lanier arguing that closed-source software is the source of innovation, that open source software is only polishing copies. The gist of the argument is there:

Why are so many of the more sophisticated examples of code in the online world—like the page-rank algorithms in the top search engines or like Adobe’s Flash—the results of proprietary development? Why did the adored iPhone come out of what many regard as the most closed, tyrannically managed software-development shop on Earth? An honest empiricist must conclude that while the open approach has been able to create lovely, polished copies, it hasn’t been so good at creating notable originals. Even though the open-source movement has a stinging countercultural rhetoric, it has in practice been a conservative force.

First of all, Page Rank was not designed in a proprietary-software environment. It was designed at Stanford. And Google is not exactly a proprietary-software shop. They use Linux, Python and plenty of open source software. They also give back a lot to the open source communities.

Apple is hot with innovations? Maybe, but before they took a sharp turn back toward BSD and built on open source software, they were going to hell with their proprietary operating system. Several key Apple components are open source software, including their Web browser engine.

Wikipedia? Built on MediaWiki, MySQL, PHP, MemCached. YouTube? Built on MySQL, Python, and MemCached.

There is plenty of innovation going on in the open source world. It just does not come in a box with a CD-ROM. Open source software is the greatest innovation enabler in our history.

A first draft of HTML 5… toward a new HTML?

W3C just published today a first draft of HTML 5. HTML 5 replaces HTML 4 and XHTML 1.

  • They are getting rid of the “acronym” elements because it was rarely used.
  • The elements “canvas,” “video”, “audio” are added: the HTML becomes fully multimedia. However, MathML and SVG remain embedded content.
  • They added the “article,” “section,” and “figure” elements which should be useful to blogs and news sites.

The network is the bottleneck?

There is a really nice article on StorageMojo about Cloud Computing. Cloud Computing is more or less the idea that you can offload your storage and processing tasks to a very large set of computers, typically maintained by some large company (such as Amazon). The novelty is that you abstract out where the data is held and which machine does the processing — not unlike what MapReduce does.

This new level of abstraction is probably a significant step forward in the way we write software. Google has shown that hand-crafting parallel algorithms is often neither necessary nor useful.

In any case, StorageMojo points out that memory and CPU cycles are becoming ridiculously cheap. Meanwhile, bandwidth is becoming a serious limitation. Therefore, he says, companies are not about to outsource their computing needs to cloud computing. It is too inexpensive to create — and maintain — your own computing infrastructure.

The first problem with his argument is that bandwidth increases with storage, automagically. Indeed, I can just drop a large hard drive in the trunk of my car and drive to destination! Our real curse is latency. However, if the organization you work for is like mine, latency-resilience is already built in the system. In a university, nobody expects the data entered yesterday to come up in the report next week.

The second problem with his argument is that paying experts to maintain your own server is almost certainly more expensive than whatever bandwidth costs cloud computing can occur. Maintaining databases and number-crunching computers is a boring task and it will get unavoidably outsourced.

Tracking call for papers… with a wiki?

WikiCFP is a tool to track call for papers collaboratively using a wiki. The call for papers are entered in categories: you can follow only the Machine Learning, Natural Language Processing, or databases call for papers. You can subscribe to RSS feeds for each category.

What a good idea!

On the sum of power laws

Many real-life data sets have power laws or Zipfian distributions. An integer-valued random variable X follows a power law with parameter a if P(X=k) is proportional to k-a. Panos asked what the sum of two power laws was. He cites Wilke at al. who claim that the sum of two power laws X and Y with parameters a and b is a power law with parameter min(a,b).

I relate this problem to the sum of exponentials. Any engineer knows that if a>b, then eat + ebt will be approximately eat for t sufficiently large.

Hence, I think that the sum of power law distributions X and Y is a power law distribution with parameter min(a,b) if you are only interested in large values of k in P(X+Y=k).

For extra credit, help me solve this problem. Suppose that I have two power laws with the same parameter. Is their sum a power law with the same parameter? (I predict it does not!)

Egghe showed in The distribution of N-grams that even if the words follow a power law, the n-grams won’t!

Disclaimer: Yes, I am being lazy. I could work it out.

What is an effective social network?

Many democratic systems require vote diversity. You do not get elected prime minister of Canada by rallying the largest number of voters. You also need to have your votes spread out over several regions.

Similarly, Scott Karp argues that completely open social networks fail. He takes two examples: Digg and Wikipedia.

Digg recommends web sites based on user votes. They recently modified their algorithm:

The algorithm change effectively holds back from the homepage any story that is Dugg by the same groups of friends, i.e. a group that is not “diverse,” (…)

As for wikipedia, Karp points out that it is not a really open system since a group of editors have a great deal of control.

Stephen Downes asks an interesting question: what constraints make a network effective?

The wisdom of crowds is not obtained by mere voting. What is required — as the new Digg algorithm explicitly recognizes — is diversity.

I would like to formalize this problem. You are given a set of users and their votes on several issues as in the Digg community. You are not given out explicitly what the cliques — or set of friends — are. Is there a canonical way to take into account diversity when counting votes?

Who should be buying expensive commercial database systems?

According to Curt Monash, few people should be buying high-end Database management systems:

There are relatively few applications that wouldn’t run perfectly well on PostgreSQL or EnterpriseDB today. (…)

What’s more, these mid-range database management systems can have significant advantages over their high-end brethren. The biggest is often price, for licenses and maintenance alike. Beyond that, they can be much easier to administer then their more complex counterparts. (…)

And what these mid-range DBMS don’t do today, they likely will do soon. (…)

EnterpriseDB is equal or superior in every way I can think of to Oracle7, a few security certifications perhaps excepted.

If you work for an organization that has expensive contracts with Oracle or Microsoft for their DBMS, it is most certainly in vain.

Meanwhile, the world of open source Business Intelligence is getting more interesting every day. We now have Pentaho Mondrian, Jedox, Birt, Enhydra Octopus, and so on. In 2005, I asked whether open source was ready for Business Intelligence. The question seems less controversial in 2008, doesn’t it?

Most of the database industry has been commoditized. If you stick around with these old schemas, you lose.

Research questions about… tag clouds?

Tag clouds are graphical representations of attributes and their relative importance. In a recent paper, we have argued that tag clouds might help bridge the gap between collaboration and Business Intelligence.

Here are some fun things to do with tag clouds:

  • In our paper, a tag cloud computation is the equivalent of an approximate orthogonal top-k range query. There has been little work in this area. We propose error measures for this problem. Our own approach is based on the pre-computation of icebergs.
  • Unlike bar charts, a tag cloud can have 50, 100 or 150 attributes. It makes it easier to collaborate because you do not need so often to rely on hierarchies. However, tag clouds tend to mix badly with non-nominal dimensions such as time or price. More generally, more work is needed on multidimensional tag clouds.
  • The problem of optimally drawing tag clouds is still very much open.

My top blog posts in 2007

Now that January 2008 is coming to an end, maybe it is time to give 2007 a final loop. According to my logs, my most popular blog posts in 2007 are:

Next Page »

18 queries. 0.420 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.