Thursday, May 31st, 2007

The Google Similarity Distance

Filed under: — Daniel Lemire @ 7:54

I read a paper on the Google Similarity Distance this morning by Cilibrasi and Vitanyi. They search for word cooccurrences using the Google search engine. Their formula goes as follows: (G(x,y)-min(G(x,x),G(y,y)))/max(G(x,x),G(y,y)) where G is the “Google code” function. The Google code function is defined as -log g(x,y) where g(x,y) is the normalized number of web pages containing both term x and term y: the normalization is such that if you sum up g(x,y) over all x,y then you get 1.0. With this simple approach, they seem to be able to translate between English and Spanish, build a thesaurus, and so on. This reminds me a bit of the recent work done by Turney on analogies.

1 Comment »

  1. Nice. BTW did you see this Google Desktop similarity app?

    http://googlesystem.blogspot.com/2007/05/visualize-google-desktop-results-on.html

    Comment by Shane — 6/6/2007 @ 1:34

RSS feed for comments on this post.

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: I + II + IX= XII. Yes, you have to enter a roman numeral. (Answer must be in upper case.)

« Blog's main page

25 queries. 0.295 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.