Sunday, January 22nd, 2006

kd-tree applet

Filed under: — lemire @ 1:04

k-dimensional trees are a clever way to select all points in a rectangle using O(sqrt(n)+k) time where n is the total number of points and k is the number of points in the range. I usually don’t care for applets as a way to illustrate results (though I have two on my own web site doing just that!), but this kd-tree applet is a great trick. I post this here because I want to use it in a course if I ever have to teach this concept. The applet is in the public domain, but was written by Jacob Marner.

I resisted posting the applet here, because I know Java can crash some browsers or create other trouble. Sad. Java is an ok language and I really wish it would work in the browser.

Thursday, January 19th, 2006

Sex.com sold for $14 million dollars

Filed under: Business / Economics / Politics — lemire @ 16:07

Slashdot reports that the sex.com domain name sold for $14 million. Recall that people spend billions every year buying software, which effectively costs nothing to copy.

The best route, these days, to become really rich is to find a way to create something that effectively costs nothing to produce, though it may cost a lot to design, and then, find a way to make sure you are the only one who can make an infinite number of copies for free. The problem is that we are collectively wising up to this get-rich-quick scheme.

I think that a more durable model, though a much less effective one, is to find a way to offer a service nobody else can match. Google comes to mind. It is a model based on constant hard work which appeals a lot more to the Catholic boy in me.

Saving Bandwidth with CSSTidy

Filed under: — lemire @ 12:32

My CSS files get verbose over time. CSS is just not a great language or else, I’m not very good at it. Anyhow, I found out about CSSTidy which is a nifty tool to optimize your CSS files. There is also an online version. CSS optimization is a cool topic and it might get to be quite a challenge as selectors become more complex.

However, given the following test:


montant {
color: red;
font-weight: bold;
background:white;
font-style: normal;
text-align: center;
}
nom {
color: white;
background:white;
font-style: normal;
text-align: left;
}
texte {
color: black;
text-align: center;
font-style: normal;
background:white;
text-align: left;
}

CSSTidy failed to rewrite it in the obvious way:


montant {
color: red;
font-weight: bold;
text-align: center;
}
nom {
color: white;
text-align: left;
}
texte {
color: black;
text-align: left;
}
montant, nom, texte {
background: white;
font-style: normal;
}

Before you rush out and try to implement your own CSS optimizer, notice that the Flumpcakes CSS Optimizer found the right solution.

Quasi-Monotonic Segmentation Talk in Ottawa

Filed under: Abstracts — lemire @ 10:49

I’m giving a talk next week at the Text Analysis and Machine Learning Group (TAMALE) seminar at the University of Ottawa. I will talk on Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. It is not directly related to text and machine learning, but many of the ideas from time series data mining port over to text processing. After all, a sequence is a sequence. I see Joel Martin wil also give a talk there this Spring on “Libminer”. Here’s the abstract for my talk:

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Wednesday, January 18th, 2006

Google was eating all my bandwidth!

Filed under: — lemire @ 17:24

Some of you who tried to access my web site in recent days have noticed that it was getting increasingly sluggish. In an earlier post, I reported that Google accounted for 25% of my page hits, sometimes much more. As it turns out, these two issues are related. Google was eating all my bandwidth.

I investigated the matter and found out that Google was spending a lot of time spidering some posting boards I host. So, I did two things: I created a robots.txt file which tells Google to stop indexing the content of the posting boards, and I deleted all messages older than 90 days in these posting boards (which resulted in the deletion of 200,000+ messages). Both of these actions are bad for the web. I wanted people to have access to these archives. I wanted to keep them. I have gigabytes of storage, but I’m far more limited on bandwidth!

I’ll report here about how it goes, but this tells me that Google has reached the limits of freshness and exhaustivity. And no, I’m not the only one worrying about Google using up too much of my bandwidth. If we get to a point where Google accounts for 25% of all web traffic, what are we going to do collectively?

I don’t believe the solution lies in the webmasters. I don’t want to have to tell Google, in details, what to index and when to index it. However, I could imagine a standard by which Google could query a web service and determine what content, if any, has changed. Similarly, given a directory of static HTML page, there is got to be a way for Apache to tell Google what files have changed in the recent past. I’m amazed there isn’t a standard way to do this yet.

I know Robin will tell me to use Sitemaps, but from the look of it, while it looks easy to create a Google Sitemap for static content, creating a Sitemap for a complex site made of static content, wordpress pages, posting boards and so on, is far more daunting. I don’t want to spend the next week working on such a stupid project. This has to be automated.

Tuesday, January 17th, 2006

Technorati allows time-based text mining

Filed under: Data Warehousing and OLAP — lemire @ 23:03

Matthew is reporting that technorati now allows you to plot word usage frequency over time in the blogosphere. Here’s the usage of the word “segmentation” over time:

Technorati Chart

I think BlogPulse has been offering this sort of things for some time. I’m confused by the relationship between these various services. However, these services could benefit from OLAPish concepts (shameless plug):

Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, APICS 2005, Wolfville, Canada, October 2005.

Linear Time Algorithm for Approximating a Curve by a Single-Peaked Curve

Filed under: Abstracts — lemire @ 22:34

Here is an interesting paper by Jinhee Chun, Kunihiko Sadakane, and Takeshi Tokuyama.

Given a function y = f(x) in one variable, we consider the problem of computing the single-peaked (unimodal) curve y=Φ(x) minimizing the L2-distance between them. If the input function f is a histogram with O(n) steps or a piecewise linear function with O(n) linear pieces, we design algorithms for computing Φ in linear time. We also give an algorithm to approximate f with a function consisting of the minimum number of unimodal pieces under the condition that each unimodal piece is within a fixed L2-distance from the corresponding portion of f.

It reminds me of this other paper:

N. Haiminen, A. Gionis, K. Laasonen, Algorithms for unimodal segmentation with applications to unimodality detection, to appear in the Journal of Knowledge and Information Systems (KAIS).

« Previous PageNext Page »

37 queries. 0.709 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.