Quasi-Monotonic Segmentation Talk in Ottawa

I’m giving a talk next week at the Text Analysis and Machine Learning Group (TAMALE) seminar at the University of Ottawa. I will talk on Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. It is not directly related to text and machine learning, but many of the ideas from time series data mining port over to text processing. After all, a sequence is a sequence. I see Joel Martin wil also give a talk there this Spring on “Libminer”. Here’s the abstract for my talk:

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Google was eating all my bandwidth!

Some of you who tried to access my web site in recent days have noticed that it was getting increasingly sluggish. In an earlier post, I reported that Google accounted for 25% of my page hits, sometimes much more. As it turns out, these two issues are related. Google was eating all my bandwidth.

I investigated the matter and found out that Google was spending a lot of time spidering some posting boards I host. So, I did two things: I created a robots.txt file which tells Google to stop indexing the content of the posting boards, and I deleted all messages older than 90 days in these posting boards (which resulted in the deletion of 200,000+ messages). Both of these actions are bad for the web. I wanted people to have access to these archives. I wanted to keep them. I have gigabytes of storage, but I’m far more limited on bandwidth!

I’ll report here about how it goes, but this tells me that Google has reached the limits of freshness and exhaustivity. And no, I’m not the only one worrying about Google using up too much of my bandwidth. If we get to a point where Google accounts for 25% of all web traffic, what are we going to do collectively?

I don’t believe the solution lies in the webmasters. I don’t want to have to tell Google, in details, what to index and when to index it. However, I could imagine a standard by which Google could query a web service and determine what content, if any, has changed. Similarly, given a directory of static HTML page, there is got to be a way for Apache to tell Google what files have changed in the recent past. I’m amazed there isn’t a standard way to do this yet.

I know Robin will tell me to use Sitemaps, but from the look of it, while it looks easy to create a Google Sitemap for static content, creating a Sitemap for a complex site made of static content, wordpress pages, posting boards and so on, is far more daunting. I don’t want to spend the next week working on such a stupid project. This has to be automated.

Technorati allows time-based text mining

Matthew is reporting that technorati now allows you to plot word usage frequency over time in the blogosphere. Here’s the usage of the word “segmentation” over time:

Technorati Chart

I think BlogPulse has been offering this sort of things for some time. I’m confused by the relationship between these various services. However, these services could benefit from OLAPish concepts (shameless plug):

Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, APICS 2005, Wolfville, Canada, October 2005.

Linear Time Algorithm for Approximating a Curve by a Single-Peaked Curve

Here is an interesting paper by Jinhee Chun, Kunihiko Sadakane, and Takeshi Tokuyama.

Given a function y = f(x) in one variable, we consider the problem of computing the single-peaked (unimodal) curve y=Φ(x) minimizing the L2-distance between them. If the input function f is a histogram with O(n) steps or a piecewise linear function with O(n) linear pieces, we design algorithms for computing Φ in linear time. We also give an algorithm to approximate f with a function consisting of the minimum number of unimodal pieces under the condition that each unimodal piece is within a fixed L2-distance from the corresponding portion of f.

It reminds me of this other paper:

N. Haiminen, A. Gionis, K. Laasonen, Algorithms for unimodal segmentation with applications to unimodality detection, to appear in the Journal of Knowledge and Information Systems (KAIS).

JOLAP versus the Oracle Java API

Some years ago (in 2000), the Java OLAP (JOLAP) spec. was proposed and it was finally ratified by all parties (including Oracle, Sun, Apple but not IBM and Microsoft). One point that has been puzzling me is why JOLAP wasn’t more widely adopted, at least partially. (Update: though the Final JOLAP Draft was approved, the spec. was never released and there is no license available right now allowing anyone to implement the Final Draft.) For our course CS6905 Advanced Technologies for E-Business, I prepared what must be too many slides on the JOLAP and Oracle Java API. I didn’t find a comparison between the Oracle OLAP API and JOLAP, so here’s my own analysis:

  • Firstly, Oracle doesn’t implement the Common Warehouse Model (CWM). I have no experience working with CWM, but it seems like CWM is quite complex. Maybe they figured it wasn’t worth the trouble?
  • Secondly, in its OLAP API, Oracle doesn’t implement the J2EE Connector model, or anything having to do with J2EE. I suspect Oracle is not eager to depend on J2EE.
  • Thirdly, the Oracle OLAP API doesn’t have Cube and Edge objects. To me, this is a shame because I really like the Edge objects JOLAP defines. Anyone knows why Oracle didn’t integrate those in its revised 10g API?

So, we are left in an OLAP world where Microsoft’s MDX is the sole cross-vendor query language. How ironic!

Java 4K Game Programming Contest

I just stumbled on the Java 4K Game Programming Contest. This looks like an excellent contest for programmers out there trying to get into the video game industry or just trying to prove their hacking skills:

The Java 4K Programming Contest is the ultimate byte-squeezing Java challenge! Using only 4096 bytes, competitors use every trick up their sleeve to create an entire game. The current Java 4K is running from December 1st, 2005 – March 1st, 2006.

I really like the size limit. In the good old days, we really just had 4KB of internal memory. I programmed a few cool games back when computers still had green on black screens, including a full Othello implementation (including the AI), and it ran very well on probably less than 4K.

And I think 4K ought to be enough for a great game prototype. Myself, I would try to do it without using obfuscated code and try to get points for programming elegance.

Meetings == bad

According to Education Guardian, meetings are bad:

(…) having too many meetings (…) may have negative effects on the individual.

As one who goes to great extends to avoid meetings and any synchronous event, I think we definitively need to move from a synchronous culture (meeting people at fixed time, using the phone) to an asynchronous culture (such as email).

The problem, of course, is that lots of people can’t write. Even university professors, sometimes, can’t write. Sure, they wrote a thesis one day, or so you hope, but they simply can’t sit down and communicate their ideas in a written form without great efforts.

I’m annoyed and tired of these people. If your ideas are not clear enough to write them down in an email, without any fuss (read: send a word documents with carefully chosen fonts), then I have no time for you.

And no, I don’t think people are dumb. They just don’t want to have to think and work. Sitting down and chatting is so much easier than sitting down alone and having to write something coherent.

We have to move on to the next level.

Academic Careers for Experimental Computer Scientists and Engineers

Peter Turney sent me a pointer to Academic Careers for Experimental Computer Scientists and Engineers which is somewhat old (1994) book made available online.

Here are some of the chapter titles (all are available online):

Best Blonde Joke Ever!

Ok, Ernie linked to the Best Blonde Joke Ever! I’ve got to say, it is amazingly funny.

Googlebot accounts for one fourth of my page hits!

I just had a look at the browser stats for the visits to my site. The results are strange. Googlebot seems to be taking up a huge share of the traffic. I think I have read an explanation somewhere, maybe it was on Tim Bray’s site. Nevertheless, these numbers are scary:

Browser count percentage
MSIE 7890 34.8
Googlebot 5631 24.8
Firefox 1956 8.6
undisclosed 1125 5.0
Yahoo 994 4.4
Bloglines 929 4.1
msn 860 3.8
Mozilla 758 3.3
Konqueror 409 1.8
Google 353 1.6
Safari 263 1.2

(The second Google is not Googlebot. The stats are for 24 hours. They exclude some parts of my web site.)

It seems that a large fraction of the visits to my sites are from search engines. What does this says about the current state of the web?

« Previous PageNext Page »

19 queries. 0.401 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.