Why libxml2 takes forever to transform XHTML files

For my Online XML Course, I process lots and lots of XHTML content using XSLT. Up until now, I avoided the libxml2 XSLT processor (xstlproc) because it was unacceptably slow.

Today, I found out what is happening. It loads all of the XHTML DTD files from W3C each and every time it processes an XHTML file. To get around the problem, use the –novalid flag when invoking the xsltproc command line. You might get warnings about problematic entities, so I suggest you try a DOCTYPE declaration like the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ENTITY laquo "&#171;">
<!ENTITY raquo "&#187;">
<!ENTITY oelig "&#339;">
<!ENTITY nbsp "&#160;">
]>

kd-tree applet

k-dimensional trees are a clever way to select all points in a rectangle using O(sqrt(n)+k) time where n is the total number of points and k is the number of points in the range. I usually don’t care for applets as a way to illustrate results (though I have two on my own web site doing just that!), but this kd-tree applet is a great trick. I post this here because I want to use it in a course if I ever have to teach this concept. The applet is in the public domain, but was written by Jacob Marner.

I resisted posting the applet here, because I know Java can crash some browsers or create other trouble. Sad. Java is an ok language and I really wish it would work in the browser.

Sex.com sold for $14 million dollars

Slashdot reports that the sex.com domain name sold for $14 million. Recall that people spend billions every year buying software, which effectively costs nothing to copy.

The best route, these days, to become really rich is to find a way to create something that effectively costs nothing to produce, though it may cost a lot to design, and then, find a way to make sure you are the only one who can make an infinite number of copies for free. The problem is that we are collectively wising up to this get-rich-quick scheme.

I think that a more durable model, though a much less effective one, is to find a way to offer a service nobody else can match. Google comes to mind. It is a model based on constant hard work which appeals a lot more to the Catholic boy in me.

Saving Bandwidth with CSSTidy

My CSS files get verbose over time. CSS is just not a great language or else, I’m not very good at it. Anyhow, I found out about CSSTidy which is a nifty tool to optimize your CSS files. There is also an online version. CSS optimization is a cool topic and it might get to be quite a challenge as selectors become more complex.

However, given the following test:


montant {
color: red;
font-weight: bold;
background:white;
font-style: normal;
text-align: center;
}
nom {
color: white;
background:white;
font-style: normal;
text-align: left;
}
texte {
color: black;
text-align: center;
font-style: normal;
background:white;
text-align: left;
}

CSSTidy failed to rewrite it in the obvious way:


montant {
color: red;
font-weight: bold;
text-align: center;
}
nom {
color: white;
text-align: left;
}
texte {
color: black;
text-align: left;
}
montant, nom, texte {
background: white;
font-style: normal;
}

Before you rush out and try to implement your own CSS optimizer, notice that the Flumpcakes CSS Optimizer found the right solution.

Quasi-Monotonic Segmentation Talk in Ottawa

I’m giving a talk next week at the Text Analysis and Machine Learning Group (TAMALE) seminar at the University of Ottawa. I will talk on Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. It is not directly related to text and machine learning, but many of the ideas from time series data mining port over to text processing. After all, a sequence is a sequence. I see Joel Martin wil also give a talk there this Spring on “Libminer”. Here’s the abstract for my talk:

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Google was eating all my bandwidth!

Some of you who tried to access my web site in recent days have noticed that it was getting increasingly sluggish. In an earlier post, I reported that Google accounted for 25% of my page hits, sometimes much more. As it turns out, these two issues are related. Google was eating all my bandwidth.

I investigated the matter and found out that Google was spending a lot of time spidering some posting boards I host. So, I did two things: I created a robots.txt file which tells Google to stop indexing the content of the posting boards, and I deleted all messages older than 90 days in these posting boards (which resulted in the deletion of 200,000+ messages). Both of these actions are bad for the web. I wanted people to have access to these archives. I wanted to keep them. I have gigabytes of storage, but I’m far more limited on bandwidth!

I’ll report here about how it goes, but this tells me that Google has reached the limits of freshness and exhaustivity. And no, I’m not the only one worrying about Google using up too much of my bandwidth. If we get to a point where Google accounts for 25% of all web traffic, what are we going to do collectively?

I don’t believe the solution lies in the webmasters. I don’t want to have to tell Google, in details, what to index and when to index it. However, I could imagine a standard by which Google could query a web service and determine what content, if any, has changed. Similarly, given a directory of static HTML page, there is got to be a way for Apache to tell Google what files have changed in the recent past. I’m amazed there isn’t a standard way to do this yet.

I know Robin will tell me to use Sitemaps, but from the look of it, while it looks easy to create a Google Sitemap for static content, creating a Sitemap for a complex site made of static content, wordpress pages, posting boards and so on, is far more daunting. I don’t want to spend the next week working on such a stupid project. This has to be automated.

« Previous PageNext Page »

19 queries. 0.368 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.