DEXA 2006 (February 21, 2006 / September 4-8, 2006)

The 17th International Conference on Database and Expert Systems Applications (DEXA 2006) call for papers is out. It will be held in Krakow, Poland.

The aim of DEXA 2006 is to present both research contributions in the area of data base and intelligent systems and a large spectrum of already implemented or just being developed applications. DEXA will offer the opportunity to extensively discuss requirements, problems, and solutions in the field. The workshop and conference should inspire a fruitful dialogue between developers in practice, users of database and expert systems, and scientists working in the field.

Oracle and MySQL — is MySQL in a weak position?

Oracle has recently bought Innobase which makes one library MySQL relies upon for storing its tables. One user on slashdot had the following insightful comment:

Among the technologies that MySQL licenses from third parties under commercial redistribution licenses:

Berkeley DB (Sleepycat Software)
InnoDB (Oracle, formerly Innobase)
MaxDB (SAP AG)

See the problem? MySQL itself is largely a language parser and a simple and technically inadequate storage engine (for anything where data integrity matters). In other words they don’t own any of the foundations of their technologies.

This is interesting. We always encourage developers to use and reuse existing libraries. Should MySQL be blamed for doing so?

The comparison with PostgreSQL is interesting. PostgreSQL works in a decentralized way as opposed to MySQL which is developed by single company, using libraries.

I think that MySQL could definitively be a fragile product whose development could be impaired through various business decisions. However, I think it has nothing to do with the fact that MySQL relies on libraries it hasn’t written, but rather on the fact that there is no community of MySQL developers.

Free Sofware is not a cure to the world’s hunger.However, building software using a highly distributed community might very be the best possible way to develop generic software.

Research versus Teaching versus Development versus Blogging versus Consulting

I’m working rather intensively on a new course (Information Retrieval and Filtering) which should be offered in 2006 or 2007. This course is really a pleasure. Normally, teaching is something you do seriously, while you either do as much consulting or as much research as you can. You won’t see many university professors spending 60 hours a week preparing a single course. However, sometimes, teaching is something that you can really become passionate about. While I have published work in Information Retrieval, I never paid much attention to the field. Being too busy in my research to stop and start fiddling with more elementary concepts such as the Zipf law: where it comes from and what you can do with it. Thanks to Will Fitzgerald, I now know how to use n-grams and Shannon’s information value to determine the language a text is written in. As a researcher, this is highly enjoyable and likely to help my research.

Where does the logarithm of the standard deviation comes from in model selection?

Update: This is a failed experiment. Online TeX to MathML simply doesn’t work fast enough to be usable. What is needed is server side support, but I don’t trust current wordpress plugins.

(This post requires MathML and JavaScript support: use Firefox or a MathML plugin such as MathPlayer. It will also not display with the inline MathML in a RSS aggregator.)

In several signal processing and data mining applications, when people use a probabilistic model, the logarithm of the standard deviation appears, the rest being a standard error measure. Up to recently, I have been too lazy to figure out where the logarithm comes from, but I finally figured it out, in part thanks to my friend Yuhong Yan.

The Normal Distribution can be defined by the following density function:

`f(x;\mu,\sigma)= \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x- \mu)^2}{2\sigma^2} }`.

Ah! You see this exponential function? That’s where the logarithm will come from!

Suppose you have `m` (independent) samples of a normal distribution: `a_1,a_2, \ldots, a_m`. The joint normal distribution has the following density function:

`f(a_1,a_2, \ldots, a_m;\mu,\sigma,m)= \frac{1}{(\sigma\sqrt{2\pi})^m} e^ { -\sum_{i=1, \ldots,a_m} \frac{(a_i- \mu)^2}{2\sigma^2} }`.

The logarithm of the joint normal distribution is

`m \log \frac{1}{\sigma\sqrt{2\pi}} -\sum_{i=1, \ldots,a_m} \frac{(a_i- \mu)^2}{2\sigma^2}`

or

`-m \log (\sigma\sqrt{2\pi}) – \frac{\sum_{i=1, \ldots,a_m} (a_i- \mu)^2}{2\sigma^2}`.

You see the last bit? `\sum_{i=1, \ldots,a_m} (a_i- \mu)^2`? That’s the `l_2` error!

Hence, whenever you see the `l_2` mixed up with the logarithm of the standard deviation, chances are that you are looking at the logarithm of the normal distribution!

In particular, this trick applies to the Bayesian information criterion (BIC) which is used to select a model by maximizing or minimizing a log-likelihood function such as -2 log-likelihood ` + k \log(n)`, where `k` represents the number of parameters and `n` the number of observations in the fitted model. The log-likelihood component can sometimes be computed using the above analysis.

Reference: Schwarz, G. (1978) “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461-464

Analyzing Large Collections of Electronic Text Using OLAP

Steven will be presenting our paper Analyzing Large Collections of Electronic Text Using OLAP at APICS 2005. This work is based on an idea by Owen Kaser: what happens if we apply multidimensional databases (OLAP) to literary research?

Data Mining and Information Retrieval techniques are used routinely for literary research or processing text in general, but decision support techniques commonly used in the business world (sometimes called “Business Intelligence”) have not seen much use yet in text processing. The main difference between decision support systems and data mining is the fact that in decision support, the user remains in control, thus simple yet extremely efficient algorithms are favoured over sophisticated, but possibly expensive algorithms. Ideally, all decision support algorithms should be O(1) after accounting for precomputations. With infinite storage almost available now, decision support research is due for a technological and scientific boom.

Computer-assisted reading and analysis of text has various applications in the humanities and social sciences. The increasing size of many electronic text archives has the advantage of a more complete analysis but the disadvantage of taking longer to obtain results. On-Line Analytical Processing is a method used to store and quickly analyze multidimensional data. By storing text analysis information in an OLAP system, a user can obtain solutions to inquiries in a matter of seconds as opposed to minutes, hours, or even days. This analysis is user-driven allowing various users the freedom to pursue their own direction of research.

Logitech USB Desktop Microphone under Linux

I got my new Logitech USB Desktop Microphone working under Linux. Should have been very easy, but I hit a small nail.

Plug the device in and type “lsusb”, you should see:

Bus 001 Device 004: ID 0556:0001 Asahi Kasei Microsystems Co., Ltd AK5370 I/F A/D Converter

Ah! The device is called AK5370.

Do “dmesg”‘ you should see two lines like those:

usb 1-3: new full speed USB device using ohci_hcd and address 4

usbcore: registered new driver snd-usb-audio

If you don’t see the second line, you have a problem. In my case, I didn’t have the usbaudio driver so I only got the first line. I had to go compile usbaudio. To do so, I did “uname -a”, it gave me “Linux romeo 2.6.10-gentoo-r6″. I went under /usr/srclinux-2.6.10-gentoo-r6 and typed

genkernel --no-clean --menuconfig all

Next, after the menu opened up, I went under driver/audio and chose usb audio drivers (and loadable modules). Exiting genkernel launched the compilation of the module and all I had to do was to unplug/replug my microphone. You should check that /dev/dsp1 appears.

All I had to do after this was to launch mhwaveedit and choose “hw:1,0″ as my recording device, so that I would not record out of my sound card, but rather from my microphone. Setting the sampling rate to 44100 Hz seemed to be necessary.

To enable the microphone under KDE, you have to launch kmix and choose the appropriate device, if you don’t see the device, quit kmix (through the file menu) and restart it. This being said, I don’t see why you need the microphone under KDE. However, make sure you turn the gain all the way to the maximum for optimal sound quality.

Voilà! Isn’t Linux friendly?

For recording tips, see this page by Bob Cunningham.

Update: sometime you might have to force the drive to load up doing “modprobe snd-usb-audio”. In theory, modprobe shouldn’t be necessary as devices should be automatically recognized, but it happens to me sometimes that I need to help my kernel a bit. (Bugs?)

Doing the Martin Shuffle

Through Will’s I got to the Martin Shuffle which is a cool randomized algorithm to quickly find sonds on a MP3 player (without browsing them one by one). They implement a nice Markov Decision Process using my favorite language: Python.

Academic Authorship

I don’t know where this come from, but Yuhong seems upset:

A professorship is not only a position to do research, but also a resource to exploit the other’s work by acquiring the authorship.

She’s right, of course. If you seek fame and fortune through a professorship, you have to become an “academic entrepreneur” where you seek to employ people (read: students) at the lowest possible wage (what she describes as “slave labor”) so that they do the kind of research that can sustain large research grants.

Last year, I chatted with some graduate students and I realized that students actually enjoy working for such professors, and not only for the money, but also because they feel they are getting better training than with a lone crazy professor. Let’s face it: the factory model has something conforting even for the students. Working with a lone crazy professor means you won’t have fixed deadlines nor any fixed research subjects.

I simply think that a professorship is a very open ended career. There are many models, and some of them are hard to compare. I believe this derives from “academic freedom”. In practice, as long as you can find a significant number of peers to vouch for the quality of your work, no matter how you achieve it, then you are ok.

However, there are routes more rewarding or rewarded than others. Some research topics are better funded than others: Ben Laden detectors are better funded than graph theory theorems. And why not? I think it is healthy.

As long as professors not working on Ben Laden detectors, professors getting small grants, and professors having few students, still keep their jobs and don’t get insulted publicly, then we are ok and academic freedom is safe.

Update: I got too many insults by email. Ok, I didn’t mean the government should be funding Ben Laden detectors, only that it is ok for some subjects to be funded better than others.

Firefox 1.0.7: better memory management?

I love Firefox, but one thing that’s causing me grief is its memory leakage. On my gentoo box, Firefox 1.0.6 would quickly eat up to 55% of my available memory. I had to kill Firefox every two days to get my machine working. I’ve upgraded to 1.0.7 yesterday, and it runs smoothly using only 25% of my available memory (according to “top”). Given that my browser is arguably the most important software application running on my machine and given that I’m unlikely at any one time to run two browsers, I don’t care if firefox uses up to 25% of my memory, but please, no more. When I’m not browsing the web, I’ve got to do real work like research and teaching. Interestingly, the release notes don’t mention anything about improved memory usage. Also, my Mandrake box (running 10.1) doesn’t seem to have this problem, nor do my windows boxes, irrespective of the Firefox version number. Anyhow, I’m crossing my fingers, hoping that Firefox will be well behaved this time.

Google launches an online RSS aggregator

Google did it, finally. They launched a beta of their RSS aggregator. It is still a bit immature, but I’m trying it out. I’m already a bit fan of gmail which has become my sole email client.

« Previous PageNext Page »

19 queries. 0.411 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.