Friday, May 12th, 2006

Multidimensional OLAP Server for Linux as Open Source Software

Filed under: Data Warehousing and OLAP — Daniel Lemire @ 19:38

Jedox will release a free open source Linux MOLAP server by the end of the year. A pre-release of the software is expected by mid of 2005.

All data is stored entirely in memory. Data can not only be read from but also written back to the cubes. Like in a spreadsheet, all calculations and consolidations are carried out within milliseconds in the server memory while they are written back to the cube.

I sure hope that by “memory” they include “external” memory because otherwise, their cubes are going to have to be quite small. Normally, you’d at least memory map large files as Lemur OLAP does.

Thursday, May 11th, 2006

Slope One in Automated Collaborative Filtering

Filed under: — Daniel Lemire @ 15:26
Slope One in Automated Collaborative Filtering
By
Bo Xu

Examining Committee:
Supervisor(s):            Dr. Huajie Zhang
                          Dr. Bruce Spencer
Chairperson:              Dr. Mike Fleming
Internal Reader           Dr. Yuhong Yan
External Reader           Dr. Donglei Du                                                                       

Thursday, May 18, 2006
10:00 a.m.
IT-C317 (Information Technology Center)

ABSTRACT

We focus in the Collaborative Filtering field on how
to improve the quality of the prediction and
recommendation and how to improve  the
performance on large datasets. Only rating-based,
non-sequential problem space is discussed. A
detailed survey of current technologies which
might be used in recommender systems is
presented to the readers, followed by a thorough
analysis of a newly proposed algorithm by
Daniel Lemire and Anna MacLachlan: Slope One.
To test the family of the Slope One schemes'
prediction quality as well as their performance,
experiments are conducted by means of
comparing them with other representative,
up-to-date collaborative filtering and machine
learning algorithms. The results show that
Slope One, in spite of its simplicity, is an
efficient, accurate within reason, and scalable
Collaborative Filtering algorithm and therefore
especially applicable for online E-Commerce
recommender systems.

ALL GRADUATE STUDENTS ARE ENCOURAGED TO ATTEND

*********************
Linda Sales
Administrative Assistant
Faculty of Computer Science
University of New Brunswick
540 Windsor Street
Fredericton, NB
E3B 5A3

Phone: 506-458-7285
Fax: 506-453-3566

SIAM Data Mining 2007 (SDM07) (October 6th, 2006 / May 3-5, 2007)

Filed under: Data Warehousing and OLAP, Passed CFP — Daniel Lemire @ 12:27

The SIAM Conference on Data Mining 2007 (SDM07) will be held in Minneapolis, May 3-5, 2007.

Data mining is an important tool in science, engineering, industrial processes, healthcare, business, and medicine. The datasets in these fields are large, complex, and often noisy. Extracting knowledge requires the use of sophisticated, high-performance and principled analysis techniques and algorithms, based on sound statistical foundations. These techniques in turn require powerful visualization technologies; implementations that must be carefully tuned for performance; software systems that are usable by scientists, engineers, and physicians as well as researchers; and infrastructures that support them.

How do people balance out precision and recall?

Filed under: — Daniel Lemire @ 10:43

In Information Retrieval, you can’t have both great recall and great precision, so you have to balance the two. What are the possible criteria to pick the best recall/precision?

What I found so far, on wikipedia of all places, is the so-called F-measure or balanced F-score, and it is merely the harmonic mean of the recall and the precision. This seems to have almost no theoretical foundation?

Anyone out there has proposed an exciting way to pick your recall and precision?

I don’t want to be too critical of the field of Information Retrieval, but sometimes, when I read papers from the 1950’s, it feels like they knew everything there was to know. I sure hope that people came up with more exciting ideas than just “use the harmonic mean” to pick the best recall?

Update: I found this related paper:
Cyril Goutte and Eric Gaussier, A Probabilistic Interpretation of Precision,
Recall and F-score, with Implication for
Evaluation
, but it is only a partial answer to my question.

Java Serialization is not for long term storage

Filed under: — Daniel Lemire @ 9:32

Using Serialization for long term storage, is a common mistake. In fact, Microsoft made with with Microsoft Word and it is a well known source of trouble (ever had a corrupted file you could not recover from?). Serialization in Java was never advertized as a viable storage long term mechanism. We serialize in order to send objects over a wire (RMI), or for lightweight persistance (especially for non-critical data). I’m not making this up, this is how Sun documents it.

Also, Sun makes no promise that you’ll be able to deserialize, if your code changes. Ever heard of the java.io.InvalidClassException class? That’s what you’ll get on your face if you ever change the class you used to serialize (even if you change it just a little bit).

Think about the following scenario:

  1. You serialize some objects you care about.
  2. Weeks pass by.
  3. For some reason or other, you change the class. Suppose, for example, that you delete a field or you move the class up or down in the hierarchy. You don’t keep the old class around anymore.
  4. That’s it, you can’t deserialize your objects ever unless you do reverse engineering. It won’t stop and ask you how you want it fixed, it will just throw an exception with no direct way for you to fix this. You’ll need to “hand recover you data”. Have fun. If the data was not your own, and it was hand crafted by a client, you are probably going to lose your job.

And let’s not even get into what happens if you must exchange your data with other software not written in Java.

If you really care about your data, dump it in a custom XML format. It isn’t that hard.

Wednesday, May 10th, 2006

Flattening lists in Python

Filed under: — Daniel Lemire @ 14:49

Can anyone do better than this ugly hack?

def flatten(x):
flat = True
ans = []
for i in x:
if ( i.__class__ is list):
ans = flatten(i)
else:
ans.append(i)
return ans

Update. I like this solution proposed by one of the commenters (sweavo):

def flatten(l):
if isinstance(l,list):
return sum(map(flatten,l))
else:
return l

Can anyone do better?

See also my posts YouTube scalability, Yield returns are not esoteric anymore, Efficient FIFO/Queue data structure in Python and Autocompletion in the Python console.

Subscribe to this blog
in a reader
or by Email.

Tuesday, May 9th, 2006

Harold and RuleML

Filed under: — Daniel Lemire @ 8:38

Harold Boley was over in Montreal yesterday. He gave a talk on RuleML. The big news, to me, is that RuleML has been modularized into sublanguages. Of particular interest to me was their DataLog sublanguage (and they have a tutorial about it). To be honest, I didn’t even know what “DataLog” meant before this talk. Guy stopped by as well as many interesting people including François Magnan and Olga.

Now, someone should go out and implement a nice DataLog-RuleML engine on top of MySQL.

We also talked about Web Services, Logic and so on. Finally, we agreed to start a network of researchers making their publication lists available as XML (I have been doing so for a long time).

Finally, we stopped by Idialia to see the lovely Anna and her office. It seems like Idilia now has a white paper out, but I can’t seem to navigate their web site.

« Previous PageNext Page »

34 queries. 0.392 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.