Tuesday, February 19th, 2008

New Trends in Physical Data Warehouse Design (April 18, 2008)

Filed under: Data Warehousing and OLAP, Passed CFP — Daniel Lemire @ 19:27

Ladjel Bellatreche is organizing a special issue in New Trends in Physical Data Warehouse Design for the journal of Distributed and Parallel Databases. Submissions are through the journal’s online system. You can read the call for papers on EventSeer.

Monday, February 18th, 2008

Recommending Journal Articles in a Scientific Digital Library

Filed under: Science and Technology — Daniel Lemire @ 11:19

André Vellino will give a talk on recommender systems in our offices (100 Sherbrooke West, room 2720) at 12:30pm this Thursday (February 21st 2008).

Recommender systems for scientific digital libraries that have been the subject of experiments in recent years have used corpora that are primarily in the field of computer science. However, designing an effective recommender system for journal articles in a broader Scientific, Technical and Medical (STM) digital library poses special challenges and presents unique opportunities.

This talk describes a recommender system for scientific scholarly articles that is both hybrid (content and collaborative filtering based) and multi-dimensional (across different rating criteria.) Our hypothesis is that such a design for a recommendation engine can improve scientists’ ability to discover new knowledge from a digital library provided that an interface to these recommendations can simultaneously offer explanations for the recommendations and increase the user’s control over how the recommender behaves.

The talk will be webcasted: mms://mediasrv.lorit.ca/presentation using Microsoft-only technology.

It says that the talk will be in French.

Wednesday, February 13th, 2008

External-Memory Shuffles?

Filed under: Science and Technology — Daniel Lemire @ 12:44

We need to shuffle the lines in very large variable-length-record flat files.

We can load the files in MySQL and do “select * from mytable order by rand().”

However, loading the data in a DBMS and dumping it out is cumbersome. So, we do an in-memory shuffle block by block. It comes close to a full random shuffle, but I am worried it might not be good enough.

Anyone knows of a fast way to do a full external-memory shuffle using only Ruby, Perl, Python, and other Unix utilities?

A crazy idea: Let n be the number of lines. Shuffle the numbers between 1 and n, in-memory. Then preprend these numbers at the beginning of each line, do external-memory sorting with the Unix command sort, and remove the random numbers from the final file. This will scale up to about 100 million lines on a PC.

It might be possible to generate random numbers from a large set, thus making collisions very unlikely, thereby avoiding a shuffle of numbers altogether.

Definitive solution? There is a GNU command called shuf, but it works only for small files. However, the new sort command has a –random-sort flag. To install a recent sort on on MacOS, do the following:

wget http://ftp.gnu.org/gnu/coreutils/coreutils-6.9.tar.bz2
tar xvjf coreutils-6.9.tar.bz2
cd coreutils-6.9
mkdir osxbuild && cd osxbuild
../configure --prefix=/usr/local/stow/coreutils-6.9
jm_cv_func_svid_putenv=yes && make
sudo cp src/sort /usr/local/bin/
sudo cp ../man/sort .1 /usr/local/share/man/man1/

However, sort –random-sort will only shuffle properly if no line is repeated. You need to first pass your file through cat -n to ensure that all lines are unique, then conclude with cut to remove the counter added by cat. Here is a command that does the trick:


cat -n myfile.csv | sort --random-sort | cut -f 2-

It may not generate a perfect shuffle, but it should be close.

What is a reusable research result?

Filed under: Academia/Research — Daniel Lemire @ 10:01

Peter argued that reusability and originality are the primary qualities of a research result.

I can tell something is not original if it is looks similar to previous work.

When reviewing a paper, it might difficult to determine if the research result is reusable. Nevertheless, here are some attributes of a reusable research result:

  • It is explained carefully and correctly in a widely available paper.
  • It can be expressed using a few simple words. Matter is Energy. No logical system can be complete. The medium is the message.
  • It can applied to a wide range of problems. Small, large, complex or simple problems. Problems from other fields.

Because most scientists work on a small set of problems at any given time, and have often some experience in one or two fields, at most, I do not think they are very good at judging the reusability of a result.

Back when Object-Oriented Programming (OOP) was the next big thing, it was believed that building reusable objects would be a relatively easy thing. Inheritance was touted as a way to build up on existing objects that you could keep on reusing. In practice, what people reuse are libraries and APIs, but few people reuse their own objects again and again. We have learned that building reusable objects is hard and it is very often the result of collaborative work.

Tuesday, February 12th, 2008

Yahoo! Research jobs in Montreal

Filed under: Science and Technology — Daniel Lemire @ 10:47

Fernando Diaz — an Information Retrieval Researcher from Yahoo! labs in Montreal — sent me this job offer. I had no idea Yahoo! had researchers in Montreal! I feel better about my home town!

Note: Do not get in touch with me regarding this position. I am just reposting it.

Machine Learning / NLP / Information Retrieval Researcher

Yahoo! Applied Research Lab
Montreal, QC

Job Responsibilities

The NLP group in Yahoo’s Applied Research Lab is looking for an researcher with the following qualifications:

-Strong knowledge of the Information Retrieval (IR) field
-Deep familiarity and hands-on experience in machine learning techniques
-Ability to conduct experiments involving massive data sources (mainly text and data mining)
-Background in natural language processing or computational linguistics
-Proven experience in software development in the fields mentioned above

Minimum Job Qualifications

The work that this researcher is expected to be conducting will be in the general area of Information Retrieval, typically will be related to one of Yahoo!’s initiatives in the area of search, including but not limited to relevance ranking, question answering, information extraction, text classification or subjectivity analysis. The candidate will be expected to have impact on product initiatives and at the same time will be encouraged to contribute to the general research community by active participation in scientific forums, publications, etc.

Masters Degree required

Preferable Job Qualifications

PhD

Yahoo! Inc. is an equal opportunity employer. For more information, please contact Fernando Diaz (diazf@yahoo-inc.com), Jean-François Crespo (jfcrespo@yahoo-inc.com), or visit http://careers.yahoo.com.

ICSOFT 2008 (March 19, 2008 / July 5-8, 2008)

Filed under: Passed CFP, Science and Technology — Daniel Lemire @ 8:42

The third International Conference on Software and Data Technologies (ICSOFT 2008) will be held in Oporto, Portugal. As the name implies, it is a generic data/software conference.

Keep in mind that — in Portugual — people smoke everywhere and all the time, especially in airports. But they make damn good wines.

Update. It appears that there is now a ban on smoking.

Friday, February 8th, 2008

No shortage of Information Technology Workers

Filed under: Science and Technology — Daniel Lemire @ 10:23

At my school, the dean of the Science Faculty claims that we should see a surge of enrollment in Computer Science given the current shortages in Information Technology workers. I have my idea on who is feeding him this information, but I believe it is nonsense.

First, I do not believe there is a shortage right now in Information Technology at large. I have seen numbers quoted left and right, but my friends who work in industry still work long hours and their pay is not increasing substantially. In fact, I believe that there never was a shortage except in some specific areas like Silicon Valley or Boston. Elsewhere, the shortages were mostly hype.

Second, even if there was a shortage in Information Technology workers, it may not translate in more Computer Science students. If we ever ran out of factory workers, we may not get more engineering students as a result. Beside, many of the higher level positions are management positions that can be adequately filled by people who studied Information Systems in some Business School.

Third, Information Technology workers are very good at making themselves obsolete, and so, even if there was a shortage right now, it may not last. With Google Mail as good as it is, I do not believe we will need local email servers much longer. There will always be a need for local applications to answer specific business needs, but these can be most often built with MySQL and some PHP glue. Knowing the difference between P and NP is hardly required.

There are technically sophisticated problems out there. For example, few people know how to handle very high volumes of data or to build applications that will become widely adopted. People able to solve these problems will always be able to get get good jobs. But these needs for highly qualified people do not constitute shortages.

« Previous PageNext Page »

35 queries. 0.598 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.