Friday, April 27th, 2007

Final Word on SIAM Data Mining 2007

Filed under: — Daniel Lemire @ 19:24

So, the conference is over. For me, this was a pretty good experience: I was not sick, I met cool people, some folks appreciated my work, and so on. The conference was well organized: coffee was good, the hotel was well chosen, and so on. For people who know me, this is quite a review since I usually complain a lot about my trips.

However, I am a tad disappointed. Actually, I was disappointed the minute I looked at the list of accepted papers. Data Mining has lost its way.

What is Data Mining? It seems that people have totally forgotten what it is about. No, Data Mining is not Machine Learning though Machine Learning can be applied to Data Mining problems. Data Mining is primarily concerned with very large data sets. It is the essence of Data Mining. Any algorithm running in quadratic time with respect to the size of the data set is automatically out.

Data Mining is not only about prediction or classification. Data Mining is also about visualization, explanations, approximations, databases, Business Intelligence, and so on. It is about applying Map Reduce to large data sets. It is about scaling up to billions of data points. It is about dirty data.

Something is wrong about the review process: obviously, the program committee is overly focused on Machine Learning. I cannot complain because my paper was accepted, but, surely, a broader range of papers should have been accepted.

ICDM 2007 (June 1st, 2007 / October 28 - 31, 2007)

Filed under: Passed CFP — Daniel Lemire @ 8:21

The IEEE International Conference on Data Mining will be held in Omaha in October 2007.

SDM08 (October 5, 2007 / April 24-26, 2008)

Filed under: Passed CFP — Daniel Lemire @ 8:16

The SIAM International Conference on Data Mining will be held in Georgia in April 2008.

Update. I received strong hints that the abstracts are due October 5, 2007 while the manuscripts are due October 12, 2007.

Thursday, April 26th, 2007

SIAM Data Mining 2007

Filed under: — Daniel Lemire @ 8:49

I am currently attending SIAM Data Mining 2007 in Minneapolis. If you are not here, you can still read the papers. They are all online.

Interesting facts:

  • Association Rules and frequent itemsets are at an all time low (6% of accepted papers);
  • they keep the conference size the same, but the number of submitted papers keeps on increasing, reducing the acceptance rate all the time (it is now at an all time low of 12%).

Papers that I liked (list not exhaustive) include:

  • Aristides Gionis and Evimaria Terzi, Segmentations with Rearrangements: they introduce a new time series segmentation problem where you are allowed to reorder the time series, heuristics are provided but no optimal solution is possible;
  • Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti and Vanja Josifovski, Bandits for Taxonomies: A Model-based Approach : they examine how you can match ads and web pages using taxonomies as a dimensionality reduction trick, and couple with this idea they borrow from the multi-arms bandit literature;
  • Michael Bertolacci and Anthony Wirth, Are approximation algorithms for consensus clustering worthwhile?: they review approximation algorithms for the consensus clustering problem (given several clustering, find the a new clustering the disagrees least with the provided clusterings and conclude that simple approximation algorithms work well.

I really enjoyed Jerome Friedman’s talk on Ensemble Learning. I am not a big fan of Machine Learning, but what he does is really neat. You throw in a lot of rules (extracted from decision trees) and you let a Lasso regression technique to find the important rules. What is really neat is that he can actually explain how the predictor works once he is done, so this is not mindless prediction.

Tom Mitchell showed that you could determine what I person thinks about using a MRI scan. This is very impressive and constitutes the first sketch of a mind-reading machine. Of course, it is quite primitive in its current state and there is a long road ahead.

Some cool people I met include Cyril Goutte (NRC - Gatineau), François Meyer (Princeton) and Ee-Peng Lim (Nanyang Technological University).

Monday, April 16th, 2007

ICDIM’07 (June 15, 2007 / October 28-31, 2007)

Filed under: Passed CFP, Data Warehousing and OLAP — Daniel Lemire @ 14:43

The Second International Conference on Digital Information Management ICDIM’07 will be held in Lyon.

The scope is quite broad, ranging from Data Mining, all the way to e-Learning, XML, Diginal Libraries, and so on.

Are we destroying research by evaluating it?

Filed under: Academia/Research — Daniel Lemire @ 11:02

This morning, I read a fascinating paper, Evaluations: Hidden Costs, Questionable Benefits, and Superior Alternatives by Bruno S. Frey and Margit Osterloh (October 2006). This paper is concerned with the undesirable effects of the focus on bibliometric indicators (”publish or perish”). In many context, it is very difficult for a researcher to land a job, or to keep the said job, or the necessary funding, unless he publishes regularly in prestigious venues. Intuitively, these measures would insure that research is of higher quality. Is that really so?

Their main point is that such (rigid) evaluations distort incentives for researchers.

The measurement exerts not only pressure to produce predictable but unexciting research outcomes that can be published quickly. More importantly, path-breaking contributions are exactly those at variance with accepted criteria. Indeed innovative research creates novel criteria which before were unknown or disregarded. The referee process, by necessity based on the opinions of average peers finds it difficult to appreciate creative and unorthodox contributions.

They argue that we see a homogenization of research endeavors. All laboratories and departments end up looking the same. Fads are followed religiously.

They argue that this disconnects researchers from the real world:

Research departments give no credit to faculties who write books and magazine articles designed to intermediate between the research community and the general public because they don’t contribute to the citation record. As a consequence, the gap between rigor and relevance of research is deepened and the dialogue between science and practice is undermined.

I often complain about this very fact on this blog. But I especially like this bit:

The tendency to measure research performance by the size of grants received creates an incentive to undertake more expensive, rather than relevant research.

I find the paper really fascinating. They go on to say that researchers who act as reviewers have a incentive to rate poorly competitors or potential competitors. If everyone gets to review in an open way, I guess this incentive is small, but in the current setup where few people get to kill or allocated most grants and paper acceptance, there is a real worry that, without ever realizing it, they may seriously hurt potential competitors only to protect their own interests.

Ah! But the people getting reviewed can play games as well. For example, one can always create new metrics against which one performs well. That’s ok, but then, it can get uglier:

Authors raise their number of publications by dividing their research results to a “least publishable unit”, slicing them up as thin as salami and submitting them to different journals. Authors may also offer to include another scholar among the authors in exchange for being put as co-authors on his or her paper. Time and energy is wasted by trying to influence editors by courting them e.g. by unnecessarily citing them. More serious are manipulations of data and results.

There is the nice concept of academic prostitution (!!!) which relates to the way in which authors are inclined to change their papers to favor the reviewers, by citing them, for example. This is exemplified by this quote:

According to the study by Simkin and Roychowdhury (2003) on average only twenty percent of cited papers were ever read by the citing authors.

They explain that the system is self-sustained because anyone who questions it is then suspected of not meeting the required quality standards. The system locks people in.

What do they propose as an alternative? In short that you should choose the right people, coach them adequately, and then leave them alone. They also correctly point out that the Web makes it less important to cluster good researchers together. I really like this concluding remark:

The characteristic of a selection system is that once a decision has been made the principals put faith in the persons selected. Important positions in society (such as top judges and presidents of Central Banks) are elected either for life or for a very long time period without formal evaluations for good reasons. It is questionable why these reasons should not apply to research.

Update. Peter Turney sent me this pointer to Reviewing the Reviewers by Kenneth Church. In it, Church argues that by selecting fewer and fewer researchers and papers, we are discarding many interesting papers because they are not conventional enough.

Update 2. My own point of view is that we should mimick the Atmospheric Chemistry and Physics journal and move to open peer review (as described in this Nature article.) I am not quite certain how this would work with grant and job reviews, but I think we must move toward more modern systems. I think we overestimate people’s fear toward transparent review systems. After all, if I am ever to be convicted of a crime, I expect to see the juries face to face.

Friday, April 13th, 2007

DOLAP 2007 (July 13, 2007 / November 9, 2007)

Filed under: Passed CFP, Data Warehousing and OLAP — Daniel Lemire @ 10:35

The DOLAP 2007 call for papers is out. They are seeking papers that extend data warehousing to new applications. It will be held in Portugual.

Next Page »

30 queries. 0.195 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.