Final Word on SIAM Data Mining 2007

So, the conference is over. For me, this was a pretty good experience: I was not sick, I met cool people, some folks appreciated my work, and so on. The conference was well organized: coffee was good, the hotel was well chosen, and so on. For people who know me, this is quite a review since I usually complain a lot about my trips.

However, I am a tad disappointed. Actually, I was disappointed the minute I looked at the list of accepted papers. Data Mining has lost its way.

What is Data Mining? It seems that people have totally forgotten what it is about. No, Data Mining is not Machine Learning though Machine Learning can be applied to Data Mining problems. Data Mining is primarily concerned with very large data sets. It is the essence of Data Mining. Any algorithm running in quadratic time with respect to the size of the data set is automatically out.

Data Mining is not only about prediction or classification. Data Mining is also about visualization, explanations, approximations, databases, Business Intelligence, and so on. It is about applying Map Reduce to large data sets. It is about scaling up to billions of data points. It is about dirty data.

Something is wrong about the review process: obviously, the program committee is overly focused on Machine Learning. I cannot complain because my paper was accepted, but, surely, a broader range of papers should have been accepted.

ICDM 2007 (June 1st, 2007 / October 28 – 31, 2007)

The IEEE International Conference on Data Mining will be held in Omaha in October 2007.

SDM08 (October 5, 2007 / April 24-26, 2008)

The SIAM International Conference on Data Mining will be held in Georgia in April 2008.

Update. I received strong hints that the abstracts are due October 5, 2007 while the manuscripts are due October 12, 2007.

SIAM Data Mining 2007

I am currently attending SIAM Data Mining 2007 in Minneapolis. If you are not here, you can still read the papers. They are all online.

Interesting facts:

  • Association Rules and frequent itemsets are at an all time low (6% of accepted papers);
  • they keep the conference size the same, but the number of submitted papers keeps on increasing, reducing the acceptance rate all the time (it is now at an all time low of 12%).

Papers that I liked (list not exhaustive) include:

  • Aristides Gionis and Evimaria Terzi, Segmentations with Rearrangements: they introduce a new time series segmentation problem where you are allowed to reorder the time series, heuristics are provided but no optimal solution is possible;
  • Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti and Vanja Josifovski, Bandits for Taxonomies: A Model-based Approach : they examine how you can match ads and web pages using taxonomies as a dimensionality reduction trick, and couple with this idea they borrow from the multi-arms bandit literature;
  • Michael Bertolacci and Anthony Wirth, Are approximation algorithms for consensus clustering worthwhile?: they review approximation algorithms for the consensus clustering problem (given several clustering, find the a new clustering the disagrees least with the provided clusterings and conclude that simple approximation algorithms work well.

I really enjoyed Jerome Friedman‘s talk on Ensemble Learning. I am not a big fan of Machine Learning, but what he does is really neat. You throw in a lot of rules (extracted from decision trees) and you let a Lasso regression technique to find the important rules. What is really neat is that he can actually explain how the predictor works once he is done, so this is not mindless prediction.

Tom Mitchell showed that you could determine what I person thinks about using a MRI scan. This is very impressive and constitutes the first sketch of a mind-reading machine. Of course, it is quite primitive in its current state and there is a long road ahead.

Some cool people I met include Cyril Goutte (NRC – Gatineau), François Meyer (Princeton) and Ee-Peng Lim (Nanyang Technological University).

ICDIM’07 (June 15, 2007 / October 28-31, 2007)

The Second International Conference on Digital Information Management ICDIM’07 will be held in Lyon.

The scope is quite broad, ranging from Data Mining, all the way to e-Learning, XML, Diginal Libraries, and so on.

Are we destroying research by evaluating it?

This morning, I read a fascinating paper, Evaluations: Hidden Costs, Questionable Benefits, and Superior Alternatives by Bruno S. Frey and Margit Osterloh (October 2006). This paper is concerned with the undesirable effects of the focus on bibliometric indicators (“publish or perish”). In many context, it is very difficult for a researcher to land a job, or to keep the said job, or the necessary funding, unless he publishes regularly in prestigious venues. Intuitively, these measures would insure that research is of higher quality. Is that really so?

Their main point is that such (rigid) evaluations distort incentives for researchers.

The measurement exerts not only pressure to produce predictable but unexciting research outcomes that can be published quickly. More importantly, path-breaking contributions are exactly those at variance with accepted criteria. Indeed innovative research creates novel criteria which before were unknown or disregarded. The referee process, by necessity based on the opinions of average peers finds it difficult to appreciate creative and unorthodox contributions.

They argue that we see a homogenization of research endeavors. All laboratories and departments end up looking the same. Fads are followed religiously.

They argue that this disconnects researchers from the real world:

Research departments give no credit to faculties who write books and magazine articles designed to intermediate between the research community and the general public because they don’t contribute to the citation record. As a consequence, the gap between rigor and relevance of research is deepened and the dialogue between science and practice is undermined.

I often complain about this very fact on this blog. But I especially like this bit:

The tendency to measure research performance by the size of grants received creates an incentive to undertake more expensive, rather than relevant research.

I find the paper really fascinating. They go on to say that researchers who act as reviewers have a incentive to rate poorly competitors or potential competitors. If everyone gets to review in an open way, I guess this incentive is small, but in the current setup where few people get to kill or allocated most grants and paper acceptance, there is a real worry that, without ever realizing it, they may seriously hurt potential competitors only to protect their own interests.

Ah! But the people getting reviewed can play games as well. For example, one can always create new metrics against which one performs well. That’s ok, but then, it can get uglier:

Authors raise their number of publications by dividing their research results to a “least publishable unit”, slicing them up as thin as salami and submitting them to different journals. Authors may also offer to include another scholar among the authors in exchange for being put as co-authors on his or her paper. Time and energy is wasted by trying to influence editors by courting them e.g. by unnecessarily citing them. More serious are manipulations of data and results.

There is the nice concept of academic prostitution (!!!) which relates to the way in which authors are inclined to change their papers to favor the reviewers, by citing them, for example. This is exemplified by this quote:

According to the study by Simkin and Roychowdhury (2003) on average only twenty percent of cited papers were ever read by the citing authors.

They explain that the system is self-sustained because anyone who questions it is then suspected of not meeting the required quality standards. The system locks people in.

What do they propose as an alternative? In short that you should choose the right people, coach them adequately, and then leave them alone. They also correctly point out that the Web makes it less important to cluster good researchers together. I really like this concluding remark:

The characteristic of a selection system is that once a decision has been made the principals put faith in the persons selected. Important positions in society (such as top judges and presidents of Central Banks) are elected either for life or for a very long time period without formal evaluations for good reasons. It is questionable why these reasons should not apply to research.

Update. Peter Turney sent me this pointer to Reviewing the Reviewers by Kenneth Church. In it, Church argues that by selecting fewer and fewer researchers and papers, we are discarding many interesting papers because they are not conventional enough.

Update 2. My own point of view is that we should mimick the Atmospheric Chemistry and Physics journal and move to open peer review (as described in this Nature article.) I am not quite certain how this would work with grant and job reviews, but I think we must move toward more modern systems. I think we overestimate people’s fear toward transparent review systems. After all, if I am ever to be convicted of a crime, I expect to see the juries face to face.

DOLAP 2007 (July 13, 2007 / November 9, 2007)

The DOLAP 2007 call for papers is out. They are seeking papers that extend data warehousing to new applications. It will be held in Portugual.

Google Summer of Code – Collaborative Filtering

Andre sent me a link to the projects that will be supported by Google for the Summer of Code. The Collaborative Filtering library Taste will get two developers over the summer. That’s pretty good.

I wonder why IBM, which is 100 times richer than Google, never thought of supporting Summer of code initiatives. Take off your ties people and think outside the box a little!

WWW 2007 Tagging and Metadata for Social Information Organization Workshop

The WWW 2007 Tagging and Metadata for Social Information Organization Workshop has published its list of accepted papers:

(Quick! Find my paper in the list and go read it!)

ICDE 2008 (June 22, 2007 / April 7-12, 2008)

The 24th International Conference on Data Engineering (ICDE 2008) will be held in Cancún, México. This is a generic conference on information systems from an engineering point of view.

I find it interesting that they list area PC vice-chairs:

  • Data Integration, Interoperability, and Metadata – Erhard Rahm, U. of Leipzig, Germany
  • Ubiquitous Data Management and Mobile Databases – Evi Pitoura, U. of Ioannina
  • Query processing, query optimization – Guido Moerkotte, U. of Mannheim, Germany
  • Data Structures and data management algorithms – Edward Chang, Google, Beijing
  • Data Privacy and Security – Bhavani Thuraisingham, U. of Texas at Dallas, USA
  • Data Mining Algorithms – Ming-Syan Chen, National Taiwan U.
  • Data Mining Systems, Data Warehousing, OLAP and Architectures – Sunita Sarawagi, IIT Bombay, India
  • XML data Processing, Filtering, Routing, and Algorithms – Hans-Arno Jacobsen, U. of Toronto, Canada
  • XML and Relational Query Languages, Mappings and Engines – Christoph Koch, U. of Saarbruecken, Germany
  • Distributed, Parallel, Peer to Peer Databases – Peter Triantafillou, U. of Patras, Greece
  • Web Search and Deep Web – Luis Gravano, Columbia University, USA
  • Databases for Science – Claudia Medeiros, U. of Campinas, Brazil
  • Internet Grids, Web Services, Web 2.0, and Mashups – Alon Halevy, Google, USA and Sihem Amer-Yahia, Yahoo!, USA
  • Data Streams – Ugur Cetintemel, Brown U., USA
  • Sensor Networks – Philippe Bonnet, U. of Copenhagen, Denmark
  • Temporal and Multimedia Databases, Algorithms and Data Structures – Christian Jensen, U. of Aalborg, Denmark
  • Spatial and High Dimensional Databases, Algorithms and Data Structures – Dimitris Papadias, Hong Kong U. of Science and Technology, Hong Kong
  • Systems, Platforms, Middleware, Applications and Experiences – Martin Kersten, CWI, The Netherlands
  • Database System Internals, Performance and Self-tuning – Natassa Ailamaki, Carnegie Mellon U., USA

Next Page »

19 queries. 0.533 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.