SIAM Data Mining 2005: report on some of the talks
It has been over a week since I came back from SIAM Data Mining 2005. It is about time I put down on my blog some of the notes I took during the meeting. Notice that these notes are not meant to be accurate, always refer back to the original author…
Strategies for visual data mining by Ed Wegman
I’m not very excited by visual data mining, but this was a good talk. He started out by describing the field in terms of 4 levels (in increasing order of sophistication):
- static graphics
- interactive static graphics: the user can interact with the otherwise static graphics, I presume you can zoom in or out and so on.
- dynamic graphics: what gets plotted can be changed, presumably the user can request for only some type of data to be plotted as opposed to everything
- evolutionary graphics (streaming data)
Not all data sources are adequate for visual data mining. The scales are as follows… (in bytes)
- 10^2 (tiny)
- 10^10 (huge)
- 10^15 (super massive)
Best you can display is about 10^6 bytes (2333 by 1866 pixels).
He described several techniques for visual data mining:
- Paralel axis: instead of using one x-axis, you have several and a point becames a set of line going through all axis
- Grand tour: rotate plots over all possible angles
- Saturation brush: different clusters with different desaturated colors, frequent white, rare black
He has other innovative ideas. For example, present multidimensional data as an image and rotate the data. He calls this the pixel tour.
A tool can be downloaded from his web site.
Segmentation algorithms for time series and sequence data by Aristide Gionis and Heikki Mannila
They work on segmenting sequences, particularly genomic sequences. For them a sequence can be a string, a time series, a sequence of events or any variant on this. A genomic sequence is a sort of string. The goal of the segmentation problem is to have homogeneous segments.
The generic solution to the segmentation problem is dynamic programming. Optimal local segmentations can be glued together. Need to construct a n by k matrix. It is quadratic in n or worse, which can be very bad.
He describes a top-down approach which adds one segmentation point at a time and is n k log n, uses a heap.
He describes also a randomized algorithm (Himberg 2001) which starts with a random segmentation and take segmentation point at random and see if you can move it to a better location. Open problem: does it always converge?
He describes the sliding window algorithm which is provably optimal for the L_inf norm (for constant fitting?)
There are other approximation algorithms like approximative dynamic programming (Guha) and divide and segment (Terzi).
There are other interesting problems like the Bursty algorithm which finds dense intervals by dynamic programming; segmentation of data streams.
How do you choose the number of segments K?
He talks about BIC (bayesian approach): score is error plus penalty, error plus K log(n), related to MDL. Hypothesis testing: stop splitting segments when new segments have the same probability distribution.
Embedding by Mark Hansen
Key idea he presented was the idea of using the web to create “human sensors”. For example, using bird watchers to collect information about birds. Or plane watchers to collect information about planes. He suggests this could be used by environmentalists to monitor pollutants and stuff.
Simple Models for Customer-based Analysis by Peter Fader
This talk made quite an impression on me. He says everything must be implemented in excel and must be understandable by smart MBA students. The guy worked on the “CDNow 1997 data set”. He assumes that only 3 variables matter: monatary value of purchase, frequency of purchases, and recency. He assumes monatary value is independent of the couple frequency-recency. He models the transaction process as follows: an active customer buys according to a Poisson distribution and die after time t, exponentially distributed with rate mu. Dropout rates by user is gamma distributed. He simplifies his distributions by assuming discrete data. He points out that Poisson-Gamma distributions are often better than Zipf distributions. He says the best way to test a model is on conditional expectations. He says we find out that lesser customers (long tail?) are very important because we have many of them. [Note: he obviously focuses on retail. In a service industry, many lesser customers is not a good thing.]
He gets to an apparent contradiction with his model: customers who haven’t bought in a long time are more valuable if they are less frequent buyers. This can be explained by his model where users die: a frequent buyer who has bought in a long time is probably dead.
He explains that using simple excel-implemented model is a great way to lock in his customers: they will come back to him for more.
The practice of cluster analysis by Jon R. Kettenring
There are two types of cluster analysis: hierarchical (linkage and ward) and partitioning (k-means, mixture). He points out that hierarchical are by far more popular. He points out that cluster analysis is usually considered as a form of unsupervised learning by that even k-means is somewhat supervised: we assume clusters are about the same size and spherical.
There has been an exponential growth in the number of cluster papers between 1995 and 2003. Seems like it is slightly dropping in 2004.
Most people cluster on small data sets. It is a myth that clustering is done on large data sets.
One common problem is that the different axis use different units (cm, Kg…) and so you have to “rescale”. People do something called “autoscaling”. It is not certain that this is all done with rigor.
Clustering takes you away from the data and leads you to unexplainable conclusions. Clustering is often a black box. People using PCA left and right without understanding and it can be evil if the clusters are too large.
Truly automatic clustering is far from easy.
Online Learning by projecting by Yoram Singer (from Google)
I really liked this talk. He wants to build a linear classifier, that is, a learned vector w such that sign(w . x) classifies x (as spam or not, say). The margin of an example is the amplitude of w.x. He makes a separability assumption: the space can be separated in two with a non-zero margin.
The idea of the algorithm is to have classified x come in one by one, each time, you rotate w so that x is correctly classified, essentially projecting w on the space of all w such that w.x has the proper sign. In a variant, you don’t project all the way and only move w by epsilon.
He gives some references:
Kivinen et al., Online Learning with Kernels
Herbster, Learning additive models online with fast evaluating kernels
Bauschke, On projection algorithms for solving convexity feasibility problems