This week was held the Statistics and Optimization of Clustering Workshop (supported by the PASCAL network). There were many interesting talks and in particular, there was a great open discussion about foundations of clustering.
Several issues were raised in this discussion, and it seems that none of them could be settled in a satisfactory manner.
Among these issues, I noted the following:
- Definition of clustering: what is the goal of clustering? what does it mean to extract hidden information, or hidden structure from the data? The main issue here is that there are plenty of possible definitions but there does not seem to be one that encompasses every other.
- Difference between clustering and dimensionality reduction: can these things put together in the "exploratory data analysis" category? Does it make sense to separate them?
- Clustering with a goal in mind: most often, clustering is used as a first step of data analysis. This means that there are other steps, and defining the goal of clustering only makes sense when one knows what is really the ultimate goal (for example, in many cases, people use clustering as a pre-processing step for supervised learning).
- What is generalization in clustering: what is an appropriate formal setting for the clustering problem? Can we use the iid setting of supervised learning, where one is concerned with convergence of the generalization error as the sample size grows? Or should we (as proposed by Tali Tishby) be concerned with convergence as the number of features grows?
- Evaluation of clustering algorithms: is there some way to tell two clustering algorithms apart in terms of their "generalization ability" or their ability to find structure in the data? This seems like the most important yet most unclear point. Most people seemed to agree that a desirable feature of a clustering algorithm is stability, but that this is far from being sufficient. Some people were saying that the only way to evaluate the results is to show them to an expert. Others said that there should be a theory that allows to compare clustering algorithms.
The conclusion is that there are plenty of interesting open questions here, and many (brilliant) people are willing to work on them.
So I hope that progress will be made soon, and bring unsupervised learning to the level of formalization that supervised learning has reached.