Philosophy

The aim of the IFCS Cluster Benchmark Data Repository is to stimulate better practice in benchmarking (performance comparison of methods) for cluster analysis by providing a variety of well documented high quality datasets and simulation routines for use in practical benchmarking.

At present, sometimes new methods, algorithms, and data-analytic devices (e.g., procedures to determine the number of clusters in a data set) are being proposed without adequate comparisons with current best practices, and without sufficient evidence of good performance on benchmarking datasets.

Even more often, when such new methods are proposed, a few simulations are run, a few real datasets are analyzed in an unsystematic manner. Sometimes this involves comparisons with one or two alternative methods, but there is no discussion of the specific conditions under which a method is supposed to work well, what kind of clusters it attempts to find, and what would therefore be the most appropriate competitors. Often such work implicitly assumes that there should be a unique best method for general clustering, and then suggests, based on a handful of experiments, that the just proposed new method is a promising candidate for this position, without giving an idea about potential weaknesses and limitations.

Benchmarking cluster analysis methods is indeed difficult. Particularly it is essentially more difficult than benchmarking supervised classification. The reason is that typically in supervised classification there is a single true grouping, and performance can be measured by looking at (potentially weighted) misclassification percentages. In cluster analysis (unsupervised classification), however, different aims of clustering may lead to different clusterings on the same dataset that could be optimal according to different criteria (e.g., overall low within-cluster distances, or optimal representation of every object by the centroid object of the cluster to which it is assigned, or optimal fit by a mixture probability model).

Often, datasets with known classes are used for benchmarking clustering in a similar manner as for supervised classification. Sometimes these are simulated from mixture probability models with clusters defined by homogeneous mixture components, sometimes these are real datasets in which the problem of interest was originally supervised classification. But in all of these cases it would be conceivable that clusterings other than those assumed to be "true" could be valid as well. Actually, clusterings different from the "known" ones could even be of more scientific interest, because cluster analysis is used in reality to find clusterings that are not yet known. For this reason, also it is of interest to use datasets for benchmarking that represent genuine clustering problems, i.e., in which the "true" clustering is not yet known.

It is therefore particularly important for benchmarking clustering to define properly the clustering problem that a method aims to solve, by specifying as precisely as possible what kinds of clusters are of interest. With this, appropriate simulation setups, datasets for benchmarking and competing existing clustering methods can be more systematically chosen.

The IFCS Cluster Benchmark Data Repository collects datasets with and without given "true" clusterings. A particular feature of the repository is that every dataset comes with a comprehensive documentation, including information on the specific nature of the clustering problem in this dataset and the characteristics that useful clusters should fulfill, with scientific justification. In case that "true" clusters are provided, such information is still required in order to understand properly what it means that a certain clustering method performs well or badly regarding misclassification, taking into account the possibility that clusterings other than the given one can also be good in some sense. In case that "true" clusters are not provided, the given information can be used to evaluate the performance of clustering methods by other means than by looking at misclassifications, for example based on quality criteria that can be computed from the clustered data, or based on other external variables or information if provided.

The overarching principle is that cluster benchmarking does not simply mean that methods are ranked in a one-dimensional manner, but rather that their advantages and shortcomings are characterized thoroughly against a more precise definition of what kind of clustering could be of interest, and the characteristics of the datasets.

Furthermore, the IFCS Cluster Benchmark Data Repository offers a platform for many of the different kinds of datasets on which clustering may be of interest, including standard multivariate data, data with mixed types of variables, similarity and dissimilarity data, and various kinds of structured data such as time series data, regression data, spatial data, symbolic data, graph data.