STATISTICAL GUARANTEES FOR CONSENSUS CLUSTERING

Abstract

Consider the problem of clustering n objects. One can apply multiple algorithms to produce N potentially different clustersings of the same objects, that is, partitions of the n objects into K groups. Even a single randomized algorithm can output different clusterings. This often happens when one samples from the posterior of a Bayesian model, or runs multiple MCMC chains from random initializations. A natural task is then to form a consensus among these different clusterings. The challenge in an unsupervised setting is that the optimal matching between clusters of different inputs is unknown. We model this problem as finding a barycenter (also known as Fréchet mean) relative to the misclassification rate. We show that by lifting the problem to the space of association matrices, one can derive aggregation algorithms that circumvent the knowledge of the optimal matchings. We analyze the statistical performance of aggregation algorithms under a stochastic label perturbation model, and show that a K-means type algorithm followed by a local refinement step can achieve near optimal performance, with a rate that decays exponentially fast in N . Numerical experiments show the effectiveness of the proposed methods.

1. INTRODUCTION

Clustering is a fundamental task in machine learning and data analysis. Given data on each of the n objects in a set, there are numerous algorithms to produce a clustering of these n objects, which is formally a partitioning of {1, . . . , n} into K disjoint sets. A natural problem that arises in practice is how to form a consensus among these clusterings. This is especially important if the different clusterings are produced by a single randomized algorithm. This situation often arises in Bayesian modeling, where the posterior naturally encodes the variability of the clustering problem. Finding a consensus clustering then corresponds to finding the center of the posterior, from which we can also obtain estimates of the variability of the posterior. A clustering of n objects can be viewed as a label vector in [K] n where [K] = {1, . . . , K}. We assume that we are given N label vectors z j ∈ [K j ] n for j = 1, . . . , N , with potentially different number of clusters each. Let K = max j K j and note that we can view all z j as vectors in [K] n . The task is to obtain a consensus K-clustering, that is, a label vector z ∈ [K] n which is close to all z 1 , . . . , z N at the same time. We also refer to this task as the label aggregation problem. In the context of clustering, there is no meaning to the label of each cluster, that is, the label aggregation problem is unsupervised, in the sense that there is no natural correspondence between labels of different clusterings. This is in contrast to label aggregation in classification in which the labels have a common meaning among different input classifications. We refer to the latter task as supervised label aggregation. In the unsupervised setting, forming a consensus label is a nontrivial task due the label-switching problem. Consider for example, the case n = 5 and the two label vectors z 1 = (1, 1, 1, 2, 2) and z 2 = (2, 2, 2, 1, 1). These two vectors are different in all 5 positions but they define the same clusterings of the objects. In this case, the consensus label z can be taken to be either z 1 or z 2 . More generally, for every z j , there could be a permutation π j on [K], such that the permuted vectors π j • z j := (π j (z ji )) n i=1 , are closer to each other than the original z j s.

