STATISTICAL GUARANTEES FOR CONSENSUS CLUSTERING

Abstract

Consider the problem of clustering n objects. One can apply multiple algorithms to produce N potentially different clustersings of the same objects, that is, partitions of the n objects into K groups. Even a single randomized algorithm can output different clusterings. This often happens when one samples from the posterior of a Bayesian model, or runs multiple MCMC chains from random initializations. A natural task is then to form a consensus among these different clusterings. The challenge in an unsupervised setting is that the optimal matching between clusters of different inputs is unknown. We model this problem as finding a barycenter (also known as Fréchet mean) relative to the misclassification rate. We show that by lifting the problem to the space of association matrices, one can derive aggregation algorithms that circumvent the knowledge of the optimal matchings. We analyze the statistical performance of aggregation algorithms under a stochastic label perturbation model, and show that a K-means type algorithm followed by a local refinement step can achieve near optimal performance, with a rate that decays exponentially fast in N . Numerical experiments show the effectiveness of the proposed methods.

1. INTRODUCTION

Clustering is a fundamental task in machine learning and data analysis. Given data on each of the n objects in a set, there are numerous algorithms to produce a clustering of these n objects, which is formally a partitioning of {1, . . . , n} into K disjoint sets. A natural problem that arises in practice is how to form a consensus among these clusterings. This is especially important if the different clusterings are produced by a single randomized algorithm. This situation often arises in Bayesian modeling, where the posterior naturally encodes the variability of the clustering problem. Finding a consensus clustering then corresponds to finding the center of the posterior, from which we can also obtain estimates of the variability of the posterior. A clustering of n objects can be viewed as a label vector in [K] n where [K] = {1, . . . , K}. We assume that we are given N label vectors z j ∈ [K j ] n for j = 1, . . . , N , with potentially different number of clusters each. Let K = max j K j and note that we can view all z j as vectors in [K] n . The task is to obtain a consensus K-clustering, that is, a label vector z ∈ [K] n which is close to all z 1 , . . . , z N at the same time. We also refer to this task as the label aggregation problem. In the context of clustering, there is no meaning to the label of each cluster, that is, the label aggregation problem is unsupervised, in the sense that there is no natural correspondence between labels of different clusterings. This is in contrast to label aggregation in classification in which the labels have a common meaning among different input classifications. We refer to the latter task as supervised label aggregation. In the unsupervised setting, forming a consensus label is a nontrivial task due the label-switching problem. Consider for example, the case n = 5 and the two label vectors z 1 = (1, 1, 1, 2, 2) and z 2 = (2, 2, 2, 1, 1). These two vectors are different in all 5 positions but they define the same clusterings of the objects. In this case, the consensus label z can be taken to be either z 1 or z 2 . More generally, for every z j , there could be a permutation π j on [K], such that the permuted vectors π j • z j := (π j (z ji )) n i=1 , are closer to each other than the original z j s. To formalize the above idea, we recall the definition of the misclassification rate between two label vectors, z, y ∈ [K] n : Mis(z, y) = min π 1 n n i=1 1{z i ̸ = π(y i )} (1) where the minimum is taken over all the permutations π : [K] → [K]. Mis(•, • ) is a proper metric on the space of K-clusterings of n objects. It is also a metric on [K] n if we identify vectors that are obtained from each other by label-switching. We can now define the consensus label as the barycenter of z 1 , . . . , z N in Mis(•, •) metric, that is, z ∈ argmin z∈[K] n N j=1 w j Mis(z, z j ) where w j ≥ 0 are a given set of weights. We often assume uniform weights: w j = 1 for all j. The barycenter z is also known as the Frechét mean. Solving ( 2) is complicated by the presence of the permutation in the definition of Mis function. More explicitly, we need to solve z ∈ argmin z ∈ [K] n min π1,...,π N N j=1 n i=1 w j 1{z i ̸ = π j (z ji )} showing that in addition to z, we have to optimize over N permutations π j , j = 1, . . . , N . In this paper, we provide alternative solutions that avoid optimizing over these permutations.

Our contributions

The unsupervised version of the label aggregation problem is the realistic and practical one when dealing with aggregating labels from Bayesian clustering algorithms, since the posterior has K! modes corresponding to all possible label permutations, and the output will be near an arbitrary mode in each run of the algorithm. The main contributions of this paper to unsupervised aggregation are the following: 1. We show that by lifting the barycenter problem to the space of association matrices, one can derive algorithms that avoid optimizing over the unknown permutations (Section 2.1). In particular, we propose both a basic and a spectral K-means type aggregation algorithm. 2. We propose a random perturbation model (RPM) under which we can study the theoretical performance of both supervised and unsupervised aggregation algorithms. We prove the statistical consistency of the basic aggregation algorithm under RPM (Section 2.2). 3. Under RPM, the supervised setting corresponds to an oracle that knows the true matching permutations. By studying this oracle, we derive the optimal statistical misclassification rate for supervised aggregation (Section 3.1). 4. We propose an efficient local refinement step on the output of any consistent aggregation algorithm in the unsupervised setting, and show that the updated labels achieve nearly the same misclassification rate as the above oracle (Section 3.2). Our theoretical analysis illustrates how different parameters affect the difficulty of the label aggregation problem. In Section 4, we provide numerical experiments comparing the performance of the proposed algorithms against each other and existing methods.

Related work

In the supervised setting, the problem of label aggregation is to combine multiple annotated dataset. The label inferred for each item from those produced by multiple annotators acts as the ground truth for the classification task. Various probabilistic models have been proposed for aggregating annotations, with parameters to account for the expertise of the annotators and the noise in the labeling process (47; 37). The unsupervised setting is more challenging as there is no meaning to the cluster labels (the label-switching issue) and the clusterings can have potentially different number of clusters. The idea of passing to association matrices to get around the label-switching issue, has been leveraged in several existing approaches (13; 24; 43; 13; 21; 29) , although the connection we make to the lifted barycenter problem and the resulting spectral methods is new to the best of our knowledge. In (24; 43), the authors employ an Expectation-Maximization strategy to obtain a nonnegative matrix factorization of the combined association matrix. The authors of (41) provide

