STATISTICAL GUARANTEES FOR CONSENSUS CLUSTERING

Abstract

Consider the problem of clustering n objects. One can apply multiple algorithms to produce N potentially different clustersings of the same objects, that is, partitions of the n objects into K groups. Even a single randomized algorithm can output different clusterings. This often happens when one samples from the posterior of a Bayesian model, or runs multiple MCMC chains from random initializations. A natural task is then to form a consensus among these different clusterings. The challenge in an unsupervised setting is that the optimal matching between clusters of different inputs is unknown. We model this problem as finding a barycenter (also known as Fréchet mean) relative to the misclassification rate. We show that by lifting the problem to the space of association matrices, one can derive aggregation algorithms that circumvent the knowledge of the optimal matchings. We analyze the statistical performance of aggregation algorithms under a stochastic label perturbation model, and show that a K-means type algorithm followed by a local refinement step can achieve near optimal performance, with a rate that decays exponentially fast in N . Numerical experiments show the effectiveness of the proposed methods.

1. INTRODUCTION

Clustering is a fundamental task in machine learning and data analysis. Given data on each of the n objects in a set, there are numerous algorithms to produce a clustering of these n objects, which is formally a partitioning of {1, . . . , n} into K disjoint sets. A natural problem that arises in practice is how to form a consensus among these clusterings. This is especially important if the different clusterings are produced by a single randomized algorithm. This situation often arises in Bayesian modeling, where the posterior naturally encodes the variability of the clustering problem. Finding a consensus clustering then corresponds to finding the center of the posterior, from which we can also obtain estimates of the variability of the posterior. A clustering of n objects can be viewed as a label vector in [K] n where [K] = {1, . . . , K}. We assume that we are given N label vectors z j ∈ [K j ] n for j = 1, . . . , N , with potentially different number of clusters each. Let K = max j K j and note that we can view all z j as vectors in [K] n . The task is to obtain a consensus K-clustering, that is, a label vector z ∈ [K] n which is close to all z 1 , . . . , z N at the same time. We also refer to this task as the label aggregation problem. In the context of clustering, there is no meaning to the label of each cluster, that is, the label aggregation problem is unsupervised, in the sense that there is no natural correspondence between labels of different clusterings. This is in contrast to label aggregation in classification in which the labels have a common meaning among different input classifications. We refer to the latter task as supervised label aggregation. In the unsupervised setting, forming a consensus label is a nontrivial task due the label-switching problem. Consider for example, the case n = 5 and the two label vectors z 1 = (1, 1, 1, 2, 2) and z 2 = (2, 2, 2, 1, 1). These two vectors are different in all 5 positions but they define the same clusterings of the objects. In this case, the consensus label z can be taken to be either z 1 or z 2 . More generally, for every z j , there could be a permutation π j on [K], such that the permuted vectors π j • z j := (π j (z ji )) n i=1 , are closer to each other than the original z j s. To formalize the above idea, we recall the definition of the misclassification rate between two label vectors, z, y ∈ [K] n : Mis(z, y) = min π 1 n n i=1 1{z i ̸ = π(y i )} where the minimum is taken over all the permutations π : [K] → [K]. Mis(•, •) is a proper metric on the space of K-clusterings of n objects. It is also a metric on [K] n if we identify vectors that are obtained from each other by label-switching. We can now define the consensus label as the barycenter of z 1 , . . . , z N in Mis(•, •) metric, that is, z ∈ argmin z∈[K] n N j=1 w j Mis(z, z j ) where w j ≥ 0 are a given set of weights. We often assume uniform weights: w j = 1 for all j. The barycenter z is also known as the Frechét mean. Solving ( 2) is complicated by the presence of the permutation in the definition of Mis function. More explicitly, we need to solve z ∈ argmin z ∈ [K] n min π1,...,π N N j=1 n i=1 w j 1{z i ̸ = π j (z ji )} showing that in addition to z, we have to optimize over N permutations π j , j = 1, . . . , N . In this paper, we provide alternative solutions that avoid optimizing over these permutations.

Our contributions

The unsupervised version of the label aggregation problem is the realistic and practical one when dealing with aggregating labels from Bayesian clustering algorithms, since the posterior has K! modes corresponding to all possible label permutations, and the output will be near an arbitrary mode in each run of the algorithm. The main contributions of this paper to unsupervised aggregation are the following: 1. We show that by lifting the barycenter problem to the space of association matrices, one can derive algorithms that avoid optimizing over the unknown permutations (Section 2.1). In particular, we propose both a basic and a spectral K-means type aggregation algorithm. 2. We propose a random perturbation model (RPM) under which we can study the theoretical performance of both supervised and unsupervised aggregation algorithms. We prove the statistical consistency of the basic aggregation algorithm under RPM (Section 2.2). 3. Under RPM, the supervised setting corresponds to an oracle that knows the true matching permutations. By studying this oracle, we derive the optimal statistical misclassification rate for supervised aggregation (Section 3.1). 4. We propose an efficient local refinement step on the output of any consistent aggregation algorithm in the unsupervised setting, and show that the updated labels achieve nearly the same misclassification rate as the above oracle (Section 3.2). Our theoretical analysis illustrates how different parameters affect the difficulty of the label aggregation problem. In Section 4, we provide numerical experiments comparing the performance of the proposed algorithms against each other and existing methods.

Related work

In the supervised setting, the problem of label aggregation is to combine multiple annotated dataset. The label inferred for each item from those produced by multiple annotators acts as the ground truth for the classification task. Various probabilistic models have been proposed for aggregating annotations, with parameters to account for the expertise of the annotators and the noise in the labeling process (47; 37) . The unsupervised setting is more challenging as there is no meaning to the cluster labels (the label-switching issue) and the clusterings can have potentially different number of clusters. The idea of passing to association matrices to get around the label-switching issue, has been leveraged in several existing approaches (13; 24; 43; 13; 21; 29) , although the connection we make to the lifted barycenter problem and the resulting spectral methods is new to the best of our knowledge. In (24; 43) , the authors employ an Expectation-Maximization strategy to obtain a nonnegative matrix factorization of the combined association matrix. The authors of (41) provide several approaches, using the hypergraph representation of the clusterings, that have shown promising results in the context of image segmentation (22) . A set of fuzzy consensus algorithms is proposed in (40) that generate soft consensus partitions by combining a collection of fuzzy clusterings. What we referred to as unsupervised label aggregation problem has appeared under many different names in the literature, including but not limited to, cluster ensembles (41) , clustering/cluster ensemble problem (44; 10), ensemble clustering (1), clustering aggregation (17) , combining clusterings (42; 14; 27) , consensus clustering (48; 18) and the median partition problem (12; 46; 18) . As can be surmised from the variety of names, there is a copious literature on the subject, spanning over multiple fields, with many ideas rediscovered time after time. We refer to the excellent surveys (48; 44; 18) for a more exhaustive list of references and historical discussions. There is also a parallel line of work in the Bayesian clustering literature on aggregating the posterior clusterings (31; 7; 9; 25; 15; 45; 8) . The barycentric view to aggregation that we take in (2) has appeared in many previous work, but often with a different distance in place of Mis, including but not limited to the symmetric difference distance (SDD), a.k.a. the Mirkin metric (up to a constant), in the median partition problem (39; 23; 12; 17; 18) , the Binder loss and variational information (VI) in (45; 15) , the normalized mutual information (NMI) in the pioneering work of ( 41), the adjusted Rand index (15) and the category utility function (42; 35) . After introducing our methods, in Section 2.3, we give a more detailed comparison with the literature. We choose Mis as the distance in the present work for a better comparison with the oracle problem in Section 3.1. We also show in Appendix B.1 that consistency in Mis implies the consistency in other distances. Despite the voluminous literature on the subject, statistical analysis of the methods under a statistical model for the input clusterings has not been undertaken before. This is the gap that we fill in this paper, by providing the first consistency and optimality results under a statistical model (the RPM) for a method of clustering aggregation that we propose. To the best of our knowledge, the question of consistency, let alone optimality, has not been considered for any method of aggregation before. We also shed more light on the relation between the barycenteric approach and those based on association matrices (Section 2.3), and how convex relaxation leading to a spectral method can be used to approximate the median partition. To illustrate the importance of statistical analysis, we also show that a simple common approach to the median partition problem, known as the BestOfK (18) , is in general inconsistent under RPM, despite being shown to be a 2-factor approximation of the median partition problem (12) . This further highlights the key insights that statistical analysis under a model can provide which is not possible to obtain by CS-type theory on approximation algorithms.

2. LIFTED AGGREGATION ALGORITHMS

We start by introducing some notation. Let E K = {e k } K k=1 be the set of standard basis vectors of R K . The elements of E K can be considered one-hot encodings of the labels from [K] . From now on, instead of encoding labels as element of [K], we encode them as element of E k . We can then view labelings of n objects as elements of the following set E n K = {z = (z 1 , . . . , z n ) : z i ∈ E K ∀i ∈ [n]}. Each z i is viewed as a K × 1 vector and each element of E n K as K × n matrices, which we refer to as label matrices. For Z ∈ E K , permuting the cluster labels is equivalent to pre-multiplication by a K × K permutation matrix P , that is, P Z. The label aggregation problem can be restated as follows: Given label matrices Z 1 , . . . , Z N ∈ E n K , find a consensus label matrix Z ∈ E n K , by solving the barycenter problem: Z ∈ argmin Z ∈ E n K min P1,...,P N N j=1 w j ∥Z -P j Z j ∥ 2 F ( ) where P 1 , . . . , P N are K × K permutation matrices and ∥X∥ F := i,j X 2 ij 1/2 is the Frobenius norm of matrix X. One can verify that problem ( 5) is equivalent to (3) . The following result shows that if we know the optimal permutations P j s, we can easily find the barycenter Z: Proposition 1. Let { P 1 , . . . , P N } be an optimal set of permutation matrices in (5) . Then, the optimal solution Z of (5) is the columnwise "argmax" of j w j P j Z j . Algorithm 1 Basic label aggregation algorithm. 1: Form association matrices X j = Z T j Z j . 2: Form the average association matrix X = N j=1 w j X j . 3: Perform K-means on the rows of X. Algorithm 2 Spectral label aggregation algorithm. 1: Define the same average association matrix X as in Algorithm 1. 2: Perform K-truncated eigendecomposition of X = U ΛU T , where Λ ∈ R K×K contains top-K eigenvalues on the diagonal and the columns of U ∈ R n×K are the corresponding eigenvectors. 3: Perform K-means on the rows of U .

2.1. LIFTING TO ASSOCIATION MATRICES

The difficulty in the unsupervised setting is that the optimal permutations { P j } are unknown. To get around this issue, we lift the barycenter problem to the space of association matrices. For a label matrix Z ∈ E n K , we define the corresponding association matrix as X = Z T Z ∈ {0, 1} n×n . Note that X ij = 1 iff i and j are in the same cluster according to Z, otherwise X ij = 0. The advantage of X is that it is invariant to label switching: X = Z T Z = (P Z) T P Z for any permutation matrix P . This suggests solving the following lifted barycenter problem instead of (3): X ∈ argmin X∈X K N j=1 w j ∥X -X j ∥ 2 F (6) where X j = Z T j Z j and X K = {Z T Z : Z ∈ E n K }, that is, the set of (binary) association matrices with at most K clusters. Semidefinite relaxation Problem ( 6) is still hard to solve due to the combinatorial nature of X K . An approach to solving problems over X K is to relax to a semidefinite program, an idea that has been applied before to community detection in networks (4). In particular, X K is inside the doubly nonnegative cone {X : X ⪰ 0, X ≥ 0}, where X ⪰ 0 and X ≥ 0 mean X is positive semidefinite and elementwise nonnegative. We note that X ii = 1 for all i. This suggests relaxing to the following problem X ∈ argmin X N j=1 w j ∥X -X j ∥ 2 F : X ⪰ 0, X ≥ 0, X ii = 1 ∀i . Problem (7) has a simple solution. The solution of the unconstrained version of (7) over R n×n is X ′ := n j=1 w j X j / n j=1 w j . Since {X j } belong to the constraint set of (7) and this set is convex, X ′ too belongs to the constraint set. Hence, X ′ is the solution of (7) , that is, X = X ′ . It remains to translate X back to labels, for which we can preform rowwise K-means, leading to Algorithm 1. Since elements of X K are of rank at most K, to get a solution which is closer to that of the lifted barycenter problem (6) , we can perform a spectral truncation of X to its K top eigenvectors, before applying the rowwise K-means. This leads to the spectral aggregation Algorithm 1. Other variants of spectral clustering on X are also possible, e.g., using the normalized Laplacian, etc. In the K-means step of Algorithm 1 and 2, any constant-factor approximation to the K-means problem can be used.

2.2. CONSISTENCY

In order to study the statistical performance of different aggregation algorithms, we propose a random perturbation model (RPM), where both the clusters and the labels of the clusters can undergo random perturbations, allowing us to study the difficulty of the unsupervised aggregation problem. Let Z * ∈ E n K be the "true" label matrix with columns z * i ∈ E K , i = 1, . . . , n. Definition 1 (RPM). We write Z ∼ L(Z * , p) if Z = (z 1 , . . . , z n ) ∈ E n K with columns z i drawn independently as follows: z i = P z ′ i , z ′ i ∼ (1 -p)δ z * i + p Unif(E K ) where δ z * i is the point mass at z * i and P is an independent K × K permutation matrix. Under RPM, the observed label matrices Z 1 , . . . , Z N are i.i.d., so it is reasonable to let w 1 = • • • = w N = 1 in Algorithms 1 and 2. We will discuss the algorithm for weighted samples in Section 5. Let n k (Z * ) be the number of objects in cluster k according to Z * . We make the following assumption: n k (Z * ) ≤ βn/K, k ∈ [K] for some β ∈ [1, K]. Here, β measures how much the true clustering deviates from being balanced. For β = 1, we have n k (Z * ) = n/K for all k, while β = K corresponds to no restriction on the sizes of the true clusters. The following result shows that the basic aggregation algorithm is statistically consistent under the RPM: Theorem 1 (Consistency). Let Z * ∈ E n K be a label matrix satisfying (9) . Assume that Z 1 , . . . , Z N are i.i.d. draws from L(Z * , p) and let ξ := p(2 -p). Let Mis be the misclassification rate between the true label matrix Z * , and the output of Algorithm 1 with w j = 1 for all j ∈ [N ]. Then, there exists a universal constant C > 0, such that E[Mis] ≤ Cξβ 2 K (1 -ξ) 2 2β 2 N + ξK n . ( ) Consistency of Algorithm 1 follows from (10) and the Markov inequality: For any δ > 0, we have P(Mis ≥ δ) ≤ δ -1 E[Mis] → 0 as n, N → ∞ and p is bounded away from 1. We note that in this and subsequent results all the parameters, such as K and p, are allowed to change as n, N → ∞, subject to the conditions of the theorems. The first term inside the parentheses in ( 10) is the dominant one. Assume for simplicity that β ≍ 1. If the model has low noise, then p is small and so is ξ since ξ ≍ p. Then, the dominant term is O(pK/N ), that is, a smaller number of clusters, K, and a larger sample size, N , improve the performance. The second term is independent of of the sample size, but vanishes at the rate O(p 2 K 2 /n) in the low noise setting, as the number of objects, n, grows.

2.3. LITERATURE COMPARISON

The relation between the metrics on clusterings is discussed in (34; 33) . Let d ′ M (Z, Z j ) be the Mirkin metric between clusterings Z and Z j (34, Eqn (6) ). As we show in Appendix B, it turns out that d ′ M (Z, Z j ) = ∥X -X j ∥ 2 F = ∥X -X j ∥ ℓ1 leading to d ′ M /2 = SDD = Binder = n 2 (1 -Rand) where Binder denotes the Binder loss (5) and Rand, the Rand index (36) . It follows that the lifted barycenter (6) that we derived is equivalent to the median partition (12) and the Binder loss barycenter of (45) as well as the Rand barycenter (15) . This problem is often solved by greedy search starting from a random initialization (45; 8). Our Algorithm 2 then provides a fast scalable spectral method of obtaining an approximate solution to this ubiquitous problem. A lot of algorithms proposed for consensus clustering operate on the average association matrix X, by treating it as a similarity matrix and performing usual clustering on it, for example, by performing agglomerative clustering (31; 18; 14; 28) . Wade and Ghahramani (45) criticize these approaches as being ad-hoc compared to the decision-theoretic approach of finding a barycenter. However, we show X is indeed a solution to the relaxed version (7) of the barycenter (6) which is the same as the Binder barycenter in (45) , hence clustering X is effectively solving the same problem with a different method. A k-means based aggregation algorithm, called KCC, has been proposed in (48) . In our notation, this is equivalent to concatenating Z j s row-wise to form an N K × n matrix and running k-means on the columns. This is different from our Algorithm (1) that operates on X. We compare with KCC in Section 4. Unlike KCC, Algorithm (1) comes with a consistency guarantee under our model assumptions. The BestOfK (12) essentially solves (6) by restricting the feasible region to {X 1 , . . . , X N }, hence picking the lowest-scoring input clustering. The approach proposed in (7; 9) (see also (15) ) is to find the input clustering that minimizes the cost X → ∥X -X∥ F . It is not hard to see that the barycenter cost ( 6) is equal to ∥X -X∥ 2 F plus a constant (essentially a bias-variance decomposition; see (26) ). Hence, the approach of (7; 9) is equivalent to the BestOfK. As we show in Appendix A, BestOfK is, in general, inconsistent under RPM unless N grows exponentially in n, a very strong condition not needed by our algorithms. The K-means step 3 in Algorithm (1) can be replaced by other clustering algorithms, e.g. averagelinkage clustering, BestOfK and CC pivot algorithm of (18; 12) . Besides global clustering algorithms, many authors, including (41; 17; 18; 45; 8) update one label at a time to minimize the barycenter loss, based on any number of metrics discussed earlier, including the information-theoretic ones (NMI and VI ( 32)). In (26) , after reducing (6) to minimizing X → ∥X -X∥ 2 F , they write X = Z T Z (in our notation) and then relax Z to a general nonnegative matrix and solve the problem with nonnegative matrix factorization. Their work has the flavor of our Algorithm 2, although our approach, being based on regular spectral decomposition, is highly robust and scalable. The RPM, with P set equal to the identity, is closely related to the artificial data model considered in (18) , with the difference that RPM does not potentially create a new cluster after perturbation, and introduces a random permutation of the clusters labels after perturbation (via P ). In Appendix E, we argue that RPM is a good model of a concentrated posterior, hence consistency under RPM is relevant to Bayesian aggregation problems for which posterior consistency has been shown.

3. OPTIMAL RATE

Theorem 1 guarantees an O(N -1 ) rate of misclassification for Algorithm 1. A natural question is whether we can do better. To answer this question, we first consider what is the best an oracle, with the knowledge of the random permutations in (8) , can do. This oracle is effectively solving the supervised version of the problem. We then show that a refinement step allows us to achieve nearly the same as the optimal oracle rate, without knowing the matching permutations.

3.1. SUPERVISED ORACLE

Let Z ′ be a label matrix with the ith column z ′ i ∼ (1 -p)δ z * i + p Unif(E K ), and let Z 1 , . . . , Z N be independent copies of Z ′ . We would like to recover the true label matrix Z * . This is the oracle version of model ( 8), since Z j s are label matrices without random permutations. In this case, the problem decouples to n independent label recovery problems. We further simplify the problem to that of deciding between z * 1 = e 1 and z * 1 = e 2 . This problem is equivalent the hypothesis testing: H 0 : Multinomial(N, (1 -p, q, . . . , q)) versus H 1 : Multinomial(N, (q, 1 -p, . . . , q)), where q := p/K and p := (K -1)q := p -q. A classical result from information theory (6, Theorem 11.9.1) allows us to determine the optimal performance in this case: Proposition 2. The Bayesian error probability, P N , for testing H 0 against H 1 , with positive prior probabilities, is bounded by e -N I , where I := -log 2 (1 -p)q + (K -2)q (11) is the best achievable error exponent in the sense that -1 N log P N → I as N → ∞. The left panel in Figure 1 shows plots of I as a function of p, for various K, and the right panel shows the convergence of the exponent of P N for K = 2. The Bayesian error probability P N can be achieved by performing a likelihood ratio test between the two hypotheses. We can generalize this result to testing between K hypotheses, in which case the Bayesian error probability is bounded by Let Xi be the ith column of X = j w j X j .

5:

Replace the ith column of Z by zeros and denote this matrix by Z -i . 6: Let (n 1 , . . . , n K ) be the row sums of Z -i . 7: Let (b 1 , . . . , b K ) = Z -i Xi . 8: Update the ith label by arg max k (b k /n k ). 9: end for (K -1)e -N I . The oracle algorithm that achieves this bound is the one that finds the columnwise "argmax" of the average of Z j s. In light of Proposition 2, the bound in Theorem 1 is far from optimal since it guarantees a linear decay of the error in N , that is O(N -1 ), rather than the exponential decay e -N I . The question is whether this gap can be filled by a non-oracle algorithm.

3.2. LOCAL REFINEMENT

To approach the oracle rate, we propose a fast local refinement on the label of each object, as outlined in Algorithm 3. This refinement can be performed on the output of any reasonable aggregation algorithm. The idea of performing a local refinement to boost the performance of clustering algorithms has been employed in various settings, including clustering of sub-Gaussian mixtures (30) and community detection in stochastic block models (SBM) (3; 16; 50; 51) . A more detailed comparison with the SBM appears in Appendix F. Algorithm 3 requires a good initial label matrix Z, with a small number of mismatches relative to the true matrix Z * . The algorithm focuses on the local information of the ith object. With X the average of the association matrices, Xij is the sample proportion of objects i and j appearing in the same cluster. Viewing X as the adjacency matrix of a weighted graph, one can verify that b k = j̸ =i w j X ji 1{ Z j = k} is the weighted number of connections between object i and objects in cluster k. The algorithm then normalizes the number of connections by the cluster size n k . Thus, b k /n k estimates the probability that object i is connected to another object in cluster k. The higher this probability is, the higher the chance that object i belongs to cluster k. The last step in the for loop, updates the ith label according to these statistics. It is possible to repeat the local refinement procedure, by feeding its output Z back as an initial label matrix. Given a good initialization, local refinements usually converge in constant or O(log n) number of steps (30).

3.3. ACHIEVING THE SUPERVISED RATE WITHOUT SUPERVISION

From the oracle result (Proposition 2), we expect the optimal misclassification rate to be close to e -N I . This is verified by the next result, showing that a single local refinement step applied to a consistent aggregation algorithm, such as Algorithm 1, can get us nearly to the optimal rate: Theorem 2. Assume that Z 1 , . . . , Z N are i.i.d. draws from the random perturbation model (8) . Let Z be the output of Algorithm 3 initialized by some, possibly data-dependent, Z. Let n min = min k∈[K] n k (Z * ) be the smallest true cluster size and recall the definition of I in (11) . Assume that (a) n min p(1 ∧ I)/K → ∞ and N I log K → ∞, and there exists δ satisfying Knδ/(n min pI) = o(1) such that one of the followings holds: (b1) P(Mis( Z, Z * ) ≤ δ) = 1 -o(1), or (b2) E[Mis( Z, Z * )] ≤ δ. Then, for some η = o(1), the misclassification rate satisfies P Mis( Z, Z * ) ≤ e -(1-η)N I = 1 -o(1). ( ) The assumptions of Theorem 2 are mild. Suppose that p is bounded away from 0 or 1, say p ∈ [0.01, 0.99], K = O(1) and the cluster sizes are similar. Then, the assumptions can be simplified to n min , N → ∞ and δ = o(1). The theorem is most interesting when p → 1 and K is large. In this case, I → 0 and the first requirement of assumption (a) becomes n min I/K → ∞. This assumption guarantees that the samples provide sufficient information to recover the permutations, although we have not attempted to do so in our algorithm. The second requirement of assumption (a) provides evidence to distinguish the true labels from the other K -1 labels. Under the assumptions of Theorem 2, Algorithm 3 initialized with input satisfying (12) , will have an output satisfying (b1). Hence, Theorem 2 also guarantees rate-optimality of an iterative Algorithm 3.

4. EXPERIMENTAL RESULTS

We now present empirical results comparing the performances of the proposed aggregation algorithms, with additional results provided in Appendix C. The ground truth label matrix Z * is generated by randomly assigning each of the n objects to one of the K labels. The N input clusterings Z j , j ∈ [N ] are generated from model (8) . We measure the performance of an algorithm by the adjusted Rand index (ARI) of its output against the ground truth. We consider seven different aggregation algorithms: (1) Our Algorithm 1, referred to as "Basic" in the plots; (2) Our Algorithm 2, referred to as SC; (3) KCC algorithm ( 48); (4) CC Pivot algorithm (2; 18) with threshold 0.25; (5) Best One Element Move (BOEM) algorithm (11; 18); ( 6) the EM algorithm of (24) ; and (7) BestOfK algorithm (12; 18) . In addition, we consider variants of algorithms ( 1)-( 5) where we apply our refinement step to their output. This gives us a total of 12 methods. In the plots, the refined version is denoted with a solid line and the original version (without refinement) with a dotted line. We also use the average ARI of the N input clusterings, denoted by the INPUT, as a baseline. Balanced setting with varying n and N . Figure 2 depicts plots of ARI versus the noise probability p in model (8) , for various methods. The results are averaged over 40 replications. The settings in Figure 2 all correspond to balanced cluster sizes. Generally, our proposed Basic and SC algorithms outperform the EM, KCC, CCPivot and BOEM algorithms, with the failure thresholds occurring at larger values of p (harder problems). In some settings, the refinement shows some improvement, but in others, the output of the refinement applied to Basic and SC nearly coincides with the original algorithm. This shows that in some settings the original algorithm implicitly performs the refinement itself. We note that increasing N shifts the failure thresholds to the right as expected, as does the increasing of n, both consistent with our theory. Note that BestOfK performs no better than INPUT. Moreover, refined versions of KCC and BOEM outperform their original versions. Unbalanced setting. Figure 3 depicts the results obtained with disproportionate cluster sizes, specifically with p 1 proportion of the objects in one cluster and the rest distributed uniformly to the remaining clusters. Local refinement over Basic and SC performs significantly better as we deal with input clusterings of disproportionate cluster sizes, especially at lower noise probabilities.

5. CONCLUSION AND DISCUSSION

In the present paper, we defined the random perturbation model to study the label aggregation problem. We developed a K-means type algorithm followed by an efficient local refinement step to achieve the optimal misclassification rate under the assumptions of the model. Numerical experiments also show the effectiveness of our proposed methods. Let us also discuss possible avenues for future work. A single-stage algorithm. Two-stage algorithms have been popular in many clustering problems (3; 30; 16) . Numerical experiments show that, in many cases, a K-means type algorithm performs sufficiently close to the local refinement with good initialization (Section 4). The K-means type algorithm assigns labels based on the distances between the objects and the centers. This criterion is different from the likelihood ratio test in many cases, and so the output of the algorithm will not achieve the misclassification rate of the oracle problem. It will be a novel improvement if the K-means type algorithm can be generalized to an EM-type algorithm so that the "distance" between the parameter and the object is defined by the likelihood. Whether such an algorithm exists, and how efficient it is statistically and computationally, are interesting open questions. A robust algorithm. We observe i.i.d. label matrices from the random perturbation model defined in Definition 1. As long as p < 1, every label matrix from this model provides the same amount of information in expectation, so there is no reason to assign different weights to label matrices. However, in practice, the samples may not be i.i.d.

Supplementary material for "Statistical Guarantees for Consensus Clustering"

This supplement contains the detailed proofs of the results and some extra simulations.

A INCONSISTENCY OF BESTOFK

Using the notation of the present paper, the name "BestOfK" should be "BestOfN". We will use our notation in the following proposition and keep the name "BestOfK". Proposition 3. BestOfK is not consistent unless N grows exponentially fast in n. Proof. We will prove this proposition by providing a counterexample. Suppose K = 2, 1 -p = 0.6 and q = 0.4. Then for a label vector z from the RPM, by the Hoeffding inequality, P(Mis(z, z * ) ≥ 0.1) ≥ 1 -exp(2(0.4 -0.1) 2 n) -exp(2(0.6 -0.1) 2 n) ≥ 1 -2 exp(-0.18n) where we have accounted for the two permutations in the definition of Mis. Suppose we observe N i.i.d. label vectors z 1 , . . . , z N from the RPM. Then P( min i∈[N ] Mis(z i , z * ) ≥ 0.1) ≥ (1 -2 exp(-0.18n)) N ≥ 1 -2N exp(-0.18n). This probability (of missing the target) approaches 1 unless N grows exponentially fast in n.

B RELATIONS AMONG CLUSTERING DISTANCES

Let n k be the size of the kth cluster of Z ∈ E n K and, n * ℓ the size of the ℓth cluster of Z * ∈ E n L , and let X and X * be the corresponding association matrices. The Mirkin distance (34, Eqn ( 6)) is given by d ′ M (Z, Z * ) = k n 2 k + ℓ (n * ℓ ) 2 -2 k,ℓ n 2 kℓ ( ) where n kℓ is the number of objects that are in cluster k according to Z and cluster ℓ according to Z * . It is not hard to see that k n 2 k = ∥X∥ 2 F and similarly ℓ (n * ℓ ) 2 = ∥X * ∥ 2 F . We also have Z(Z * ) T = (n kℓ ), hence, using ∥A∥ 2 F = tr(AA T ), k,ℓ n 2 kℓ = ∥Z(Z * ) T ∥ 2 F = tr(Z(Z * ) T Z * Z T ) = tr((Z * ) T Z * Z T Z) = tr(X * X). Combining these facts, we obtain the first equality below d ′ M (Z, Z * ) = ∥X -X * ∥ 2 F = ∥X -X * ∥ ℓ1 . The second equality follows from X -X * having elements in {-1, 0, 1}. Here ∥ • ∥ ℓ1 denotes the ℓ 1 norm of a matrix viewed as a vector. The equality d ′ M (Z, Z * ) = ∥X -X * ∥ ℓ1 immediately shows that d ′ M is indeed a distance on the space of clusterings. It also connects the Mirkin distance with the Rand index. To see the connection with the Rand index, let N disagree be the number of pairs of objects for which Z and Z ′ disagree about their co-clustering, that is, whether the two objects are in the same cluster or not. Similarly, let N agree be the number of pairs of objects for which Z and Z ′ agree about their co-clustering. We have N disagree + N agree = n 2 . The Rand index is defined as the proportion of the agreements, that is, Rand = N agree n 2 . It is easy to see that ∥X -X * ∥ ℓ1 = 2N disagree where the factor of 2 is due to the double-counting caused by the symmetry of X -X * . This proves the relation 1 2 d ′ M = n 2 (1 -Rand). ( ) The symmetric difference distance (SDD) is another name for N disagree , hence d ′ M /2 = SDD. The Binder loss is defined as half the expression in (13) 9)). It then follows that consistency in Mis implies consistency in the normalized SDD, normalized Binder loss and the Rand index, as discussed above. This claim follows from the following inequality: Proposition 4. We have d M ≤ 2 • Mis. Proof. Let X and X * be the association matrices corresponding to label vectors z and z * . Then d ′ M (z, z * ) = ∥X -X * ∥ ℓ 1 as shown in (14) . The entries of X -X * take values in {-1, 0, 1}. Assume, WLOG, that the optimal permutation between z and z * is the identity. Then: 1. If the label z i = z * i , then the ith row of X -X * has at most "n • Mis" nonzero entries. There are at most n such rows.

2.. If the label z

i ̸ = z * i , then the ith row of X -X * has at most n nonzero entries. There are at most "n • Mis" such rows. Figure 4 shows some extra cases of unbalanced cluster sizes (various values of p 1 as defined earlier), showing the significant improvement of the refinement step in such cases. All the results for the unbalanced case (including those in the main text) are averaged over 120 runs. Therefore, d ′ M = ∥X -X * ∥ ℓ 1 ≤ n • (n • Mis) + (n • Mis) • n = 2n 2 • Tables 1, 2 and 3 show the average ARI in all the eight settings (abbreviated Set in the tables) shown in Figures 2, 3 and 4 . The tables show the performance of the methods at noise probabilities p = 0.45, 0.55 and 0.65 respectively-corresponding to a cross-section of each plot at a line parallel to the y-axis, crossing the x-axis at the respective value of p. 

D PROOFS D.1 PROOF OF PROPOSITION 1

For any Z ∈ E n K , we have ∥Z∥ 2 F = k,i Z 2 ki = k,i Z ki = n. Thus, ∥Z∥ 2 F = ∥ P j Z j ∥ 2 F = n for all j ∈ [N ]. Hence, solving (5) is equivalent to maximizing f (Z) := N j=1 w j tr(Z T P j Z j ) = tr(Z T Z) over E n K , where Z := j w j P j Z j . Let Z = (z 1 , . . . , z n ) and Z = (z 1 , . . . , zn ). Maximizing f (Z) = n i=1 ⟨z i , zi ⟩ is a separable problem over i, and maximizing z → ⟨z, zi ⟩ over E K amounts to finding the index of the maximum element of zi , that is, the "argmax" of zi , as claimed.

D.2 PROOF OF THEOREM 1

We have Z j = P j Z ′ j where Z ′ j = (z ′ j1 , . . . , z jn ) and z ′ ji are i.i.d. draws as in (8) . Since the algorithm is invariant to permutations P j , without loss of generality we assume P j = I n , hence Z j = Z ′ j . We write X * = (Z * ) T Z * for the true association matrix. Let E n be the all-ones n × n matrix.

Method

Refined 3 : Mean adjusted rand index (ARI) for all settings at noise probability p = 0.65. Lemma 1. Let Z ∼ L(Z * , p) and let X = Z T Z be the corresponding association matrix. Then, M := E[X] = (1 -ξ)X * + ξ 1 K E n + (1 - 1 K )I n where ξ = p(2 -p). Proof of Lemma 1. We have X ij = (Z T Z) ij = ⟨z i , z j ⟩ and E[z i ] = (1 -p)z * i + p 1 K 1 K . For i ̸ = j, z i and z j are independent, hence EX ij = ⟨Ez i , Ez j ⟩ = ⟨(1 -p)z * i + p 1 K 1 K , (1 -p)z * j + p 1 K 1 K ⟩ = (1 -p) 2 ⟨z * i , z * j ⟩ + 2p(1 -p) 1 K + p 2 1 K For i = j, we have E[X ii ] = 1. The above shows that E[X] = (1 -p) 2 X * + p(2 -p) 1 K E n + p(2 -p) 1 - 1 K I n which simplifies to the desired expression. Let Z 1 , . . . , Z N , Z ∼ L(Z * , p) be independent draws, and let X j = Z T j Z j and X = Z T Z be the associated association matrices. Setting X = 1 N N t=1 X t , we obtain E∥ X -M ∥ 2 F = ij E( Xij -M ij ) 2 = ij var( Xij ) = 1 N ij var(X ij ). We have var(X ij ) = 0 for i = j. For i ̸ = j, one has X ij ∼ Ber((1 -ξ)X * ij + ξ/K), hence var(X ij ) = (1 -ξ)X * ij + ξ K -(1 -ξ) 2 X * ij + 2 ξ K (1 -ξ)X * ij + ξ 2 K 2 = ψ(ξ) 1 - 2 K X * ij + ψ(ξ/K) where ψ(x) = x(1 -x). Note that ξ = p(2 -p) ∈ (0, 1). It follows that N • E∥ X -M ∥ 2 F ≤ ψ(ξ) 1 - 2 K ij X * ij + n 2 ψ(ξ/K) where the inequality is due to bounding var(X ii ) by the same formula used for var(X ij ), i ̸ = j. Let n * k be the number of entities in cluster k of Z * , that is, n * k = (Z * 1 n ) k . We have ij X * ij = ∥Z * 1 n ∥ 2 = k (n * k ) 2 . Using the assumption n * k ≤ βn/K, we have N • E∥ X -M ∥ 2 F ≤ ψ(ξ) 1 - 2 K β 2 n 2 K + n 2 ψ(ξ/K). Calculating the center separations. Let M = (1 -ξ)X * + (ξ/K)E n . We note that M -M is diagonal and ∥M -M ∥ 2 F = ∥ξ(1 -1/K)I n ∥ 2 F = ξ 2 (1 -1/K) 2 n ≤ ξ 2 n. It follows that E∥ X -M ∥ 2 F = E i̸ =j ( Xij -M ij ) 2 + E i ( Xii -M ii ) 2 ≤ E∥ X -M ∥ 2 F + ξ 2 n. We obtain 1 n 2 E∥ X -M ∥ 2 F ≤ 2 N ψ(ξ) 1 - 2 K β 2 K + ψ(ξ/K) + 2ξ 2 n . The matrix M is a K-means matrix with K distinct rows. If z i = r ̸ = k = z i ′ , then ∥ M i * -M i ′ * ∥ 2 = (1 -ξ) 2 ∥X * i * -X * i ′ * ∥ 2 = (1 -ξ) 2 (n * r + n * k ) ≥ 2(1 -ξ) 2 n βK using n * k ≥ n/(βK) , which holds by assumption (9) . We have n r δ 2 r ≥ 2(1 -ξ) 2 ( n βK ) 2 which gives the following bound, using (49, Proposition 1), E[Mis r ] ≲ 1 N (1 -ξ) 2 ψ(ξ)(K -2)β 4 + β 2 K 2 ψ(ξ/K) + ξ 2 (1 -ξ) 2 β 2 K 2 n . Here, Mis r is the misclassification rate over true cluster r. The dependence on β of the first term is O(β 2 ) when K = 2 and O(β 4 ) when K > 2. Ignoring this difference, we can simplify the bound, by noting that K 2 ψ(ξ/K) = Kξ(1 -ξ/K) ≤ Kξ and β 2 ≤ β 4 . Then, E[Mis r ] ≲ ξ (1 -ξ) 2 2Kβ 4 N + ξ 2 (1 -ξ) 2 β 2 K 2 n , from which the bound in the theorem follows since Mis = r (n * r /n) Mis r .

D.3 PROOF SKETCH FOR THEOREM 2

For the benefit of the readers, we first give a proof sketch for Theorem 2 and its key lemma. A detailed proof is given in Appendix D.4. The proof of Theorem 2 relies on the following key lemma: Lemma 2. Let B(δ) denote the set of label matrices Z with at most nδ labels different from Z * , and let Z(Z) be the output of Algorithm 3 with initial label matrix Z. If n min p(1 ∧ I)/K → ∞ and log K N I → 0, then P ∃Z ∈ B(δ) such that z i (Z) ̸ = z * i ≤ e -(1-η ′ )N I+ 3KnδN 2pn min for some η ′ = o(1). The first step is to prove the case δ = 0 in Lemma 2, corresponding to the initial label matrix in Algorithm 3 being Z * . If z * i = e 1 , then the algorithm fails to recover z * i if there exists k ̸ = 1 such that Y k := n 1 b k -n k b 1 ≥ 0. Y k is the average of i.i.d. samples, where each sample follows a mixture model depending on which events among z i = e 1 , z i = e k , or z i ̸ ∈ {e 1 , e k } happens. We compute the MGF of Y k and obtain the bound E exp(tN Y k /(n 1 n 2 (1 -p))) ≤ (1 -p) e -t(1+o(1)) + q e t(1+o(1)) + (K -2)q e 2qt 2 n min (1-p) 2 N . The choice of t has little affect on the last term since n min is large, so we set t = 1 2 log[(1 -p)/q] to minimize (1 -p)e -t + qe t . Under the regularity conditions of the lemma and the definition of I in (11) , we have E exp(tY k /(n 1 n 2 (1 -p))) ≤ 2 (1 -p)q + (K -2)q (1-o(1))N = e -(1-o(1))N I . Applying the Chernoff inequality, it follows that P( z i (Z * ) ̸ = z * i ) ≤ K k=2 P (Y k ≥ 0) ≤ K k=2 E exp(tY k /(n 1 n 2 (1 -p))) ≤ (K -1)e -(1-o(1))N I . Using the assumption log K N I → 0, we obtain P( z i (Z * ) ̸ = z * i ) ≤ e -(1-η)N I . This proves the case δ = 0. Now we compare Y k 's obtained from Algorithm 3 initialized with label matrices Z * and Z, and denoted by Y k (Z * ) and Y k (Z), respectively. For all Z ∈ B(δ), we show that |Y k (Z * ) -Y k (Z)| ≤ 3(n 1 ∨ n k ) nδ if z * i = e 1 , giving P ∃Z ∈ B(δ) such that z i (Z) ̸ = z * i ≤ K k=2 P Y k ≥ -3(n 1 ∨ n k )nδ . We apply the Chernoff inequality with the same choice of t to obtain (17) . We arrive at the proof of Theorem 2. Let Z(Z) be the output of Algorithm 3 with initial label matrix Z. Consider the event A δ = {Mis( Z, Z * ) ≤ δ}. For any ε > 0, P Mis( Z( Z), Z * ) > ε ≤ P A c δ + P ∃Z ∈ B(δ ′ ), Mis( Z(Z), Z * ) > ε . We have P A c δ = o(1) under assumption (b1). Letting ε = N Ie -(1-η ′ )N I+ 3KnδN 2pn min , one can verify that the second probability also converges to 0 under the conditions of the theorem and ε = e -(1-o(1))N I . This proves (20) under assumption (b1). For the proof under assumption (b2), please see Appendix D.  If assumption (b1) holds, then let δ ′ = δ so that P(A c δ ) = o(1). If assumption (b2) holds, Then we let δ ′ = n min pIδ/(Kn) so that Knδ ′ n min pI = Knδ n min pI = o(1) and by Markov's inequality, P(Mis( Z, Z * ) > δ ′ ) ≤ 1 δ ′ E[Mis( Z, Z * )] ≤ δ δ ′ = Knδ n min pI = o(1). Then, (b1) is satisfied with δ = δ ′ . Therefore, it is enough to only consider assumption (b1) and let δ ′ = δ for the rest of the proof. Let π * be the permutation corresponding to Mis( Z, Z * ) in assumption (b1), that is, π * = argmin π n i=1 1{z i ̸ = π(z * i )}. Since we can always assume π * (z * ) to be the true label, without loss of generality, we can assume π * = identity. Writing T 2 for the second term in (18) , T 2 ≤ P ∃Z ∈ B(δ), n i=1 1{ z i (Z) ̸ = z * i } > nε ≤ P n i=1 1 ∃Z ∈ B(δ), z i (Z) ̸ = z * i > nε . By Markov's inequality, we obtain T 2 ≤ 1 nε n i=1 P ∃Z ∈ B(δ), z i (Z) ̸ = z * i ≤ 1 ε e -(1-η ′ )N I+ 3KnδN 2pn min . ( ) where the second inequality follows from Lemma 2, given assumption (a) of the theorem. The case z 1 = e 2 is argued similarly and we obtain the bound E[e λY | z 1 = e 2 ] ≤ exp t 1 + o(1) . For z 1 / ∈ {e 1 , e 2 }, we perform a second-order expansion, assuming λ = o(1): ϕ(λ; µ) = exp µ λ + 1 2 λ 2 + o(λ 2 ) ≤ exp µ λ + λ 2 and obtain ψ(λn 1 ; q) n2 • ψ(-λn 2 ; q) n1 ≤ exp λ 2 n 1 n 2 (n 1 + n 2 )q . Let γ := 2q/(1 -p) 2 and let n har := 2n 1 n 2 /(n 1 + n 2 ) be the harmonic mean of n 1 and n 2 . Note that n har ≥ n min . We have λ 2 n 1 n 2 (n 1 + n 2 )q = t 2 (n 1 + n 2 )q n 1 n 2 (1 -p) 2 = γt 2 /n har . To summarize, the conditional MGF satisfies E[e tY /(n1n2(1-p)) | z 1 ] ≤    exp -t 1 + o(1) if z 1 = e 1 exp t 1 + o(1) if z 1 = e 2 exp(γt 2 /n har ) if z 1 / ∈ {e 1 , e 2 }. Recall that the events z 1 = e 1 , z 1 = e 2 and z 1 / ∈ {e 1 , e 2 } happen with probability 1 -p, q and (K -2)q respectively. It follows that M Y (t/(n 1 n 2 (1 -p))) ≤ (1 -p) e -t(1+o(1)) + q e t(1+o(1)) + (K -2)q e γt 2 /nhar . ( ) Let us set t = 1 2 log((1 -p)/q) = 1 2 log 1 + K p (1 -p) , so that (1 -p)e -t = qe t = (1 -p)q. Then, t ≥ 0 and since log(1 + x) ≤ x, we have t ≤ K(1 -p)/(2p). ( ) The condition λ(n 1 + n 2 ) = o(1) is satisfied under assumption n min p/K → ∞, since λ(n 1 + n 2 ) = (n 1 + n 2 )t n 1 n 2 (1 -p) = 2t n har (1 -p) ≤ K(1 -p)/p n har (1 -p) ≤ K n min p = o(1). Recalling that q = p/K, the exponent of the last term in (22) satisfies γt 2 n har ≤ γK 2 (1 -p) 2 4p 2 n har = 2qK 2 4p 2 n har = K 2n har p ≤ K 2n min p = o(I) under the assumption of the lemma. It follows that M Y (t/(n 1 n 2 (1 -p))) ≤ 2 (1 -p)q e o(t) + (K -2)qe o(I) = 2( (1 -p)q) 1+o(1) + (K -2)qe o(I) = ( (1 -p)q) o(1) ∨ e o(I) e -I where the first equality is by e o(t) = (e t ) o(1) = ( (1 -p)q) o(1) for our choice of t, and the second equality by the definition (11) of I. Since (1 -p)q ≤ e -I , we have ( (1 -p)q) o(1) = e o(I) , hence M Y (t/(n 1 n 2 (1 -p))) ≤ e o(I) e -I = e -(1-o(1))I = e -(1-η)I . Let Y 1 , . . . , Y N be the i.i.d. copies of Y . By Markov's inequality, P N j=1 Y j ≥ 0 = P e λ N j=1 Yj ≥ 1 ≤ Ee λ N j=1 Yj = M Y1 (λ) N ≤ e -(1-η)N I . The above argument shows that P b2 n2 ≥ b1 n1 ≤ e -(1-η)N I . Repeating the argument for the ith label, it shows that P( z i (Z * ) ̸ = z * i ) ≤ P max k=2,...,K b k n k ≥ b 1 n 1 ≤ K k=2 P b k n k ≥ b 1 n 1 ≤ (K -1)e -(1-η)N I . If K = 2, then we have already obtained (20) . If K > 2, then (K -1)e -(1-η)N I = e -(1-η)N I+log(K-1) = e -(1-η)N I+o(N I) . The term o(N I) can be absorbed into ηN I, so we can still obtain (20) .  (Z * ) = (Z * -1 X1 ) k and b k (Z) = (Z -1 X1 ) k where X1 is the first column of X in the algorithm. Suppose Z has at most nδ labels different from Z * , then [n 1 b 2 (Z * ) -n 2 b 1 (Z * )] -[n 1 b 2 (Z) -n 2 b 1 (Z)] ≤ (n 1 ∨ n 2 ) nδ. and [n 1 b 2 (Z) -n 2 b 1 (Z)] -[n 1 b 2 (Z) -n2 b 1 (Z)] ≤ |n 1 -n1 | • b 2 (Z) + |n 2 -n2 | • b 1 (Z) ≤ nδ (n 1 + n 2 ). Let Y (Z * ) = n 1 b 2 (Z * ) -n 2 b 1 (Z * ) and Y (Z) = n1 b 2 (Z) -n2 b 1 (Z). Combining the two by the triangle inequality, we obtain |Y (Z * ) -Y (Z)| ≤ 3(n 1 ∨ n 2 ) nδ =: h(δ). By Markov's inequality, for λ ≥ 0, P max Z∈B(δ) Y (Z) ≥ 0 ≤ P Y (Z * ) ≥ -h(δ) = P e λN Y (Z * ) ≥ e -λN h(δ) ≤ e λN h(δ) E[e λN Y (Z * ) ] As in the proof of Lemma 3, we take λ = t [n 1 n 2 (1 -p)] -1 , with t given by (23) . Using the upper bound (24), we have λN h(δ) = 3(n 1 ∨ n 2 )nN n 1 n 2 (1 -p) δt ≤ 3KnδN 2pn min . The result follows as in Lemma 3.

E RPM AND BAYESIAN AGGREGATION

One might ask whether RPM is a useful model in practice. For the applications in which all the label vectors are perturbations of a common true "center", and our goal is to recover this center, RPM is a good first approximation. This is the case for Bayesian label aggregation as we argue below. In such settings, the RPM is like the i.i.d. noise model used in classical regression. Although one can imagine more complex regression models (like those with heteroscedastic noise, or mixtures of regressions, etc.), the i.i.d. setting still provides a lot of insights for understanding the more complex models.

E.1 RPM IS A GOOD MODEL FOR A CONCENTRATED POSTERIOR

Lets us now argue how one can arrive at RPM in the context of Bayesian aggregation, by systematically making some assumptions. First, we note that our goal in the paper is not to prove the "posterior concentration around the truth", also known as posterior consistency. This is problem-specific and out of the scope of this work. We assume that we are working with a model for which posterior consistency has already been established. The question that we are trying to answer is then this: Given that the posterior concentrates around the true partition, and given that the MCMC has converged-that is, we are sampling from this concentrated posteriorcan we obtain a consistent estimate of the center of the posterior (which would be the true partition) based these samples? For this purpose, it is enough to assume that we are observing samples from the posterior p(z 1 , . . . , z n |D), where D is the observed data, and this posterior is concentrating around z * which is the true partition. For simplicity, let us drop D and note that the posterior is some multivariate discrete distribution p(z 1 , . . . , z n ). Then, we proceed in steps: 1. First, we address the label-switching issue. Let z be a sample from the posterior and let permutation π be the minimizer of H(z * , π(z)) over all K! permutations, where H(•, •) is the Hamming distance. We note that π(z) is an equivalent label vector to z (only the cluster labels are permuted.) We consider the distribution of π(z) as the posterior distribution rather than that of z. This is only to simplify the discussion and is without loss of generality, since our methods are invariant to label-switching. The distribution of π(z) will have a single mode at z * while that of z will have K! modes on all equivalent versions of z * . Given such simplification, renaming π(z) to z from now on, the posterior is a multivariate distribution p(z 1 , . . . , z n ) which is highly concentrated around z * = (z * 1 , . . . , z * n ). This follows from the posterior consistency assumption. 2. We claim that this multivariate distribution can be approximated by the product of its marginals p(z 1 , . . . , z n ) ≈ n i=1 p i (z i ). This is intuitively clear for any concentrated discrete distribution. Alternatively, it is intuitively clear to anyone who has looked at MCMC samples at stationarity. Each node/object i usually has a most likely assignment which is z * i , but it occasionally jumps to some other label with a small probability. The fluctuations for different nodes are independent; this is intuitively because the bulk of the labels don't change their clusters all at once; only a few do at any given time. 3. Non-uniform RPM: Given the independence assumption across i, the most general form p i (z i ) can take is a categorical (a.k.a. Multinomial(1,π)) distribution. The bulk of the mass of this categorical variable will be on z * i , and the rest distributed among the other labels. For example, for some node, i, the label can jump, say, between z * i = 3 and 5 with the bulk of the probability on 3. For others it could be that when they jump from z * i , they land over a larger collection of labels. Here, we are making the simplifying assumption that for each node, the mass that is put on anything other than z * i is uniformly distributed over the label set [K] \ {z * i }. This assumption can be removed, and we can work with general categorical variables, at the expense of making the rates and the analysis more complicated. 4. Inhomogeneous RPM: Given that the noise in the categorical variable is uniform over [K] \ {z * i }, we now assume that the probability of taking any of those values is the same for the all nodes (i.e., independent of i). This is exactly the homogeneous RPM that we consider. This assumption is easy to remove and we can work with the inhomogeneous RPM that allows this probability to depend on i. We do not lose anything in Step 1. Steps 3 and 4 are simplifying steps that are taken for the ease of understanding and presentation. The main assumption is Step 2, the approximation by the product of marginals. Below we provide some hard evidence that this is very reasonable in the Bayesian setting.

E.2 HARD EVIDENCE OF NEAR-INDEPENDENCE

We consider the problem of recovering the clusters in a stochastic block model (SBM) which is a random network model with latent node clusters. Figure 5 shows the Sequential NMI plot for a Gibbs sampler on a non-parametric SBM (with a Dirichlet Process prior on labels). The sequential NMI means that we compute the NMI of the partition at iteration t relative to that at iteration t -1, and the x-axis shows t. The plot suggests that the chain enters stationarity roughly around iteration 100. We use the MMD-based approach, described in (19; 20; 38) , to compare the posterior joint distribution of the labels to its approximate versions: -Figure 6 (a) shows the result for samples from iterations 100 to 1000 of the Gibbs sampler. This is the stationary joint distribution. -Figure 6 (b) shows the result for samples from iterations 1 to 100 of the Gibbs sampler. This is based on the transient samples (effectively, average joint behavior over the transient period). -Figure 6 (c) shows the results on a movie rating dataset with complicated dependent joint distribution (that has nothing to do with SBM). We refer to (19; 20; 38) for details of how these experiments are performed. The bootstrap MMD serves as the baseline; those methods whose MMD is closer to the bootstrap are better approximations of the joint distribution. In general smaller MMD means a better approximation of the joint. Ind Mult is exactly the approximation by independent multinomials (i.e. product of marginals). The Copula Mult is a good joint model for the discrete multivariate distribution. We see that for the movie rating data and the transient chain, indeed Copula Mult has a much lower MMD, than Ind Mult, showing that there is dependence in the joint that is not captured by Ind Mult. However, for the stationary distribution (Figure 6 (a)), the Ind Mult has comparable (and even slightly smaller MMD) relative to the Copula Mult, and is close to bootstrap. This shows that the product of marginals is a good approximation in this case, and justifies Step 2 in our reduction.

F COMPARISON WITH STOCHASTIC BLOCK MODEL

Below we outline some of the similarities and differences between the problem of cluster recovery in SBM and the consensus clustering problem we consider in this paper. Suppose that A is the adjacency matrix generated from the stochastic block model (SBM) and X is the association matrix generated from the RPM. Similarity: A larger value of X ij increases the likelihood of i and j being in the same cluster in RPM. Similarly, A ij = 1, i.e., there is an edge between i and j, increases the likelihood of i and j being in the same community in SBM. Therefore, both X and A can be considered proximity matrices and we can utilize a min-cut algorithm on them to find the clusters. To approximate the min-cut algorithm, various researchers have proposed the approach of using a good initialization plus a local refinement step. This idea can be applied to many clustering problems, including community detection. In this paper, we show that it can also be applied to consensus clustering. Difference: The entries of the adjacency matrix A are independent, but the entries of the association matrix X are not. Indeed, the entries on the same row of X have very strong dependence. The likelihood function of X is very different from A, but we can still show that a simple local refinement step outputs optimal labels. The error rate is comparable to the Bayes error rate given by likelihood ratio test.



Figure 1: Left: Chernoff divergence I in equation (11) as a function of p, the label perturbation probability. Right: How fast -(log P N )/N converges to the Chernoff divergence for K = 2.

n = 500, N = 200, K = 6

Figure 2: Performance impact with the increase in n and N

Figure 4: Significant improvements due to local refinement in the case of unbalanced cluster sizes.

Set 6: Unbalanced, n = 100, N = 20, p 1 = 0.8. 8. Set 7: Unbalanced, n = 100, N = 20, p 1 = 0.9.

DETAILED PROOF OF THEOREM 2 Let Z(Z) be the output of Algorithm 3 with initial label matrix Z. Consider the event A δ ′ = {Mis( Z, Z * ) ≤ δ ′ }. For any ε > 0, P Mis( Z( Z), Z * ) > ε ≤ P A c δ ′ + P ∃Z ∈ B(δ ′ ), Mis( Z(Z), Z * ) > ε .

.2 DETAILED PROOF OF LEMMA 2 Let i = 1 without loss of generality, and let Z ∈ B(δ), and nk = n k (Z), the size of the kth cluster in the label matrix Z. Let b k

Figure 5: Sequential NMI for the SBM Gibbs sampler

CONSISTENCY IN Mis IMPLIES CONSISTENCY IN MIRKIN DISTANCE Let us now show that the consistency in Mis implies consistency in the normalized Mirkin distance defined as d M := d ′ M /n 2 . See (34, Eqn (

The settings are as follows: Mean adjusted rand index (ARI) for all settings at noise probability p = 0.45.

Mean adjusted rand index (ARI) for all settings at noise probability p = 0.55.

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8

ACKNOWLEDGEMENT

Zhixin Zhou was supported by the Research Grants Council of Hong Kong [Project Number 21502921]. A. A. Amini was supported by the NSF grant DMS-1945667.

annex

Published as a conference paper at ICLR 2023 Assumption (a) of the theorem also implies 3KnδN 2pnmin = o(N I), so this term can be absorbed into η ′ N I, giving T 2 ≤ 1 ε e -(1-η ′′ )N I for some η ′′ = o(1). LetHence, we obtain (12) as desired.

D.4.1 AN AUXILIARY LEMMA

We state the case δ = 0 in Lemma 2 as a separate lemma and prove it first. Recall that n min = min k∈[K] n k where n k is the size of the kth cluster. We have the following lemma. Lemma 3 (Local refinement with Z * ). Suppose the initial label matrix in Algorithm 3 is Z * , and assume n min p(1 ∧ I)/K → ∞ and log K N I → 0, thenProof of Lemma 3. Let q := p/K and p := (K -1)q := p -q. We first focus on the probabilityby assumption, z 1 takes values e 1 and any of e ℓ , ℓ ̸ = 1 w.p. 1 -p and q. For i ∈ C * 1 , z i has the same distribution as z 1 . For i ∈ C * 2 , z i takes values e 2 and any of e ℓ , ℓ ̸ = 2 w.p. 1 -p and q respectively. Note thatwhere ⊗ is the notation for the product measure, that is, b 1 and b 2 are independent in each case. The three possibilities above hold with probability 1-p, q and (K -2)q respectively. Let Y = n 1 b 2 -n 2 b 1 and let M Y (λ) be the moment-generating function (MGF) of Y .Let ψ(λ; p) = 1 -p + pe λ be the MGF of a Ber(p) variable. Then, the MGF of Bin(n, p) is ψ(λ; p) n and henceLet ϕ(λ; µ) = exp(µ(e λ -1)) be the MGF of Poi(µ) and note that ψ(λ; p) n ≤ ϕ(λ; np). Then, for example, we haveassuming that λ(n 1 + n 2 ) = o(1). Then, E[e λY | z 1 = e 1 ] ≤ exp n 1 n 2 λ(q -1 + p) 1 + o(1)Take λ = t [n 1 n 2 (1 -p)] -1 for some t ≥ 0 to be determined below. Noting q -1 + p = -(1 -p),E[e λY | z 1 = e 1 ] ≤ exp -t 1 + o(1) .

