SYSTEMATIC ANALYSIS OF CLUSTER SIMILARITY INDICES: HOW TO VALIDATE VALIDATION MEASURES Anonymous

Abstract

There are many cluster similarity indices used to evaluate clustering algorithms, and choosing the best one for a particular task remains an open problem. We demonstrate that this problem is crucial: there are many disagreements among the indices, these disagreements do affect which algorithms are chosen in applications, and this can lead to degraded performance in real-world systems. We propose a theoretical solution to this problem: we develop a list of desirable properties and theoretically verify which indices satisfy them. This allows for making an informed choice: given a particular application, one can first select properties that are desirable for a given application and then identify indices satisfying these. We observe that many popular indices have significant drawbacks. Instead, we advocate using other ones that are not so widely adopted but have beneficial properties.

1. INTRODUCTION

Clustering is an unsupervised machine learning problem, where the task is to group objects that are similar to each other. In network analysis, a related problem is called community detection, where grouping is based on relations between items (links), and the obtained clusters are expected to be densely interconnected. Clustering is used across various applications, including text mining, online advertisement, anomaly detection, and many others (Allahyari et al., 2017; Xu & Tian, 2015) . To measure the quality of a clustering algorithm, one can use either internal or external measures. Internal measures evaluate the consistency of the clustering result with the data being clustered, e.g., Silhouette, Hubert-Gamma, Dunn, and many other indices. Unfortunately, it is unclear whether optimizing any of these measures would translate into improved quality in practical applications. External (cluster similarity) measures compare the candidate partition with a reference one (obtained, e.g., by human assessors). A comparison with such a gold standard partition, when it is available, is more reliable. There are many tasks where external evaluation is applicable: text clustering (Amigó et al., 2009 ), topic modeling (Virtanen & Girolami, 2019) , Web categorization (Wibowo & Williams, 2002) , face clustering (Wang et al., 2019) , news aggregation (see Section 3), and others. Often, when there is no reference partition available, it is possible to let a group of experts annotate a subset of items and compare the algorithms on this subset. Dozens of cluster similarity measures exist and which one should be used is a subject of debate (Lei et al., 2017) . In this paper, we systematically analyze the problem of choosing the best cluster similarity index. We start with a series of experiments demonstrating the importance of the problem (Section 3). First, we construct simple examples showing the inconsistency of all pairs of different similarity indices. Then, we demonstrate that such disagreements often occur in practice when well-known clustering algorithms are applied to real datasets. Finally, we illustrate how an improper choice of a similarity index can affect the performance of production systems. So, the question is: how to compare cluster similarity indices and choose the best one for a particular application? Ideally, we would want to choose an index for which good similarity scores translate to good real-world performance. However, opportunities to experimentally perform such a validation of validation indices are rare, typically expensive, and do not generalize to other applications. In contrast, we suggest a theoretical approach: we formally define properties that are desirable across various applications, discuss their importance, and formally analyze which similarity indices satisfy them (Section 4). This theoretical framework would allow practitioners to choose the best index based on relevant properties for their applications. In Section 5, we advocate two indices that are expected to be suitable across various applications. While many ideas discussed in the paper can be applied to all similarity indices, we also provide a more in-depth theoretical characterization of pair-counting ones (e.g., Rand and Jaccard), which gives an analytical background for further studies of pair-counting indices. We formally prove that among dozens of known indices, only two have all the properties except for being a distance: Correlation Coefficient and Sokal & Sneath's first index (Lei et al., 2017) . Surprisingly, both indices are rarely used for cluster evaluation. The correlation coefficient has an additional advantage of being easily convertible to a distance measure via the arccosine function. The obtained index has all the properties except constant baseline, which is still satisfied asymptotically. Constant baseline is a particular focus of the current research: this is one of the most important and non-trivial properties. Informally, a sensible index should not prefer one candidate partition over another just because it has too large or too small clusters. To the best of our knowledge, we are the first to develop a rigorous theoretical framework for analyzing this property. In this respect, our work improves over the previous (mostly empirical) research on constant baseline of particular indices (Albatineh et al., 2006; Lei et al., 2017; Strehl, 2002; Vinh et al., 2009; 2010) , we refer to Appendix A for a detailed comparison to related research.

2. CLUSTER SIMILARITY INDICES

We assume that there is a set of elements V with size n = |V |. A clustering is a partition of V into disjoint subsets. Capital letters A, B, C will be used to name the clusterings, and we will represent them as A = {A 1 , . . . , A k A }, where A i is the set of elements belonging to i-th cluster. If a pair of elements v, w ∈ V lie in the same cluster in A, we refer to them as an intra-cluster pair of A, while inter-cluster pair will be used otherwise. The total number of pairs is denoted by N = n 2 . The value that an index I assigns to the similarity between partitions A and B will be denoted by I(A, B). Let us now define some of the indices used throughout the paper, while a more comprehensive list, together with formal definitions, is given in Appendix B.1 and B.2. Pair-counting indices consider clusterings to be similar if they agree on many pairs. Formally, let A be the N -dimensional vector indexed by the set of element-pairs, where the entry corresponding to (v, w) equals 1 if (v, w) is an intra-cluster pair and 0 otherwise. Further, let M AB be the N × 2 matrix that results from concatenating the two (column-) vectors A and B. Each row of M AB is either 11, 10, 01, or 00. Let the pair-counts N 11 , N 10 , N 01 , N 00 denote the number of occurrences for each of these rows in M AB . Definition 1. A pair-counting index is a similarity index that can be expressed as a function of the pair-counts N 11 , N 10 , N 01 , N 00 . Some popular pair-counting indices are Rand and Jaccard: R = N 11 + N 00 N 11 + N 10 + N 01 + N 00 , J = N 11 N 11 + N 10 + N 01 . Adjusted Rand (AR) is a linear transformation of Rand ensuring that for a random B we have AR(A, B) = 0 in expectation. A less widely used index is the Pearson Correlation Coefficient (CC) between the binary incidence vectors A and B.foot_0 Another index, which we discuss further in more details, is the Correlation Distance CD(A, B) := 1 π arccos CC(A, B). In Table 4 , we formally define 27 known pair-counting indices and only mention ones of particular interest throughout the main text. 



Note that Spearman and Pearson correlation are equal when comparing binary vectors. Kendall rank correlation for binary vectors coincides with the Hubert index, which is linearly equivalent to Rand.



Information-theoretic indices consider clusterings similar if they share a lot of information, i.e., if little information is needed to transform one clustering into the other. Formally, let H(A) := H(|A 1 |/n, . . . , |A k A |/n) be the Shannon entropy of the cluster-label distribution of A. Similarly, the joint entropy H(A, B) is defined as the entropy of the distribution with probabilities(p ij ) i∈[k A ],j∈[k B ], where p ij = |A i ∩ B j |/n. Then, the mutual information of two clusterings can be defined as

