SYSTEMATIC ANALYSIS OF CLUSTER SIMILARITY INDICES: HOW TO VALIDATE VALIDATION MEASURES Anonymous

Abstract

There are many cluster similarity indices used to evaluate clustering algorithms, and choosing the best one for a particular task remains an open problem. We demonstrate that this problem is crucial: there are many disagreements among the indices, these disagreements do affect which algorithms are chosen in applications, and this can lead to degraded performance in real-world systems. We propose a theoretical solution to this problem: we develop a list of desirable properties and theoretically verify which indices satisfy them. This allows for making an informed choice: given a particular application, one can first select properties that are desirable for a given application and then identify indices satisfying these. We observe that many popular indices have significant drawbacks. Instead, we advocate using other ones that are not so widely adopted but have beneficial properties.

1. INTRODUCTION

Clustering is an unsupervised machine learning problem, where the task is to group objects that are similar to each other. In network analysis, a related problem is called community detection, where grouping is based on relations between items (links), and the obtained clusters are expected to be densely interconnected. Clustering is used across various applications, including text mining, online advertisement, anomaly detection, and many others (Allahyari et al., 2017; Xu & Tian, 2015) . To measure the quality of a clustering algorithm, one can use either internal or external measures. Internal measures evaluate the consistency of the clustering result with the data being clustered, e.g., Silhouette, Hubert-Gamma, Dunn, and many other indices. Unfortunately, it is unclear whether optimizing any of these measures would translate into improved quality in practical applications. External (cluster similarity) measures compare the candidate partition with a reference one (obtained, e.g., by human assessors). A comparison with such a gold standard partition, when it is available, is more reliable. There are many tasks where external evaluation is applicable: text clustering (Amigó et al., 2009) , topic modeling (Virtanen & Girolami, 2019), Web categorization (Wibowo & Williams, 2002) , face clustering (Wang et al., 2019) , news aggregation (see Section 3), and others. Often, when there is no reference partition available, it is possible to let a group of experts annotate a subset of items and compare the algorithms on this subset. Dozens of cluster similarity measures exist and which one should be used is a subject of debate (Lei et al., 2017) . In this paper, we systematically analyze the problem of choosing the best cluster similarity index. We start with a series of experiments demonstrating the importance of the problem (Section 3). First, we construct simple examples showing the inconsistency of all pairs of different similarity indices. Then, we demonstrate that such disagreements often occur in practice when well-known clustering algorithms are applied to real datasets. Finally, we illustrate how an improper choice of a similarity index can affect the performance of production systems. So, the question is: how to compare cluster similarity indices and choose the best one for a particular application? Ideally, we would want to choose an index for which good similarity scores translate to good real-world performance. However, opportunities to experimentally perform such a validation of validation indices are rare, typically expensive, and do not generalize to other applications. In contrast, we suggest a theoretical approach: we formally define properties that are desirable across various applications, discuss their importance, and formally analyze which similarity indices satisfy them (Section 4). This theoretical framework would allow practitioners to choose the best index

