CEREAL: FEW-SAMPLE CLUSTERING EVALUATION

Abstract

Evaluating clustering quality with reliable evaluation metrics like normalized mutual information (NMI) requires labeled data that can be expensive to annotate. We focus on the underexplored problem of estimating clustering quality with limited labels. We adapt existing approaches from the few-sample model evaluation literature to actively sub-sample, with a learned surrogate model, the most informative data points for annotation to estimate the evaluation metric. However, we find that their estimation can be biased and only relies on the labeled data. To that end, we introduce CEREAL, a comprehensive framework for few-sample clustering evaluation that extends active sampling approaches in three key ways. First, we propose novel NMI-based acquisition functions that account for the distinctive properties of clustering and uncertainties from a learned surrogate model. Next, we use ideas from semi-supervised learning and train the surrogate model with both the labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the surrogate model. We run experiments to estimate NMI in an active sampling pipeline on three datasets across vision and language. Our results show that CEREAL reduces the area under the absolute error curve by up to 78% compared to the best sampling baseline. We perform an extensive ablation study to show that our framework is agnostic to the choice of clustering algorithm and evaluation metric. We also extend CEREAL from clusterwise annotations to pairwise annotations. Overall, CEREAL can efficiently evaluate clustering with limited human annotations.

1. INTRODUCTION

Unsupervised clustering algorithms (Jain et al., 1999) partition a given dataset into meaningful groups such that similar data points belong to the same cluster. Obtaining high-quality clusterings plays an important role in numerous learning applications like intent induction (Perkins & Yang, 2019), anomaly detection (Liu et al., 2021) , and self supervision (Caron et al., 2018) . However, evaluating these clusterings can be challenging. Unsupervised evaluation metrics, such as Silhouette Index, often do not correlate well with downstream performance (von Luxburg et al., 2012) . On the other hand, supervised evaluation metrics such as normalized mutual information (NMI) and adjusted Rand index (ARI) require a labeled reference clustering. This supervised evaluation step introduces a costly bottleneck which limits the applicability of clustering for exploratory data analysis. In this work, we study an underexplored area of research: estimating the clustering quality with limited annotations. Existing work on this problem, adapted from few-sample model evaluation, can often perform worse than uniform random sampling (see Section 5). These works use learned surrogate models such as multilayer perceptrons to identify the most informative unlabeled data from the evaluation set. Similar to active learning, they then iteratively rank the next samples to be labeled according to an acquisition function and the surrogate model's predictions. However, many active sampling methods derive acquisition functions tailored to a specific classification or regression metric (Sawade et al., 2010; Kossen et al., 2021) , which make them inapplicable to clustering. Furthermore, these methods only rely on labeled data to learn the surrogate model and ignore the vast amounts of unlabeled data. In this paper, we present CEREAL (Cluster Evaluation with REstricted Availability of Labels), a comprehensive framework for few-sample clustering evaluation without any explicit assumptions on the evaluation metric or clustering algorithm (see Figure 1 ). We propose several improvements to the standard active sampling pipeline. First, we derive acquisition functions based on normalized mutual information, a popular evaluation metric for clustering. The choice of acquisition function depends on whether the clustering algorithm returns a cluster assignment (hard clustering) or a distribution over 1

