CEREAL: FEW-SAMPLE CLUSTERING EVALUATION

Abstract

Evaluating clustering quality with reliable evaluation metrics like normalized mutual information (NMI) requires labeled data that can be expensive to annotate. We focus on the underexplored problem of estimating clustering quality with limited labels. We adapt existing approaches from the few-sample model evaluation literature to actively sub-sample, with a learned surrogate model, the most informative data points for annotation to estimate the evaluation metric. However, we find that their estimation can be biased and only relies on the labeled data. To that end, we introduce CEREAL, a comprehensive framework for few-sample clustering evaluation that extends active sampling approaches in three key ways. First, we propose novel NMI-based acquisition functions that account for the distinctive properties of clustering and uncertainties from a learned surrogate model. Next, we use ideas from semi-supervised learning and train the surrogate model with both the labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the surrogate model. We run experiments to estimate NMI in an active sampling pipeline on three datasets across vision and language. Our results show that CEREAL reduces the area under the absolute error curve by up to 78% compared to the best sampling baseline. We perform an extensive ablation study to show that our framework is agnostic to the choice of clustering algorithm and evaluation metric. We also extend CEREAL from clusterwise annotations to pairwise annotations. Overall, CEREAL can efficiently evaluate clustering with limited human annotations.

1. INTRODUCTION

Unsupervised clustering algorithms (Jain et al., 1999) partition a given dataset into meaningful groups such that similar data points belong to the same cluster. Obtaining high-quality clusterings plays an important role in numerous learning applications like intent induction (Perkins & Yang, 2019) , anomaly detection (Liu et al., 2021) , and self supervision (Caron et al., 2018) . However, evaluating these clusterings can be challenging. Unsupervised evaluation metrics, such as Silhouette Index, often do not correlate well with downstream performance (von Luxburg et al., 2012) . On the other hand, supervised evaluation metrics such as normalized mutual information (NMI) and adjusted Rand index (ARI) require a labeled reference clustering. This supervised evaluation step introduces a costly bottleneck which limits the applicability of clustering for exploratory data analysis. In this work, we study an underexplored area of research: estimating the clustering quality with limited annotations. Existing work on this problem, adapted from few-sample model evaluation, can often perform worse than uniform random sampling (see Section 5). These works use learned surrogate models such as multilayer perceptrons to identify the most informative unlabeled data from the evaluation set. Similar to active learning, they then iteratively rank the next samples to be labeled according to an acquisition function and the surrogate model's predictions. However, many active sampling methods derive acquisition functions tailored to a specific classification or regression metric (Sawade et al., 2010; Kossen et al., 2021) , which make them inapplicable to clustering. Furthermore, these methods only rely on labeled data to learn the surrogate model and ignore the vast amounts of unlabeled data. In this paper, we present CEREAL (Cluster Evaluation with REstricted Availability of Labels), a comprehensive framework for few-sample clustering evaluation without any explicit assumptions on the evaluation metric or clustering algorithm (see Figure 1 ). We propose several improvements to the standard active sampling pipeline. First, we derive acquisition functions based on normalized mutual information, a popular evaluation metric for clustering. The choice of acquisition function depends on whether the clustering algorithm returns a cluster assignment (hard clustering) or a distribution over clusters (soft clustering). Then, we use a semi-supervised learning algorithm to train the surrogate model with both labeled and unlabeled data. Finally, we pseudo-label the unlabeled data with the learned surrogate model before estimating the evaluation metric. Our experiments across multiple real-world datasets, clustering algorithms, and evaluation metrics show that CEREAL accurately and reliably estimates the clustering quality much better than several baselines. Our results show that CEREAL reduces the area under the absolute error curve (AEC) up to 78.8% compared to uniform sampling. In fact, CEREAL reduces the AEC up to 74.7% compared to the best performing active sampling method, which typically produces biased underestimates of NMI. In an extensive ablation study we observe that the combination of semi-supervised learning and pseudolabeling is crucial for optimal performance as each component on its own might hurt performance (see Table 1 ). We also validate the robustness of our framework across multiple clustering algorithms -namely K-Means, spectral clustering, and BIRCH -to estimate a wide range evaluation metricsnamely normalized mutual information (NMI), adjusted mutual information (AMI), and adjusted rand index (ARI). Finally, we show that CEREAL can be extended from clusterwise annotations to pairwise annotations by using the surrogate model to pseudo-label the dataset. Our results with pairwise annotations show that pseudo-labeling can approximate the evaluation metric but requires significantly more annotations than clusterwise annotations to achieve similar estimates. We summarize our contributions as follows: • We introduce CEREAL, a framework for few-sample clustering evaluation. To the best of our knowledge, we are the first to investigate the problem of evaluating clustering with a limited labeling budget. Our solution uses a novel combination of active sampling and semi-supervised learning, including new NMI-based acquisition functions. • Our experiments in the active sampling pipeline show that CEREAL almost always achieves the lowest AEC across language and vision datasets. We also show that our framework reliably estimates the quality of the clustering across different clustering algorithms, evaluation metrics, and annotation types.

2. RELATED WORK

Cluster Evaluation The trade-offs associated with different types of clustering evaluation are well-studied in the literature (Rousseeuw, 1987; Rosenberg & Hirschberg, 2007; Vinh et al., 2010; Gösgens et al., 2021) . Clustering evaluation metrics -oftentimes referred to as validation indicesare either internal or external. Internal evaluation metrics gauge the quality of a clustering without supervision and instead rely on the geometric properties of the clusters. However, they might not be reliable as they do not account for the downstream task or make clustering specific assumptions (von Luxburg et al., 2012; Gösgens et al., 2021; Mishra et al., 2022) . On the other hand, external evaluation metrics require supervision, oftentimes in the form of ground truth annotations. Commonly used external evaluation metrics are adjusted Rand index (Hubert & Arabie, 1985) , V-measure (Rosenberg & Hirschberg, 2007) , and mutual information (Cover & Thomas, 2006) along with its normalized and adjusted variants. We aim to estimate external evaluation metrics for a clustering with limited labels for the ground truth or the reference clustering. Recently, Mishra et al. (2022) proposed a framework to select the expected best clustering achievable given a hyperparameter tuning method and a computation budget. Our work complements theirs by choosing best clustering under a given labeling budget.



Figure 1: The CEREAL framework evaluates the test clustering with limited annotations for the reference clustering.

