CONFIDENCE AND DISPERSITY SPEAK: CHARACTER-ISING PREDICTION MATRIX FOR UNSUPERVISED AC-CURACY ESTIMATION

Abstract

This work focuses on estimating how well a model performs on out-of-distribution (OOD) datasets without using labels. While recent methods study the prediction confidence, this work newly reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. Specifically, we need to consider the two properties so as to make more accurate estimates. To this end, we use the nuclear norm which has been shown to characterize both properties. In our experiments, we extensively validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that nuclear norm is more accurate and robust in predicting OOD accuracy than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we study the limitation of the nuclear norm and discuss potential directions.

1. INTRODUCTION

Model evaluation is critical in both machine learning research and practice. The standard evaluation protocol is to evaluate a model on a held-out test set that is 1) fully labeled and 2) drawn from the same distribution as the training set. However, this way of evaluation is often infeasible for realworld deployment, where the test environments undergo distribution shifts and ground truths are not provided. In presence of a distribution shift, in-distribution accuracy may only be a weak predictor of model performance (Deng & Zheng, 2021; Garg et al., 2022) . Moreover, annotating data itself is a laborious task, let alone it is impractical to label every new test distribution. Hence, a way to predict a classifier accuracy using unlabelled test data only has recently received much attention (Chuang et al., 2020; Deng & Zheng, 2021; Guillory et al., 2021; Garg et al., 2022) . In the task of accuracy estimation, existing methods typically derive model-based distribution statistics of test sets (Deng & Zheng, 2021; Guillory et al., 2021; Deng et al., 2021; Garg et al., 2022; Baek et al., 2022) . Recent works develop methods based on prediction matrix on unlabeled data (Guillory et al., 2021; Garg et al., 2022) . They focus on the overall confidence of the prediction matrix. Confidence refers to whether the model gives a confident prediction on an individual test data. It can be measured by entropy or maximum softmax probability. Guillory et al. (2021) show that the average of maximum softmax scores on a test set is useful for accuracy estimation. Garg et al. (2022) predict accuracy as the fraction of test data with maximum softmax scores above a threshold. In this work, we newly consider another property of prediction matrix: dispersity. It measures how spread out the predictions are across classes. When testing a source-trained classifier on a target (out-of-distribution) dataset, target features may exhibit degenerate structures due to the distribution shift, where many target features are distributed in a few clusters . As a result, their corresponding class predictions would also be degenerate rather than diverse: the classifier predicts test features into specific classes and few into others. There are existing works that encourages the cluster sizes in the target data to be balanced (Shi & Sha, 2012; Liang et al., 2020; Yang et al., 2021; Tang et al., 2020) , thereby increasing the prediction dispersity. In contrast, this work does not aim to improve cluster structures and instead studies the prediction dispersity to predict model performance on various test sets without ground truths. To illustrate that dispersity is useful for accuracy estimation, we report our empirical observation in Fig. 1 . We compute the predicted dispersity score by measuring whether the frequency of predicted class is uniform. Specifically, we use entropy to quantify the frequency distribution, with higher scores indicating that the overall predictions are well-balanced. We show that the dispersity score exhibits a very strong correlation (Spearman's rank correlation ρ > 0.950) with classifier performance when testing on various test sets. This implies that when the classifier does not generalize well on the test set, it tends to give degenerate predictions (i.e., low prediction dispersity), where the test samples are mainly assigned to some specific classes. Based on the above observation, we propose to use nuclear norm, known to be effective in measuring both prediction dispersity and confidence (Cui et al., 2020; 2021) , towards accurate estimation. Other measurements can also be used, such as mutual information maximizing (Bridle et al., 1991; Krause et al., 2010; Shi & Sha, 2012) . Across various model architectures on a range of datasets, we show that the nuclear norm is more effective than state-of-the-art methods (e.g., ATC (Garg et al., 2022) and DoC (Guillory et al., 2021) ) in predicting OOD performance. Using uncontrollable and severe synthetic corruptions, we show that nuclear norm is again superior. Finally, we demonstrate that the nuclear norm still makes reasonably accurate estimations for test sets with moderate imbalances of classes. We additionally discuss potential solutions under strong label shifts.

2. RELATED WORK

Unsupervised accuracy estimation is proposed to evaluate a model on unlabeled datasets. Recent methods typically consider the characteristics of unlabeled test sets (Deng & Zheng, 2021; Guillory et al., 2021; Deng et al., 2021; Garg et al., 2022; Baek et al., 2022; Yu et al., 2022; Chen et al., 2021b; a) . For example, Deng & Zheng ( 2021 In addition, agreement score of multiple models' predictions on test data is investigated in (Madani et al., 2004; Platanios et al., 2016; 2017; Donmez et al., 2010; Chen et al., 2021a) . This work also focuses on estimating a model's OOD accuracy on various datasets and proposes to achieve robust estimations by considering the both prediction confidence and dispersity. Predicting ID generalization gap. To predict the performance gap between a certain pair of training-testing set, several works explore develop complexity measurements on trained models and training data (Eilertsen et al., 2020; Unterthiner et al., 2020; Arora et al., 2018; Corneanu et al., 2020; Jiang et al., 2019a; Neyshabur et al., 2017; Jiang et al., 2019b; Schiff et al., 2021) . Calibration aims to make the probability obtained by the model reflect the true correctness likelihood (Guo et al., 2017; Minderer et al., 2021) . To achieve this, several methods have been developed to improve the calibration of their predictive uncertainty, both during training (Karandikar et al., 2021; Krishnan & Tickoo, 2020) and after (Guo et al., 2017; Gupta et al., 2021) training. For a perfectly calibrated model, the average confidence over a distribution corresponds to its accuracy over this distribution. However, calibration methods seldom exhibit desired calibration performance under distribution shifts (Ovadia et al., 2019; Gong et al., 2021) . To estimate OOD accuracy, this work does not focus on calibrating confidence. Instead, we use the dispersity and confidence of prediction matrix to predict model performance on unlabeled data.



); Yu et al. (2022); Chuang et al. (2020) consider the distribution discrepancy for accuracy estimation. Chen et al. (2021b) achieve more accurate estimation by using specified slicing functions in the importance weighting. Chuang et al. (2020) learn a domain-invariant classifiers on unlabeled test set to estimate the target accuracy. Guillory et al. (2021); Garg et al. (2022) propose to predict accuracy based the softmax scores on unlabeled data.

For example, Corneanu et al. (2020) predict the generalization gap by using persistent topology measures. Jiang et al. (2019a) develop a measurement of layer-wise margin distributions for the generalization prediction. Neyshabur et al. (2017) use the product of norms of the weights across multiple layers. Baldock et al. (2021) introduce a measure of example difficulty (i.e., prediction depth) to study the learning of deep models. Chuang et al. (2021) develop margin-based generalization bounds with optimal transport. The above works assume that the training and test sets are from the same distribution and they do not consider the characteristics of the test distribution. In comparison, we focus on predicting a model's accuracy on various OOD datasets.

