CONFIDENCE AND DISPERSITY SPEAK: CHARACTER-ISING PREDICTION MATRIX FOR UNSUPERVISED AC-CURACY ESTIMATION

Abstract

This work focuses on estimating how well a model performs on out-of-distribution (OOD) datasets without using labels. While recent methods study the prediction confidence, this work newly reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. Specifically, we need to consider the two properties so as to make more accurate estimates. To this end, we use the nuclear norm which has been shown to characterize both properties. In our experiments, we extensively validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that nuclear norm is more accurate and robust in predicting OOD accuracy than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we study the limitation of the nuclear norm and discuss potential directions.

1. INTRODUCTION

Model evaluation is critical in both machine learning research and practice. The standard evaluation protocol is to evaluate a model on a held-out test set that is 1) fully labeled and 2) drawn from the same distribution as the training set. However, this way of evaluation is often infeasible for realworld deployment, where the test environments undergo distribution shifts and ground truths are not provided. In presence of a distribution shift, in-distribution accuracy may only be a weak predictor of model performance (Deng & Zheng, 2021; Garg et al., 2022) . Moreover, annotating data itself is a laborious task, let alone it is impractical to label every new test distribution. Hence, a way to predict a classifier accuracy using unlabelled test data only has recently received much attention (Chuang et al., 2020; Deng & Zheng, 2021; Guillory et al., 2021; Garg et al., 2022) . In the task of accuracy estimation, existing methods typically derive model-based distribution statistics of test sets (Deng & Zheng, 2021; Guillory et al., 2021; Deng et al., 2021; Garg et al., 2022; Baek et al., 2022) . Recent works develop methods based on prediction matrix on unlabeled data (Guillory et al., 2021; Garg et al., 2022) . They focus on the overall confidence of the prediction matrix. Confidence refers to whether the model gives a confident prediction on an individual test data. It can be measured by entropy or maximum softmax probability. Guillory et al. (2021) show that the average of maximum softmax scores on a test set is useful for accuracy estimation. Garg et al. (2022) predict accuracy as the fraction of test data with maximum softmax scores above a threshold. In this work, we newly consider another property of prediction matrix: dispersity. It measures how spread out the predictions are across classes. When testing a source-trained classifier on a target (out-of-distribution) dataset, target features may exhibit degenerate structures due to the distribution shift, where many target features are distributed in a few clusters . As a result, their corresponding class predictions would also be degenerate rather than diverse: the classifier predicts test features into specific classes and few into others. There are existing works that encourages the cluster sizes in the target data to be balanced (Shi & Sha, 2012; Liang et al., 2020; Yang et al., 2021; Tang et al., 2020), 

