ASSESSING MODEL OUT-OF-DISTRIBUTION GENER-ALIZATION WITH SOFTMAX PREDICTION PROBABIL-ITY BASELINES AND A CORRELATION METHOD

Abstract

This paper studies the use of Softmax prediction probabilities to assess model generalization under distribution shift. Specifically, given an out-of-distribution (OOD) test set and a pool of classifiers, we aim to develop a prediction probabilitybased measure which has a monotonic relationship with OOD generalization performance. We first show existing uncertainty measures (e.g., entropy and maximum Softmax prediction probability) are fairly useful of predicting generalization in some OOD scenarios. We then move ahead with proposing a new measure, Softmax Correlation (SoftmaxCorr). To obtain the SoftmaxCorr score for a classifier, we compute the class-class correlation matrix from all the Softmax vectors in a test set, and then its cosine similarity with an identity matrix. We show that the class-class correlation matrix reveals significant knowledge about the confusion matrix: its high similarity with the identity matrix means predictions have low confusion (uncertainty) and evenly cover all classes, and vice versa. Across three setups including ImageNet, CIFAR-10, and WILDS, we show that SoftmaxCorr is well predictive of model accuracy on both in-distribution and OOD datasets.

1. INTRODUCTION

Understanding the generalization of deep neural networks is an essential problem in deep learning. There is substantial interest in predicting ID generalization gap via complexity measures (Jiang et al., 2020a; Neyshabur et al., 2015; Bartlett et al., 2017; Keskar et al., 2017; Nagarajan & Kolter, 2019; Neyshabur et al., 2017; Chuang et al., 2021; Jiang et al., 2020b; Smith & Le, 2018; Arora et al., 2018; Dziugaite & Roy, 2017; Dinh et al., 2017) . Although significant, developing measures to characterize OOD generalization remains under-explored. In fact, the test environment in the real world often undergoes distribution shift due to factors like sample bias and non-stationarity. Ignoring the potential distribution shift can lead to serious safety concerns in self-driving cars (Kuutti et al., 2020) and histopathology (Bandi et al., 2018) , etc. Softmax prediction probability is found to be useful to analyze the test environment (Hendrycks & Gimpel, 2016; Guillory et al., 2021; Deng et al., 2022a; Liang et al., 2018; Garg et al., 2022) . For example, Hendrycks & Gimpel (2016); Liang et al. (2018) utilize maximum Softmax prediction probability to identify samples from open-set classes. This gives us a hint: models' prediction probabilities may be informative to reflect their OOD performance. Therefore, we are interested in measures based on prediction probability and conduct large-scale correlation study to explore whether they are useful to characterize generalization of various models under distribution shift. Concretely, given various deep models, we aim to study and develop prediction probability-based measures which monotonically relate to model generalization. To this end, we construct a catalog of empirical prediction probability-based measures and create a wide range of experimental setups. We collect 502 different classification models ranging from standard convolutional neural network to Vision Transformers. We cover 19 ID and OOD datasets with various types of distribution shift, such as ImageNet-V2 (Recht et al., 2019) with dataset reproduction shift and stylized-ImageNet (Geirhos et al., 2019) with style shift. Based on experimental results, we observe that empirical uncertainty measures based on prediction probabilities (e.g., entropy) are useful in characterizing OOD generalization to some extent. How-ever, they are limited in leveraging class-wise relationship encoded in prediction probabilities. We thus further propose Softmax correlation (SoftmaxCorr), an effective prediction probability-based measure describing for each classifier to what extent classes correlate with each other. Specifically, for each classifier we compute a class correlation matrix from all prediction vectors in a test set. Then, we calculate its cosine similarity with an identity matrix to evaluate whether this classifier makes diverse and certain predictions. We show that class-class correlation effectively uncovers knowledge of confusion matrix, thus better reflecting overall accuracy on OOD test set. The broad correlation study shows the efficacy of SoftmaxCorr. Our contributions are summarized below. • We observe that Softmax prediction probability-based measures generally give good baseline indicators of the OOD accuracy of various classification models. et al., 2018; Zhao et al., 2020; Sagawa et al., 2020; Liu et al., 2021a; Mansilla et al., 2021; Shi et al., 2021) , such as adversarial domain augmentation (Volpi et al., 2018; Qiao & Peng, 2021) and inter-domain gradient matching (Shi et al., 2021) . There are few works study the characterization of model OOD generalization. Ben-David et al. (2006; 2010) 



• Furthering this finding, we propose SoftmaxCorr which assesses model generalization by explicitly calculating class-class correlation, a new angle to leverage the prediction probability. A wide range of experiment suggests the effectiveness of this method.

provide upper bounds of OOD generalization error for domain adaptation. Some works further bound the OOD generalization error based on the divergence between the two distributions(Acuna et al., 2021; Zhang et al., 2019; Tachet des Combes et al., 2020). However, as suggested byMiller et al. (2021), when the shift becomes larger, the above bounds on OOD performance become looser. Moreover, Vedantam et al. (2021) report that the adapting theory from domain adaptation is limited in predicting OOD generalization. In this work, we aim to assess model generalization under distribution shift by OOD measures. Hendrycks et al. (2022) further show that, without being normalized by Softmax function, maximal model logit is also a strong baseline for OOD detection.Liu et al. (2020b)  calculate an energy score based on Softmax probability, which is regarded as a simple yet effective replacement for maximum prediction probability. Other methods investigate the predictive uncertainty(Liu et al.,  2020a; Van Amersfoort et al., 2020). Different from the above works, this research does not aim to detect OOD data, but to explore Softmax-based measures to assess model generalization.

