WHAT CAN WE LEARN FROM THE SELECTIVE PREDIC-TION AND UNCERTAINTY ESTIMATION PERFORMANCE OF 523 IMAGENET CLASSIFIERS?

Abstract

When deployed for risk-sensitive tasks, deep neural networks must include an uncertainty estimation mechanism. Here we examine the relationship between deep architectures and their respective training regimes, with their corresponding selective prediction and uncertainty estimation performance. We consider some of the most popular estimation performance metrics previously proposed including AUROC, ECE, AURC as well as coverage for selective accuracy constraint. We present a novel and comprehensive study of selective prediction and the uncertainty estimation performance of 523 existing pretrained deep ImageNet classifiers that are available in popular repositories. We identify numerous and previously unknown factors that affect uncertainty estimation and examine the relationships between the different metrics. We find that distillation-based training regimes consistently yield better uncertainty estimations than other training schemes such as vanilla training, pretraining on a larger dataset and adversarial training. Moreover, we find a subset of ViT models that outperform any other models in terms of uncertainty estimation performance. For example, we discovered an unprecedented 99% top-1 selective accuracy on ImageNet at 47% coverage (and 95% top-1 accuracy at 80%) for a ViT model, whereas a competing EfficientNet-V2-XL cannot obtain these accuracy constraints at any level of coverage. Our companion paper, also published in ICLR 2023 (Galil et al., 2023), examines the performance of these classifiers in a class-out-of-distribution setting.

1. INTRODUCTION

The excellent performance of deep neural networks (DNNs) has been demonstrated in a range of applications, including computer vision, natural language understanding and audio processing. To deploy these models successfully, it is imperative that they provide an uncertainty quantification of their predictions, either via some kind of selective prediction or a probabilistic confidence score. Notwithstanding, what metric should we use to evaluate the uncertainty estimation performance? There are many and diverse ways so the answer to this question is not obvious, and to demonstrate the difficulty, consider the case of two classification models for the stock market that predict whether a stock's value is about to increase, decrease, or remain neutral (three-class classification). Suppose that model A has a 95% true accuracy, and generates a confidence score of 0.95 on every prediction (even on misclassified instances); model B has a 40% true accuracy, but always gives a confidence score of 0.6 on correct predictions, and 0.4 on incorrect ones. Model B can be utilized easily to generate perfect investment decisions. Using selective prediction (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) , Model B will simply reject all investments on stocks whenever the confidence score is 0.4. While model A offers many more investment opportunities, each of its predictions carries a 5% risk of failure. Among the various metrics proposed for evaluating the performance of uncertainty estimation are: Area Under the Receiver Operating Characteristic (AUROC or AUC), Area Under the Risk-Coverage curve (AURC) (Geifman et al., 2018) , selective risk or coverage for a selective accuracy constraint (SAC), Negative Log-likelihood (NLL), Expected Calibration Error (ECE), which is often used for evaluating a model's calibration (see Section 2) and Brier score (Brier, 1950) . All these metrics A full version graph is given in Figure 8 . Distilled models are better than non-distilled ones. A subset of ViT models is superior to all other models for all aspects of uncertainty estimation ("ViT" in the legend, marked as a red triangle facing upwards); the performance of EfficientNet-V2 and GENet models is worse. are well known and are often used for comparing the uncertainty estimation performance of models (Moon et al., 2020; Nado et al., 2021; Maddox et al., 2019; Lakshminarayanan et al., 2017) . Somewhat surprisingly, NLL, Brier, AURC, and ECE all fail to reveal the uncertainty superiority of Model B in our investment example (see Appendix A for the calculations). Both AUROC and SAC, on the other hand, reveal the advantage of Model B perfectly (see Appendix A for details). It is not hard to construct counterexamples where these two metrics fails and others (e.g., ECE) succeed. To sum up this brief discussion, we believe that the ultimate suitability of a performance metric should be determined by its context. If there is no specific application in mind, there is a strong incentive to examine a variety of metrics, as we choose to do in this study. This study evaluates the ability of 523 models from the Torchvision and Timm repositories (Paszke et al., 2019; Wightman, 2019) to estimate uncertaintyfoot_0 . Our study identifies several major factors that affect confidence rankings, calibration, and selective prediction, and lead to numerous empirical contributions important to selective predictions and uncertainty estimation. While no new algorithm or method is introduced in our paper, our study generates many interesting conclusions that will help practitioners achieve more powerful uncertainty estimation. Moreover, the research questions that are uncovered by our empirical study shed light on uncertainty estimation, which may stimulate the development of new methods and techniques for improving uncertainty estimation. Among the most interesting conclusions our study elicits are: ( (2) Certain architectures are more inclined to perform better or worse at uncertainty estimation. Some architectures seem more inclined to perform well on all aspects of uncertainty estimation, e.g., a subset of vision transformers (ViTs) (Dosovitskiy et al., 2021) and the zero-shot languagevision CLIP model (Radford et al., 2021) , while other architectures tend to perform worse, e.g., EfficientNet-V2 and GENet (Tan & Le, 2021; Lin et al., 2020) . These results are visualized in Figure 1 . In Galil et al. ( 2023) we find that ViTs and CLIPs are also powerful C-OOD detectors. (3) Several training regimes result in a subset of ViTs that outperforms all other architectures and training regimes. These regimes include the original one from the paper introducing ViTs (Dosovitskiy et al., 2021; Steiner et al., 2022; Chen et al., 2022; Ridnik et al., 2021) . These ViTs



Our code is available at https://github.com/IdoGalil/benchmarking-uncertainty-estimation-performance



Knowledge distillation training improves estimation. Training regimes incorporating any kind of knowledge distillation (KD) (Hinton et al., 2015) lead to DNNs with improved uncertainty estimation performance evaluated by any metric, more than by using any other training tricks (such as pretraining on a larger dataset, adversarial training, etc.). In Galil et al. (2023) we find similar performance boosts for class-out-of-distribution (C-OOD) detection.

Each marker's size is determined by the model's number of parameters.

