WHAT CAN WE LEARN FROM THE SELECTIVE PREDIC-TION AND UNCERTAINTY ESTIMATION PERFORMANCE OF 523 IMAGENET CLASSIFIERS?

Abstract

When deployed for risk-sensitive tasks, deep neural networks must include an uncertainty estimation mechanism. Here we examine the relationship between deep architectures and their respective training regimes, with their corresponding selective prediction and uncertainty estimation performance. We consider some of the most popular estimation performance metrics previously proposed including AUROC, ECE, AURC as well as coverage for selective accuracy constraint. We present a novel and comprehensive study of selective prediction and the uncertainty estimation performance of 523 existing pretrained deep ImageNet classifiers that are available in popular repositories. We identify numerous and previously unknown factors that affect uncertainty estimation and examine the relationships between the different metrics. We find that distillation-based training regimes consistently yield better uncertainty estimations than other training schemes such as vanilla training, pretraining on a larger dataset and adversarial training. Moreover, we find a subset of ViT models that outperform any other models in terms of uncertainty estimation performance. For example, we discovered an unprecedented 99% top-1 selective accuracy on ImageNet at 47% coverage (and 95% top-1 accuracy at 80%) for a ViT model, whereas a competing EfficientNet-V2-XL cannot obtain these accuracy constraints at any level of coverage. Our companion paper, also published in ICLR 2023 (Galil et al., 2023), examines the performance of these classifiers in a class-out-of-distribution setting.

1. INTRODUCTION

The excellent performance of deep neural networks (DNNs) has been demonstrated in a range of applications, including computer vision, natural language understanding and audio processing. To deploy these models successfully, it is imperative that they provide an uncertainty quantification of their predictions, either via some kind of selective prediction or a probabilistic confidence score. Notwithstanding, what metric should we use to evaluate the uncertainty estimation performance? There are many and diverse ways so the answer to this question is not obvious, and to demonstrate the difficulty, consider the case of two classification models for the stock market that predict whether a stock's value is about to increase, decrease, or remain neutral (three-class classification). Suppose that model A has a 95% true accuracy, and generates a confidence score of 0.95 on every prediction (even on misclassified instances); model B has a 40% true accuracy, but always gives a confidence score of 0.6 on correct predictions, and 0.4 on incorrect ones. Model B can be utilized easily to generate perfect investment decisions. Using selective prediction (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) , Model B will simply reject all investments on stocks whenever the confidence score is 0.4. While model A offers many more investment opportunities, each of its predictions carries a 5% risk of failure. Among the various metrics proposed for evaluating the performance of uncertainty estimation are: Area Under the Receiver Operating Characteristic (AUROC or AUC), Area Under the Risk-Coverage curve (AURC) (Geifman et al., 2018) , selective risk or coverage for a selective accuracy constraint (SAC), Negative Log-likelihood (NLL), Expected Calibration Error (ECE), which is often used for evaluating a model's calibration (see Section 2) and Brier score (Brier, 1950) . All these metrics

