UNCERTAINTY FOR DEEP IMAGE CLASSIFIERS ON OUT-OF-DISTRIBUTION DATA

Abstract

In addition to achieving high accuracy, in many applications, it is important to estimate the probability that a model prediction is correct. Predictive uncertainty is particularly important on out-of-distribution (OOD) data where accuracy degrades. However, models are typically overconfident, and model calibration on OOD data remains a challenge. In this paper we propose a simple post hoc calibration method that significantly improves on benchmark results (Ovadia et al., 2019) on a wide range of corrupted data. Our method uses outlier exposure to properly calibrate the model probabilities.

1. PREDICTIVE UNCERTAINTY

When a machine learning model makes a prediction, we want to know how confident (or uncertain) we should be about the result. Uncertainty estimates are useful for both in distribution and outof-distribution (OOD) data. Predictive uncertainty addresses this challenge by endowing model predictions with estimates of class membership probabilities. The baseline method for predictive uncertainty is to simply use the softmax probabilities of the model, p softmax (x) = softmax(f (x)), as a surrogate for class membership probabilities (Hendrycks & Gimpel, 2017) . Here f (x) denotes the model outputs. Other approaches include temperature scaling (Guo et al., 2017 ), dropout (Gal & Ghahramani, 2016; Srivastava et al., 2014) , and model ensembles (Lakshminarayanan et al., 2017) , as well as Stochastic Variational Bayesian Inference (SVBI) for deep learning (Blundell et al., 2015; Graves, 2011; Louizos & Welling, 2016; 2017; Wen et al., 2018) , among others. All methods suffer from some degree of calibration error, which is the difference between predicted error rates and actual error rates, as measured by collecting data into bins based on p max = max i p softmax i bins. The standard measurement of calibration error is the expected calibration error (ECE) Guo et al. (2017) , although other measures have been used (see Nguyen et al. (2015) , Hendrycks & Gimpel (2017)), including the Brier score (DeGroot & Fienberg, 1983) , which is also used in Ovadia et al. (2019) .

1.1. OUR RESULTS

We are interested in image classification problems, in particular the CIFAR-10 and Imagenet 2012 datasets, in the setting of distribution covariate shift, where the data has been corrupted by an unknown transformation of unknown intensity. These corruptions are described in Hendrycks & Dietterich (2019) . Our starting point is the work of Ovadia et al. ( 2019) which offers a large-scale benchmark of existing state-of-the-art methods for evaluating uncertainty on classification problems under dataset shift by providing the softmax model outputs. One of main take-aways from the work of Ovadia et al. ( 2019) is that, unsurprisingly, the quality of the uncertainty predictions deteriorates significantly along with the dataset shift. In order to be able to calibrate for different intensity levels of unknown corruptions, we make use of surrogate calibration sets, which are corruptions of the data by a different (known) corruption. Then, when given an image (or sample of images), we first estimate the corruption level, and then recalibrate the model probabilities based on the surrogate representative calibration set. The latter step is done with a simple statistical calibration step, converting the model outputs into calibrated uncertainty estimates. Surprisingly we can estimate the corruption level just using the model outputs. We focus on the probability of correct classification, using the p max values. The ECE decreases across almost all methods and levels of intensity with the greatest improvement at the higher intensities. See Tables 1 and 2 in the Appendix for numerical comparisons. In Figure 1 we compare each benchmark uncertainty method with our two methods. The single image method determines the appropriate calibration set using only a single image. The multiple image method uses a sample of images (all drawn from the same corruption level and type) to choose the calibration set. More images allow for a better choice of the calibration set, further reducing the calibration error. As shown in the figure, the ECE decreases across almost all methods and levels of intensity with the greatest improvement at the higher intensities. The Brier scores give similar results (see Figure 5 below). We also reproduce a figure in Ovadia et al. ( 2019) which gives a whisker plot of the distribution of the values. Our results are based on the fact that data distribution shift typically leads to overconfident models (Nguyen et al., 2015) : the p max values are above the true probability, and so they themselves are shifted. This allows us to use the p max distribution shift as a surrogate for data distribution shift and ultimately significantly reduce the calibration error using a purely statistical approach. In practice, we perform the model recalibration for the different corruptions and intensities based simply on the p max distribution shift, which we detect using surrogate corrupted calibration sets Crucially, the corruption used to generate the surrogate corrupted datasets is left out of the test set. The calibrated probabilities are visualized using histograms in Figure 2 . In Figure 3 , we can see the p max distributions for the chosen surrogate corrupted calibration sets and for a specific test set corruption.

1.2. OTHER RELATED WORK

Models trained on a given dataset are unlikely to perform as well on a shifted dataset (Hendrycks & Dietterich, 2019) . Moreover, there are inevitable tradeoffs between accuracy and robustness (Chun et al., 2020) . Training models against corruptions can fail to make models robust to new corruptions (Vasiljevic et al., 2016; Geirhos et al., 2018) . Hendrycks et al. ( 2019) deals with anomaly detection, the task of distinguishing between anomalous and in-distribution data. They propose an approach called Outlier Exposure (OE) that consists in training anomaly detectors on an auxiliary dataset



Figure 1: Comparison of the benchmark implementation (Ovadia et al., 2019), versus our single and multiple image methods. Mean Expected Calibration Error (ECE) across different corruptions types, for fixed corruption intensity going from 0 to 5. Each box represents a different uncertainty method.The ECE decreases across almost all methods and levels of intensity with the greatest improvement at the higher intensities. See Tables1 and 2in the Appendix for numerical comparisons.

