UNCERTAINTY FOR DEEP IMAGE CLASSIFIERS ON OUT-OF-DISTRIBUTION DATA

Abstract

In addition to achieving high accuracy, in many applications, it is important to estimate the probability that a model prediction is correct. Predictive uncertainty is particularly important on out-of-distribution (OOD) data where accuracy degrades. However, models are typically overconfident, and model calibration on OOD data remains a challenge. In this paper we propose a simple post hoc calibration method that significantly improves on benchmark results (Ovadia et al., 2019) on a wide range of corrupted data. Our method uses outlier exposure to properly calibrate the model probabilities.

1. PREDICTIVE UNCERTAINTY

When a machine learning model makes a prediction, we want to know how confident (or uncertain) we should be about the result. Uncertainty estimates are useful for both in distribution and outof-distribution (OOD) data. Predictive uncertainty addresses this challenge by endowing model predictions with estimates of class membership probabilities. The baseline method for predictive uncertainty is to simply use the softmax probabilities of the model, p softmax (x) = softmax(f (x)), as a surrogate for class membership probabilities (Hendrycks & Gimpel, 2017) . Here f (x) denotes the model outputs. Other approaches include temperature scaling (Guo et al., 2017) , dropout (Gal & Ghahramani, 2016; Srivastava et al., 2014) , and model ensembles (Lakshminarayanan et al., 2017) , as well as Stochastic Variational Bayesian Inference (SVBI) for deep learning (Blundell et al., 2015; Graves, 2011; Louizos & Welling, 2016; 2017; Wen et al., 2018) , among others. All methods suffer from some degree of calibration error, which is the difference between predicted error rates and actual error rates, as measured by collecting data into bins based on p max = max i p softmax i bins. The standard measurement of calibration error is the expected calibration error (ECE) Guo et al. (2017) , although other measures have been used (see Nguyen et al. (2015) , Hendrycks & Gimpel (2017) ), including the Brier score (DeGroot & Fienberg, 1983) , which is also used in Ovadia et al. (2019) .

1.1. OUR RESULTS

We are interested in image classification problems, in particular the CIFAR-10 and Imagenet 2012 datasets, in the setting of distribution covariate shift, where the data has been corrupted by an unknown transformation of unknown intensity. These corruptions are described in Hendrycks & Dietterich (2019) . Our starting point is the work of Ovadia et al. (2019) which offers a large-scale benchmark of existing state-of-the-art methods for evaluating uncertainty on classification problems under dataset shift by providing the softmax model outputs. One of main take-aways from the work of Ovadia et al. (2019) is that, unsurprisingly, the quality of the uncertainty predictions deteriorates significantly along with the dataset shift. In order to be able to calibrate for different intensity levels of unknown corruptions, we make use of surrogate calibration sets, which are corruptions of the data by a different (known) corruption. Then, when given an image (or sample of images), we first estimate the corruption level, and then recalibrate the model probabilities based on the surrogate representative calibration set. The latter step is done with a simple statistical calibration step, converting the model outputs into calibrated uncertainty estimates. Surprisingly we can estimate the corruption level just using the model outputs. We focus on the probability of correct classification, using the p max values. Figure 1 : Comparison of the benchmark implementation (Ovadia et al., 2019) , versus our single and multiple image methods. Mean Expected Calibration Error (ECE) across different corruptions types, for fixed corruption intensity going from 0 to 5. Each box represents a different uncertainty method. The ECE decreases across almost all methods and levels of intensity with the greatest improvement at the higher intensities. See Tables 1 and 2 in the Appendix for numerical comparisons. In Figure 1 we compare each benchmark uncertainty method with our two methods. The single image method determines the appropriate calibration set using only a single image. The multiple image method uses a sample of images (all drawn from the same corruption level and type) to choose the calibration set. More images allow for a better choice of the calibration set, further reducing the calibration error. As shown in the figure, the ECE decreases across almost all methods and levels of intensity with the greatest improvement at the higher intensities. The Brier scores give similar results (see Figure 5 below). We also reproduce a figure in Ovadia et al. (2019) which gives a whisker plot of the distribution of the values. Our results are based on the fact that data distribution shift typically leads to overconfident models (Nguyen et al., 2015) : the p max values are above the true probability, and so they themselves are shifted. This allows us to use the p max distribution shift as a surrogate for data distribution shift and ultimately significantly reduce the calibration error using a purely statistical approach. In practice, we perform the model recalibration for the different corruptions and intensities based simply on the p max distribution shift, which we detect using surrogate corrupted calibration sets Crucially, the corruption used to generate the surrogate corrupted datasets is left out of the test set. The calibrated probabilities are visualized using histograms in Figure 2 . In Figure 3 , we can see the p max distributions for the chosen surrogate corrupted calibration sets and for a specific test set corruption.

1.2. OTHER RELATED WORK

Models trained on a given dataset are unlikely to perform as well on a shifted dataset (Hendrycks & Dietterich, 2019) . Moreover, there are inevitable tradeoffs between accuracy and robustness (Chun et al., 2020) . Training models against corruptions can fail to make models robust to new corruptions (Vasiljevic et al., 2016; Geirhos et al., 2018) . Hendrycks et al. (2019) The ECE is large (lower is better) for the Vanilla method at higher corruption levels, due to the probability shift. The Brier score (lower is better) is also improved. The gap between the orange and blue curves represents the calibration error. Notice that we used 30 equally sized bins, so in the CIFAR-10 plot, there are very few values below .4, which is why the first bin is wide. of outliers. Similarly in Hendrycks et al. (2020) , the authors propose AUGMIX, a method that improves both robustness and uncertainty measures by exposing the model to perturbed images during training. Shao et al. (2020) propose a confidence calibration method that uses an auxiliary class to separate mis-classified samples from correctly classified ones which thus allowing the misclassified samples to be assigned with a low confidence. Nado et al. (2020) argues that the internal activations of the deep models also suffer from covariate shift in the presence of OOD images. Thus they propose to recompute the batch norm at prediction time using a sample of the unlabeled images from the test distribution improving the accuracy and ultimately the calibration. Park et al. (2020) and Wang et al. (2020) 

2.1. CLASSIFICATION AND LABEL PROBABILITIES

Predictive uncertainty seeks to estimate the probability that an input x belongs to each class, p class k (x) = P [y = k | x] , for each k ∈ Y. (1) Here we write x ∈ X for data, y ∈ Y = {1, . . . , K} for labels. In the benchmark methods, the softmax of the model outputs, p softmax (x) = softmax(f (x)) are used as a surrogate for the class probabilities. In the case of the vanilla and temperature scaling methods, f (x) is simply the model outputs. Similarly, for the Ensemble or Dropout methods, p softmax (x) represents the average of the probability vectors over the multiple models or queries of the model, respectively. Generally speaking, these softmax values are not an accurate prediction of the class probabilities p class k (x) (Domingos & Pazzani, 1996) . Here we focus on the correct classification p correct (x) = P [y = ŷ(x)] (2) where the classification of the model is given by ŷ(x) = arg max f i (x). Guo et al. (2017) showed that p max = max i p softmax i usually overestimates p correct . We can extend our method to top 5 correctness, as well as to other quantities of interest, such as, using different binning methods, based on structured predictors (Kuleshov & Liang, 2015) .

Problem definition

We are given a model (or ensemble of models) trained on a given dataset ρ train . We want to have calibrated error estimates on an unknown (different) data set ρ test , which could have different levels of corruption. Since one model cannot be calibrated on all of the possible (different) datasets, we want to allow for multiple calibrations, and apply them accordingly. We study two cases: (i) we have a single image drawn from an unknown distribution, or (ii) we have multiple images, each drawn from the same unknown distribution. In the latter case, we use the full test set (we obtain similar results using 100 images).

3.1. CALIBRATING FOR DATASET SHIFT

We choose C distinct calibration sets generated from shifted distributions ρ CAL,j . These sets are chosen to have different representative degrees of corruption intensity. Each calibration set leads (Ovadia et al., 2019) versus our single and multiple image methods. Mean Brier score across different corruptions types, for fixed corruption intensity going from 0 to 5. Each box represents a different uncertainty method. See Tables 3 and 4 for numerical comparisons. to a different uncertainty estimate for a given p max value. We adaptively choose the calibration set given an image (Single Image Method) or a test set of images (Multiple Image Method). 3 and 4 in the Appendix for numerical comparisons. For each calibration dataset S CAL,j , we convert the p max values on that calibration set into the p correct values, as follows. Recall that p max = max i p softmax i , i.e., the model's probability for the predicted class. (i) Evaluate the model p max values on the calibration set, P CAL,j = {p max (x) | x ∈ S CAL,j }. Record the probability density h CAL,j (x), of the p max values as a histogram, by binning the p max using equally spaced bins. (ii) Define the calibrated model probabilities by p CAL,j correct (x) = P [y = ŷ(x) | p max (x)] , (3) using ground truth labels on the calibration set S CAL,j . These probabilities are computed using a histogram in the following way. Partition S CAL,j = {x 1 , . . . , x m } into bins B i according its p max values. Given an image x with p max (x) ∈ B i , approximate p CAL,j correct (x) as 1 |B i | pmax(xj )∈Bi 1 y(xj )=ŷ(xj ) See (Oberman et al., 2020) for more details. Single Image Method: Given a single image x drawn from an unknown distribution ρ test , we estimate the likelihood that the corruption level of the image corresponds to each of the calibration sets, and then take the corresponding weighted average of the calibrated probilities. Ideally, in the single image method we would like to obtain q i (p max ) close to one for the calibration set whose p max distribution looks closely to the p max distribution of the test images. (i) The probability that the p max value came from a given calibration set, making the standard assumption that the a priori likelihoods of the calibration sets are all equal, is q i (p max ) = h CAL,i (p max ) C j=1 h CAL,j (p max ) (ii) Then the calibrated probability, conditional on each of the calibration sets, is given by (i) Record the corresponding model p max values, P test = {p max (x) | x ∈ S test }, and compute the mean µ. (ii) Compare to the means µ j of P CAL and find the closest mean for the calibration sets, i = arg min j |µ -µ j |. Set p test correct (x) = p CAL,i correct (x). We can use a simpler formula for the multiple image method, because with multiple samples, knowing the mean is sufficient to estimate the correct calibration set. p test correct (x) = C j=1 q j (p max (x)) p CAL,j correct (x)

3.2. PRACTICAL IMPLEMENTATION OF OUR METHOD

In Ovadia et al. (2019) the additional methods used include (LL) Approximate Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018) , (LL SVI) Mean field stochastic variational inference on the last layer, (LL Dropout) Dropout only on the activations before the last layer. We refer to Ovadia et al. (2019) for more details on how each method was implemented. All these methods ultimately use the softmax probabilities as a surrogate for the class probabilities. The difference between the methods is how these probabilities are obtained. The distributional shift on ImageNet used 16 corruption types with corruption intensity on a scale from 0 to 5: various forms of Noise, and Blur as well as Pixelate, Saturate, Brightness, Contrast, Fog and Frost, etc. See Figure S3 in (Ovadia et al., 2019) . We used the published softmax value of each of the methods from the benchmark dataset in Ovadia et al. (2019) . We selected Contrast as the corruption to use for calibration, removing it from the test set. We calibrated our models using equally sized bins, on each of following calibration sets: {0}, {0,1}, {0,2}, {0,3}, {0,4}, and {0,5}. For instance the set 0, 1 corresponds to clean images and respective corruption with an intensity level of 1. Heuristically we always want clean images in our calibration set while having different shifted means as a result of increasingly corrupted images (see Figure 3 ). Without the clean images, the single image method would become uncalibrated for in-distribution images as the calibration sets would have a disproportional amount of corrupted images. For both CIFAR-10 and Imagenet, we select 5000 images for calibration, which are shared across the different calibration sets. This means that for instance the calibration set corresponding to {0, 1} contains a total of 10000 images: the selected 5000 clean images and their corrupted counterparts at an intensity level of 1. A more sophisticated combination of calibration sets could lead to improvements. We reproduced the figures in Ovadia et al. (2019) with a small adjustment: we measure the ECE and Brier score just for p correct , rather than for p class k . However this made a negligible difference to the values. In addition, ECE can depend on the binning procedures (equally spaced or equally sized). Equally sized bins are more effective for calibration since they reduce statistical error. They do however lead to different bin edges on different calibration sets, which required combining the bins. This can be done by refinement (which we used here) or simple by one dimensional density estimation (Wasserman, 2006, Chapter 6) . Under review as a conference paper at ICLR 2021 (Xiao et al., 2017) and Not-MNIST (Bulatov, 2011) . Our proposed methods are significantly less confident on OOD than the benchmark method.

3.3. DISCUSSION OF THE RESULTS

Both our single and multi image methods improve on the benchmark widely across methods and corruption intensity levels, as shown in Figure 1 and Tables 1, and 2, which report the mean ECE scores for each model and across different corruption types, for fixed corruption intensity going from 0 to 5. The multi image method performs better than the single image with a few exceptions. This is a natural consequence of using more images to better estimate the corruption level. While the ensemble method remains overall the best method, the gap to other methods is significantly reduced. Moreover, we test the OOD detection performance. We evaluate the CIFAR-10 trained models on the SVHN dataset (Netzer et al., 2011) , in addition to MNIST trained models (for these we used rotation to form the surrogate calibration sets) on Fashion-MNIST (Xiao et al., 2017) and Not-MNIST (Xiao et al., 2017) . Ideally, the models should not be confident when presented with this completely OOD data. As we can see in Figures 7 and 8 , both the single and multi images methods result in methods that are significantly less confident when compared to the benchmark data (Ovadia et al., 2019) . We can explain the increased performance by looking at the p max distributions depicted in Figure 3 . For the SVHN dataset, the multi image method recalibrated the model based on calibration set {0, 5}: the dataset shift is correctly captured by the p max shift and the probabilities are recalibrated accordingly. By exposing the model to corrupted images at the calibration stage, it now "knows what it does not know". We note that instead of requiring class predictions for all classes, our method only requires p max values, which is a considerably smaller data set. Moreover, we compile these values into a histogram with 30 bins, making the added cost of our method negligible.

4. CONCLUSIONS

Increasingly we are asking models trained on a given dataset to perform on out of distribution data. Our work focused on uncertainty estimates, in particular, an estimate of the probability that our model classification is correct. In contrast to most deep uncertainty work, we use a purely statistical approach to reduce the calibration error of deep image classifiers under dataset shift. The approach is model agnostic, so it can be applied to future models. Previous work has shown that uncertainty estimates degrade on corrupted data, as measured by the expected calibration error. The greater the mismatch between training data and test data, the greater the degradation of uncertainty estimates. We overcome this limitation introducing a method which allows a given model to be better calibrated to multiple corruption intensities. Our method works by no longer requiring that model outputs approximate class probabilities. We add a simple extra calibration step, and detect the level of corruption of data, which allows the use calibrations tuned to the corruption level of the data. 

A ABLATION AND CROSS VALIDATION STUDY

We start by investigating the impact of the choice of corruption for the calibration set. Ideally, the choice of corruption should be representative of the distribution of corruptions, so a mild corruption or a very strong corruption would give slightly worse results. At the same time, here we demonstrate that choosing a different corruption should not significantly degrade the results. In Figure 9 we perform a cross-validation study over the choice of corruption used to generate the calibration sets (always leaving it out of the corruptions used at test time). We plot the mean and variance of the ECE across different validation corruptions types. For CIFAR-10, both the single and multi image method are robust to the choice of validation corruption. On ImageNet, the multi image method the improvement is consistent across the validation corruption chosen except for the dropout method. As for the single image method the calibration at lower levels of intensity is degraded using certain corruptions, for example the glass blur corruption. However, this seems to be caused by the strength of the corruption: the accuracy on glass blur level 1 was roughly half that of clean images. We hypothesize that better results could be obtained by simply having the corruption strength be proportional to the loss of accuracy, as is the case of the contrast corruption (see Figure 3 ) In practice, the choice of corruption for single image method should be such that the method remains calibrated for in-distribution images. We investigate as well the impact of adding corrupted images to the calibration sets. In order to do so, we compare the results of our proposed method and the benchmark from Ovadia et al. (2019) with the calibration obtained from using a single calibration set with only clean images like the method proposed in (Oberman et al., 2020) . We refer to it as the Top1 binning method. As we can see in Figure 10 , if the classifier is not well-calibrated, e.g., the vanilla or the LL SVI classifiers, there is a consistent improvement across all corruption intensity levels, with the improvement being only marginal for ImageNet. Moreover, when the classifier is well-calibrated, Top1 binning does not improve calibration (e.g. the Temp Scaling classifier) or even decreases it at high levels of corruption intensity (e.g. Ensemble and Dropout). Finally, we explore why the method works in practice. Figure 11 shows us that without any calibration the ECE scores become higher when the mismatch between the p max distribution of the training set and the p max distribution of the test set increases. Here we measure the mismatch in terms of the p max means, the same criteria used in the multi image method. These qualitative results are confirmed by the Pearson's correlation coefficient. This correlation justifies why detecting the p max distribution shit allows us to significantly improve the calibration of the different methods: in practice our proposed methods perform the recalibration of the model based on the calibration set whose p max distribution is closest to the p max distribution of the test set. Moreover, one notices the higher the correlation, the bigger the calibration improvement provided by both our single and multi image methods. For instance, Dropout has the lowest Pearson's r score and it is also the method where we notice the least improvement. On the other hand, Vanilla has the largest improvement and also the highest Pearson's r score.

B TABLES OF BRIER METRICS

Table 3 and Table 4 report the mean Brier scores for each model and dataset across different corruption types, for fixed corruption intensity going from 0 to 5. The Brier scores can be computed directly from the data, without binning. The ranking provided by the Brier scores is quite similar that provided by the ECE. On ImageNet the only difference is Ensemble at corruption level 1. On CIFAR-10 there were two ranking differences.

C TABLES OF ECE METRICS ACCROSS DIFFERENT CORRUPTIONS

Table 5 and Table 6 report the ECE scores for the vanilla model across different corruption types and intensities ranging from 0 to 5 for ImageNet and CIFAR-10, respectively. Contrast is the corruption used for to form the calibration sets. 



Figure2: Visualization of calibration errors for the Vanilla method on corrupted images on ImageNet and CIFAR-10 using the elastic transform corruption with intensity 4. The x-axis corresponds to the p max values and the y-axis to the confidence estimates of p correct . The blue histogram is the calibration probabilities, the orange is the test probabilities, and the brown is where both overlap. The ECE is large (lower is better) for the Vanilla method at higher corruption levels, due to the probability shift. The Brier score (lower is better) is also improved. The gap between the orange and blue curves represents the calibration error. Notice that we used 30 equally sized bins, so in the CIFAR-10 plot, there are very few values below .4, which is why the first bin is wide.

Figure 4: Comparison of the benchmark implementation (Ovadia et al., 2019) versus our single and multiple image methods. Expected Calibration Error (ECE) distribution across different corruptions types, for fixed corruption intensity going from 0 to 5. Each box represents a different uncertainty method.

Figure 6: Comparison of the benchmark implementation (Ovadia et al., 2019) versus our single and multiple image methods. Brier score distribution across different corruptions types, for fixed corruption intensity going from 0 to 5. Each box represents a different uncertainty method. See Tables 3 and 4 in the Appendix for numerical comparisons.

Figure7: Confidence of CIFAR-10 (bottom) trained models on entirely OOD data (SVHN dataset(Netzer et al., 2011)). The benchmark method (blue) has highest confidence. The single and multi image method are much less confidence on OOD data.

Figure8: Confidence of MINST trained models on entirely OOD data: Fashion-MNIST(Xiao et al., 2017) and Not-MNIST(Bulatov, 2011). Our proposed methods are significantly less confident on OOD than the benchmark method.

Figure9: Figure1shows us the mean ECE across different corruptions types, for fixed corruption intensity going from 0 to 5 when contrast is used in the calibration sets. Here we show how those means change when different corruptions are used in the calibration set. For CIFAR-10, our proposed methods are robust to the choice of corruption used in the calibration set, while for ImageNet the choice of the corruption is import, in particular for the single image method.

Figure 11: ECE (pre-calibration) versus |µ test -µ train | and corresponding Pearson's r score, where µ test ,µ train denote the p max mean of the test and training set, respectively. Each point in the plot represents a different corruption at a different level of intensity.

Comparison on Imagenet of the benchmark implementation(Ovadia et al., 2019)  versus our single and multiple image methods. Numerical values of the means of ECE scores across different corruptions types, for fixed corruption intensity going from 0 to 5.

Comparison on CIFAR-10 of the benchmark implementation Ovadia et al. (2019) versus our single and multiple image methods. Numerical values of the means ECE scores across different corruptions types, for fixed corruption intensity going from 0 to 5.

Comparison on Imagenet of the benchmark implementation(Ovadia et al., 2019)  versus our single and multiple image methods. Numerical values of the means of the Brier scores across different corruptions types, for fixed corruption intensity going from 0 to 5.

Comparison of CIFAR-10 of the benchmark implementation Ovadia et al. (2019) versus our single and multiple image methods. Numerical values of means Brier scores across different corruptions types, for fixed corruption intensity going from 0 to 5.

Comparison of ImageNet of the benchmark implementationOvadia et al. (2019)  versus our single and multiple image methods for the vanilla classifier. Numerical values of ECE scores for different corruptions at different intensity levels going from 0 to 5. The contrast corruption was used to form the calibration sets.

Comparison of CIFAR-10 of the benchmark implementationOvadia et al. (2019)  versus our single and multiple image methods for the vanilla classifier. Numerical values of ECE scores for different corruptions at different intensity levels going from 0 to 5. The contrast corruption was used to form the calibration sets.

