UNCERTAINTY FOR DEEP IMAGE CLASSIFIERS ON OUT-OF-DISTRIBUTION DATA

Abstract

In addition to achieving high accuracy, in many applications, it is important to estimate the probability that a model prediction is correct. Predictive uncertainty is particularly important on out-of-distribution (OOD) data where accuracy degrades. However, models are typically overconfident, and model calibration on OOD data remains a challenge. In this paper we propose a simple post hoc calibration method that significantly improves on benchmark results (Ovadia et al., 2019) on a wide range of corrupted data. Our method uses outlier exposure to properly calibrate the model probabilities.

1. PREDICTIVE UNCERTAINTY

When a machine learning model makes a prediction, we want to know how confident (or uncertain) we should be about the result. Uncertainty estimates are useful for both in distribution and outof-distribution (OOD) data. Predictive uncertainty addresses this challenge by endowing model predictions with estimates of class membership probabilities. The baseline method for predictive uncertainty is to simply use the softmax probabilities of the model, p softmax (x) = softmax(f (x)), as a surrogate for class membership probabilities (Hendrycks & Gimpel, 2017) . Here f (x) denotes the model outputs. Other approaches include temperature scaling (Guo et al., 2017 ), dropout (Gal & Ghahramani, 2016; Srivastava et al., 2014) , and model ensembles (Lakshminarayanan et al., 2017) , as well as Stochastic Variational Bayesian Inference (SVBI) for deep learning (Blundell et al., 2015; Graves, 2011; Louizos & Welling, 2016; 2017; Wen et al., 2018) , among others. All methods suffer from some degree of calibration error, which is the difference between predicted error rates and actual error rates, as measured by collecting data into bins based on p max = max i p softmax 

1.1. OUR RESULTS

We are interested in image classification problems, in particular the CIFAR-10 and Imagenet 2012 datasets, in the setting of distribution covariate shift, where the data has been corrupted by an unknown transformation of unknown intensity. These corruptions are described in Hendrycks & Dietterich (2019) . Our starting point is the work of Ovadia et al. ( 2019) which offers a large-scale benchmark of existing state-of-the-art methods for evaluating uncertainty on classification problems under dataset shift by providing the softmax model outputs. One of main take-aways from the work of Ovadia et al. ( 2019) is that, unsurprisingly, the quality of the uncertainty predictions deteriorates significantly along with the dataset shift. In order to be able to calibrate for different intensity levels of unknown corruptions, we make use of surrogate calibration sets, which are corruptions of the data by a different (known) corruption. Then, when given an image (or sample of images), we first estimate the corruption level, and then recalibrate the model probabilities based on the surrogate representative calibration set. The latter step is done with a simple statistical calibration step, converting the model outputs into calibrated uncertainty estimates. Surprisingly we can estimate the corruption level just using the model outputs. We focus on the probability of correct classification, using the p max values. 1



i bins. The standard measurement of calibration error is the expected calibration error (ECE) Guo et al. (2017), although other measures have been used (see Nguyen et al. (2015), Hendrycks & Gimpel (2017)), including the Brier score (DeGroot & Fienberg, 1983), which is also used in Ovadia et al. (2019).

