UNCERTAINTY SETS FOR IMAGE CLASSIFIERS USING CONFORMAL PREDICTION

Abstract

Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network's probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Our method modifies an existing conformal prediction algorithm to give more stable predictive sets by regularizing the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving coverage with sets that are often factors of 5 to 10 smaller than a stand-alone Platt scaling baseline.

1. INTRODUCTION

Imagine you are a doctor making a high-stakes medical decision based on diagnostic information from a computer vision classifier. What would you want the classifier to output in order to make the best decision? This is not a casual hypothetical; such classifiers are already used in medical settings (e.g., Razzak et al., 2018; Lundervold & Lundervold, 2019; Li et al., 2014) . A maximumlikelihood diagnosis with an accompanying probability may not be the most essential piece of information. To ensure the health of the patient, you must also rule in or rule out harmful diagnoses. In other words, even if the most likely diagnosis is a stomach ache, it is equally or more important to rule out stomach cancer. Therefore, you would want the classifier to give you-in addition to an estimate of the most likely outcome-actionable uncertainty quantification, such as a set of predictions that provably covers the true diagnosis with a high probability (e.g., 90%). This is called a prediction set (see Figure 1 ). Our paper describes a method for constructing prediction sets from any pre-trained image classifier that are formally guaranteed to contain the true class with the desired probability, relatively small, and practical to implement. Our method modifies a conformal predictor (Vovk et al., 2005) given in Romano et al. (2020) for the purpose of modern image classification in order to make it more stable in the presence of noisy small probability estimates. Just as importantly, we provide extensive evaluations and code for conformal prediction in computer vision. Formally, for a discrete response Y ∈ Y = {1, . . . , K} and a feature vector X ∈ R d , we desire an uncertainty set function, C(X), mapping a feature vector to a subset of {1, . . . , K} such that P (Y ∈ C(X)) ≥ 1 -α, for a pre-specified confidence level α such as 10%. Conformal predictors like our method can modify any black-box classifier to output predictive sets that are rigorously guaranteed to satisfy the desired coverage property shown in Eq. ( 1). For evaluations, we focus on Imagenet classification using convolutional neural networks (CNNs) as the base classifiers, since this is a particularly challenging testbed. In this setting, X would be the image and Y would be the class label. Note that the guarantee in Eq. ( 1) is marginal over X and Y -it holds on average, not for a particular image X. A first approach toward this goal might be to assemble the set by including classes from highest to lowest probability (e.g., after Platt scaling and a softmax function; see Platt et al., 1999; Guo et al., 2017) until their sum just exceeds the threshold 1 -α. We call this strategy naive and formulate it precisely in Algorithm 1. There are two problems with naive: first, the probabilities output by CNNs are known to be incorrect (Nixon et al., 2019) , so the sets from naive do not achieve coverage. Second, image classification models' tail probabilities are often badly miscalibrated, leading to large sets that do not faithfully articulate the uncertainty of the model; see Section 2.3. Moreover, smaller sets that achieve the same coverage level can be generated with other methods. The coverage problem can be solved by picking a new threshold using holdout samples. For example, with α =10%, if choosing sets that contain 93% estimated probability achieves 90% coverage on the holdout set, we use the 93% cutoff instead. We refer to this algorithm, introduced in Romano et al. ( 2020), as Adaptive Prediction Sets (APS). The APS procedure provides coverage but still produces large sets. To fix this, we introduce a regularization technique that tempers the influence of these noisy estimates, leading to smaller, more stable sets. We describe our proposed algorithm, Regularized Adaptive Prediction Sets (RAPS), in Algorithms 2 and 3 (with APS as a special case). As we will see in Section 2, both APS and RAPS are always guaranteed to satisfy Eq. ( 1)-regardless of model and dataset. Furthermore, we show that RAPS is guaranteed to have better performance than choosing a fixed-size set. Both methods impose negligible computational requirements in both training and evaluation, and output useful estimates of the model's uncertainty on a new image given, say, 1000 held-out examples. In Section 3 we conduct the most extensive evaluation of conformal prediction in deep learning to date on Imagenet and Imagenet-V2. We find that RAPS sets always have smaller average size than naive and APSsets. For example, using a ResNeXt-101, naive does not achieve coverage, while APS and RAPS achieve it almost exactly. However, APS sets have an average size of 19, while RAPS sets have an average size of 2 at α = 10% (Figure 2 and Table 1 ). We will provide an accompanying codebase that implements our method as a wrapper for any PyTorch classifier, along with code to exactly reproduce all of our experiments.

1.1. RELATED WORK

Reliably estimating predictive uncertainty for neural networks is an unsolved problem. Historically, the standard approach has been to train a Bayesian neural network to learn a distribution over network weights (Quinonero-Candela et al., 2005; MacKay, 1992; Neal, 2012; Kuleshov et al., 2018; Gal, 2016) . This approach requires computational and algorithmic modifications; other approaches avoid these via ensembles (Lakshminarayanan et al., 2017; Jiang et al., 2018) or approximations of Bayesian inference (Riquelme et al., 2018; Sensoy et al., 2018) . These methods also have major practical limitations; for example, ensembling requires training many copies of a neural network adversarially. Therefore, the most widely used strategy is ad-hoc traditional calibration of the softmax scores with Platt scaling (Platt et al., 1999; Guo et al., 2017; Nixon et al., 2019) . This work develops a method for uncertainty quantification based on conformal prediction. Originating in the online learning literature, conformal prediction is an approach for generating predictive sets that satisfy the coverage property in Eq. ( 1) (Vovk et al., 1999; 2005) . We use a convenient data-splitting version known as split conformal prediction that enables conformal prediction meth-



Figure 1: Prediction set examples on Imagenet. We show three examples of the class fox squirrel and the 95% prediction sets generated by RAPS to illustrate how the size of the set changes as a function of the difficulty of a test-time image.

