EMPIRICAL FREQUENTIST COVERAGE OF DEEP LEARNING UNCERTAINTY QUANTIFICATION PROCEDURES

Abstract

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model's uncertainty is evaluated using point-prediction metrics such as negative log-likelihood or the Brier score on heldout data. In this study, we provide the first large scale evaluation of the empirical frequentist coverage properties of well known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and establish coverage as an important metric in developing models for real-world applications.

1. INTRODUCTION

Predictive models based on deep learning have seen dramatic improvement in recent years (LeCun et al., 2015) , which has led to widespread adoption in many areas. For critical, high-stakes domains such as medicine or self-driving cars, it is imperative that mechanisms are in place to ensure safe and reliable operation. Crucial to the notion of safe and reliable deep learning is the effective quantification and communication of predictive uncertainty to potential end-users of a system. Many approaches have recently been proposed that fall into two broad categories: ensembles and Bayesian methods. Ensembles (Lakshminarayanan et al., 2017) aggregate information from many individual models to provide a measure of uncertainty that reflects the ensembles agreement about a given data point. Bayesian methods offer direct access to predictive uncertainty through the posterior predictive distribution, which combines prior knowledge with the observed data. Although conceptually elegant, calculating exact posteriors of even simple neural models is computationally intractable (Yao et al., 2019; Neal, 1996) , and many approximations have been developed (Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Graves, 2011; Pawlowski et al., 2017; Hernández-Lobato et al., 2016; Louizos & Welling, 2016; 2017) . Though approximate Bayesian methods scale to modern sized data and models, recent work has questioned the quality of the uncertainty provided by these approximations (Yao et al., 2019; Wenzel et al., 2020; Ovadia et al., 2019) . Previous work assessing the quality of uncertainty estimates have focused on calibration metrics and scoring rules such as the negative-loglikelihood (NLL), expected calibration error (ECE), and Brier score. Here we provide a complementary perspective based on the notion of empirical coverage, a well-established concept in the statistical literature (Wasserman, 2013) that evaluates the quality of a predictive set or interval instead of a point prediction. Informally, coverage asks the question: If a model produces a predictive uncertainty interval, how often does that interval actually contain the observed value? Ideally, predictions on examples for which a model is uncertain would produce larger intervals and thus be more likely to cover the observed value. More formally, given features x n ∈ R d and a response y n ∈ R, coverage is defined in terms of a set Ĉn (x) and a level α ∈ [0, 1]. The set Ĉn (x) is said to have coverage at the 1 -α level if for all distributions P ∈ R d × R where (x, y) ∼ P , the following inequality holds: The set Ĉn (x) can be constructed using a variety of procedures. For example, in the case of simple linear regression a prediction interval for a new point x n+1 can be constructed 1 using a simple, closed-form solution. Figure 1 provides a graphical depiction of coverage for two hypothetical regression models. P{y n ∈ Ĉn (x n )} ≥ 1 -α (1) A complementary metric to coverage is width, which is the size of the prediction interval or set. Width can provide a relative ranking of different methods, i.e. given two methods with the same level of coverage we should prefer the method that provides intervals with smaller widths.

Contributions:

In this study we investigate the empirical coverage properties of prediction intervals constructed from a catalog of popular uncertainty quantification techniques such as ensembling, Monte-Carlo dropout, Gaussian processes, and stochastic variational inference. We assess the coverage properties of these methods on nine regression tasks and two classification tasks with and without dataset shift. These tasks help us make the following contributions: • We introduce coverage and width as a natural and interpretable metrics for evaluating predictive uncertainty. • A comprehensive set of coverage evaluations on a suite of popular uncertainty quantification techniques. • An examination of how dataset shift affects these coverage properties.

2. BACKGROUND AND RELATED WORK

Obtaining Predictive Uncertainty Estimates Several lines of work focus on improving approximations of the posterior of a Bayesian neural network (Graves, 2011; Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Hernández-Lobato et al., 2016; Louizos & Welling, 2016; Pawlowski et al., 2017; Louizos & Welling, 2017) . Yao et al. 1 A well-known result from the statistics literature (c.f. chapter 13 of Wasserman (2013) ) is that the interval is given by ŷn+1 ± tn-2sy 1/n + (xn+1 -x) 2 /((n -1)s 2 x ), where ŷn+1 is the predicted value, tn-2 is the 1 -α/2 critical value from a t-distribution with n -2 degrees of freedom, x is the mean of x in the training data, and sy, sx are the standard deviations for y and x respectively. such that (1) holds asymptotically. However, for more complicated models such as deep learning, closed form solutions with coverage guarantees are unavailable, and constructing these intervals via the bootstrap (Efron, 1982) ) can be computationally infeasible or fail to provide the correct coverage (Chatterjee & Lahiri, 2011) . (2019) provide a comparison of many of these methods and highlight issues with common metrics of comparison, such as test-set log likelihood and RMSE. Good scores on these metrics often indicates that the model posterior happens to match the test data rather than the true posterior (Yao et al., 2019) . Maddox et al. (2019) developed a technique to sample the approximate posterior from the first moment of SGD iterates. Wenzel et al. (2020) demonstrated that despite advances in these approximations, there are still outstanding challenges with Bayesian modeling for deep networks. Alternative methods that do not rely on estimating a posterior over the weights of a model can also be used to provide uncertainty estimates. Gal & Ghahramani (2016) , for instance, demonstrated that Monte Carlo dropout is related to a variational approximation to the Bayesian posterior implied by the dropout procedure. Lakshminarayanan et al. (2017) used ensembling of several neural networks to obtain uncertainty estimates. Guo et al. (2017) established that temperature scaling provides well calibrated predictions on an i.i.d test set. More recently, van Amersfoort et al. (2020) showed that the distance from the centroids in a RBF neural network yields high quality uncertainty estimates. Liu et al. (2020) also leveraged the notion of distance (in this case, the distance from test to train examples) to obtain uncertainty estimates with their Spectral-normalized Neural Gaussian Processes. Assessments of Uncertainty Properties under Dataset Shift Ovadia et al. (2019) analyzed the effect of dataset shift on the accuracy and calibration of Bayesian deep learning methods. Their large scale empirical study assessed these methods on standard datasets such as MNIST, CIFAR-10, ImageNet, and other non-image based datasets. Additionally, they used translations, rotations, and corruptions (Hendrycks & Gimpel, 2017) of these datasets to quantify performance under dataset shift. They found stochastic variational inference (SVI) to be promising on simpler datasets such as MNIST and CIFAR-10, but more difficult to train on larger datasets. Deep ensembles had the most robust response to dataset shift.

Theoretical Coverage Guarantees

The Bernstein-von Mises theorem connects Bayesian credible sets and frequentist confidence intervals. Under certain conditions, Bayesian credible sets of level α are asymptotically frequentist confidence sets of level α and thus have the same coverage properties. However, when there is model misspecification, coverage properties no longer hold (Kleijn & van der Vaart, 2012) . Barber et al. (2019) explored under what conditions conditional coverage guarantees can hold for arbitrary models (i.e. guarantees for P{y n ∈ Ĉn (x|x = x n )}, which are per sample guarantees). They show that even when these coverage properties are not desired to hold for any possible distribution, there are provably no methods that can give such guarantees. By extension, no Bayesian deep learning methods can provide conditional coverage guarantees.

3. METHODS

In both the regression and classification settings, we analyzed the coverage properties of prediction intervals and sets of five different approximate Bayesian and non-Bayesian approaches for uncertainty quantification. These include Dropout (Gal & Ghahramani, 2016; Srivastava et al., 2015) , ensembles (Lakshminarayanan et al., 2017) , Stochastic Variational Inference (Blundell et al., 2015; Graves, 2011; Louizos & Welling, 2016; 2017; Wen et al., 2018) , and last layer approximations of SVI and Dropout (Riquelme et al., 2019) . Additionally, we considered prediction intervals from linear regression and the 95% credible interval of a Gaussian process with the squared exponential kernel as baselines in regression tasks. For classification, we also considered temperature scaling (Guo et al., 2017) and the softmax output of vanilla deep networks (Hendrycks & Gimpel, 2017) .

3.1. REGRESSION METHODS AND METRICS

We evaluated the coverage properties of these methods on nine large real world regression datasets used as a benchmark in Hernández-Lobato & Adams (2015) and later Gal and Ghahramani (Gal & Ghahramani, 2016) . We used the training, validation, and testing splits publicly available from Gal and Ghahramani and performed nested cross validation to find hyperparameters and evaluated coverage properties, defined as the fraction of prediction intervals which contained the true value in the test set. On the training sets, we did 100 trials of a random search over hyperparameter space of a multi-layer-perceptron architecture with an Adam optimizer (Kingma & Ba, 2015) and selected hyperparameters based on RMSE on the validation set. Each approach required slightly different ways to obtain a 95% prediction interval. For an ensemble of neural networks, we trained N = 40 vanilla networks and used the 2.5% and 97.5% quantiles as the boundaries of the prediction interval. For dropout and last layer dropout, we made 200 predictions per sample and similarly discarded the top and bottom 2.5% quantiles. For SVI, last layer SVI (LL SVI), and Gaussian processes we had approximate variances available for the posterior which we used to calculate the prediction interval. We calculated 95% prediction intervals from linear regression using the closed-form solution. Then we calculated two metrics: • Coverage: A sample is considered covered if the true label is contained in this 95% prediction interval. We average over all samples in a test set to estimate the coverage of a method on this dataset. • Width: The width is the average over the test set of the ranges of the 95% prediction intervals. Coverage measures how often the true label is in the prediction region while width measures how specific that prediction region is. Ideally, we would have high levels of coverage with low levels of width on in-distribution data. As data becomes increasingly out of distribution, we would like coverage to remain high while width increases to indicate model uncertainty. 2 shows the effects of the 16 corruptions in CIFAR-10-C at the first, third, and fifth levels of intensity.

3.2. CLASSIFICATION METHODS AND METRICS

We calculate the prediction set of a model's output. Given α ∈ (0, 1), the 1 -α prediction set S for a sample x i is the minimum sized set of classes such that c∈S p(y c |x i ) ≥ 1 -α (2) This consists of the top k i probabilities such that 1 -α probability has been accumulated. Then we can define: • Coverage: For each dataset point, we calculate the 1 -α prediction set of the label probabilities, then coverage is what fraction of these prediction sets contain the true label. • Width: The width of a prediction set is simply the number of labels in the set, |S|. We report the average width of prediction sets over a dataset in our figures. Although both calibration (Guo et al., 2017) and coverage can involve a probability over a model's output, calibration only considers the most likely label and it's corresponding probability, while coverage considers the the top-k i probabilities. In the classification setting, coverage is more robust to label errors as it does not penalize models for putting probability on similar classes.

4.1. REGRESSION

Table 1 shows the mean and standard error of coverage levels for the methods we evaluated. In the regression setting, we find high levels of coverage for linear regression, Gaussian processes, SVI, and LL SVI. Ensembles and Dropout had lower levels of coverage, while LL Dropout had the lowest average coverage. (Gal & Ghahramani, 2016) , though the intention was not to produce state of the art results, but merely demonstrate the models were trained in a reasonable manner.

4.2. MNIST

We begin by calculating coverage and width for predictions from Ovadia et al. (2019) on MNIST and shifted MNIST data. Ovadia et al. (2019) used a LeNet architecture and we refer to their manuscript for more details on their implementation. Figure 3 shows how coverage and width co-vary as dataset shift increases. We observe high coverage and low width for all models on training, validation, and non-shifted test set data. The elevated width for SVI on these dataset splits indicate that the posterior predictions of label probabilities were the most diffuse to begin with among all models. In Figure 3 , all seven models have at least 0.95 coverage with a 15 degree rotation shift. Most models don't see an appreciable increase in the average width of the 0.95 prediction set, except for SVI. The average width for SVI jumps to over 2 at 15 degrees rotation. As the amount of shift increases, coverage decreases across all methods in a comparable way. SVI maintains higher levels of coverage, but with a compensatory increase in width. Figure 4 : The effect of translation shifts on coverage and width in CIFAR-10 images. Coverage remains robust across all pixel shifts, while width increases. In Figure 3 , we observe the same coverage-width pattern at the lowest level of shift, 2 pixels. All methods have at least 0.95 coverage, but only SVI has a distinct jump in the average width of its prediction set. The average width of the prediction set increases slightly then plateaus for all methods but SVI as the amount of translation increases. For this simple dataset, SVI outperforms other models with regards to coverage and width properties. It is the only model that has an average width that corresponds to the amount of shift observed. 

4.3. CIFAR-10

Next, we consider a more complex image dataset, CIFAR-10. Ovadia et al. (2019) trained 20 layer and 50 layer ResNets. Figure 4 shows how all seven models have high coverage levels over all translation shifts. Temperature scaling and ensemble, in particular, have at least 0.95 coverage for every translation. We find that this high coverage comes with increases in width as shift increases. Figure 4 shows that temperature scaling has the highest average width across all models and shifts. All models have the same pattern of width increases, with peak average widths at 16 pixels translation. Between the models which satisfy 0.95 coverage levels on all shifts, ensemble models have lower width than temperature scaling models. Under translation shifts on CIFAR-10, ensemble methods perform the best given their high coverage and lower width. Additionally, we consider the coverage properties of models on 16 different corruptions of CIFAR-10 from Hendrycks and Gimpel (Hendrycks & Gimpel, 2017) . Figure 5 shows coverage vs. width over varying levels of shift intensity. Models that have more dispersed points to the right have higher Each facet has 80 points per method, since 5 iterations were trained per method. For methods with points at the same coverage level, the superior method is to the left as it has a lower width. widths for the same level of coverage. An ideal model would have a cluster of points above the 0.95 coverage line and be far to the left portion of each facet. For models that have similar levels of coverage, the superior method will have points further to the left. Figure 5 demonstrates that at the lowest shift intensity, ensemble models, dropout, temperature scaling, and SVI were able to generally provide high levels of coverage on most corruption types. However, as the intensity of the shift increases, coverage decreases. Ensembles and dropout models have for at least half of their 80 model-corruption evaluations at least 0.95 coverage up to the third intensity level. At higher levels of shift intensity, ensembles, dropout, and temperature scaling consistently have the highest levels of coverage. Although these higher performing methods have similar levels of coverage, they have different widths. See Figure A1 for a further examination of the coverage and widths of these methods. (2017) . Figure 6 shows similar coverage vs. width plots to Figure 5 . We find that over the 16 different corruptions at 5 levels, ensembles, temperature scaling, and dropout models had consistently higher levels of coverage. Unsurprisingly, Figure 6 shows that these methods have correspondingly higher widths. At the first three levels of corruption, ensembling has the lowest level of width of the top performing methods (see Figure A2 ). However, at the highest two levels of corruption, dropout has lower width than ensembling. None of the methods have a commensurate increase in width to maintain the 0.95 coverage levels seen on in-distribution test data as dataset shift increases. For methods equal coverage, the superior method is to the left as it has a lower width. 

5. DISCUSSION

We have provided the first comprehensive empirical study of the frequentist-style coverage properties of popular uncertainty quantification techniques for deep learning models. In regression tasks, Gaussian Processes were the clear winner in terms of coverage across nearly all benchmarks, with smaller widths than linear regression, whose prediction intervals come with formal guarantees. SVI and LL SVI also had excellent coverage properties across most tasks with tighter intervals than GPs and linear regression. In contrast, the methods based on ensembles and Monte Carlo dropout had significantly worse coverage due to their overly confident and tight prediction intervals. Another interesting finding is that despite higher levels of uncertainty (e.g. larger widths), SVI was also the most accurate model based on RMSE as reported in Table 3 . In the classification setting, all methods showed very high coverage in the i.i.d setting (i.e. no dataset shift), as coverage is reflective of top-1 accuracy in this scenario. On MNIST data, SVI had the best performance, maintaining high levels of coverage under slight dataset shift and scaling the width of its prediction intervals more appropriately as shift increased relative to other methods. On CIFAR-10 data, ensemble models were superior. They had the highest levels of coverage at the third of five intensity levels on CIFAR-10-C data, while have lower width than the next best method, temperature scaling. Dropout and SVI also had slightly worse coverage levels, but lower widths as well. Last layer dropout and last layer SVI performed poorly, oftentimes having lower coverage than vanilla neural networks. In summary, we find that popular uncertainty quantification methods for deep learning models do not provide good coverage properties under moderate levels of datset shift. Although the width of prediction regions do increase under increasing amounts of shift, these changes are not enough to maintain the levels of coverage seen on i.i.d data. We conclude that the methods we evaluated for uncertainty quantification are likely insufficient for use in high-stakes, real-world applications where dataset shift is likely to occur.



Figure 1: An example of the coverage properties for two methods of uncertainty quantification. In this scenario, each model produces an uncertainty interval for each x i which attempts to cover the true y i , represented by the red points. Coverage is calculated as the fraction of true values contained in these regions, while the width of these regions is reported in terms of multiples of the standard deviation of the training set y i values.

Figure 2: An example of the corruptions in CIFAR-10-C. The 16 different corruptions have 5 discrete levels of shift, of which 3 are shown here. The same corruptions were applied to ImageNet to form the ImageNet-C dataset.

Figure 3: The effect of rotation and translation on coverage and width, respectively, for MNIST.

Figure 5: The effect of corruption intensity on coverage levels vs. width in CIFAR-10-C. Each facet panel represents a different corruption level, while points are the coverage of a model on one of 16 corruptions.Each facet has 80 points per method, since 5 iterations were trained per method. For methods with points at the same coverage level, the superior method is to the left as it has a lower width.

Figure 6: The effect of corruption intensity on coverage levels vs. width in ImageNet-C. Each facet panel represents a different corruption level, while points are the coverage of a model on one of 16 corruptions. Each facet has 16 points per method, as only only 1 iteration was trained per method.For methods equal coverage, the superior method is to the left as it has a lower width.

For ImageNet,Ovadia et al. (2019) only analyzed model predictions on the corrupted images of ImageNet-C. Each of these transformations (rotation, translation, or any of the 16 corruptions) has multiple levels of shift. Rotations range from 15 to 180 degrees in 15 degrees increments. Translations shift images every 2 and 4 pixels for MNIST and CIFAR-10, respectively. Corruptions have 5 increasing levels of intensity. Figure

Table2reports the average width of the 95% prediction interval in terms of standard deviations of the response variable. We see that higher coverage correlates with a higher average width.

The average coverage of six methods across nine datasets with the standard error over 20 cross validation folds in parentheses.

The average width of the posterior prediction interval of six methods across nine datasets with the standard error over 20 cross validation folds in parentheses. Width is reported in terms of standard deviations of the response variable in the training set.

The average RMSE of six methods across nine datasets with the standard error over 20 cross validation folds in parentheses. These values are comparable to other reported in the literature for these benchmarks



The mean coverage and widths on the test set of CIFAR-10 as well as on the mean coverage and width averaged over 16 corruptions and 5 intensities.

The mean coverage and widths on the test set of ImageNet as well as on the mean coverage and width averaged over 16 corruptions and 5 intensities.

Code Availability

The code and data to reproduce our results will be made available after the anonymous review period.

Method

Mean 

