EMPIRICAL FREQUENTIST COVERAGE OF DEEP LEARNING UNCERTAINTY QUANTIFICATION PROCEDURES

Abstract

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model's uncertainty is evaluated using point-prediction metrics such as negative log-likelihood or the Brier score on heldout data. In this study, we provide the first large scale evaluation of the empirical frequentist coverage properties of well known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and establish coverage as an important metric in developing models for real-world applications.

1. INTRODUCTION

Predictive models based on deep learning have seen dramatic improvement in recent years (LeCun et al., 2015) , which has led to widespread adoption in many areas. For critical, high-stakes domains such as medicine or self-driving cars, it is imperative that mechanisms are in place to ensure safe and reliable operation. Crucial to the notion of safe and reliable deep learning is the effective quantification and communication of predictive uncertainty to potential end-users of a system. Many approaches have recently been proposed that fall into two broad categories: ensembles and Bayesian methods. Ensembles (Lakshminarayanan et al., 2017) aggregate information from many individual models to provide a measure of uncertainty that reflects the ensembles agreement about a given data point. Bayesian methods offer direct access to predictive uncertainty through the posterior predictive distribution, which combines prior knowledge with the observed data. Although conceptually elegant, calculating exact posteriors of even simple neural models is computationally intractable (Yao et al., 2019; Neal, 1996) , and many approximations have been developed (Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Graves, 2011; Pawlowski et al., 2017; Hernández-Lobato et al., 2016; Louizos & Welling, 2016; 2017) . Though approximate Bayesian methods scale to modern sized data and models, recent work has questioned the quality of the uncertainty provided by these approximations (Yao et al., 2019; Wenzel et al., 2020; Ovadia et al., 2019) . Previous work assessing the quality of uncertainty estimates have focused on calibration metrics and scoring rules such as the negative-loglikelihood (NLL), expected calibration error (ECE), and Brier score. Here we provide a complementary perspective based on the notion of empirical coverage, a well-established concept in the statistical literature (Wasserman, 2013) that evaluates the quality of a predictive set or interval instead of a point prediction. Informally, coverage asks the question: If a model produces a predictive uncertainty interval, how often does that interval actually contain the observed value? Ideally, predictions on examples for which a model is uncertain would produce larger intervals and thus be more likely to cover the observed value. More formally, given features x n ∈ R d and a response y n ∈ R, coverage is defined in terms of a set Ĉn (x) and a level α ∈ [0, 1]. The set Ĉn (x) is said to have coverage at the 1 -α level if for all distributions P ∈ R d × R where (x, y) ∼ P , the following inequality holds: The set Ĉn (x) can be constructed using a variety of procedures. For example, in the case of simple linear regression a prediction interval for a new point x n+1 can be constructedfoot_0 using a simple, closed-form solution. Figure 1 provides a graphical depiction of coverage for two hypothetical regression models. P{y n ∈ Ĉn (x n )} ≥ 1 -α (1) A complementary metric to coverage is width, which is the size of the prediction interval or set. Width can provide a relative ranking of different methods, i.e. given two methods with the same level of coverage we should prefer the method that provides intervals with smaller widths. Contributions: In this study we investigate the empirical coverage properties of prediction intervals constructed from a catalog of popular uncertainty quantification techniques such as ensembling, Monte-Carlo dropout, Gaussian processes, and stochastic variational inference. We assess the coverage properties of these methods on nine regression tasks and two classification tasks with and without dataset shift. These tasks help us make the following contributions: • We introduce coverage and width as a natural and interpretable metrics for evaluating predictive uncertainty. • A comprehensive set of coverage evaluations on a suite of popular uncertainty quantification techniques. • An examination of how dataset shift affects these coverage properties.

2. BACKGROUND AND RELATED WORK

Obtaining Predictive Uncertainty Estimates Several lines of work focus on improving approximations of the posterior of a Bayesian neural network (Graves, 2011; Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Hernández-Lobato et al., 2016; Louizos & Welling, 2016; Pawlowski et al., 2017; Louizos & Welling, 2017) . Yao et al.



A well-known result from the statistics literature (c.f. chapter 13 of Wasserman (2013)) is that the interval is given by ŷn+1 ± tn-2sy 1/n + (xn+1 -x) /((n -1)s 2 x ), where ŷn+1 is the predicted value, tn-2 is the 1 -α/2 critical value from a t-distribution with n -2 degrees of freedom, x is the mean of x in the training data, and sy, sx are the standard deviations for y and x respectively. such that (1) holds asymptotically. However, for more complicated models such as deep learning, closed form solutions with coverage guarantees are unavailable, and constructing these intervals via the bootstrap(Efron, 1982)) can be computationally infeasible or fail to provide the correct coverage(Chatterjee & Lahiri, 2011).



Figure 1: An example of the coverage properties for two methods of uncertainty quantification. In this scenario, each model produces an uncertainty interval for each x i which attempts to cover the true y i , represented by the red points. Coverage is calculated as the fraction of true values contained in these regions, while the width of these regions is reported in terms of multiples of the standard deviation of the training set y i values.

