EMPIRICAL FREQUENTIST COVERAGE OF DEEP LEARNING UNCERTAINTY QUANTIFICATION PROCEDURES

Abstract

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model's uncertainty is evaluated using point-prediction metrics such as negative log-likelihood or the Brier score on heldout data. In this study, we provide the first large scale evaluation of the empirical frequentist coverage properties of well known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and establish coverage as an important metric in developing models for real-world applications.

1. INTRODUCTION

Predictive models based on deep learning have seen dramatic improvement in recent years (LeCun et al., 2015) , which has led to widespread adoption in many areas. For critical, high-stakes domains such as medicine or self-driving cars, it is imperative that mechanisms are in place to ensure safe and reliable operation. Crucial to the notion of safe and reliable deep learning is the effective quantification and communication of predictive uncertainty to potential end-users of a system. Many approaches have recently been proposed that fall into two broad categories: ensembles and Bayesian methods. Ensembles (Lakshminarayanan et al., 2017) aggregate information from many individual models to provide a measure of uncertainty that reflects the ensembles agreement about a given data point. Bayesian methods offer direct access to predictive uncertainty through the posterior predictive distribution, which combines prior knowledge with the observed data. Although conceptually elegant, calculating exact posteriors of even simple neural models is computationally intractable (Yao et al., 2019; Neal, 1996) , and many approximations have been developed (Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Graves, 2011; Pawlowski et al., 2017; Hernández-Lobato et al., 2016; Louizos & Welling, 2016; 2017) . Though approximate Bayesian methods scale to modern sized data and models, recent work has questioned the quality of the uncertainty provided by these approximations (Yao et al., 2019; Wenzel et al., 2020; Ovadia et al., 2019) . Previous work assessing the quality of uncertainty estimates have focused on calibration metrics and scoring rules such as the negative-loglikelihood (NLL), expected calibration error (ECE), and Brier score. Here we provide a complementary perspective based on the notion of empirical coverage, a well-established concept in the statistical literature (Wasserman, 2013) that evaluates the quality of a predictive set or interval instead of a point prediction. Informally, coverage asks the question: If a model produces a predictive uncertainty interval, how often does that interval actually contain the observed value? Ideally, predictions on examples for which a model is uncertain would produce larger intervals and thus be more likely to cover the observed value. More formally, given features x n ∈ R d and a response y n ∈ R, coverage is defined in terms of a set Ĉn (x) and a level α ∈ [0, 1]. The set Ĉn (x) is said to have coverage at the 1 -α level if for all distributions P ∈ R d × R where (x, y) ∼ P , the following inequality holds: P{y n ∈ Ĉn (x n )} ≥ 1 -α (1)

