VARIATIONAL DETERMINISTIC UNCERTAINTY QUANTIFICATION

Abstract

Building on recent advances in uncertainty quantification using a single deep deterministic model (DUQ), we introduce variational Deterministic Uncertainty Quantification (vDUQ). We overcome several shortcomings of DUQ by recasting it as a Gaussian process (GP) approximation. Our principled approximation is based on an inducing point GP in combination with Deep Kernel Learning. This enables vDUQ to use rigorous probabilistic foundations, and work not only on classification but also on regression problems. We avoid uncertainty collapse away from the training data by regularizing the spectral norm of the deep feature extractor. Our method matches SotA accuracy, 96.2% on CIFAR-10, while maintaining the speed of softmax models, and provides uncertainty estimates competitive with Deep Ensembles. We demonstrate our method in regression problems and by estimating uncertainty in causal inference for personalized medicine.

1. INTRODUCTION

Deploying machine learning algorithms as part of automated decision making systems, such as self driving cars and medical diagnostics, requires implementing fail-safes. Whenever the model is presented with a novel or ambiguous situation, it would not be wise to simply trust its prediction. Instead, the system should try to get more information or simply withhold or defer judgment. While significant progress has been made towards estimating predictive uncertainty reliably in deep learning (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017) , there is no single method that is shown to work on large datasets in classification and regression without significant computation overheads, such as multiple forward passes. We propose Variational Deterministic Uncertainty Quantification (vDUQ), a method for obtaining predictive uncertainty in deep learning for both classification and regression problems in only a single forward pass. In previous work, van Amersfoort et al. (2020) show that combining a distance aware decision function with a regularized feature extractor in the form of a deep RBF network, leads to a model (DUQ) that matches a softmax model in accuracy, but is competitive with Deep Ensembles for uncertainty on large datasets. The feature extractor is regularized using a two-sided gradient penalty, which encourages the model to be sensitive to changes in the input, avoiding feature collapse, and encouraging generalization by controlling the Lipschitz constant. This model, however, has several limitations; for example the uncertainty (a distance in feature space) cannot be interpreted probabilistically and it is difficult to disentangle aleatoric and epistemic uncertainty. Additionally, the loss function and centroid update scheme are not principled and do not extend to regression tasks. A probabilistic and principled alternative to deep RBF networks are Gaussian Processes (GPs) in combination with Deep Kernel Learning (DKL) (Hinton & Salakhutdinov, 2008; Wilson et al., 2016b) . DKL was introduced as a "best of both worlds" solution: apply a deep model on the training data and learn the GP in feature space, ideally getting the advantages of both models. In practice, however, DKL suffers from the same failure as Deep RBF networks: the deep model is free to map out of distribution data close to the feature representation of the training data, removing the attractive properties of GPs with distance sensitive kernels. Using insights from DUQ, we are able to mitigate the problems of uncertainty collapse in DKL. In particular, we use direct spectral normalization (Gouk et al., 2018; Miyato et al., 2018) normalization enforces smoothness, while the residual connections enforce sensitivity of the feature represenation to changes in the input, obtaining a similar effect as the gradient penalty of DUQ. We use an inter-domain inducing point variational approximation of the GP predictive distribution (Lázaro-Gredilla & Figueiras-Vidal, 2009; Hensman et al., 2015) , which places inducing points in feature space leading to needing fewer inducing points than previous work (Wilson et al., 2016a) . These two techniques combined speed up inference in the GP model and decouple it from the dataset size. We release our codefoot_0 and hope that it will become a drop in alternative for softmax models with improved uncertainty. In Figure 1 , we show how vDUQ and Deep Ensembles (Lakshminarayanan et al., 2017) , the current state of the art for uncertainty quantification (Ovadia et al., 2019) , perform on simple 1D regression. This task is particularly hard for deep networks as shown in Foong et al. (2019) . vDUQ shows the desired behavior of reverting back to the prior away from the data, while the Deep Ensemble extrapolates arbitrarily and confidently. In between the two sinusoids, the Deep Ensemble is certain while vDUQ increases its uncertainty. In summary, our contributions are as follows: • We improve training a DKL model and for the first time match the accuracy and speed of training a deep network using regular softmax output on standard vision benchmarks. • We demonstrate excellent uncertainty quantification in classsification which matches or exceeds the state of the art on CIFAR-10, including ensembling approaches. • We show state of the art performance on causal inference for personalized medicine, an exciting real world application. This task requires calibrated uncertainty in regression to be able to defer treatment to an expert when uncertainty is high.

2. BACKGROUND

Gaussian Processes (GPs) provide an interpretable, explainable and principled way to make predictions, and can work well even with little training data due to their use of Bayesian inference. In contrast to deep neural networks, GPs have high uncertainty away from the training data and on noisy inputs. There are however two main issues with the standard GP setup: poor performance on high dimensional inputs and inefficient computational scaling with large datasets. The poor performance on high dimensional inputs is due to the fact that most standard shift-invariant kernels are based on



Available at: anonymized-for-review



Figure 1: We show results on a 1D regression dataset of a sinusoid curve. In green are data points of the dataset and in blue the prediction including uncertainty (two standard deviations). As expected, vDUQ reverts to the prior away from the training data, while Deep Ensembles extrapolates arbitrarily and confidently. The uncertainty is the posterior variance in the case of vDUQ, and the variance across the ensemble element predictions in Deep Ensembles.

