IMPROVING NEURAL NETWORK ACCURACY AND CALIBRATION UNDER DISTRIBUTIONAL SHIFT WITH PRIOR AUGMENTED DATA

Abstract

Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators. However, neural networks are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions. The problem of overconfidence becomes especially apparent in cases where the test-time data distribution differs from that which was seen during training. We propose a solution to this problem by seeking out regions in arbitrary feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the Bayesian prior on the distribution of the labels. Our method results in a better calibrated network and is agnostic to the underlying model structure, so it can be applied to any neural network which produces a probability density as an output. We demonstrate the effectiveness of our method and validate its performance on both classification and regression problems by applying it to the training of recent state-of-the-art neural network models.

1. INTRODUCTION

While deep neural networks have achieved success on many diverse tasks due to their ability to learn highly expressive task-specific representations, they are known to be overconfident when presented with unseen inputs from unknown data distributions. Probabilistic models should be accurate in terms of both accuracy and calibration. Accuracy measures how often the model's predictions agree with the labels in the dataset. Calibration measures test the accuracy of the uncertainty around a probabilistic output. For example, an event predicted with 10% probability should be the empirical outcome 10% of the time. The probability around rare but important outlier events needs to be trustworthy for mission critical tasks such as autonomous driving. Bayesian neural networks (BNN) and ensembling methods are popular ways to achieve a predictive distribution for both classification and regression models. Since Gal & Ghahramani (2015) showed that Monte Carlo Dropout acts as a Bayesian approximation, there have been numerous advances in modeling predictive uncertainty with BNNs. As laid out by Kendall & Gal (2017) , models need to account for sources of both aleatoric and epistemic uncertainty. Epistemic uncertainty arises from uncertainty in knowledge or beliefs in a system. For parametric models such as BNNs, this presents as uncertainty in the parameters which are trained to encode knowledge about a data distribution. Aleatoric uncertainty arises from irreducible noise in the data. Correctly modeling both forms of uncertainty is essential in order to form accurate and calibrated predictions. Accuracy and calibration are negatively impacted when the data seen during deployment varies substantially from that seen during training. It has been shown that when test data has undergone a significant distributional shift from the training data, one can witness performance degradation across all models (Snoek et al., 2019) . A recurring result from Snoek et al. ( 2019) is that Deep Ensembles (Lakshminarayanan et al., 2017) show superior performance on shifted test data. Previous work has also shown that BNNs fail to accurately model epistemic uncertainty, as regions with sparse amounts of training data often lead to confident predictions even when evidence to justify such confidence is lacking (Sun et al., 2019) . Bayesian non-parametric models such as Gaussian processes (GP) also model epistemic uncertainties, but suffer from limited expressiveness and the need to specify a kernel a priori which may not be feasible for distributions with unknown structure. In this work, we propose a new method for achieving accurate and calibrated models by providing generated samples from a variational distribution which augments the natural data to seek out areas of feature space for which the model exhibits unjustifiably low amounts of uncertainty. For those regions of features, the model is encouraged to predict uncertainty closer to that of the Bayesian prior belief. Our method can be applied to any existing neural network model during training, in any arbitrary feature space, and results in improved accuracy and calibration on shifted test data. Our contributions in this work are as follows: • We propose a new method of data augmentation, which we dub Prior Augmented Data (PAD) that seeks to generate samples in areas where the model has an unjustifiably low level of epistemic uncertainty. • We introduce a method for creating OOD data for regression problems, which to our knowledge has not been proposed before. • We experimentally validate our method on shifted data distributions for both regression and classification tasks, on which it significantly improves both the accuracy and the calibration of a number of state-of-the-art Bayesian neural network models.

2. BACKGROUND

For regression tasks, we denote a set of features x ∈ R d and labels y ∈ R which make up a dataset of i.i.d. samples D = {(x n , y n )} N n=1 , with X := {x n } N n=1 . Let f θ be a neural network which is parameterized by weights θ. Let the output of f θ (x) be a probability density p θ (y|x) which is either in the form of a Gaussian density N (µ, σ) for single task regression or a categorical distribution in the case of multi-class classification. Let g φ be a generative model which generates pseudo inputs for f θ . We assume that both f θ and g φ are iteratively trained via mini-batch stochastic gradient descent with updates to generic model parameters τ given by the update rule τ t+1 = τ t -∇ τt L, with L representing a loss function which is differentiable w.r.t generic parameters τ . We refer to an out-of-distribution (OOD) or distributionally shifted dataset D as one which is drawn from a different region of the distribution than the training dataset D. This distributional shift can occur naturally for multiple reasons including a temporal modal shift, or an imbalance in training data which may come about when gathering data is more difficult or costly in particular regions of D.

2.1. BAYESIAN NEURAL NETWORKS

BNNs are neural networks with a distribution over the weights that can model uncertainty in the weight space. In practice, this is often done by introducing a variational distribution and then minimizing the Kullback-Leibler divergence (Kullback, 1997) between the variational distribution and the true weight posterior. For a further discussion of this topic, we refer the reader to existing works (Kingma & Welling, 2013; Blundell et al., 2015; Gal & Ghahramani, 2015) . During inference, BNNs make predictions by approximating the following integral with Monte Carlo samples from the variational distribution q(θ|D). ∼ q(θ|D). (1)

2.2. MISPLACED CONFIDENCE

A problem arises when neural networks do not accurately model the true posterior over the weights p(θ|D) given in (2). Our conjecture is that a major factor contributing to generalization error in both likelihood and calibration is a failure to revert to the prior p(θ) for regions of the input space with insufficient evidence to warrant low entropy predictions. Bayesian non-parametric models such as Gaussian processes (GP) solve this through utilizing a kernel which makes pairwise comparisons between all datapoints. GP's come with the drawback of having to specify a kernel a priori and are



p(y|x, D) = p θ (y|x)p(θ|D)dθ ≈ 1 S S s=1p θs (y|x)q(θ s |D), θ 1 , . . . , θ S i.i.d.

