IMPROVING NEURAL NETWORK ACCURACY AND CALIBRATION UNDER DISTRIBUTIONAL SHIFT WITH PRIOR AUGMENTED DATA

Abstract

Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators. However, neural networks are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions. The problem of overconfidence becomes especially apparent in cases where the test-time data distribution differs from that which was seen during training. We propose a solution to this problem by seeking out regions in arbitrary feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the Bayesian prior on the distribution of the labels. Our method results in a better calibrated network and is agnostic to the underlying model structure, so it can be applied to any neural network which produces a probability density as an output. We demonstrate the effectiveness of our method and validate its performance on both classification and regression problems by applying it to the training of recent state-of-the-art neural network models.

1. INTRODUCTION

While deep neural networks have achieved success on many diverse tasks due to their ability to learn highly expressive task-specific representations, they are known to be overconfident when presented with unseen inputs from unknown data distributions. Probabilistic models should be accurate in terms of both accuracy and calibration. Accuracy measures how often the model's predictions agree with the labels in the dataset. Calibration measures test the accuracy of the uncertainty around a probabilistic output. For example, an event predicted with 10% probability should be the empirical outcome 10% of the time. The probability around rare but important outlier events needs to be trustworthy for mission critical tasks such as autonomous driving. Bayesian neural networks (BNN) and ensembling methods are popular ways to achieve a predictive distribution for both classification and regression models. Since Gal & Ghahramani (2015) showed that Monte Carlo Dropout acts as a Bayesian approximation, there have been numerous advances in modeling predictive uncertainty with BNNs. As laid out by Kendall & Gal (2017) , models need to account for sources of both aleatoric and epistemic uncertainty. Epistemic uncertainty arises from uncertainty in knowledge or beliefs in a system. For parametric models such as BNNs, this presents as uncertainty in the parameters which are trained to encode knowledge about a data distribution. Aleatoric uncertainty arises from irreducible noise in the data. Correctly modeling both forms of uncertainty is essential in order to form accurate and calibrated predictions. Accuracy and calibration are negatively impacted when the data seen during deployment varies substantially from that seen during training. It has been shown that when test data has undergone a significant distributional shift from the training data, one can witness performance degradation across all models (Snoek et al., 2019) . A recurring result from Snoek et al. (2019) is that Deep Ensembles (Lakshminarayanan et al., 2017) show superior performance on shifted test data. Previous work has also shown that BNNs fail to accurately model epistemic uncertainty, as regions with sparse amounts of training data often lead to confident predictions even when evidence to justify such confidence is lacking (Sun et al., 2019) . Bayesian non-parametric models such as Gaussian processes (GP) also model epistemic uncertainties, but suffer from limited expressiveness and the need to specify a kernel a priori which may not be feasible for distributions with unknown structure. In this work, we propose a new method for achieving accurate and calibrated models by providing generated samples from a variational distribution which augments the natural data to seek out areas of feature space for which the model exhibits unjustifiably low amounts of uncertainty. For those regions of features, the model is encouraged to predict uncertainty closer to that of the Bayesian prior belief. Our method can be applied to any existing neural network model during training, in any arbitrary feature space, and results in improved accuracy and calibration on shifted test data. Our contributions in this work are as follows: • We propose a new method of data augmentation, which we dub Prior Augmented Data (PAD) that seeks to generate samples in areas where the model has an unjustifiably low level of epistemic uncertainty. • We introduce a method for creating OOD data for regression problems, which to our knowledge has not been proposed before. • We experimentally validate our method on shifted data distributions for both regression and classification tasks, on which it significantly improves both the accuracy and the calibration of a number of state-of-the-art Bayesian neural network models.

2. BACKGROUND

For regression tasks, we denote a set of features x ∈ R d and labels y ∈ R which make up a dataset of i.i.d. samples D = {(x n , y n )} N n=1 , with X := {x n } N n=1 . Let f θ be a neural network which is parameterized by weights θ. Let the output of f θ (x) be a probability density p θ (y|x) which is either in the form of a Gaussian density N (µ, σ) for single task regression or a categorical distribution in the case of multi-class classification. Let g φ be a generative model which generates pseudo inputs for f θ . We assume that both f θ and g φ are iteratively trained via mini-batch stochastic gradient descent with updates to generic model parameters τ given by the update rule τ t+1 = τ t -∇ τt L, with L representing a loss function which is differentiable w.r.t generic parameters τ . We refer to an out-of-distribution (OOD) or distributionally shifted dataset D as one which is drawn from a different region of the distribution than the training dataset D. This distributional shift can occur naturally for multiple reasons including a temporal modal shift, or an imbalance in training data which may come about when gathering data is more difficult or costly in particular regions of D.

2.1. BAYESIAN NEURAL NETWORKS

BNNs are neural networks with a distribution over the weights that can model uncertainty in the weight space. In practice, this is often done by introducing a variational distribution and then minimizing the Kullback-Leibler divergence (Kullback, 1997) between the variational distribution and the true weight posterior. For a further discussion of this topic, we refer the reader to existing works (Kingma & Welling, 2013; Blundell et al., 2015; Gal & Ghahramani, 2015) . During inference, BNNs make predictions by approximating the following integral with Monte Carlo samples from the variational distribution q(θ|D). p(y|x, D) = p θ (y|x)p(θ|D)dθ ≈ 1 S S s=1 p θs (y|x)q(θ s |D), θ 1 , . . . , θ S i.i.d. ∼ q(θ|D). (1)

2.2. MISPLACED CONFIDENCE

A problem arises when neural networks do not accurately model the true posterior over the weights p(θ|D) given in (2). Our conjecture is that a major factor contributing to generalization error in both likelihood and calibration is a failure to revert to the prior p(θ) for regions of the input space with insufficient evidence to warrant low entropy predictions. Bayesian non-parametric models such as Gaussian processes (GP) solve this through utilizing a kernel which makes pairwise comparisons between all datapoints. GP's come with the drawback of having to specify a kernel a priori and are generally outperformed by deep neural networks (DNNs) which are known to be more expressive than a GP with a common RBF kernel. p(θ|D) = p(D|θ)p(θ) p(D|θ )p(θ )dθ . The problem of misplaced confidence of neural network was first studied by Guo et al. (2017) which showed that modern DNNs have poor correlation between their confidence and actual accuracy, and are often overconfident. Snoek et al. (2019) came to the general conclusion that Deep Ensembles (Lakshminarayanan et al., 2017) tend to be the best calibrated model on OOD test data. Since then, there have been a number of newly proposed BNN models such as SWAG (Snoek et al., 2019) Multi-SWAG (Wilson & Izmailov, 2020), and Rank One Bayesian Neural Networks (R1BNN) (Dusenberry et al., 2020) , each of which utilize different strategies for modeling p(θ|D). To illustrate the problem of failing to revert to the prior in underspecified regions of input space, we have provided a toy example in (figure 1 ). The true function is given by x + + sin(4(x + )) + sin(13(x + )), where ∼ N (0, 0.03). We then sample 100 points in the range of [0, 0.4] and 100 points in the range of [0.8, 1.0], which leaves a gap in the training data. One can observe in (Figure 1 ) how different neural network models tend to make confident predictions in regions where they have not observed any data, and exhibit unpredictable behavior around the outer boundaries of the dataset. Our method encourages a reversion to the prior in uncertain regions, similar to the behavior of a GP. In addition to covering the gap between the datapoints, our model exhibits more predictable behavior around the outer boundaries of the data when compared to other BNN models.

3. METHOD

To encourage a reversion towards the prior in uncertain regions, we learn to generate pseudo OOD data which leads to a better calibrated model by raising the entropy of the OOD predictions. An important design choice that we make is for our OOD generator network g φ (•) to take a dataset X as input to produce distributions of OOD data. The goal of the OOD generator is to fill the "gaps" between the training data, using all available current knowlege -the training data itself. Once trained, the model f θ should predict more uncertainty for data generated from g φ . Raising the entropy of uninformative noisy psuedo inputs may be a solution, but also could be too distant from the natural data, and therefore provide no useful information for f θ to learn from. Ideally, we want realistic OOD data that are not too distant from the training data and still predicted with more uncertainty by f θ . To achieve this, we employ an adversarial training procedure similar to generative adversarial nets (Goodfellow et al., 2014a) -we train g φ to find where f θ may be overconfident, and at the same time train f θ to defend against this by predicting higher levels of uncertainty in those regions. In the next section, we explain the objectives of g φ . 3.1 THE OUT-OF-DISTRIBUTION SAMPLE GENERATOR NETWORK g φ g φ takes a dataset X = {x i } n i=1 and produces a distribution of an equally sized pseudo dataset X = {x n } N n=1 . First, each x i is encoded via a feedforward network g enc (•) to construct a set of representations Z = {z n = g enc (x n )} N n=1 . Then, we pick a subset size K ∈ [1, N/2 ] for each mini-batch, and for each n = 1, . . . , N , g φ (X) n = g dec 1 K m∈nn K (n) z m , where nn K (m) denotes the set of K-nearest neighbors of z n and g dec is another feedforward network. The distribution for the generated data X is then defined as q φ ( X|X) = N n=1 q φ (x n |g φ (X) n ). As stated previously, our goal is to generate psuedo-OOD data, but we cannot be sure where such data will arise from. The only thing we can be certain of is that 1) the training data exists, and 2) there exists some distribution which is OOD to the training data. Therefore, our only option is to learn directly from representations of training data in order to find likely regions of OOD data.

3.2. TRAINING OBJECTIVES

Training g φ Given a batch of data X B = {x n } B n=1 , we first construct the OOD distribution g φ ( XB |X B ) via g φ . The training loss for X B is defined as φ (X B ) = 1 B N n=1 E q φ (xn) H[p θ (y|xn)] A -H[q φ (xn)] B + 1 BK B n=1 m∈nn K (n) E q φ (xn) [ xn -xm 2 ] C , where q φ (x n ) := q φ (x n |g φ (X B ) n ) and H[p] := -p(x) log p(x)dx is the entropy, and nn K (m) in C is the set of K-nearest neighbors of xn among X B . The A term trains g φ to fool f θ by seeking regions where f θ (x n ) makes low entropy predictions, i.e., where it may be overconfident on OOD data. The B term encourages diversity of the generated samples by maximizing the entropy of q φ (x n ). Without B, the generated samples would be prone to mode collapse and become homogeneous and uninformative. The final term, C, minimizes the average pairwise distance between the real data and the generated data it is conditioned on, where x m ∈ X B . As the C term minimizes this distance, the generated data is likely to exist in the "gap" regions of the natural training data. Combined with the training loss for f θ which we will describe next, we aim to train g φ to produce x that are not too distant from the training data but still differ enough to warrant an increase in uncertainty. We provide an ablation study to see the effects of the terms A, B, and C, in tables 4 and 5. Given the training data D, we optimize the expected loss L φ := E X B [ φ (X B )] over subsets of size B obtained from X. In practice, at each step, we sample a single mini-batch X B to compute the gradient. Also, we choose q φ (x n ) to be a reparameterizable distribution, and approximate the expectation over xn via a single sample xn ∼ q φ (x n ). For regression tasks we found the constraint given by C in 5 was too strong of a constraint given that the generation happens directly in the input space and the datasets are generally of low dimensionality. For regression, we therefore only started to penalize the C term distance when it exceeded a certain boundary threshold. We used ||1 d ||, where d is the dimensionality of the inputs. Training f θ For f θ , we minimize the expected loss, L θ = E (x,y) [-log p θ (y|x)] + E X B [r θ (X B )], where the first term is the negative log-likelihood of the natural data, and the second term is a regularizer which encourages more uncertainty on OOD data, r θ (X B ) := 1 B B n=1 E q φ (xn) 1 -exp - min m xn -x m 2 2 2 KL[p θ (y|x n ) p(y)] , where > 0 is a scaling parameter. The role of the regularizer is to encourage predictions on the OOD data to be closer to the prior p(y). Note that the KL term decays towards zero as x becomes closer to x. This allows f θ the freedom to make confident predictions in regions where real data exists. We draw a mini-batch X B from X, approximating r θ (X B ) with a single sample XB ∼ q φ ( XB |X B ). The loss is then approximated by ( 8), L θ ≈ - 1 B B n=1 log p θ (y n |x n ) + r θ (X B ) In terms of added complexity, PAD optimizes one additional set of parameters φ and requires an alternating pattern of training akin to that of GAN's (Goodfellow et al., 2014a) . In our experiments, we utilize a shallow network with a single hidden layers for φ. For further information about how we implement the KL term in 6 we we refer the reader to section 8 for details and algorithms.

4.1. EXPERIMENTAL SETUP AND DATASETS

For regression, we use UCI datasets (Dua & Graff, 2017) following Hernández-Lobato & Adams (2015). The base network is a multi-layer perceptron with two hidden layers of 50 units with ReLU activations. The generator uses a single hidden layer of 50 units for both the encoder and decoder and uses a permutation invariant pooling layer consisting of [mean(x), max(x)]. The encoding of X and generation of X are done directly in the input space. We train the models for a total of 50 epochs on 10 random splits of train/test data. In order to create an shifted test set, we first run a spectral clustering algorithm to get 10 clusters on each dataset. We then randomly choose test clusters until we have a test set which is ≥ 20% of the total dataset size, then use the remaining clusters for training. We repeat this process 10 times for each dataset. A TSNE visualization of a dataset created this way is given in (figure 2 ). We do this to ensure there is a significant shift between training and testing data. We report both negative log likelihood (NLL) and calibration error as proposed by Kuleshov et al. (2018) measured with 100 bins on the cumulative distribution function (CDF) of the density p θ (y i |x i ). We show calibration plots in (figure 3 ), plotting the expected outcome frequency versus the empirical frequency. For baseline models, we tune hyperparameters with a randomly chosen 80/20 training/validation split. For PAD models, we do k-fold cross validation with K = 2, selecting half of the previously made clusters for each fold. For all models and datasets, we tune the hyperparameters for each individual train/test split. Classification experiments are done on both MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009) datasets. We use an architecture consisting of 4 and 5 convolutional layers respectively, followed by 3 fully connected layers. We provide extra information regarding the exact architecture in the appendix (section 8). Instead of clustering to create shifted test distributions, we use image corruptions as were used by Snoek et al. (2019) . As PAD works in any arbitrary feature space, we apply our method in the latent space between the last convolutional layer and the fully connected layers. We report classification accuracy, negative log likelihood, and expected calibration error (ECE) (Guo et al., 2017) over 5 runs for all models.

4.2. BASELINES

We compare PAD against a number of Bayesian models and neural networks including Gaussian Processes, Functional Variational Bayesian Neural Networks (FVBNN) (Sun et al., 2019) , Monte Carlo Dropout (MC Drop) (Gal & Ghahramani, 2015) , Deep Ensembles (DE) (Lakshminarayanan et al., 2017 ), SWAG (Maddox et al., 2019) , and Rank One Bayesian Neural Networks (R1BNN) (Dusenberry et al., 2020) and Depth Uncertain Networks (DUN) (Antorán et al., 2020) .

4.3. ANALYSIS

For regression it can be seen in (table 1) that PAD generally improves the likelihood in the scenario where the shifted data distribution caused the baseline model to perform poorest. The underlined entries in (table 1 ) are those which the negative log likelihood differs by ≥ 1 on the log scale. It can be seen that the majority (8/9) of these increases in likelihood are achieved by PAD models. In terms of single experiments, PAD beat the baseline models 21/32 ≈ 66% of the time. For calibration, in (table 2), the only difference is that the underlined entries are those for which differ by ≥ 5. For calibration, it can be seen that PAD models contain all cases where a large difference between methods is present. In terms of single experiments, PAD beat the baselines 22/32 ≈ 69% of the time. In (figure 3 ) we show calibration curves for MC Drop and R1BNN models on each dataset (other models included in figure 9 ). Baseline models are shown with a dotted line, while PAD models are shown with a solid line. It can be seen that in datasets where the baseline model is poorly calibrated, such as Housing and Concrete, PAD results in curves which are closer to the center line than the baseline models. On datasets where the baseline models tend to be well calibrated, such as Power and Kin8nm, PAD does not result in a large deviation from the baseline. We attribute this to the fact that we tune the lengthscale parameter in such a way that if no improvement can be made over the baseline model on the validation set, the lengthscale parameter and the second term in (6) will have little to no effect on the resulting model parameters. However, the opposite is also true, which leads to large improvements in negative log likelihood and calibration error in the worst case. It is interesting to note that PAD does not give uniform performance gains across all datasets. To further investigate this, we used TSNE (Hinton & Roweis, 2002) to embed both the natural data X, and the generated data X for all datasets (figures 6 and 7). Datasets which naturally form a single dense cluster, such as Kin8nm and Power are consistently the lowest performers when paired with PAD, where pad shows the best performance in 1/8 (NLL) and 2/8 (calibration) experiments. For datasets which form clusters on manifolds, such as Naval and Yacht, PAD gives the best results on In order to understand the effect of each term in equation 5, we performed an ablation on each term on the MC PAD (MC Dropout + PAD) variant. It can be seen that as more terms are removed from the equation, the model performance degrades in both NLL and calibration. The effect is more pronounced for NLL, as the full equation contains the majority of the best performances.

5. RELATED WORK

Bayesian Methods for Deep Learning Functional Bayesian Neural Networks (FVBNN) (Sun et al., 2019) use a functional prior from a fully trained GP (Rasmussen, 2003) . This approach is somewhat similar to PAD, but PAD accomplishes this in a way which adversarially generates difficult samples for f θ and does not require specifying a GP kernel and then fitting data to it with the limited expessivity of a GP. (Maddox et al., 2019) recently proposed SWAG, which keeps a moving average of the weights, and subsequently builds a multivariate Gaussian posterior p(θ|D) which can be sampled for inference. Deep ensembles (Lakshminarayanan et al., 2017) simply trains an ensemble of N independent models, and has become a strong standard baseline in the calibration literature. The resulting ensemble of networks are sampled on new inputs and combined in order to achieve set of highly probable solutions from the posterior p(w|D). While simple and effective, it can require excessive computation for large models and datasets. (Dusenberry et al., 2020) proposed R1BNN, which uses shared set of parameters θ along with multiple sets of rank one parameters r i and s i , which can combine via an outer product θ rs T to create a parameter efficient ensemble. Other recent advances include gathering ensemble members by treating depth as a random variable and ensembling the latent features through a shared output layer (Antorán et al., 2020) . Recently, there has been resurgence of interest in distance sensitive RBF networks (LeCun et al., 1998) for modeling predictive uncertainty (Liu et al., 2020; van Amersfoort et al., 2020) with both Bayesian and deterministic methods, highlighting the importance of distance in uncertainty modeling.

Data Augmentation

The data augmentation literature is vast, and as such we will only mention some of the most relevant works here. Adversarial examples (Goodfellow et al., 2014b) utilize the gradient of the input with respect to the loss which is then used to create a slightly perturbed input which would cause an increase to the loss, and further training with that input. The resulting network should be more robust to such perturbations. (Zhang et al., 2017) proposed Mixup, which creates linear interpolations between natural inputs with the hopes of creating a robust network with linear transitions between classes instead of having a hard decision boundary. Subsequent work by (Thulasidasan et al., 2019) showed that mixup training and the soft decision boundary has the effect of improving network calibration. Set Encoding Methods which operate as a function of sets have been an active topic in recent years. Deep Sets (Zaheer et al., 2017) first proposed a model f ({x i , x i+1 , ..., x n }) → R d which passes the input set through a feature extractor, before being aggregated with a permutation invariant pooling function, and then decoded through a DNN which projects the set representation to the output space R d . (Lee et al., 2019) proposed to use attention layers to create the set transformer. Our method builds on the idea of set based methods by using the sets towards a new objective of identifying and generating data in underspecified regions where the model is likely overconfident.

6. CONCLUSION

We have proposed a new method of increasing accuracy and calibration of Bayesian neural network models which we named Prior Augmented Data (PAD). Our method works by encouraging a Bayesian reversion to the prior beliefs of the labels for inputs in previously unseen or sparse regions of the known data distribution. We have demonstrated through various experiments that our method achieves an improvement in likelihood and calibration under shifting data conditions, creating exponential improvements in log likelihood in those conditions which the baseline models tend to make the poorest predictions. One interesting direction of future work could be to investigate a principled way of doing cross validation when training a network with the objective of being robust to OOD data. The OOD data is by definition unknown, so selecting the best hyperparameter settings is not a trivial task. We performed k-fold cross validation using clusters in our regression experiments, but we feel that this topic deserves more study into possible better solutions. Another possible avenue of future research would be to look into the effect that different sizes of X may have on the resulting uncertainty predictions. It may be the case that the specific cluster sizes and manifold geometries may require more/less attention and therefore some performance and efficiency gains could be had by giving these regsions the proper amount of attention. Algorithm 1: PAD for input space for |D B | = (X B , y B ) do optimize θ... XB ∼ q φ ( XB |X B ). ŷn = f θ (x n ) for n = 1, . . . , B. ỹn = f θ (x n ) for n = 1, . . . , B. θ t+1 ← θ t -∇ θ L θ (D B ) optimize φ... XB ∼ q φ ( XB |X B ). ỹn = f θ (x n ) for n = 1, . . . , B. φ t+1 ← φ t -∇ φ L φ ; Algorithm 2: PAD for latent features for |X B | do optimize θ... ŷn = f θ (x n ) for n = 1, . . . , B. ẑn = f θ (x :n ) for n = 1, . . . , B. ZB ∼ q φ ( ZB |Z B ). ỹn = f θ (z n: ) for n = 1, . . . , B. θ t+1 ← θ t -∇ θ L θ (D B ) optimize φ... ẑn = f θ (x :n ) for n = 1, . . . , B. ZB ∼ q φ ( ZB |Z B ). ỹn = f θ (z n: ) for n = 1, . . . , B. φ t+1 ← φ t -∇ φ L φ ; Figure 5 : Training steps for f θ and g φ . PAD can work in both the input space (left) and the latent space (right)

8.3. NEW YORK REAL ESTATE

To apply PAD on a real world problem where the dataset shift is not synthetically created, we applied it to regression on New York real estate data. The dataset consists of 12,000 sales records spanning over 12 years. Each instance has a total of 667 features including real valued and one-hot categorical features. We used the same base models as outlined in section 4. The regression labels are the price that the house sold for. We train the model to predict log(y) to account for the log-normal distribution of prices and report results based on the log-transformed label. We used the years of 2008-2009 as training/validation data and evaluated the performance on all following years until 2019. It can be seen that with a real temporal distributional shift, PAD models exhibit strong performance. PAD models achieve the best negative log likelihood in 35/40 cases. Similar performance can be seen in terms of calibration error, where PAD models show the best calibration error in 35/40 cases. 



Figure 2: Left: A random train/test split. The test data is randomly chosen from the whole distribution. Middle: Clusters resulting from spectral clustering. Right: Shifted data distributions chosen from clusters as a train/test split for our experiments on OOD data.

Figure 3: Calibration curves for each model on each dataset. Base models are shown with dotted lines. Our models (model + PAD) are shown with solid lines. GP's are included with a solid line for reference.

Figure 4: Model performance on varying degrees of shift intensity for MNIST-C and CIFAR10-C. 0 represents the original test set while 6 represents the most extreme level of shift. Models which are augmented with PAD show comparable performance on the natural test set. As shift intensity increases, PAD models exhibit superior performance in terms of negative log likelihood and calibration error. Other models are included in figure 10

Figure 6: TSNE embeddings for the first half of the UCI regression datatsets we used in our experiments. Both in-distribution D and out-of-distribution data D are shown. The color gradient of D corresponds to different settings for in 5

Performance of different models on OOD test data. The top row contains baseline models, the bottom row contains baseline + PAD. It can be seen that PAD effectively increases the epistemic uncertainty in regions with sparse training data and exhibits a more predictable reversion to the prior, much like a GP (top right). Baseline BNN models in the top row show unpredictable behavior in OOD regions of input space.

Negative log likelihood on UCI regression datasets. GP's and FVBNN's are included for reference.bold entries contain the best result for the base model. underlined entries are those with a large difference of ≥ 1 (log scale) between methods.

Calibration error on UCI regression datasets. GP's and FVBNN's are included for reference. bold entries contain the best result for the base model. underlined entries are those with a large difference of ≥ 5 between methods.

Classification accuracy, NLL, and calibration error on OOD test sets for classification. Each row contains a baseline model, with Mixup and PAD variants. Left: Accuracy, Middle: ECE, Right: NLL

Equation 5 Ablation: NLL

Equation 5 Ablation: Calibration Error

Negative Log likelihood on 10 years of NY real estate data. The timeline constitutes a real-world temporal dataset shift

7. APPENDIX

8 IMPLEMENTATION DETAILS 8.1 KL DIVERGENCES For regression tasks we use an analytic calculation of KL[p θ (y i |x i )||p(y)] by assuming a prior of N (0, 1). In practice, we only want to raise the uncertainty of the prediction while keeping the expressive generalization properties of DNNs, so we use the output of p θ as the mean with the standard deviation set to 1.For classification tasks, the prior is a uniform categorical distribution. In practice, we found that directly raising the entropy of the output led to a degradation in accuracy, so we instead add another output parameter to the base model such that instead of outputting C class logits, we output C + 1 and treat the extra output logit as log t and use it as a temperature scaling parameter σ(z/t) where σ is the softmax function. We then implement the KL divergence by setting a hyperparameter τ controlling the maximum temperature, and conditionally raising the temperature by the following function.where w is the weight given by the exponential term before the KL divergence in 6. In this way, the minimum temperature of 1 is enforced when w = 0, and the maximum temperature of τ is enforced when w = 1. Importantly, we do not transform the logits by when calculating the likelihood loss (first term in equation 8) because we found that this led to a larger generalization error in practice. 

