AMORTIZED CONDITIONAL NORMALIZED MAXIMUM LIKELIHOOD

Abstract

While deep neural networks provide good performance for a range of challenging tasks, calibration and uncertainty estimation remain major challenges. In this paper, we propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation, calibration, and out-of-distribution robustness with deep networks. Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle, but is computationally intractable to evaluate exactly for all but the simplest of model classes. We propose to use approximate Bayesian inference technqiues to produce a tractable approximation to the CNML distribution. Our approach can be combined with any approximate inference algorithm that provides tractable posterior densities over model parameters. We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.

1. INTRODUCTION

Current machine learning methods provide unprecedented accuracy across a range of domains, from computer vision to natural language processing. However, in many high-stakes applications, such as medical diagnosis or autonomous driving, rare mistakes can be extremely costly, and thus effective deployment of learned models requires not only high expected accuracy, but also a way to measure the certainty in a model's predictions in order to assess risk and allow the model to abstain from making decisions when there is low confidence in the prediction. While deep networks offer excellent prediction accuracy, they generally do not provide the means to accurately quantify their uncertainty. This is especially true on out-of-distribution inputs, where deep networks tend to make overconfident incorrect predictions (Ovadia et al., 2019) . In this paper, we tackle the problem of obtaining reliable uncertainty estimates under distribution shift. Most prior work approaches the problem of uncertainty estimation from the standpoint of Bayesian inference. By treating parameters as random variables with some prior distribution, Bayesian inference can compute posterior distributions that capture a notion of epistemic uncertainty and allow us to quantitatively reason about uncertainty in model predictions. However, computing accurate posterior distributions becomes intractable as we use very complex models like deep neural nets, and current approaches require highly approximate inference methods that fall short of the promise of full Bayesian modeling in practice. Bayesian methods also have a deep connection with the minimum description length (MDL) principle, a formalization of Occam's razor that recasts learning as performing efficient lossless data compression and has been widely used as a motivation for model selection techniques. Codes corresponding to maximum-a-posteriori estimators and Bayes marginal likelihoods have been commonly used within the MDL framework. However, other coding schemes have been proposed in MDL centered around achieving different notions of minimax optimality. Interpreting coding schemes as predictive distributions, such methods can directly inspire prediction strategies that give conservative predictions and do not suffer from excessive overconfidence due to their minimax formulation. One such predictive distribution is the conditional normalized maximum likelihood (CNML) (Grünwald, 2007; Rissanen and Roos, 2007; Roos et al., 2008) model, also known as sequential NML or predictive NML (Fogel and Feder, 2018b) . To make a prediction on a new input, CNML considers every possible label and tries to find the model that best explains that label for the query point together with the training set. It then uses that corresponding model to assign probabilities for each input and normalizes to obtain a valid probability distribution. Intuitively, instead of relying on a learned model to extrapolate from the training set to the new (potentially out-of-distribution) input, CNML can obtain more reasonable predictive distributions by asking "given the training data, which labels would make sense for this input?" While CNML provides compelling minimax regret guarantees, practical instantiations have been exceptionally difficult, because computing predictions for a test point requires retraining the model on the test point concatenated with the entire training set. With large models like deep neural networks, this can potentially require hours of training for every prediction. In this paper, we proposed amortized CNML (ACNML), a tractable and practical algorithm for approximating CNML utilizing approximate Bayesian inference. ACNML avoids the need to optimize over large datasets during inference by using an approximate posterior in place of the training set. We demonstrate that our proposed approach is substantially more feasible and computationally efficient than prior techniques for using CNML predictions with deep neural networks and compares favorably to a number of prior techniques for uncertainty estimation on out-of-distribution inputs.

2. MINIMUM DESCRIPTION LENGTH: BACKGROUND AND PRELIMINARIES

ACNML is motivated from the minimum description length (MDL) principle, which can be used to derive a connection between optimal codes and prediction. We begin with a review of the MDL principle and discuss the challenges in implementing minimax codes that motivate our method. For more comprehensive treatments of MDL, we refer the readers to (Grünwald, 2007; Rissanen, 1989) . Minimum description length. The MDL principle states that any regularities in a dataset can be exploited to compress it, and hence learning is reformulated as losslessly transmitting the data with the fewest number of bits (Rissanen, 1989; Grünwald, 2007) . Simplicity is thus formalized as the length of the resulting description. MDL was originally formulated in a generative setting where the goal is to code arbitrary data, and we will present a brief overview in this setting. We can translate the results to a supervised learning setting, which corresponds to transmitting the labels after assuming either a fixed coding scheme for the inputs or that the inputs are known beforehand. While MDL is typically described in terms of code lengths, in general, we can associate codes with probability distributions, with the code length of an object corresponding to the negative log-likelihood under that probability distribution (Cover and Thomas, 2006) . Normalized Maximum Likelihood. Let θ(x 1:n ) denote the maximum likelihood estimator for a sequence of data x 1:n over all θ ∈ Θ. For any x 1:n ∈ X n and distribution q over X n , we can define a regret relative to the model class Θ as R(q, Θ, x 1:n ) def = log p θ(x1:n) (x 1:n ) -log q(x 1:n ). (1) This regret corresponds to the excess number of bits q uses to encode x 1:n compared to the best distribution in Θ, denoted θ(x 1:n ). We can then define the normalized maximum likelihood distribution (NML) with respect to Θ as p NML (x 1:n ) = p θ(x1:n) (x 1:n ) x1:n∈X n p θ(x1:n) (x 1:n ) (2) when the denominator is finite. The NML distribution can be shown to achieve minimax regret (Shtarkov, 1987; Rissanen, 1996 ) p NML = argmin q max x1:n∈X n R(q, Θ, x 1:n ). (3) This corresponds, in a sense, to an optimal coding scheme for sequences of known fixed length. Conditional NML. Instead of making predictions across entire sequences at once, we can adapt NML to the setting where we make predictions about the next data point based on the previously seen data, resulting in conditional NML (CNML) (Rissanen and Roos, 2007; Grünwald, 2007; Fogel and Feder, 2018a) . While several variations on CNML exist, we consider the following: p CNML (x n |x 1:n-1 ) ∝ p θ(x1:n) (x n ). For any fixed sequence x 1:n-1 , p CNML solves the minimax regret problem p CNML = argmin q max xn log p θ(x1:n) (x n ) -log q(x n ), where the inner maximization is only over the last data point x n . We can extend this approach to the supervised classification setting, where our models represent conditional distributions p θ (y|x). The CNML distribution, given a sequence of already seen datapoints (x 1:n-1 , y 1:n-1 ) and the next input x n , then takes the form p CNML (y n |x n ; x 1:n-1 , y 1:n-1 ) ∝ p θ(y1:n|x1:n) (y n |x n ), and solves the minimax problem p CNML = argmin q max yn log p θ(y1:n|x1:n) (y n |x n ) -log q(y n ). ( ) We see that this conditional distribution is amenable to our usual inductive learning procedure, where (x 1:n-1 , y 1:n-1 ) is our training set, and we want to output a predictive distribution over labels y n for a new test input x n . We use a 2D logistic regression example to illustrate CNML's conservative predictions, showing a heatmap of CNML probabilities in Figure 1 . CNML provides uniform predictions on most of the input space away from the training samples. In Figure 2 , we illustrate how CNML arrives at these predictions, showing the predictions for the parameters θ0 and θ1 , corresponding to labeling the test point (shown in pink in Figure 2 , left) with either the label 0 or 1. However, CNML may be too conservative when the model class Θ is very expressive. Naïvely applying CNML with large model classes can result in the per-label models fitting their labels for the query point arbitrarily well, such that CNML gives unhelpful uniform predictions even on inputs we would hope to reasonably extrapolate on. We see this in the 2D logistic regression example in Figure 1 . Thus, the model class Θ would need to be restricted in some form, for example by only considering only parameters within a certain distance from the training set solution as a hard constraint. Another approach for controlling the expressivity of the model class is to generalize CNML to use regularized estimators instead of maximum likelihood, resulting in normalized maximum a posteriori (NMAP) (Kakade et al., 2006) codes. Instead of using maximum likelihood parameters, NMAP selects θs to be the parameter that maximizes both data likelihood and a regularization term, or prior, over parameters, and we can define slightly altered notions of regret using these MAP estimators in all the previous equations to get a conditional normalized maximum a posteriori distribution instead. See Appendix D for completeness. Going back to the logistic regression example, we plot heatmaps of CNMAP predictions in Figure 3 , adding different amounts of L2 regularization to the logistic regression weights. As we add more regularization, the model class becomes effectively less expressive, and the CNMAP predictions become less conservative. Computational Costs of CNML. A major practical issue with actually utilizing CNML or CNMAP with neural networks is the prohibitive computational costs of computing the maximum likelihood estimators for each new input and label combination. To evaluate the distribution on a new test point, one must solve a nonconvex optimization problem for each possible label, with each problem involving the entire training dataset along with the new test point. This direct evaluation of CNML therefore becomes computationally infeasible with large datasets and high-capacity models, and further requires that the model carry around the entire training set even when it is deployed. In settings where critical decisions must be made in real time, even running a single epoch of additional training would be infeasible. For this reason, NML-based methods have not gained much traction as a practical tool for improving the predictive performance of high-capacity models.

3. AMORTIZED CNML

In this section, we derive our method, amortized conditional normalized maximum likelihood (ACNML). ACNML provides a tractable approximation for CNML and CNMAP via approximate Bayesian inference. Instead of directly computing maximum likelihood parameters over the query point and training set, our method uses an approximate posterior distribution over parameters to capture the necessary information about the training set, and thus reduces the maximization to only the single new point. The computational cost at test-time therefore does not increase with training set size. We specialize our notation to the supervised learning setting, where our aim is to obtain a predictive distribution p(y n |x n ) after observing a training set (x 1:n-1 , y 1:n-1 ) and a test input x n .

3.1. ALGORITHM DERIVATION

Incorporating an exact posterior into CNML. Given a prior distribution p(θ), the Bayesian posterior likelihood conditioned on the training data is given by p(θ|x 1:n-1 , y We can thus replace the training data log-likelihood p θ (y 1:n-1 |x 1:n-1 ) with the Bayesian posterior density log p(θ|x 1:n-1 , y 1:n-1 ) when computing θy . We can also recover CNML as a special case of CNMAP by using a uniform prior, but as discussed previously, CNML with highly expressive model classes can lead to overly conservative predictions, so we will opt to use non-uniform priors that help control model complexity instead. For example, with deep neural networks, we may elect to use a zero-mean Gaussian prior p(θ) on the network weights, corresponding to L2 regularization. ACNML with an approximate posterior. Of course, the exact Bayesian likelihood is no easier to compute than the original training log likelihood. However, we can derive a tractable approximation by replacing the exact posterior p(θ|x 1:n-1 , y 1:n-1 ) with an approximate posterior q(θ) instead. We can obtain an approximate posteriors via standard approximate Bayesian techniques such as variational inference or Laplace approximations. We focus on Gaussian posterior approximations for computational efficiency, and discuss in Section 3.2 why this class of distributions provides a reasonable approximation. For practical purposes, we expect the approximate posterior log-likelihood to ensure the optimal θy selected for each label retains good performance on the training set. By replacing the likelihood over the training data with the probability under an approximate posterior, it becomes unnecessary to retain the training data at test time, only the parameters of the approximate distribution. Optimization also becomes much simpler, as it no longer requires stochastic gradients, and the Gaussian posterior log density log q(θ) can serve as a strong convex regularizer. ACNML algorithm summary A summary of the ACNML algorithm is presented in Algorithm 1. The training process for obtaining q(θ) only needs to be performed once on the training set, whereas the inference step is performed for each test point. However, this inference step only requires optimizing the model parameters on a single data point, with the regularizer provided by log q(θ).

3.2. ANALYSIS OF GAUSSIAN APPROXIMATIONS IN ACNML

In this section, we argue that using a Gaussian approximate posteriors in ACNML, which correspond to second-order approximations to the training set log-likelihood, suffice for accurately computing the CNML distributions when the training set is large. The intuition is that for large training sets, the combined likelihoods of all the training points dominate over the single new test point, so the perturbed MLEs θy remains close to the original training set MLE θ, letting us rely on local approximations to the training loss. Under some simplifying assumptions, we can formalize this argument using the concept of influence functions, which measure how maximum likelihood parameters (and more general M -estimators) for a dataset would change if the dataset were perturbed by reweighting inputs an infinitesimal amount. We recall that maximum likelihood estimators for a dataset with n datapoints (x 1:n , y 1:n ) is given by θ = argmax θ 1 n n i=1 log p θ (y i |x i ). Influence functions analyze how θ relates to the MLE of a perturbed dataset θx,y, = argmax θ log p θ (y|x) + 1 n n i=1 log p θ (y i |x i ) , where θx,y, is the new MLE if we perturb the training set by adding a datapoint (x, y) with a weight . A classical result (Cook and Weisberg, 1982) shows that θx,y, is differentiable (under appropriate regularity conditions) with respect to with derivative given by the influence function d θx,y, d | =0 = -H -1 θ ∇ θ log p θ (y|x), ( ) where θ is the MLE for the original dataset and H θ the Hessian of the mean training set log-likelihood evaluated at θ. CNML computes the MLE after adding datapoint (x, y) with equal weight to points in the training set, which is precisely given by θx,y, evaluated at = 1/n. Thus, for sufficiently large n, a first order Taylor expansion around θ should be accurate and the new parameter can be estimated by θx,y = θ - 1 n H -1 θ ∇ θ log p θ (y|x), which is equivalent to solving θx,y = argmax θ 1 n (θ -θ) T ∇ θ log p θ (y|x) + 1 2 (θ -θ) T H θ (θ -θ). ( ) This suggests that, with large training datasets, the perturbed MLE parameters θy in Equation 9can be approximated accurately using a quadratic approximation to the training log-likelihood, corresponding to a Gaussian posterior obtained via a Laplace approximation. We can explicitly quantify the accuracy of this approximation in the theorem below, which is based on Theorem 1 from Giordano et al. (2019) , with full details and proof in Appendix E. Theorem 3.1. (Adapted from Giordano et al. ( 2019)) Consider a training set with n datapoints and an additional datapoint (x, y). Assume assumptions 1-5 hold with constants C op , C IJ , ∆ δ as defined in Appendix E. Let θx,y denote the exact MLE if we had appended (x, y) to the training set, and θx,y the parameter obtained via the approximation in Equation 13. Let δ = 1 n + 2 max{sup θ∈Θ ∇ θ log p θ (y|x) 1 , sup θ∈Θ ∇ 2 θ log p θ (y|x) 1 }. ( ) If δ ≤ ∆ δ , then θx,y -θx,y 2 ≤ 2C 2 op C IJ δ 2 , ( ) Given a bound on how accurately we estimate the new parameters for CNML, we can also explicitly quantify the accuracy of the resulting normalized distributions, with proof in Appendix E. Proposition 3.2. Suppose y ∈ Y with |Y| = k (classification with k classes). Let θ x,y be the exact MLE after appending the datapoint (x, y) to the training set, and let θx,y be an approximate MLE with θx,yθx,y ≤ δ for each y. Further suppose log p θ (y|x) is L-Lipschitz in θ. Denote the exact CNML distribution for the fixed input x to be p CNML (y) ∝ p θx,y (y|x) and an approximate CNML distribution p ACNML (y) ∝ p θx,y (y|x). We then have sup y |log p CNML (y) -log p ACNML (y)| ≤ 2Lδ. ( ) Theorem 3.1 and Proposition 3.2 together suggest that the approximation produced by ACNML will be increasingly close to the exact CNML distribution as the training set size n grows. However, this formal theoretical result only holds for sufficiently large datasets and under strong simplifying assumptions including smoothness and strong convexity of the training loss, so does not necessarily hold in practical settings with deep neural networks. In the context of interpreting how different data points influence the predictions of neural networks, Koh and Liang showed that influence function approximations were able to provide useful predictions for estimating leave-one-out retraining with deep convolutional neural networks. This closely resembles the conditions we encounter when computing parameters for each label of the query point with ACNML, with the key difference being that ACNML adds a datapoint while leave-one-out retraining removes one. This suggests second-order approximations to the training loss, corresponding to Gaussian approximations in ACNML, may suffice to yield useful predictions about how parameters change when the query point is added, despite lacking formal guarantees with deep neural networks.

4. RELATED WORK

Minimum description length has been used to motivate neural net methods dating back to Hinton and van Camp (1993) , who treat description length as a regularizer to mitigate overfitting. The idea of preferring flat minima (Hochreiter and Schmidhuber, 1997 ) also has its origins in the MDL framework, as it allows a coarser discretization of the weights (and thus fewer bits needed). Bayesian methods typically serve as the starting point for uncertainty estimation in deep networks, and a commonly used approach is to use simple tractable distributions to approximate the true posterior (Hoffman et al., 2013; Blundell et al., 2015; Ritter et al., 2018) . Recent work (Maddox et al., 2019; Dusenberry et al., 2020) has shown fairly simple posterior approximations are able to achieve well-calibrated predictions with marginalization. Our method builds on top of these approximate posterior methods, but in contrast to the Bayesian methods, where the posterior is typically used to efficiently sample models for Bayesian model averaging, our method uses the posterior density to enable efficient optimization for computing the CNML, without needing to retain the training data. In the analysis for our approximation, we use influence functions (Cook and Weisberg, 1982) , which have been studied as asymptotic approximations to how M -estimators change when perturbing a dataset. In deep learning, Koh and Liang advocated for using influence functions to interpret neural nets, generate adversarial examples, and diagnose errors in datasets. We use a theorem from Giordano et al. (2019) , which broadened the necessary assumptions for these infinitisemal approximations to be accurate and provides explicit guarantees for fixed datasets rather than asymptotic results.

5. EXPERIMENTS

To instantiate ACNML, we must select a method for obtaining the approximate posterior. In principle, any technique for computing a tractable posterior over parameters can be used, and we demonstrate this flexibility by implementing ACNML on top of Stochastic Weight Averaging -Gaussian (SWAG) (Maddox et al., 2019) , KFAC-Laplace (Ritter et al., 2018) , and Bayes-by-backprop (Blundell et al., 2015) . SWAG computes a posterior by fitting a Gaussian distribution to the trajectory of SGD iterates. For simplicity and computational efficiency, we instantiate ACNML with the SWAG-D variant, which uses a Gaussian posterior with only a diagonal covariance. KFAC-Laplace uses a Gaussian posterior approximation with the MAP solution as the mean and the inverse Hessian of the negative log likelihood as covariance, approximating the Hessian using KFAC (Martens and Grosse, 2015) to allow for tractable inversion and storage. Bayes-by-backprop (Blundell et al., 2015) uses the reparameterization trick to learn a diagonal Gaussian posterior via the variational lower bound. For each model, we report results across 3 seeds. We compare negative log likelihood (NLL), accuracy, and expected calibration error (ECE) (Naeini et al., 2015) as well as showing reliability diagrams (Guo et al., 2017) to further assess calibration. For reliability diagrams, we sort data points by confidence and divide them into twenty equal sized buckets, plotting the mean accuracy against the mean confidence for each bucket. This allows to see qualitatively see how well the confidence of the prediction relates to the actual accuracy, as well as showing how the confidences are distributed for each method. MNIST. We start with a simple illustrative task based on the MNIST dataset, where we construct out-ofdistribution inputs by randomly rotating the images in the MNIST test set. Here, ACNML is implemented on top of Bayes-by-backprop (Blundell et al., 2015) , and we compare to the MAP estimate and the marginal over models obtained from the same Bayes-by-backprop posterior. The results in Table 1 show that all methods perform well on the in-distribution MNIST test set, though ACNML likelihoods are somewhat worse due to the more conservative CNML distribution. On OOD rotated digits, we see that ACNML exhibits substantial improvements in calibration as measured by the ECE metric, as well as slightly better NLL value. In general, this agrees with what we expect from ACNML: the predictions are more conservative across the board, which does not necessarily improve results in-distribution, particularly for easy domains like MNIST, but offer considerable improvements in calibration for out-of-distribution inputs where errors are prevalent. We additionally compared to a much more computationally expensive instantiation of CNML used by Bibas et al. (2019a) (denoted naive CNML in Table 1 ), which directly finetunes for several epochs using the training set to obtain the optimal parameters for each query point and label, rather than using the approximate posterior like ACNML does. This direct instantiation of CNML performs the best in terms of accuracy and NLL on the in-distribution test set, while also improving over the MAP solution in terms of NLL and calibration on the OOD inputs. However, we find that ACNML is overall more conservative when using this particular posterior approximation, resulting in better NLL and calibration on the OOD inputs (see Appendix C for more detailed comparisons between ACNML and naive CNML).

CIFAR and Corruptions

We evaluate all methods using the VGG16 (Simonyan and Zisserman, 2014) network architecture. Focusing on the most direct comparisons, we compare against the MAP solution for the given posterior, which is equivalent to Stochastic Weight Averaging (SWA) (Izmailov et al., 2018) , and Bayes model averaging with SWAGD and KFAC-Laplace, which provide an apples-to-apples comparison to the two versions of our method that directly utilize the posteriors from these prior approaches. We use CIFAR10 (Krizhevsky, 2012) for training and in-distribution testing. Following (Ovadia et al., 2019) , we evaluate predictive uncertainty in out-of-distribution settings using the CIFAR10-Corrupted (Hendrycks and Dietterich, 2019) datasets, which apply different severities of 15 common corruptions to the test set images. With this, we can assess performance over a wide range of distribution shifts, as well as how performance degrades as shifts become more extreme. We include additional comparisons across other methods and architectures in Appendix B. Examining the reliability diagrams in Figure 4 , we see that ACNML provides more conservative (less confident) predictions than other methods, to the point of being underconfident on the in-distribution CIFAR10 test set, while other methods tend toward being overconfident. On out-of-distribution datasets, where accuracy degrades, we see that ACNML's conservative predictions lead to many better calibrated low-confidence predictions, while other methods drastically overestimate confidence. Figure 5 : ACNML compared against their Bayesian counterparts and the deterministic MAP baseline on out-of-distribution CIFAR10-Corrupted datsets. We plot medians and 95% confidence intervals across all corruptions. We see that ACNML methods (solid lines) achieve much lower ECE at higher corruption values, and ACNML with SWAGD also achieves better NLL than other methods. All methods perform similarly in terms of accuracy in all domains, and we find that ACNML's more conservative estimates perform competitively with Bayesian methods in NLL and calibration on in-distribution datasets, with all evaluated methods performing reasonably well in-distribution (see Table 3 in Appendix B). However, differences in calibration are much more pronounced for the OOD results in Figure 5 . We see that as the corruption strength increases, ACNML variants provide much better calibration while performing similarly to or slightly better than other methods in terms of NLL. 

Timing Comparison vs. standard CNML:

In Table 2 , we examine the computational costs of our method. We compare against a naïve implementation of CNML that fine-tunes for N epochs on each test point and label, similarly to the method proposed by Bibas et al. (2019b) . In total, predicting a single input with k possible labels involves running kN epochs of training. While ACNML is over two orders of magnitude faster than naïve CNML even with just a single epoch of training (our experiments with naive CNML on MNIST used 5 epochs), it is still slower than standard inference. The computational requirements of our method scale linearly with the number of classes, but are constant with respect to dataset size. It is also not easily amenable to data batching, as new copies of the model parameters are needed for each data point. Timing experiments are run using a single NVIDIA 1080Ti, using MNIST for the MNIST MLP timing reselts and using CIFAR10 for VGG16 and WideResNet28x10, with no parallelization over data points.

6. DISCUSSION

In this paper, we present amortized CNML (ACNML) as an alternative to Bayesian marginalization for obtaining uncertainty estimates and calibrated predictions with high-capacity models, such as deep neural networks. The CNML distribution is a theoretically well-motivated strategy derived from the MDL principle with strong minimax optimality properties, but actually evaluating this distribution is computationally daunting. ACNML utilizes approximate Bayesian posteriors to tractably approximate it, and can be instantiated on top of a wide range of approximate Bayesian methods. We view ACNML as a step towards practical uncertainty aware predictions that would be essential for real-world decision making. Future work could further improve on our proposed method, for example by combining ACNML with more complex and expressive posterior approximations. In particular, training losses are highly non-convex and have many local minima, so incorporating local approximations around multiple diverse minima could allow for even more reliable uncertainty estimation. More broadly, tractable algorithms inspired by ACNML could in the future provide for substantial improvement in our ability to produce accurate and reliable confidence estimates on out-of-distribution inputs, improving the reliability and safety of learning-enabled systems.

A EXPERIMENTAL DETAILS

For obtaining approximate posteriors with SWAG and KFAC-Laplace, we follow the exact training procedures given in Maddox et al. (2019) . We then implement ACNML on top of the diagonal SWAG posterior and the KFAC-Laplace posterior. The variance of the SWAG posterior depends in a complex way on the learning rate and gradient covariances. To account for this, we introduce an additional temperature hyperparameter α and solve for the ACNML approximation using θ * = argmax θ∈Θ log p θ (y n |x n ) + 1 α log q(θ). ( ) To calibrate α, we can calculate the CNML distribution using a validation set, by training on the entire training set and the validation point, and then selecting α such that our ACNML procedure produces similar likelihoods. We can also treat α as a tunable hyperparameter and select it using a validation set, similarly to how temperature scaling (Guo et al., 2017) is used to achieve better calibration for prediction, or how the relative weighting of priors and likelihoods are used in generalized Bayesian inference (Vovk, 1990) or safe Bayesian inference (Grünwald et al., 2017) as a way to deal with model misspecification. For our experiments using the SWAGD posterior, we heuristically tune α to be as large as possible without degrading the accuracy compared to the MAP solution. Note, however, that this procedure is specific to the particular way in which SWAG estimates the parameter distribution, and any posterior inference procedure that explicitly approximates the posterior likelihood (e.g., Blundell et al. (2015) ) would not require this step. To select α for each model class, we swept over values [0.25, 0.5, 1, 1.5, 2] and selected the highest value such that accuracy and NLL on the validation set did not degrade significantly compared to SWA. For VGG16, we use α = 0.5 and for WideResNet28x10, we used α = 1.5. With our posterior q(θ) being a Gaussian with covariance Σ, we approximately compute the MAP solution for each label y as per Algorithm 1 by initializing θ 0 to be the posterior mean and iterating θ t+1 = θ t + t Σ(α∇ log p θt (y|x n ) + ∇ log q(θ t )), using the covariance as a preconditioner. For our experiments, we run 5 steps of gradient ascent on this objective, with a constant step size = 0.5. We empirically find that 5 steps was often enough to find an approximate stationary point with the SWAG-D posterior, and 10 steps for the KFAC-Laplace posterior. For the reliability diagrams in Figure 4 , we again follow the procedure used by Maddox et al. (2019) . We first divide the points into twenty bins uniformly based on confidence (each bin has the same number of points), then plot the mean accuracy vs mean confidence within each bin. This differs from the reliability diagrams used by Guo et al. (2017) , where they divide the range of confidence values into bins uniformly, resulting in unevenly filled bins. For our expected calibration error (ECE) numbers, we use the same bins as computed for our reliability diagrams, and compute ECE = K i=1 P (i) • |o i -e i | , where P (i) is the empirical probability a randomly chosen point lies in bin i, o i is the accuracy within bin i, and e i is the average confidence in bin i. We adapted the SWAG authors' implementation at https://github.com/wjmaddox/swa_gaussian to include the ACNML procedure for test time evaluation, and include a copy of the modified codebase in the supplementary materials with instructions on how to reproduce our experiments. We additionally include pretrained models that were used for our experiments. Experiments were conducted using a mix of local GPU servers and Google Cloud Program compute resources. For the MNIST experiments, we used a feedforward network with 2 hidden layers of size 1200, with no data augmentation. The posterior is factored as independent Gaussians for each parameter, with the prior for each parameter being a zero-mean Gaussian with standard deviation 0.1. 

B FURTHER EXPERIMENTAL RESULTS AND COMPARISONS ON CIFAR10

In addition to the comparisons in the main paper, we additionally compare to SWA-Gaussian (SWAG), which uses a more expressive posterior than SWAG-D, and SWA with Monte Carlo Dropout (Gal and Ghahramani, 2015) (SWA-Drop). For reference, we show in-distribution performance of all methods in Table 3 . Overall, performance differences between all methods are quite small, and ACNML's conservative predictions do not improve on NLL or ECE over some baselines on in-distribution performance, which is to be expected, since the main aim of our method is produce more calibrated predictions on out-of-distribution tasks. For completeness, we show expanded results on CIFAR10-Corrupted in Figures 6, 7 , and 8. With the same architecture, all methods generally have very similar accuracy. ACNML consistently achieves significantly better ECE on the more severe corruptions, and generally comparable or slightly better NLL. While evaluating MC-Dropout, we found that adding dropout before each layer in VGG16 (labelled VGG16Drop in 7) significantly improved performance on CIFAR10-C. For fair comparisons, we reran all methods with the VGG16Drop architecture as well. Looking at the CNML normalizers, we see that the ACNML adaptation procedure using the approximate posterior is much less constraining than using the training set, resulting in the normalizers being higher for ACNML than naive CNML for almost all inputs. This leads to excess conservatism, with ACNML almost always having lower confidence its predictions, and many inputs with close to 0 NLL with naive CNML having higher NLL with ACNML.

C COMPARISONS BETWEEN ACNML AND NAIVE CNML ON MNIST

In this section, we include expanded comparisons between ACNML and a naive implementation of CNML from Bibas et al. (2019b) that computes the MLE/MAP θy for each label by appending the query point and label to the dataset and finetuning for N epochs. Both ACNML and naive CNML are initialized from the same MAP solution, with ACNML taking 5 gradient steps on the query point and posterior and naive CNML finetuning with the query point and training set for 5 epochs. This naive implementation differs slightly from Bibas et al. (2019b) in that we finetune the entire network, while Bibas et al. (2019b) proposed only tuning the last few layers. During the finetuning, we also append the query point and label to every batch in optimization, and downweighting that portion of the loss accordingly to get unbiased gradient estimates. We found this led to more efficient optimization than randomly sampling We first examine how closely ACNML and naive CNML's predictions match on the same datapoint. To assess this, we compare the CNML normalization terms y p θy (y|x), NLLs, and the confidences of the two methods. The CNML normalization term captures how much each procedure was able to adapt to different labels for that input. A higher normalization term for an input means that we were flexible enough to fit multiple different labels well together with the training set (or approximate posterior in the case of ACNML), and typically means a less confident prediction on that input. In Figures 9 and 10, We show scatter plots over 1000 randomly selected test points (from the indistribution test set and the rotated OOD images respectively) comparing the CNML normalizers, NLLs, and confidences of ACNML and naive CNML. In each scatter plot, we include a diagonal red line to illustrate where points would lie if predictions of ACNML and naive CNML matched exactly. We additionally plot reliability diagrams for MNIST experiments in Figure 11 . For the in-distribution test set, we see from the CNML normalizer plot that the ACNML adaptation procedure using the approximate posterior is much less constraining than using the training set, resulting in the normalizers being higher for ACNML than naive CNML for almost all inputs. This leads to excess conservatism, with ACNML almost always having lower confidence its predictions. As a result, we see that on many points where naive CNML outputted confident correct answers and achieved close to 0 NLL loss, ACNML still incurs some higher losses due to its less confident predictions. On the OOD rotated images, we again see that ACNML typically adapts more than CNML as measured by the CNML normalizers, though the difference is much less extreme compared to the in-distribution dataset. In the confidence scatter plot, we again see that ACNML tends to make lower confidence predictions than naive CNML (especially when naive CNML's predictions are confident), and as seen in Table 1 and Figure 11 , result in ACNML having better NLL and calibration on the OOD inputs. Looking at the CNML normalizers, we again see that the ACNML adaptation procedure using the approximate posterior is less constraining than using the training set, with the normalizers being higher for ACNML than naive CNML for most inputs (though to lesser extent than the in-distribution data). ACNML again outputs more conservative predictions with lower confidence on many inputs, which leads to better NLL and calibration on the OOD dataset, unlike with the in-distribution test set. Handling multiple MLEs in CNML: Strictly speaking, the CNML distribution is not well defined when there exist multiple potential MLEs θy that can output different predictions (prior references to CNML typically assume such MLEs are unique). However, the non-convexity of the objective for deep neural networks means multiple MLEs can exist, and to properly define CNML in this case, we would need to select a particular MLE to use when assigning probabilities in CNML. In line with the min-max formulation of CNML, we propose to select the MLE θy that maximizes the likelihood p θy (y|x) of the query point and proposed label, as this is the choice that maximizes the regret for that particular label over all MLEs. With our naive CNML instantiation, we observe that during the finetuning for each query point x and label y, the predicted probability of that label p θ (y|x) does not monotonically increase over iterations as we might hope (since we initialize θ to be the MLE of the training set, then update it to maximize likelihood of the training set with the query point and label), but can potentially oscillate substantially throughout the finetuning process. We suspect this is due to the stochasticity in the optimization procedure from to sampling minibatches of the training data causing the trajectory of parameters can potentially visit several different (approximate) local optima that output different predictions on the query point. While our instantiation of naive CNML simply used the parameter found at the end of 5 epochs, we additionally compare against a variant that explicitly tries to select the MLE that maximizes the likelihood of the proposed label. This variant heuristically uses the bset value of p θ (y|x) over all θ encountered in the last epoch of finetuning. We see in denoted naive CNML (max over itrs), gives more conservative predictions than naive CNML and improves in NLL and calibration on the OOD dataset.

D NMAP AND ACNML

NML type methods can be extended with a prior-like regularization term on the selected parameter, resulting in Normalized Maximum a Posteriori (NMAP) (Kakade et al., 2006) , also referred to as Luckiness NML (Grunwald, 2004) . For a regularizer given by log p(θ), NMAP assigns probabilities according to p NMAP (x n ) ∝ p θ(x n ) (x n ) θ(x n ) = argmax θ log p θ (x n ) + log p(θ). Similarly to CNML, there are several variations on NMAP or LNML that predict slightly different distributions, but we adopt the one of the same form as our CNML. Similarly to how NML was extended to CNML, NMAP can be extended to a conditional version, again with the θ's being chosen via MAP rather than MLE. As mentioned in Section 3.1, with a non-uniform prior, ACNML actually approximates a version of conditional NMAP, with the Bayesian prior term on the parameters corresponding to the additional regularizer. We also note that with the calculations in section 3.1, we see that CNML can be viewed as performing NMAP on the new test point, with a regularizer corresponding to the likelihoods on the training data. In this perspective, ACNML approximates CNML by using an approximation to that training loss regularizer.

E DETAILS OF ANALYSIS IN SECTION 3.2 E.1 BOUNDING ERROR IN PARAMETER ESTIMATION

Here we state the primary theorem of Giordano et al. (2019) along with the necessary definitions and assumptions. Here, we attempt to estimate an unknown parameter θ ∈ Ω θ ⊆ R D where Ω θ is compact. Suppose we have a dataset N datapoints and a weight vector w 1 , . . . , w N . Let g i (θ) denote the gradient of the loss at datapoint i evaluated at θ, and h i (θ) the Hessian. We can then define G(θ, w) = 1 N N i=1 w i g i (θ) (21) H(θ, w) = 1 N N i=1 w i h i (θ). The MLE θ(w) for the dataset weighted by w is given by solving for G( θ(w), w) = 0. Let 1 w denote the vector of weights consisting of all 1s. We define θ1 to be the MLE for the whole unweighted dataset, which is equivalent to evaluating θ(1 w ) and also define the corresponding Hessian H 1 = H( θ1 , 1 w ). We now wish to estimate θ(w) using a first order approximation around θ1 given by θIJ (w) = θ1 -H -1 1 G( θ1 , ∆w), where we define ∆ w = w -1 w . The theorem will proceed to bound θ(w) -θIJ 2 for suitable weights w. Now we further define g(θ) ∈ R N ×D to be the concatenation of all g i (θ)s and similarly for h(θ) ∈ R N ×D×D . We let g(θ) p and h(θ) p to refer to the p-norms when treating those as vector quantities. Assumption 1 (Smoothness): For all θ ∈ Ω θ each g n (θ) is continuously differentiable. Assumption 2 (Non-degeneracy): For all θ ∈ Ω θ , H(θ, 1 w ) is nonsingular and We note that assumption 2 is equivalent to H 1 being strongly positive definite. Assumption 5 is not relevant for our use cases, but is stated for completeness. (w i -1)h i (θ) 1 ≤ δ. ( ) Condition 1 essentially describes the set of weight vectors for which θIJ will be an accurate approximation within order δ. Definition 1: Given assumptions 1-5, define C IJ = 1 + DC w L h C op ∆ δ = min{∆ θ C -1 op , 1 n C -1 IJ C -1 op }. ( ) We now state the main theorem of Giordano et al. (2019) . Theorem (Error Bound for the approximation). Under assumptions 1-5 and condition 1, δ ≤ ∆ δ ⇒ max w∈W δ θIJ (w) -θ(w) 2 ≤ 2C 2 op C IJ δ 2 . ( ) We can now apply the above theorem to provide error bounds for a setting where we have a training set of n datapoints and wish to consider the MLE after adding a new datapoint z. The issue is that the theorem as stated bounds the error of the approximation when the approximation is centered around the uniform weighting over all the datapoints, which would be appropriate for considering the impact of removing datapoints from the dataset. To apply the theorem to bound the effects of adding a datapoint, we have to do some slight manipulation. We apply the previous theorem with N = n + 2, where g i (θ) correspond to the gradients of training data point i for i in (1, . . . , n), g n+1 = -∇ log p θ (z), and g n+2 = ∇ log p θ (z), and similarly for the Hessians h i (θ). We have thus added the query point to the dataset, as well as another fake point that serves to cancel out the contribution of the query point under a uniform weighting, so G(θ, 1 w ) and H(θ, 1 w ) are the mean gradients and Hessians for just the training set. Now supposing assumptions 1-5 are met for this problem, then we need to check condition 1 for the particular W δ that contains the vector w of all 1s, except for a 2 in the last entry. We can then find the smallest δ that satisfies sup θ∈Ω θ 1 N + 2 g n+2 (θ) 1 ≤ δ (30) sup θ∈Ω θ 1 N + 2 h n+2 (θ) 1 ≤ δ, and so long as δ ≤ ∆ δ , applying the theorem bounds θIJ ( w) -θ( w)

2

.

Commentary:

The above theorem gives explicit conditions for the accuracy of the approximation that we can verify for a particular training set and query point. Under assumptions that we have some limiting procedure for growing the training set such that the constants defined hold uniformly, we can extend this to an asymptotic statement to explicitly say that the approximation error decays as O(n -2 ).

E.2 BOUNDING ERROR IN THE RESULTING CNML DISTRIBUTION

We now provide the proof for Proposition 3.2, which we restate here. For notational simplicity, we ignore any dependence on the input x, which we consider fixed.  We note that the log probabilities of the exact CNML distribution p CNML (p ACNML is given by a similar expression using θz instead of θz ) is given by log p CNML (z) = log p θz (z) -log  We now bound the difference between the log-normalizers log z p θz (z ) -log z p θz (z ) .



Figure 1: CNML probabilities with a logistic regression model. Note that CNML provides uniform predictions (indicated by the white color) on most of the input space away from the training set (shown in blue and orange dots).

Figure 2: Given the labeled training set (blue and orange dots), we want to predict the label at the query input (shown in pink in the left image), which the training set MLE θtrain confidently classifies as the blue class. However, CNML assigns a near-uniform prediction on the query point, as it computes new MLEs θ0 and θ1 (center and right images) by assigning different labels to the query point, and finds both labels are consistent with the training data.

Figure 3: CNMAP probability heatmaps with different levels of L2 regularization λ w 22 . We see predictions are less conservative as regularization increases.

Figure4: Reliability diagrams plotting confidence vs. accuracy for VGG16 on in-distribution and outof-distribution data. ACNML provides more conservative predictions than other methods, resulting in better calibration on out-of-distribution inputs. For the OOD task, we show results for the Gaussian blur corruptions at levels 3 and 5, with level 5 corresponding to a higher amount of corruption. Each point shows the mean confidence and mean accuracy within a bucket, so the spread of points along the x-axis shows that ACNML makes more low confidence predictions than other methods.

(a) CIFAR10C VGG16 ECEs (lower is better) (b) CIFAR10C VGG16 NLLs (lower is better)

Figure 6: CIFAR10-C performance with the VGG16 architecture. Instantations of our methods are shown in stripes. Boxplots show quartiles of each statistic over all different corruption types of the given intensity, with the mean indicated by a circle. The accuracy (a) and NLL (c) for most methods are similar, but both ACNML variants attain significantly better ECE (b) on the more severe corruptions, as the images move further out of distribution.

(a) CIFAR10C VGG16Drop Accuracies (higher is better) (b) CIFAR10C VGG16Drop ECEs (lower is better) (c) CIFAR10C VGG16Drop NLLs (lower is better)

Figure 7: CIFAR10-C performance with the VGG16Drop architecture. Instantations of our methods are shown in stripes. Boxplots show quartiles of each statistic over all different corruption types of the given intensity, with the mean indicated by a circle. Again, the accuracy (a) and NLL (c) for most methods are similar, but both ACNML variants attain significantly better ECE (b) on the more severe corruptions, as the images move further out of distribution.

Figure 8: CIFAR10-C performance with the WideResNet28x10 architecture. Instantations of our methods are shown in stripes. Boxplots show quartiles of each statistic over all different corruption types of the given intensity, with the mean indicated by a circle. Again, we see that ACNML attains better ECE values than comparable methods on the heavier corruptions (b).Note that the best performing prior method, SWAG, uses a substantially more expressive posterior than the diagonal approximation used by SWAGD+ACNML, whereas the comparable SWAGD method attains worse ECE.

Figure9: In Distribution Comparisons between ACNML and naive CNML. We plot scatter plots of the values of each statistic for naive CNML (x-axis) vs ACNML (y-axis), with the red line indicating Looking at the CNML normalizers, we see that the ACNML adaptation procedure using the approximate posterior is much less constraining than using the training set, resulting in the normalizers being higher for ACNML than naive CNML for almost all inputs. This leads to excess conservatism, with ACNML almost always having lower confidence its predictions, and many inputs with close to 0 NLL with naive CNML having higher NLL with ACNML.

Figure10: OOD Comparisons between ACNML and naive CNML. We plot scatter plots of the values of each statistic for naive CNML (x-axis) vs ACNML (y-axis). Looking at the CNML normalizers, we again see that the ACNML adaptation procedure using the approximate posterior is less constraining than using the training set, with the normalizers being higher for ACNML than naive CNML for most inputs (though to lesser extent than the in-distribution data). ACNML again outputs more conservative predictions with lower confidence on many inputs, which leads to better NLL and calibration on the OOD dataset, unlike with the in-distribution test set.

Figure 11: Reliability diagrams plotting confidence vs. accuracy for Bayes-by-Backprop experiments on the MNIST test set and the randomly rotated MNIST test set (OOD). ACNML's conservative predictions provided better calibrated predictions on the OOD test set.

sup

1 w ) -1 op ≤ C op ≤ ∞. (24)Assumption 3 (Bounded averages): There exist finite constants C g and C h such thatsup θ∈Ω θ 1 √ N g(θ) 2 ≤ C g and sup θ∈Ω θ 1 √ N h(θ) 2 ≤ C h . Assumption 4 (Local Smoothness): There exists a ∆ θ > 0 and a finite constant L h such that θ -θ1 2 ≤ ∆ θ implies h(θ)-h( θ1) 2 √ N ≤ L h θ -θ12Assumption 5 (Bounded weight averages). 1 √ N w 2 is uniformly bounded for all w ∈ W by a finite constant C w .

Set Complexity): There exists a δ ≥ 0 and corresponding set W δ ⊆ W such that max

Proposition E.1 (3.2). Suppose z ∈ Z with |Z| = k (for example classification with k classes). Let θz be the exact MLE after appending z to the training set, and let θz be an approximate MLE with θzθz ≤ δ for all z. Further suppose log p θ (z) is L-Lipschitz in θ.Denote the exact CNML distribution p CNML (z) ∝ p θz (z) and an approximate CNML distribution p ACNML (z) ∝ p θz (z). Then, we have the boundsup z |log p CNML (z) -log p ACNML (z)| ≤ 2Lδ.(32)Proof. The assumed bound θzθz 2 ≤ δ combined with L-Lipschitzness implies a bound on differences of logits of each class log p θz (z) -log p θz (z) ≤ Lδ.

z ∈ Z, we can then expand, apply the triangle inequality and then Equation 33 to obtain|log p CNML (z) -log p ACNML (z)| = log p θz (z) -log p θz (z) -log log p θz (z) -log p θz (zLδ + log z ∈Z p θz (z ) -log z ∈Zp θz (z ) .

Ovadia et al. (2019) evaluate various proposed methods for uncertainty estimates in deep learning under different types of distribution shift. They found that good calibration on in-distribution points did not necessarily indicate good calibration under distribution shift, and that methods relying on marginalizing predictions over multiple models(Lakshminarayanan et al., 2016;Srivastava et al., 2014) gave better uncertainty estimates under distribution shift than other techniques. We show that our method ACNML maintains much better calibration under distribution shift than prior methods.

Comparative results for ACNML on MNIST using a posterior obtained via Bayes by Backprop.

Inference time per input (in seconds).

In-distribution comparative resultsWe see that for in-distribution performance, ACNML variants perform comparably to other methods, without large separations between most methods. Results for SWA-Temp and SGD are taken fromMaddox et al. (2019).

Table 4 that this variant, ± 0.0032 97.28 ± 0.21 0.1013 ± 0.0006 2.766 ± 0.0197 37.34 ± 0.06 0.1540 ± 0.0023 MAP 0.0864 ± 0.0025 97.28 ± 0.21 0.0047 ± 0.0006 3.994 ± 0.072 37.29 ± 0.02 0.4371 ± 0.0094 Marginal 0.1069 ± 0.0067 97.22 ± 0.24 0.0313 ± 0.0010 3.017 ± 0.022 37.63 ± 0.31 0.2928 ± 0.0032 naive CNML 0.0774 ± 0.0024 98.05 ± 0.08 0.0231 ± 0.0001 3.100 ± 0.057 37.33 ± 0.34 0.2497 ± 0.0072 naive CNML (max over itrs) 0.0882 ± 0.0018 97.90 ± 0.23 0.0355 ± 0.0005 2.991 ± 0.021 37.34 ± 0.003 0.1858 ± 0.0075 Expanded comparative results for ACNML on MNIST using a posterior obtained via Bayes by Backprop.

annex

We first let p min (z) = min{p θz (z), p θz (z)} and p max (z) = max{p θz (z), p θz (z)}, and note that Equation 33 implies log p max (z) ≤ log p min (z) + Lδ for all z. We then bound the difference in log-normalizers= log z ∈Z exp(log p max (z )) z ∈Z p min (z )= log exp(Lδ) z ∈Z p min (z )Plugging back into Equation 37, we have the following bound for all z ∈ Z |log p CNML (z) -log p ACNML (z)| ≤ 2Lδ.(44)

