COLD POSTERIORS THROUGH PAC-BAYES

Abstract

We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections of the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter λ which is not restricted to be λ = 1. For classification tasks, in the case of Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures important aspects of the cold posterior effect.



In their influential paper, Wenzel et al. (2020) highlighted the observation that Bayesian neural networks typically exhibit better test time predictive performance if the posterior distribution is "sharpened" through tempering. Their work has been influential primary because it serves as a well documented example of the potential drawbacks of the Bayesian approach to deep learning. While other subfields of deep learning have seen rapid adoption, and have had impact on real world problems, Bayesian deep learning has, to date, seen relatively limited practical use (Izmailov et al., 2021; Lotfi et al., 2022; Dusenberry et al., 2020; Wenzel et al., 2020) . The "cold posterior effect", as the authors of Wenzel et al. (2020) named their observation, highlights an essential mismatch between Bayesian theory and practice. As the number of training samples increases, Bayesian theory tells states that the posterior distribution should be concentrating more and more on the true model parameters, in a frequentist sense. At any time, the posterior is our best guess at the true model parameters, without having to resort to heuristics. Since the original paper, a number of works (Noci et al., 2021; Zeno et al., 2020; Adlam et al., 2020; Nabarro et al., 2022; Fortuin et al., 2021; Aitchison, 2021) have attempted to explain the cold posterior effect, identify its origins, propose remedies and defend Bayesian deep learning in the process. The experimental setups where the cold posterior effect arises have, however, been hard to pinpoint precisely. Noci et al. (2021) conducted detailed experiments testing various hypotheses. The cold posterior effect was shown to arise from augmenting the data during optimization (data augmentation hypothesis), from selecting only the "easiest" data samples when constructing the dataset (data curation hypothesis), and from selecting a "bad" prior (prior misspecification hypothesis). Nabarro et al. (2022) propose a principled log-likelihood that incorporates data augmentation, however they show that the cold-posterior persists. Bachmann et al. (2022) also propose a mechanism by which data-augmentation leads to mispecification and how the tempered posterior alleviates it. They prove their results for simplified settings, and acknowledge that there might be other potential sources of the cold-posterior effect. Data curation was first proposed as an explanation in Aitchison (2021) , however the author shows that data curation can only explain a part of the cold posterior effect. Misspecified priors have also been explored as a possible cause in several other works (Zeno et al., 2020; Adlam et al., 2020; Fortuin et al., 2021) . Again the results have been mixed. In smaller models, data dependent priors seem to decrease the cold posterior effect while in larger models the effect increases (Fortuin et al., 2021) . We posit that discussions of the cold posterior effect should take into account that in the nonasymptotic setting (where the number of training data points is relatively small), Bayesian inference does not readily provide a guarantee for performance on out-of-sample data. Existing theorems describe posterior contraction (Ghosal et al., 2000; Blackwell & Dubins, 1962) , however in practical settings, for a finite number of training steps and for finite training data, it is often difficult to precisely characterise how much the posterior concentrates. Furthermore, theorems on posterior contraction are somewhat unsatisfying in the supervised classification setting, in which the cold posterior effect is usually discussed. Ideally, one would want a theoretical analysis that links the posterior distribution to the test error directly. Here, we investigate PAC-Bayes generalization bounds (McAllester, 1999; Catoni, 2007; Alquier et al., 2016; Dziugaite & Roy, 2017) as the model that governs performance on out-of-sample data. PAC-Bayes bounds describe the performance on out-of-sample data, through an application of the convex duality relation between measurable functions and probability measures. The convex duality relationship naturally gives rise to the log-Laplace transform of a special random variable (Catoni, 2007) . Importantly the log-Laplace transform has a temperature parameter λ which is not constrained to be λ = 1. We investigate the relationship of this temperature parameter to cold posteriors. In summary, our contributions are the following: • Through detailed experiments for the Laplace approximation to the posterior, we show that PAC-Bayes bounds correlate with out-of-sample performance for different values of the temperature parameter λ. This might indicate that the temperature in the cold-posterior literature coincides with the temperature of the log-Laplace transform, and motivate Bayesian practitioners to use heuristics when targeting Frequentist metrics. • Contrary to Wenzel et al. (2020) , we find that the coldest temperature (such that the posterior is a Dirac delta centered on a MAP estimate of the weights) is empirically almost always optimal in terms of test accuracy. PAC-Bayes bounds track and predict this behaviour. However, the negative log-likelihood and the Expected Calibration Error (ECE) have a more complex behaviour. Contrary to prior work (Wenzel et al., 2020; Noci et al., 2021) , this highlights that the evaluation metric choice plays an important role when discussing the cold-posterior effect. More importantly, we show that to improve the test ECE or NLL one typically needs to reduce the test accuracy. • We derive a PAC-Bayes bound for the case of the widely used generalized Gauss-Newton Laplace approximations to the posterior. Contrary to prior work (Bachmann et al., 2022; Aitchison, 2021) our bound implies that λ does not simply fix a misspecified prior or likelihood. For a fixed target test risk, likelihood and prior, the required λ varies due to the stochasticity of the inference procedure and the loss landscape shape. We also include a detailed FAQ section in the Appendix.

2. COLD POSTERIOR EFFECT: MISSPECIFIED AND NON-ASYMPTOTIC SETTING

We denote the learning sample (X, Y ) = {(x i , y i )} n i=1 ∈ (X × Y) n , that contains n input-output pairs. Observations (X, Y ) are assumed to be sampled randomly from a distribution D. Thus, we denote (X, Y ) ∼ D n the i.i.d observation of n elements. We consider loss functions ℓ : F × X × Y → R, where F is a set of predictors f : X → Y. We also denote the risk L ℓ D (f ) = E (x,y)∼D ℓ(f, x, y) and the empirical risk Lℓ X,Y (f ) = (1/n) i ℓ(f, x i , y i ). We consider two probability measures, the prior π ∈ M(F) and the approximate posterior ρ ∈ M(F). Here, M(F) denotes the set of all probability measures on F. We encounter cases where we make predictions using the approximate posterior predictive distribution E f ∼ ρ[p(y|x, f )]. We will use two loss functions, the nondifferentiable zero-one loss ℓ 01 (f, x, y) = I(arg max j f (x) j ̸ = y), and the negative log-likelihood, which is a commonly used differentiable surrogate ℓ nll (f, x, y) = -log(p(y|x, f )). We assume that outputs of f form a probability distribution p(y|x, f ) either through a Gaussian likelihood (in the case of regression) or using the softmax activation function (in the case of classification). Given the above, the Evidence Lower Bound (ELBO) has the following form -E f ∼ ρ Lℓ nll X,Y (f ) - 1 λn KL(ρ∥π), where λ = 1. Note that our temperature parameter λ is the inverse of the one typically used in cold posterior papers. In this form λ has a clearer interpretation as the temperature of a log-Laplace transform. Our setup is discussed in Wenzel et al. (2020) , p3 Section 2.3, and used in Bachmann et al. (2022) ; Aitchison (2021) . While Wenzel et al. (2020) use MCMC to conduct their experiments, we opt for the ELBO for analytical tractability. While Wenzel et al. (2020) temper by λ both the likelihood and the prior, as discussed in Aitchison (2021) and Wenzel et al. (2020) the relevant setting for the ELBO is the one of (Eq. 1), where only the KL is tempered. One then typically models the posterior and prior distributions over weights using a parametric distribution (commonly a Gaussian) and optimizes the ELBO, using the reparametrization trick, to find the posterior distribution (Blundell et al., 2015; Khan et al., 2018) . The cold posterior is the following observation: Even though the ELBO has the form (1) with λ = 1, practitioners have found that much larger values λ ≫ 1 typically result in better test time performance, for example a lower test misclassification rate and lower test negative log-likelihood. The starting point of our discussion will be thus to define the quantity that we care about in the context of Bayesian deep neural networks and cold posterior analyses. Concretely, in the setting of supervised prediction, what we often try to minimize is KL(p D (y|x)∥E f ∼ ρ[p(y|x, f )]) = E x,y∼D ln p D (y|x) E f ∼ ρ[p(y|x, f )] , the conditional relative entropy (Cover, 1999) between the true conditional distribution p D (y|x) and the posterior predictive distribution E f ∼ ρ[p(y|x, f )]. For example, this is implicitly the quantity that we minimize when optimizing classifiers using the cross-entropy loss (Masegosa, 2020; Morningstar et al., 2022) . It is also on this and similar predictive metrics that the cold posterior appears. In the following we will outline the relationship between the ELBO, PAC-Bayes and (2).

2.1. ELBO

We assume a training sample (X, Y ) ∼ D n as before, denote p(w|X, Y ) the true posterior probability over predictors f parameterized by w (typically weights for neural networks), and π and ρ respectively the prior and variational posterior distributions as before. The ELBO results from the following calculations KL(ρ(w)∥p(w|X, Y )) = ρ(w) ln ρ(w) p(w|X, Y ) dw = ρ(w) ln ρ(w)p(Y |X) π(w)p(Y |X, w) dw = ρ(w) -ln p(Y |X, w) + ln ρ(w) π(w) + ln p(Y |X) dw = -n -E f ∼ ρ Lℓ nll X,Y (f ) - 1 n KL(ρ∥π) ELBO + ln p(Y |X). Thus, maximizing the ELBO can be seen as minimizing the KL divergence between the true posterior and the variational posterior over the weights KL(ρ(w)∥p(w|X, Y )). The true posterior distribution p(w|X, Y ) gives more probability mass to predictors which are more likely given the training data, however these predictors do not necessarily minimize KL(p D (y|x)∥E f ∼ ρ[p(y|x, f )]), the evaluation metric of choice (2) for supervised prediction. In the well-specified regime (where the true predictor f * is f * ∈ F) and when n → ∞, the Blackwell-Dubins consistency theorem (Blackwell & Dubins, 1962) implies that the posterior quickly concentrates on the true set of parameters. In such cases, a more detailed analysis, such as a PAC-Bayesian one, is unnecessary as the posterior is akin to a Dirac delta mass at the true parameters. However neural networks do not operate in this regime. The existence of multiple minima hints that neural networks are misspecified, and the number of samples is small relative to the number of parameters. Operating in the regime where f * / ∈ F and where n is (comparatively) small makes it important to derive a more precise certificate of generalization through a generalization bound, which directly bounds the true risk. In the following we focus on analyzing a PAC-Bayes bound in order to obtain insights into when the cold posterior effect occurs.

2.2. PAC-BAYES

We first look at the following bound denoted by B Alquier . It was considered by Alquier et al. (2016) (Theorem 4.1); see also Theorem 1 by Masegosa (2020) for a statement under the same conditions. Theorem 1 (B Alquier , Alquier et al., 2016) . Given a distribution D over X × Y, a hypothesis set F, a loss function ℓ : F × X × Y → R, a prior distribution π over F, real numbers δ ∈ (0, 1] and λ > 0, with probability at least 1 -δ over the choice (X, Y ) ∼ D n , we have for all ρ on F E f ∼ ρL ℓ D (f ) ≤ E f ∼ ρ Lℓ X,Y (f ) + 1 λn KL(ρ∥π) + ln 1 δ + Ψ ℓ,π,D (λ, n) where Ψ ℓ,π,D (λ, n) = ln E f ∼π E X ′ ,Y ′ ∼D n exp λn L ℓ D (f ) -Lℓ X ′ ,Y ′ (f ) . There are three different terms in the above bound. The empirical risk term E f ∼ ρ Lℓ X,Y (f ) is the empirical mean of the loss of the classifier over all training samples. The KL term 1/(λn)KL(ρ∥π) is the complexity of the model, which in this case is measured as the KL-divergence between the posterior and prior distributions. The Moment term 1/(λn)Ψ ℓ,π,D (λ, n) is the log-Laplace transform of the difference between the risk and empirical risk for a reversal of the temperature. We will keep the name "Moment" in the following. Note that unless stated otherwise (for example in Section 4), we can make some assumption on the risk to ensure that the Moment term is bounded. Such common assumptions include that the risk is a sub-Gaussian or sub-Gamma random variable under the prior π and the distribution D (see eg Germain et al., 2016) or more generally sub-Weibull (Vladimirova et al., 2019; 2020) . Using a PAC-Bayes bound together with Jensen's inequality, one can bound (2) directly as follows KL(p D (y|x)∥E f ∼ ρ[p(y|x, f )]) = E x,y∼D ln p D (y|x) E f ∼ ρ[p(y|x, f )] = E x,y∼D [-ln E f ∼ ρ[p(y|x, f )]] + E x,y∼D [ln p D (y|x)] ≤ E x,y∼D [E f ∼ ρ[-ln p(y|x, f )]] + E x,y∼D [ln p D (y|x)] ≤ E f ∼ ρ Lℓ nll X,Y (f ) + 1 λn KL(ρ∥π) + ln 1 δ + Ψ ℓ nll ,π,D (λ, n) PAC-Bayes + E x,y∼D [ln p D (y|x)]. The last line holds under the conditions of Theorem 1 and in particular with probability at least 1 -δ over the choice (X, Y ) ∼ D n . Notice here the presence of the temperature parameter λ ≥ 0, which need not be λ = 1. In particular it is easy to see that maximizing the ELBO is equivalent to minimizing a PAC-Bayes bound for λ = 1, which might not necessarily be optimal for a finite sample size. More specifically even for exact inference, where E w∼ ρ[p(y|x, w)]| ρ=p(w|X,Y ) = p(y|x, X, Y ), the Bayesian posterior predictive distribution does not necessarily minimize KL(p D (y|x)∥E f ∼ ρ[p(y|x, f )]).

2.3. CLASSIFICATION TASKS

For classification tasks, we are typically mainly interested in achieving low expected zero-one risk E f ∼ ρL ℓ01 D (f ). The ELBO objective is not directly related to this risk. However in the PAC-Bayesian literature there exist bounds specifically adapted to it. In the following we will use one of the tightest and most commonly used bounds, the "Catoni" bound, denoted B Catoni from Catoni (2007) Theorem 1.2.6. Theorem 2 (B Catoni , Catoni, 2007) . Given a distribution D over X × Y, a hypothesis set F, the 0-1 loss function ℓ 01 : F × X × Y → [0, 1], a prior distribution π over F, a real number δ ∈ (0, 1], and a real number λ > 0, with probability at least 1 -δ over the choice of (X, Y ) ∼ D n , we have ∀ρ on F : E f ∼ ρL ℓ01 D (f ) ≤ Φ -1 λ E f ∼ ρ Lℓ01 X,Y (f ) + 1 λn KL(ρ||π) + ln 1 δ , where Φ -1 λ (x) = 1-e -λx 1-e -λ . Similarly to the Alquier bound, the empirical risk term is the empirical mean of the loss of the classifier over all training samples. The KL term is the complexity of the model, which in this case is measured as the KL-divergence between the posterior and prior distributions. The Moment term has been absorbed in this case in the function Φ -1 λ (x) = 1-e -λx 1-e -λ .

2.4. SAFE-BAYES AND OTHER RELEVANT WORK

After identifying two sources of misspecification in standard Bayesian inference, Grünwald & Langford (2007) proposed a solution, through an approach which they named Safe-Bayes (Grünwald, 2012; Grünwald & Van Ommen, 2017) . Safe-Bayes corresponds to finding a temperature parameter λ for a generalized (tempered) posterior distribution with λ possibly different than 1. The optimal value of λ is found by taking a sequential view of Bayesian inference, and for a Cèsaro averaged posterior, which is an average of the posteriors at different optimization steps, and which doesn't coincide with the standard posterior. The analysis of Grünwald (2012) ; Grünwald & Van Ommen (2017) is also restricted to the case where λ < 1. By contrast we provide an analytical expression of the bound on true risk, given λ, and also numerically investigate the case of λ > 1. Our analysis thus provides intuition regarding which parameters (for example the curvature) might result in cold posteriors. Catoni (2007) discusses the optimal value of the temperature λ for PAC-Bayes bounds, for fixed priors and posteriors. By contrast we investigate the case where the posterior is optimized for different λ and which is the relevant one for the cold-posterior literature. Germain et al. (2016) find that minimizing a PAC-Bayesian generalization risk bound maximizes the Bayesian marginal likelihood. However they only investigate the case where λ = 1.

3.1. EXPERIMENTAL SETUP

The ELBO ( 1) is minimized at the probability density ρ ⋆ (f ) given by: ρ (Catoni, 2007) . We will use the Laplace approximation to the posterior in our experiments. This is equivalent to approximating λn Lℓ nll X,Y (f ) using a second order Taylor expansion around a minimum w ρ, such that λn ⋆ (f ) := π(f )e -λn Lℓ nll X,Y (f ) /E f ∼π e -λn Lℓ nll X,Y (f ) Lℓ nll X,Y (f w ) ≈ λn Lℓ nll X,Y (f w ρ ) + λn(w - w ρ) ⊤ 1 2 ∇∇ Lℓ nll X,Y (f w )| w=w ρ (w -w ρ). Assuming a Gaussian prior π = N (0, σ 2 π I), the Laplace approximation to the posterior ρ is again a Gaussian ρ = N w ρ, λH + 1 σ 2 π I -1 where H is the network Hessian H = n∇∇ Lℓ nll X,Y (f w )| w=w ρ . This Hessian is generally infeasible to compute in practice for modern deep neural networks, such that many approaches employ the generalized Gauss-Newton (GGN) approximation where J w (x) is the network per-sample Jacobian [J w (x)] c = ∇ w f c (x; w ρ), and Λ(y; f ) = -∇ 2 f f log p(y; f ) is the per-input noise matrix (Kunstner et al., 2019) . We will use two simplified versions of the GGN • An isotropic approximation with variance σ 2 ρ(λ) such that 1 H GGN := n i=1 J w (x i ) ⊤ Λ(y i ; f i )J w (x i ), σ 2 ρ (λ) = λh d + 1 σ 2 π , where h = i,j,k g(i, k)(∇ w f k (x i ; w ρ) j ) 2 is the trace of the Gauss-Newton approximation to the Hessian, with g(i, k) = [Λ(y i ; f )] kk . • The Kronecker-Factorized Approximate Curvature (KFAC) (Martens & Grosse, 2015) approximation, which retains only a block diagonal part of the GGN. When making predictions, we use the posterior predictive distribution E w∼ ρ[p(y|x, f w )] of the full neural network model, meaning that samples from ρ are inputted to the full neural network. Since the 0-1 loss is not differentiable, the posterior estimated with the cross entropy loss will be used for classification problems. We have tested extensively in realistic classification tasks. We used the CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) , SVHN (Netzer et al., 2011) and FashionMnist (Xiao et al., 2017) datasets. In all experiments, we split the dataset into three sets. These three are the typical prediction tasks sets: training set Z train , testing set Z test , and validation set Z validation . We use Monte Carlo sampling to estimate the Empirical Risk term (f ∼ ρ). For the isotropic Laplace approximation, and a Gaussian isotropic prior, the KL divergence has a simple analytical expres- sion KL(ρ||π) = 1 2 d σ 2 ρ (λ) σ 2 π + 1 σ 2 π ∥w ρ -w π ∥ 2 -d -d ln σ 2 ρ(λ) + d ln σ 2 π . PAC-Bayes bounds require correct control of the prior mean as the ℓ 2 distance between prior and posterior means in the KL term is often the dominant term in the bound. To control this distance, we follow a variation of Figure 3 : The Pareto front of the test ECE compared to test 0-1 loss. Apart from the case of FMNIST we see a clear trade-off between 0-1 loss and ECE. As such there seems to be no λ that simultaneously doesn't hurt the test 0-1 loss while improving the ECE, which is the most pertinent task, in light of the different behaviours of the different metrics with respect to λ. the approach in Dziugaite et al. ( 2021) to constructing our classifiers. We first use Z train to find a prior mean w π . We then set the posterior mean equal to the prior mean w ρ = w π but evaluate the r.h.s of the bounds on Z validation . Note that in this way ∥w ρ -w π ∥ 2 2 = 0, while the bound is still valid since the prior is independent from the evaluation set X, Y = Z validation . For the CIFAR-10, CIFAR-100, and SVHN datasets, we use a WideResNet22 (Zagoruyko & Komodakis, 2016) , with Fixup initialization (Zhang et al., 2019) . For the FashionMnist dataset, we use a convolutional architecture with three convolutional layers, followed by two fully connected non-linear layers. More details on the experimental setup can be found in the Appendix.

3.2. CLASSIFICATION EXPERIMENTS

We find ten MAP estimates for the neural network weights of the CIFAR-10, CIFAR-100, SVHN and FMNIST datasets by training on Z train using SGD. We then fit an Isotropic Laplace approximation to each MAP estimate using X, Y = Z validation . For different values of λ we then estimate the Catoni bound (Theorem 2) using Z validation . We also estimate the test 0-1 Loss, negative loglikelihood (NLL) and the Expected Calibration Error (ECE) (Naeini et al., 2015) of the posterior predictive on Z test . We use the prior variance σ 2 π = 0.1, as optimizing the marginal likelihood leads to σ 2 π ≈ 0 which is not relevant for BNNs. We also test a standard KFAC Laplace setup. Specifically we fit the KFAC Laplace on Z train and also choose the prior through the marginal likelihood. In this case, we estimate a standard validation set bound using the validation set Z validation (instead of a PAC-Bayes bound) as from the literature we know that any PAC-Bayes bound will be vacuous (larger than 1) as we do not control ∥w ρ -w π ∥ 2 2 . We plot the results for all datasets in Figure 2 . The Catoni bound correlates tightly with test 0-1 Loss for all datasets and we plot this correlation in Figure 1 (a). Contrary to Wenzel et al. (2020) in terms of test 0-1 Loss, the MAP estimate (obtained where λ ≫ 1 and the posterior is "coldest") is almost always optimal. This behaviour is replicated in the "KFAC" case. This result is more coherent than the one in Wenzel et al. (2020) (Figure 1 ) where the coldest temperature 0-1 test loss and the MAP and the 0-1 test loss don't match. It highlights that in a tightly controlled setting, often Bayesian approaches don't improve at all over deterministic ones in terms of test 0-1 loss, for any temperature λ. We plot in Figure 2 (bottom row) the NLL for the KFAC case. Even without data augmentation and even when we optimize the prior variance using the marginal likelihood, we find that all three cases of temperatures (cold posterior, warm posterior, as well as posterior with λ = 1) can be optimal, for varying datasets. This highlights the importance of the choice of the evaluation metric when discussing the cold posterior effect, as results can vary significantly depending on our choice. Specifically one can see in Kapoor et al. (2022) p18 and Aitchison (2021) p9 that different metrics have different behaviours and/or minima with respect to λ. We then ask the more pertinent question: "If the Laplace approximation doesn't in general improve the test 0-1 loss, can it (for some value of λ) retain the same test 0-1 loss while improving calibration?". In Figure 3 we see that in our experiments in most cases we couldn't find such a temperature λ. Apart from FMNIST there seems to be a clear tradeoff between test 0-1 loss and calibration error. w ρ ∼ P(X, Y ), as well as the test 0-1 loss variability for each Laplace approximation due to the MC approximation to the Laplace predictive w ∼ ρ(w). We display the combined variability for all values of λ ∈ [10 -7 , 10 4 ]. The intragroup test 0-1 variability cannot be explained by the variability of each Laplace predictive due to the MC approximation. This large intragroup variability contradicts the assumption that λ simply adjusts for the aleatoric uncertainty of the data. Intuitively, different MAP estimates result in different predictive functions, which yield different test 0-1 losses for a fixed prior, λ, and likelihood. Alternatively for a fixed prior, likelihood, and target test 0-1 loss the required value of λ depends on the MAP estimate. As MAP estimates are usually found using a stochastic algorithm P(X, Y ) finding a λ that consistently achieves the target test 0-1 loss might be difficult.

4. THE TEMPERATURE λ , ALEATORIC UNCERTAINTY AND CURVATURE

In light of our empirical results, it would be interesting to derive an analytical form that elucidates the important variables that affect the bound. However, PAC-Bayes objectives are difficult to analyze theoretically for the non-convex case. Thus in the following we make a number of simplifying assumptions. The Laplace approximation with the Generalized Gauss-Newton approximation to the Hessian corresponds to a linearization of the neural network around the MAP estimate w ρ ∈ R d (Immer et al., 2021)  f lin (x; w) = f (x; w ρ) + ∇ w f (x; w ρ) ⊤ (w -w ρ). When analyzing minima of the loss landscape linearization is reasonable even without assuming infinite width Zancato et al. (2020) ; Maddox et al. (2021) . For appropriate modelling choices, we aim at deriving a bound for this linearized model. We adopt the aforementioned linear form together with the Gaussian likelihood, yielding ℓ nll (w, x, y) = 1 2 ln(2πσ 2 ) + 1 2σ 2 (y -f (x; w ρ) -∇ w f (x; w ρ) ⊤ (w -w ρ)) 2 . We also make the following modeling choices: 1) Prior over weights: w ∼ N (w π , σ 2 π I). 2) Gradients as Gaussian mixture: ∇ w f (x; w ρ) ∼ k i=1 ϕ i N (µ i , σ 2 xi I). 3) Labeling function: y = f (x; w ρ) + ∇ w f (x; w ρ) ⊤ (w * -w ρ) + ϵ, where ϵ ∼ N (0, σ 2 ϵ ). We also assume that we have a deterministic estimate of the posterior weights w ρ which we keep fixed, and we model the posterior as ρ = N (w ρ, σ 2 ρ(λ)I), similarly to our experimental section. Therefore estimating the posterior corresponds to estimating the variance σ 2 ρ(λ). Proposition 1 (B approximate ). With the above modeling choices, and given a distribution D over X × Y, real numbers δ ∈ (0, 1] and λ ∈ (0, 1 c ) with c = 2nσ 2 x σ 2 π , with probability at least 1 -δ over the choice (X, Y ) ∼ D n , we have E w∼ ρL ℓ nll D (w) ≤ ∥y -f (X; w ρ)∥ 2 2 2n σ 2 + λ h d σ 2 + 1 σ 2 π -1 h 2n σ 2 + 1 2 ln(2π σ 2 ) Empirical Risk + σ 2 x (σ 2 π d + ∥w * ∥ 2 2 ) 1 -2 λ nσ 2 x σ 2 π + σ 2 ϵ Moment + 1 λ n       1 2       d σ 2 π 1 λ h d σ 2 + 1 σ 2 π + 1 σ 2 π ∥w ρ -w π ∥ 2 2 -d -d ln 1 λ h d σ 2 + 1 σ 2 π + d ln σ 2 π       + ln 1 δ       KL where h = i j (∇ w f (x i ; w ρ) j ) 2 is the curvature parameter, σ 2 x = k j=1 ϕ j σ 2 xj is the posterior gradient variance, and σ 2 is the variance of the likelihood function. We now make a number of observations regarding Proposition 1. Here, h is the trace of the Hessian under the Gauss-Newton approximation (without a scaling factor n). Under the PAC-Bayesian modeling of the risk, cold posteriors are the result of a complex interaction between various parameters resulting from 1) our model such as the prior variance σ 2 π and prior mean w π , and 2) our data σ 2 x and w * (the curvature of the minimum h and the MAP estimate w ρ depend on the deep neural network architecture, the optimization procedure and the data). Contrary to prior work (Bachmann et al., 2022; Nabarro et al., 2022; Aitchison, 2021) our bound suggests that λ cannot be seen as simply fixing a mispecified likelihood variance σ 2 or prior variance σ 2 π . In particular it does not simply rescale the aforementioned quantities. Furthermore, our bound implies that even for a fixed prior and likelihood the same λ can imply different test risk based on the properties of each MAP estimate such as the curvature h and the distance from initialization ||w ρ -w π || 2 2 . We can observe this in our empirical data. We fit a Laplace approximation with a similar procedure as the KFAC case of Figure 2 but using the Isotropic approximation to the posterior, for the CIFAR-10, CIFAR-100, SVHN and FMNIST datasets. We keep the prior variance fixed σ 2 π = 0.1. For each value of λ ∈ [10 -7 , 10 4 ] we compute the test 0-1 loss of all the Laplace approximations at the different MAP estimates E f ∼ ρ(i) Lℓ01 Xtest,Ytest (f ), i ∈ [1, ..., M ] where M is the number of MAP estimates, and then the mean of these test 0-1 losses (1/M ) M i=1 E f ∼ ρ(i) Lℓ01 Xtest,Ytest (f ). We then compute the residuals of the test 0-1 loss of each Laplace approximation with respect to the aforementioned mean. This gives us a measure of how much the test 0-1 loss of each Laplace approximation deviates from the mean test 0-1 loss of all Laplace approximations. We plot the inferred distribution of residuals for all λ combined in Figure 4 . For the Laplace approximation at each MAP we are using 100 Monte Carlo samples to approximate the predictive. As such, it is important to plot as a sanity check the variability we would expect simply from the MC approximation of each predictive for each E f ∼ ρ(i) Lℓ01 Xtest,Ytest (f ). Specifically we model this as a Gaussian distribution with σ = 0.005 such that P(|E f ∼ ρ Lℓ01 Xtest,Ytest (f ) -(1/N MC )

NMC i=1

Lℓ01 Xtest,Ytest (f i )| ≤ 0.01) ≥ 0.95 where f i ∼ ρ (though we stress that this modelling comes from past experience and we didn't have time to precisely estimate this error). In Figure 4 we see significant deviations from (1/M ) M i=1 E f ∼ ρ(i) Lℓ01 Xtest,Ytest (f ) which can't be explained from the MC approximation of each predictive. Most of the variability comes from moderate values of λ as for large values the Laplace approximations degenerate to Dirac masses on the MAP estimates, and all our MAP estimates have approximately the same test 0-1 loss. Alternatively for small values of λ all Laplace approximations have 90% test 0-1 loss. Assume that different MAP estimates w ρ are generated from some randomized algorithm w ρ ∼ P(X, Y ) where X, Y are the training data. Our results imply that even for the same dataset, a fixed prior and likelihood, it might be difficult to find a single value of λ that gives a target test loss, simply because of the stochasticity of w ρ ∼ P(X, Y ).

5. DISCUSSION

We argued that Bayesian inference does not readily provide high probability guarantees on outof-sample performance, leading to inconsistencies such as the cold-posterior effect. We hope that this will motivate Bayesian practitioners to use heuristics "guilt-free" when targeting Frequentist performance metrics, or target contraction to the true posterior. Furthermore, our empirical results on the ECE point towards the need for the use of Pareto curves when evaluating Bayesian approaches. Finally, it would be interesting to see how our results from Section 4 translate to the MCMC setting.



Figure 1: PAC-Bayes bounds correlate with the test 0-1 Loss for different values of the temperature λ (quantities on both axes are normalized). (a) Classification tasks on CIFAR-10, CIFAR-100, and SVHN datasets (σ 2 π = 0.1, ResNet22) and FMNIST dataset (σ 2 π = 0.1, ConvNet). (b) Graphical representation of the Laplace approximation for different temperatures: for hot temperatures λ ≪ 1, the posterior variance becomes equal to the prior variance; for λ = 1 the posterior variance is regularized according to the curvature h; for cold temperatures λ ≫ 1, the posterior becomes a Dirac delta on the MAP estimate.

Figure 2: Test 0-1 Loss mean, as well as 10 MAP trials , along with the mean generalization certificate (we denote λ = 1 by ): B Catoni PAC-Bayes bound 0-1 loss (top), standard KFAC Laplace 0-1 loss (middle) and standard KFAC NLL (bottom). The B Catoni PAC-Bayes bound closely tracks the test 0-1 loss. For the standard KFAC posteriors the test and validation 0-1 loss behave similar to the Catoni case, with a rapid improvement as λ ↑ followed by a plateau. Coldest posteriors λ ≫ 1 are almost always best.

Figure 4: Intragroup variability of the test 0-1 loss for all Laplace approximationsw ρ ∼ P(X, Y ), as well as the test 0-1 loss variability for each Laplace approximation due to the MC approximation to the Laplace predictive w ∼ ρ(w). We display the combined variability for all values of λ ∈ [10 -7 , 10 4 ]. The intragroup test 0-1 variability cannot be explained by the variability of each Laplace predictive due to the MC approximation. This large intragroup variability contradicts the assumption that λ simply adjusts for the aleatoric uncertainty of the data. Intuitively, different MAP estimates result in different predictive functions, which yield different test 0-1 losses for a fixed prior, λ, and likelihood. Alternatively for a fixed prior, likelihood, and target test 0-1 loss the required value of λ depends on the MAP estimate. As MAP estimates are usually found using a stochastic algorithm P(X, Y ) finding a λ that consistently achieves the target test 0-1 loss might be difficult.

