COLD POSTERIORS THROUGH PAC-BAYES

Abstract

We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections of the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter λ which is not restricted to be λ = 1. For classification tasks, in the case of Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures important aspects of the cold posterior effect.



In their influential paper, Wenzel et al. (2020) highlighted the observation that Bayesian neural networks typically exhibit better test time predictive performance if the posterior distribution is "sharpened" through tempering. Their work has been influential primary because it serves as a well documented example of the potential drawbacks of the Bayesian approach to deep learning. While other subfields of deep learning have seen rapid adoption, and have had impact on real world problems, Bayesian deep learning has, to date, seen relatively limited practical use (Izmailov et al., 2021; Lotfi et al., 2022; Dusenberry et al., 2020; Wenzel et al., 2020) . The "cold posterior effect", as the authors of Wenzel et al. (2020) named their observation, highlights an essential mismatch between Bayesian theory and practice. As the number of training samples increases, Bayesian theory tells states that the posterior distribution should be concentrating more and more on the true model parameters, in a frequentist sense. At any time, the posterior is our best guess at the true model parameters, without having to resort to heuristics. Since the original paper, a number of works (Noci et al., 2021; Zeno et al., 2020; Adlam et al., 2020; Nabarro et al., 2022; Fortuin et al., 2021;  The experimental setups where the cold posterior effect arises have, however, been hard to pinpoint precisely. Noci et al. (2021) conducted detailed experiments testing various hypotheses. The cold posterior effect was shown to arise from augmenting the data during optimization (data augmentation hypothesis), from selecting only the "easiest" data samples when constructing the dataset (data curation hypothesis), and from selecting a "bad" prior (prior misspecification hypothesis). Nabarro et al. (2022) propose a principled log-likelihood that incorporates data augmentation, however they show that the cold-posterior persists. Bachmann et al. ( 2022) also propose a mechanism by which data-augmentation leads to mispecification and how the tempered posterior alleviates it. They prove their results for simplified settings, and acknowledge that there might be other potential sources of the cold-posterior effect. Data curation was first proposed as an explanation in Aitchison (2021), however the author shows that data curation can only explain a part of the cold posterior effect. Misspecified priors have also been explored as a possible cause in several other works (Zeno et al., 2020; Adlam et al., 2020; Fortuin et al., 2021) . Again the results have been mixed. In smaller models, data dependent priors seem to decrease the cold posterior effect while in larger models the effect increases (Fortuin et al., 2021) . We posit that discussions of the cold posterior effect should take into account that in the nonasymptotic setting (where the number of training data points is relatively small), Bayesian inference does not readily provide a guarantee for performance on out-of-sample data. Existing theorems describe posterior contraction (Ghosal et al., 2000; Blackwell & Dubins, 1962) , however in practical settings, for a finite number of training steps and for finite training data, it is often difficult to precisely characterise how much the posterior concentrates. Furthermore, theorems on posterior contraction are somewhat unsatisfying in the supervised classification setting, in which the cold posterior effect is usually discussed. Ideally, one would want a theoretical analysis that links the posterior distribution to the test error directly. Here, we investigate PAC-Bayes generalization bounds (McAllester, 1999; Catoni, 2007; Alquier et al., 2016; Dziugaite & Roy, 2017) as the model that governs performance on out-of-sample data. PAC-Bayes bounds describe the performance on out-of-sample data, through an application of the convex duality relation between measurable functions and probability measures. The convex duality relationship naturally gives rise to the log-Laplace transform of a special random variable (Catoni, 2007) . Importantly the log-Laplace transform has a temperature parameter λ which is not constrained to be λ = 1. We investigate the relationship of this temperature parameter to cold posteriors. In summary, our contributions are the following: • Through detailed experiments for the Laplace approximation to the posterior, we show that PAC-Bayes bounds correlate with out-of-sample performance for different values of the temperature parameter λ. This might indicate that the temperature in the cold-posterior literature coincides with the temperature of the log-Laplace transform, and motivate Bayesian practitioners to use heuristics when targeting Frequentist metrics. • Contrary to Wenzel et al. ( 2020), we find that the coldest temperature (such that the posterior is a Dirac delta centered on a MAP estimate of the weights) is empirically almost always optimal in terms of test accuracy. PAC-Bayes bounds track and predict this behaviour. However, the negative log-likelihood and the Expected Calibration Error (ECE) have a more complex behaviour. Contrary to prior work (Wenzel et al., 2020; Noci et al., 2021) , this highlights that the evaluation metric choice plays an important role when discussing the cold-posterior effect. More importantly, we show that to improve the test ECE or NLL one typically needs to reduce the test accuracy. • We derive a PAC-Bayes bound for the case of the widely used generalized Gauss-Newton Laplace approximations to the posterior. Contrary to prior work (Bachmann et al., 2022; Aitchison, 2021) our bound implies that λ does not simply fix a misspecified prior or likelihood. For a fixed target test risk, likelihood and prior, the required λ varies due to the stochasticity of the inference procedure and the loss landscape shape. We also include a detailed FAQ section in the Appendix.



Figure 1: PAC-Bayes bounds correlate with the test 0-1 Loss for different values of the temperature λ (quantities on both axes are normalized). (a) Classification tasks on CIFAR-10, CIFAR-100, and SVHN datasets (σ 2 π = 0.1, ResNet22) and FMNIST dataset (σ 2 π = 0.1, ConvNet). (b) Graphical representation of the Laplace approximation for different temperatures: for hot temperatures λ ≪ 1, the posterior variance becomes equal to the prior variance; for λ = 1 the posterior variance is regularized according to the curvature h; for cold temperatures λ ≫ 1, the posterior becomes a Dirac delta on the MAP estimate.

