COLD POSTERIORS THROUGH PAC-BAYES

Abstract

We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections of the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter λ which is not restricted to be λ = 1. For classification tasks, in the case of Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures important aspects of the cold posterior effect.



In their influential paper, Wenzel et al. ( 2020) highlighted the observation that Bayesian neural networks typically exhibit better test time predictive performance if the posterior distribution is "sharpened" through tempering. Their work has been influential primary because it serves as a well documented example of the potential drawbacks of the Bayesian approach to deep learning. While other subfields of deep learning have seen rapid adoption, and have had impact on real world problems, Bayesian deep learning has, to date, seen relatively limited practical use (Izmailov et al., 2021; Lotfi et al., 2022; Dusenberry et al., 2020; Wenzel et al., 2020) . The "cold posterior effect", as the authors of Wenzel et al. (2020) named their observation, highlights an essential mismatch between Bayesian theory and practice. As the number of training samples increases, Bayesian theory tells states that the posterior distribution should be concentrating more and more on the true model parameters, in a frequentist sense. At any time, the posterior is our best guess at the true model parameters, without having to resort to heuristics. Since the original paper, a number of works (Noci et al., 2021; Zeno et al., 2020; Adlam et al., 2020; Nabarro et al., 2022; Fortuin et al., 2021;  



Figure 1: PAC-Bayes bounds correlate with the test 0-1 Loss for different values of the temperature λ (quantities on both axes are normalized). (a) Classification tasks on CIFAR-10, CIFAR-100, and SVHN datasets (σ 2 π = 0.1, ResNet22) and FMNIST dataset (σ 2 π = 0.1, ConvNet). (b) Graphical representation of the Laplace approximation for different temperatures: for hot temperatures λ ≪ 1, the posterior variance becomes equal to the prior variance; for λ = 1 the posterior variance is regularized according to the curvature h; for cold temperatures λ ≫ 1, the posterior becomes a Dirac delta on the MAP estimate.

