A STATISTICAL THEORY OF COLD POSTERIORS IN DEEP NEURAL NETWORKS

Abstract

To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a "tempered" or "cold" posterior. This is extremely concerning: if the generative model is accurate, Bayesian inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work.

1. INTRODUCTION

Recent work has highlighted that Bayesian neural networks (BNNs) typically have better predictive performance when we "sharpen" the posterior (Wenzel et al., 2020) . In stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) , this can be achieved by multiplying the log-posterior by 1/T , where the "temperature", T is smaller than 1 (Wenzel et al., 2020) . Broadly the same effect can be achieved in variational inference by "tempering", i.e. downweighting the KL term. As noted in Wenzel et al. (2020) , this approach has been used in many recent papers to obtain good performance, albeit without always emphasising the importance of this factor (Zhang et al., 2017; Bae et al., 2018; Osawa et al., 2019; Ashukha et al., 2020) . These results are puzzling if we take the usual Bayesian viewpoint, which says that the Bayesian posterior, used with the right prior, and in combination with Bayes decision theory should give optimal performance (Jaynes, 2003) . Thus, these results may suggest we are using the wrong prior. While new priors have been suggested (e.g. Ober & Aitchison, 2020), they give only minor improvements in performance -certainly nothing like enough to close the gap to carefully trained non-Bayesian networks. In contrast, tempered posteriors directly give performance comparable to a carefully trained finite network. The failure to develop an effective prior suggests that we should consider alternative explanations for the effectiveness of tempering. Here, we consider the possibility that it is predominantly (but not entirely) the likelihood, and not the prior that is at fault. In particular, we note that standard image benchmark datasets such as ImageNet and CIFAR-10 are carefully curated, and that it is important to consider this curation as part of our generative model. We develop a simplified generative model describing dataset curation which assumes that a datapoint is included in the dataset only if there is unanimous agreement on the class amongst multiple labellers. This model naturally multiplies the effect of each datapoint, and hence gives posteriors that closely match tempered or cold posteriors. We show that toy data drawn from our generative model of curation can give rise to optimal temperatures being smaller than 1. Our model predicts that cold posteriors will not be helpful when the original underlying labels from all labellers are available. While these are not available for standard datasets such as CIFAR-10, we found a good proxy: the CIFAR-10H dataset (Peterson et al., 2019) , in which ∼ 50 humans annotators labelled the CIFAR-10 test-set (we use these as our training set, and use the standard CIFAR-10 training set for test-data). As expected, we find strong cold-posterior effects 1

