A STATISTICAL THEORY OF COLD POSTERIORS IN DEEP NEURAL NETWORKS

Abstract

To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a "tempered" or "cold" posterior. This is extremely concerning: if the generative model is accurate, Bayesian inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work.

1. INTRODUCTION

Recent work has highlighted that Bayesian neural networks (BNNs) typically have better predictive performance when we "sharpen" the posterior (Wenzel et al., 2020) . In stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) , this can be achieved by multiplying the log-posterior by 1/T , where the "temperature", T is smaller than 1 (Wenzel et al., 2020) . Broadly the same effect can be achieved in variational inference by "tempering", i.e. downweighting the KL term. As noted in Wenzel et al. (2020) , this approach has been used in many recent papers to obtain good performance, albeit without always emphasising the importance of this factor (Zhang et al., 2017; Bae et al., 2018; Osawa et al., 2019; Ashukha et al., 2020) . These results are puzzling if we take the usual Bayesian viewpoint, which says that the Bayesian posterior, used with the right prior, and in combination with Bayes decision theory should give optimal performance (Jaynes, 2003) . Thus, these results may suggest we are using the wrong prior. While new priors have been suggested (e.g. Ober & Aitchison, 2020), they give only minor improvements in performance -certainly nothing like enough to close the gap to carefully trained non-Bayesian networks. In contrast, tempered posteriors directly give performance comparable to a carefully trained finite network. The failure to develop an effective prior suggests that we should consider alternative explanations for the effectiveness of tempering. Here, we consider the possibility that it is predominantly (but not entirely) the likelihood, and not the prior that is at fault. In particular, we note that standard image benchmark datasets such as ImageNet and CIFAR-10 are carefully curated, and that it is important to consider this curation as part of our generative model. We develop a simplified generative model describing dataset curation which assumes that a datapoint is included in the dataset only if there is unanimous agreement on the class amongst multiple labellers. This model naturally multiplies the effect of each datapoint, and hence gives posteriors that closely match tempered or cold posteriors. We show that toy data drawn from our generative model of curation can give rise to optimal temperatures being smaller than 1. Our model predicts that cold posteriors will not be helpful when the original underlying labels from all labellers are available. While these are not available for standard datasets such as CIFAR-10, we found a good proxy: the CIFAR-10H dataset (Peterson et al., 2019) , in which ∼ 50 humans annotators labelled the CIFAR-10 test-set (we use these as our training set, and use the standard CIFAR-10 training set for test-data). As expected, we find strong cold-posterior effects when using the original single-label, which are almost entirely eliminated when using the 50 labels from CIFAR-10H. In addition, curation implies that each label is almost certain to be correct, which is one way to understand the statistical patterns exploited by cold posteriors. As such, if we destroy this pattern by adding noise to the labels, the cold posterior effect should disappear. We confirmed that with increasing label noise, the cold posterior effect disappears and eventually reverses (giving better performance at temperatures close to 1).

2. BACKGROUND: COLD AND TEMPERED POSTERIORS

Tempered (e.g. Zhang et al., 2017) and cold (Wenzel et al., 2020) posteriors differ slightly in how they apply the temperature parameter. For cold posteriors, we scale the whole posterior, whereas tempering is a method typically applied in variational inference, and corresponds to scaling the likelihood but not the prior, log P cold (θ|X, Y ) = 1 T log P (X, Y |θ) + 1 T log P (θ) + const (1) log P tempered (θ|X, Y ) = 1 λ log P (X, Y |θ) + log P (θ) + const . ( ) While cold posteriors are typically used in SGLD, tempered posteriors are usually targeted by variational methods. In particular, variational methods apply temperature scaling to the KL-divergence between the approximate posterior, Q (θ) and prior, L = E Q(θ) [log P (X, Y |θ)] -λ D KL (Q (θ) || P (θ)) . ( ) Note that the only difference between cold and tempered posteriors is whether we scale the prior, and if we have Gaussian priors over the parameters (the usual case in Bayesian neural networks), this scaling can be absorbed into the prior variance, 1 T log P cold (θ) = -1 2T σ 2 cold i θ 2 i + const = -1 2σ 2 tempered i θ 2 i + const = log P cold (θ) . in which case, σ 2 cold = σ 2 tempered /T , so the tempered posteriors we discuss are equivalent to cold posteriors with rescaled prior variances.

3. METHODS: A GENERATIVE MODEL FOR CURATED DATASETS

Standard image datasets such as CIFAR-10 and ImageNet are carefully curated to include only unambiguous examples of each class. For instance, in CIFAR-10, student labellers were paid per hour (rather than per image), were instructed that "It's worse to include one that shouldn't be included than to exclude one", and then Krizhevsky (2009) "personally verified every label submitted by the labellers". For ImageNet, Deng et al. (2009) required the consensus of a number of Amazon Mechanical Turk labellers before including an image in the dataset. To understand the statistical patterns that might emerge in these curated datasets, we consider a highly simplified generative model of consensus-formation. In particular, we draw a random image X from some underlying distribution over images, P (X), and ask S humans to assign a label, {Y s } S s=1 (e.g. using Mechanical Turk). We force every labeller to label every image and if the image is ambiguous they are instructed to give a random label. If all the labellers agree, Y 1 =Y 2 = • • • =Y S , consensus is reached and we include the datapoint in the dataset. If any of the labellers disagree consensus is not reached, and we exclude the datapoint (Fig. 1 ), Formally, the observed random variable, Y , is taken to be the usual label if consensus was reached and None if consensus was not reached (Fig. 2B ),  Y |{Y s } S s=1 = Y 1 if Y 1 =Y 2 = • • • =Y S None otherwise



5)Taking the human labels, Y s , to come from the set Y, so Y s ∈ Y, the consensus label, Y , could be any of the underlying labels in Y, or None if no consensus is reached, so Y ∈ Y ∪ {None}. When consensus was reached, the likelihood is,P (Y =y|X, θ) = P {Y s =y} S s=1 |X, θ = S s=1 P (Y s =y|X, θ) = P (Y s =y|X, θ) S(6)

