GLOBAL INDUCING POINT VARIATIONAL POSTERIORS FOR BAYESIAN NEURAL NETWORKS AND DEEP GAUS-SIAN PROCESSES

Abstract

We derive the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned "global" inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard, "local", inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%.

1. INTRODUCTION

Deep models, formed by stacking together many simple layers, give rise to extremely powerful machine learning algorithms, from deep neural networks (DNNs) to deep Gaussian processes (DGPs) (Damianou & Lawrence, 2013) . One approach to reason about uncertainty in these models is to use variational inference (VI) (Jordan et al., 1999) . VI in Bayesian neural networks (BNNs) requires the user to specify a family of approximate posteriors over the weights, with the classical approach being independent Gaussian distributions over each individual weight (Hinton & Van Camp, 1993; Graves, 2011; Blundell et al., 2015) . Later work has considered more complex approximate posteriors, for instance using a Matrix-Normal distribution as the approximate posterior for a full weight-matrix (Louizos & Welling, 2016; Ritter et al., 2018) . By contrast, DGPs use an approximate posterior defined over functions -the standard approach is to specify the inputs and outputs at a finite number of "inducing" points (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . Critically, these classical BNN and DGP approaches define approximate posteriors over functions that are independent across layers. An approximate posterior that factorises across layers is problematic, because what matters for a deep model is the overall input-output transformation for the full model, not the input-output transformation for individual layers. This raises the question of what family of approximate posteriors should be used to capture correlations across layers. One approach for BNNs would be to introduce a flexible "hypernetwork", used to generate the weights (Krueger et al., 2017; Pawlowski et al., 2017) . However, this approach is likely to be suboptimal as it does not sufficiently exploit the rich structure in the underlying neural network. For guidance, we consider the optimal approximate posterior over the top-layer units in a deep network for regression. Remarkably, the optimal approximate posterior for the last-layer weights given the earlier weights can be obtained in closed form without choosing a restrictive family of distributions. In particular, the optimal approximate posterior is given by propagating the training inputs through lower layers to compute the top-layer representation, then using Bayesian linear regression to map from the top-layer representation to the outputs. Inspired by this result, we use Bayesian linear regression to define a generic family of approximate posteriors for BNNs. In particular, we introduce learned "pseudo-data" at every layer, and compute the posterior over the weights by performing linear regression from the inputs (propagated from lower layers) onto the pseudo-data. We reduce the burden of working with many training inputs by summarising the posterior using a small number of "inducing" points. We find that these approximate posteriors give excellent performance in the non-tempered, no-data-augmentation regime, with performance on datasets such as CIFAR-10 reaching 86.7%, comparable to SGMCMC (Wenzel et al., 2020) . Our approach can be extended to DGPs, and we explore connections to the inducing point GP literature, showing that inference in the two classes of models can be unified.

2. METHODS

We consider neural networks with lower-layer weights {W } L =1 , W ∈ R N -1 ×N , and top-layer weights, W L+1 ∈ R N L ×N L+1 , where the activity, F , at layer is given by, F 1 = XW 1 , F = φ (F -1 ) W for ∈ {2, . . . , L} , where φ(•) is an elementwise nonlinearity. The outputs, Y ∈ R P ×N L+1 , depend on the top-level activity, F L , and the output weights, W L+1 , according to a likelihood, P (Y|W L+1 , F L ). In the following derivations, we will focus on > 1; corresponding expressions for the input layer can be obtained by replacing φ(F 0 ) with the inputs, X ∈ R P ×N0 . The prior over weights is independent across layers and output units (see Sec. 2.3 for the form of S ), P (W ) = N λ=1 P w λ , P w λ = N w λ 0, 1 N -1 S , where w λ is a column of W , i.e. all the input weights to unit λ in layer . To fit the parameters of the approximate posterior, Q {W} L+1 =1 , we maximise the evidence lower bound (ELBO), L = E Q({W} L+1 =1 ) log P Y, {W} L+1 =1 |X -log Q {W} L+1 =1 . To build intuition about how to parameterise Q {W} L+1 =1 , we consider the optimal Q W L+1 |{W } L =1 for any given Q {W } L =1 . We begin by simplifying the ELBO by incorporating terms that do not depend on W L+1 into a constant, c, L = E Q({W } L+1 =1 ) log P Y, W L+1 | X, {W } L =1 -log Q W L+1 | {W } L =1 + c . (4) Rearranging these terms, we find that all W L+1 dependence can be written in terms of the KL divergence between the approximate posterior of interest and the true posterior, L = E Q({W } L =1 ) log P Y| X, {W } L =1 -D KL Q W L+1 |{W } L =1 || P W L+1 | Y, X, {W } L =1 + c . (5) Thus, the optimal approximate posterior is, Q W L+1 |{W } L =1 = P W L+1 | Y, X, {W } L =1 ∝ P (Y|W L+1 , F L ) P (W L+1 ) , (6) and where the final proportionality comes by applying Bayes theorem and exploiting the model's conditional independencies. For regression, the likelihood is Gaussian, P (Y|W L+1 , F L ) = N L+1 λ=1 N y λ ; φ (F L ) w L+1 λ , Λ -1 L+1 , where y λ is the value of a single output channel for all training inputs, and Λ L+1 is a precision matrix. Thus, the posterior is given in closed form by Bayesian linear regression (Rasmussen & Williams, 2006) .

2.1. DEFINING THE FULL APPROXIMATE POSTERIOR WITH GLOBAL INDUCING POINTS AND PSEUDO-DATA

We adapt the optimal scheme above to give a scalable approximate posterior over the weights at all layers. To avoid propagating all training inputs through the network, which is intractable for large datasets, we instead propagate M global inducing locations, U 0 , U 1 = U 0 W 1 , U = φ (U -1 ) W for = 2, . . . , L + 1. Next, the optimal posterior requires outputs, Y. However, no outputs are available at inducing locations for the output layer, let alone for intermediate layers. We thus introduce learned variational parameters to mimic the form of the optimal posterior. In particular, we use the product of the prior over weights and a "pseudo-likelihood", N v λ ; u λ , Λ -1 , representing noisy "pseudo-observations" of the outputs of the linear layer at the inducing locations, u λ = φ (U -1 ) w λ . Substituting u λ into the pseudo-likelihood the approximate posterior becomes, Q W | {W } -1 =1 ∝ N λ=1 N v λ ; φ (U -1 ) w λ , Λ -1 P w λ , Q W {W } -1 =1 = N λ=1 N w λ Σ w φ (U -1 ) T Λ v λ , Σ w , Σ w = N -1 S -1 + φ (U -1 ) T Λ φ (U -1 ) -1 . ( ) where v λ and Λ are variational parameters. Therefore, our full approximate posterior factorises as Q {W } L+1 =1 = L+1 =1 Q W {W } -1 =1 . Thus, the full ELBO can be written, L = E Q({W} L+1 =1 ) log P Y|X, {W} L+1 =1 + log P {W } L+1 =1 -log Q {W } L+1 =1 (11) = E Q({W} L+1 =1 )   log P Y, |X, {W} L+1 =1 + L+1 =1 log P (W ) Q W | {W } -1 =1   . ( ) The forms of the ELBO and approximate posterior suggest a sequential procedure to evaluate and subsequently optimise it: we alternate between sampling the weights using Eq. ( 9) and propagating the data and inducing locations using Eq. ( 8) (see Alg. 1). In summary, the parameters of the approximate posterior are the global inducing inputs, U 0 , and the pseudo-data and precisions at all layers, {V , Λ } L+1 =1 . As each factor Q W {W } -1 =1 is a Gaussian, these parameters can be optimised using standard reparameterised variational inference (Kingma & Welling, 2013; Rezende et al., 2014) in combination with the Adam optimiser (Kingma & Ba, 2014) (Appendix A). Importantly, by placing inducing inputs on the training data (i.e. U 0 = X), and setting v λ = y λ this approximate posterior matches the optimal top-layer posterior (Eq. 6). Finally, we note that while this posterior is conditionally Gaussian, the full posterior over all {W } L+1 =1 is non-Gaussian, and is thus potentially more flexible than a full-covariance Gaussian over all weights. Algorithm 1: Global inducing points for neural networks Parameters: inducing inputs, U 0 , inducing outputs and precisions, {V , Λ } L =1 , at all layers. Neural network inputs: (e.g. MNIST digits) F 0 Neural network outputs: (e.g. classification logits) F L+1 L + ∞ ← 0 for ∈ {1, . . . , L + 1} do Compute the mean and covariance over the weights at this layer Σ w = N -1 S -1 + φ (U -1 ) T Λ φ (U -1 ) -1 M = Σ w φ (U -1 ) T Λ V Sample the weights and compute the ELBO W ∼ N (M , Σ w ) = Q W {W } -1 =1 L ← L + log P (W ) -log N (W |M , Σ w ) Propagate the inputs and inducing points using the sampled weights, U = φ (U -1 ) W F = φ (F -1 ) W L ← L + log P (Y|F L+1 )

2.2. EFFICIENT CONVOLUTIONAL BAYESIAN LINEAR REGRESSION

The previous sections were valid for a fully connected network. The extension to convolutional networks is straightforward in principle: we transform the convolution into a matrix multiplication by treating each patch as a separate input feature-vector, flattening the spatial and channel dimensions together into a single vector. Thus, the feature-vectors have length in_channels × kernel_width × kernel_height, and the matrix U contains patches_per_image × minibatch patches. Likewise, we now have inducing outputs, v λ , at each location in all the inducing images, so this again has length patches_per_image × minibatch. After explicitly extracting the patches, we can straightforwardly apply standard Bayesian linear regression. However, explicitly extracting image patches is very memory intensive in a DNN. If we consider a standard convolution with a 3 × 3 convolutional kernel, then there is a 3 × 3 patch centred at each pixel in the input image, meaning a factor of 9 increase in memory consumption. Instead, we note that computing the matrices required for linear regression, φ (U -1 ) T Λ φ (U -1 ) and φ (U ) T Λ V , does not require explicit extraction of image-patches. Instead, these matrices can be computed by taking the autocorrelation of the image/feature map, i.e. a convolution operation where we treat the image/feature map, as both the inputs and the weights (Appendix B for details).

2.3. PRIORS

We consider four priors in this work, which we refer to using the class names in the BNN library published with this paper. We are careful to ensure that all parameters in the model have a prior and approximate posterior, which is necessary to ensure that ELBOs are comparable across models. First, we consider a Gaussian prior with fixed scale, NealPrior, so named because it is necessary to obtain meaningful results when considering infinite networks (Neal, 1996) , S = I, though it bears strong similarities to the "He" initialisation (He et al., 2015) . NealPrior is defined so as to ensure that the activations retain a sensible scale as they propagate through the network. We compare this to the standard N (0, 1) (StandardPrior), which causes the activations to increase exponentially as they propagate through network layers (see Eq. 2): S = N -1 I. Next, we consider ScalePrior, which defines a prior and approximate posterior over the scale, S = 1 s I P (s ) = Gamma (s ; 2, 2) Q (s ) = Gamma (s ; 2 + α , 2 + β ) (15) where here we parameterise the Gamma distribution with the shape and rate parameters, and α and β are non-negative learned parameters of the approximate posterior over s . Finally, we consider SpatialIWPrior, which allows for spatial correlations in the weights. In particular, we take the covariance to be the Kronecker product of an identity matrix over channel dimensions, and a Wishart-distributed matrix, L -1 , over the spatial dimensions, S = I ⊗ L -1 P (L ) = InverseWishart (L ; (N -1 + 1) I , N -1 + 1) Q (L ) = InverseWishart (L ; (N -1 + 1) I + Ψ, N -1 + 1 + ν) (16) where the non-negative real number, ν, and the positive definite matrix, Ψ, are learned parameters of the approximate posterior (see Appendix C).

3. EXTENSION TO DGPS

It is a remarkable but underappreciated fact that BNNs are special cases of DGPs, with a particular choice of kernel (Louizos & Welling, 2016; Aitchison, 2019) . Combining Eqs. ( 1) and (2), Here, we generalise our approximate posterior to the DGP case and link to the DGP literature. In a DGP there are no weights; instead we work directly with inducing outputs {U } L+1 =1 , P (F |F -1 ) = N λ=1 N f λ 0, K (F -1 ) , K (F -1 ) = 1 N -1 φ (F -1 ) S φ (F -1 ) T . ( ) -1 0 1 x -2 0 2 y factorised True Predicted -1 0 1 x -2 0 2 local inducing -1 0 1 x -2 0 2 global inducing -1 0 1 x -2 0 2 Hamiltonian Monte Carlo P (U |U -1 ) = N λ=1 N u λ 0, K (U -1 ) , Note that here, we take the "global" inducing approach of using the inducing outputs from the previous layer, U -1 as the inducing inputs for the next layer. In this case, we need only learn the original inducing inputs, U 0 . This contrasts with the standard "local" inducing formulation, (as in Salimbeni & Deisenroth, 2017) , which learns separate inducing inputs at every layer, Z -1 , giving P (U |Z -1 ) = N λ=1 N u λ 0, K (Z -1 ) . As usual in DGPs (Salimbeni & Deisenroth, 2017) , the approximate posterior over U induces an approximate posterior on F through the prior correlations. However, it is important to remember that underneath the tractable distributions in Eqs. ( 17) and ( 18), there is an infinite dimensional GP-distributed function, F , such that F = F (F -1 ). Standard local inducing point methods specify a factorised approximate posterior over F by specifying the function's inducing outputs, U = F (Z -1 ), at a finite number of inducing input locations, Z -1 . Importantly, the approximate posterior over a function, F , depends only on Z -1 , and U . Thus, standard, local inducing, DGP approaches (e.g. Salimbeni & Deisenroth, 2017) , give a layerwise-independent approximate posterior over {F } L+1 =1 , as they treat the inducing inputs, {Z -1 } L+1 =1 , as fixed, learned parameters and use a layerwise-independent approximate posterior over {U } L+1 =1 (Appendix F). Next, we need to choose the approximate posterior on {U } L+1 =1 . However, if our goal is to introduce dependence across layers, it seems inappropriate to use the standard layerwise-independent approximate posterior over {U } L+1 =1 . Indeed, in Appendix F, we show that such a posterior implies functions in non-adjacent layers (e.g. F and F +2 ) are marginally independent, even with global inducing points. To obtain more appropriate approximate posteriors, we derive the optimal top-layer posterior for DGPs, which involves GP regression from activations propagated from lower layers onto the output data (Appendix E). Inspired by the form of the optimal posterior we again define an approximate posterior by taking the product of the prior and a "pseudo-likelihood", Q (U |U -1 ) ∝ N λ=1 Q u λ N v λ u λ , Λ -1 Q (U |U -1 ) = N λ=1 N u λ Σ u Λ v λ , Σ u , Σ u = K -1 (U -1 ) + Λ -1 , where v λ and Λ -1 are learned parameters, and in our global inducing method, the inducing inputs, U -1 , are propagated from lower layers (Eq. 18). Importantly, setting the inducing inputs to the training data and v L+1 λ = y λ , the approximate posterior captures the optimal top-layer posterior for regression (Appendix E). Under this approximate posterior, dependencies in U naturally arise across all layers, and hence there are dependencies between functions F at all layers (Appendix F). In summary, we propose an approximate posterior over inducing outputs that takes the form Q {U } L+1 =1 = L+1 l=1 Q (U |U -1 ) . ( ) As before, the parameters of this approximate posterior are the global inducing inputs, U 0 , and the pseudo-data and precisions at all layers, {V , Λ } L+1 =1 . Our ELBO, which we derive in D, takes the form L = E Q({F ,U } L+1 =1 |X,U0) log P (Y|F L+1 ) + L+1 =1 log P (U |U -1 ) Q (U |U -1 ) . We provide a full description of our method as applied to DGPs in Appendix D. In concluding our discussion of our proposed method, we note that we provide an analysis of the asymptotic complexity in Appendix L.

4. RESULTS

We describe our experiments and results to assess the performance of global inducing points ('gi') against local inducing points ('li') and the fully factorised ('fac') approximation family. We additionally consider models where we use one method up to the last layer and another for the last layer, which may have computational advantages; we denote such models 'method1 → method2'.

4.1. UNCERTAINTY IN 1D REGRESSION

We demonstrate the use of local and global inducing point methods in a toy 1-D regression problem, comparing it with fully factorised VI and Hamiltonian Monte Carlo (HMC; (Neal et al., 2011) ). Following Hernández-Lobato & Adams (2015), we generate 40 input-output pairs (x, y) with the inputs x sampled i.i.d. from U([-4, -2] ∪ [2, 4] ) and the outputs generated by y = x 3 + , where ∼ N (0, 3 2 ). We then normalised the inputs and outputs. Note that we have introduced a 'gap' in the inputs, following recent work (Foong et al., 2019b; Yao et al., 2019; Foong et al., 2019a) that identifies the ability to express 'in-between' uncertainty as an important quality of approximate inference algorithms. We evaluated the inference algorithms using fully-connected BNNs with 2 hidden layers of 50 ReLU hidden units, using the NealPrior. For the inducing point methods, we used 100 inducing points per layer. The predictive distributions for the toy experiment can be seen in Fig. 1 . We observe that of the variational methods, the global inducing method produces predictive distributions closest to HMC, with good uncertainty in the gap. Meanwhile, factorised and local inducing fit the training data, but do not produce reasonable error bars: this demonstrates an important limitation of methods lacking correlation structure between layers. We connect the factorised models with the fac → gi models with a thin grey line as an aid for easier comparison. Further to the right is better. We provide additional experiments looking at the effect of the number of inducing points in Appendix H, and experiments looking at compositional uncertainty (Ustyuzhaninov et al., 2020) in both BNNs and DGPs for 1D regression in Appendix I.

4.2. DEPTH-DEPENDENCE IN DEEP LINEAR NETWORKS

The lack of correlations between layers might be expected to become more problematic in deeper networks. To isolate the effect of depth on different approximate posteriors, we considered deep linear networks trained on data generated from a toy linear model: 5 input features were mapped to 1 output feature, where the 1000 training and 100 test inputs are drawn IID from a standard Gaussian, and the true outputs are drawn using a weight-vector drawn IID from a Gaussian with variance 1/5, and with noise variance of 0.1. We could evaluate the model evidence under the true data generating process which forms an upper bound (in expectation) on the model evidence and ELBO for all models. We found that the ELBO for methods that factorise across layers -factorised and local inducingdrops rapidly as networks get deeper and wider (Fig. 2 ). This is undesirable behaviour, as we know that wide, deep networks are necessary for good performance on difficult machine learning tasks. In contrast, we found that methods with global inducing points at the last layer decay much more slowly with depth, and perform better as networks get wider. Remarkably, global-inducing points gave good performance even with lower-layer weights drawn at random from the prior, which is not possible for any method that factorises across layers. We believe that fac → gi performed poorly at width = 250 due to optimisation issues as rand → gi performs better yet is a special case of fac → gi.

4.3. REGRESSION BENCHMARK: UCI

We benchmark our methods on the UCI datasets in Hernández-Lobato & Adams (2015), popular benchmark regression datasets for BNNs and DGPs. Following the standard approach (Gal & Ghahramani, 2015) , each dataset uses 20 train-test 'splits' (except for protein with 5 splits) and the inputs and outputs are normalised to have zero mean and unit standard deviation. We focus on the five smallest datasets, as we expect Bayesian methods to be most relevant in small-data settings (see App. J andM for all datasets). We consider two-layer fully-connected ReLU networks, using fully factorised and global inducing approximating families, as well as two-and five-layer DGPs with For the BNNs, we consider the standard N (0, 1) prior and ScalePrior. We display ELBOs and average test log likelihoods for the un-normalised data in Fig. 3 , where the dots and error bars represent the means and standard errors over the test splits, respectively. We observe that global inducing obtains better ELBOs than factorised and DSVI in every case, indicating that it does indeed approximate the true posterior better (since the ELBO is the marginal likelihood minus the KL to the posterior). While this is the case for the ELBOs, this does not always translate to a better test log likelihood due to model misspecification, as we see that occasionally DSVI outperforms global inducing by a very small margin. The very poor results for factorised on ScalePrior indicate that it has difficulty learning useful prior hyperparameters for prediction, which is due to the looseness of its bound to the marginal likelihood. We provide experimental details, as well as additional results with additional architectures, priors, datasets, and RMSEs, in Appendices J and M, for BNNs and DGPs, respectively.

4.4. CONVOLUTIONAL BENCHMARK: CIFAR-10

For CIFAR-10, we considered a ResNet-inspired model consisting of conv2d-relu-block-avgpool2block-avgpool2-block-avgpool-linear, where the ResNet blocks consisted of a shortcut connection in parallel with conv2d-relu-conv2d-relu, using 32 channels in all layers. In all our experiments, we used no data augmentation and 500 inducing points. Our training scheme (see App. O) ensured that our results did not reflect a 'cold posterior' (Wenzel et al., 2020) . Our results are shown in Table 1 . We achieved remarkable performance of 86.7% predictive accuracy, with global inducing points used for all layers, and with a spatial inverse Wishart prior on the weights. These results compare very favourably with comparable Bayesian approaches, i.e. those without data augmentation or posterior sharpening: past work with deep GPs obtained 80.3% (Shi et al., 2019) , and work using infinite-width neural networks to define a GP obtained 81.4% accuracy (Li et al., 2019) . Remarkably, with only 500 inducing points we are approaching the accuracy of sampling-based methods (Wenzel et al., 2020) , which are in principle able to more closely approximate the true posterior. Furthermore, we see that global inducing performs the best in terms of ELBO (per datapoint) by a wide margin, demonstrating that it gets far closer to the true posterior than the other methods. We provide additional results on uncertainty calibration in Appendix K.

5. RELATED WORK

Louizos & Welling (2016) attempted to use pseudo-data along with matrix variate Gaussians to form an approximate posterior for BNNs; however, they restricted their analysis to BNNs, and it is not clear how their method can be applied to DGPs. Their approach factorises across layers, thus missing the important layerwise correlations that we obtain. Moreover, they encountered an important limitation: the BNN prior implies that U is low-rank and it is difficult to design an approximate posterior capturing this constraint. As such, they were forced to use M < N inducing points, which is particularly problematic in the convolutional, global-inducing case where there are many patches (points) in each inducing image input. Note that some work on BNNs reports better performance on datasets such as CIFAR-10. However, to the best of our knowledge, no variational Bayesian method outperforms ours without modifying the BNN model or some form of posterior tempering (Wenzel et al., 2020) , where the KL term in the ELBO is down-weighted relative to the likelihood (Zhang et al., 2017; Bae et al., 2018; Osawa et al., 2019; Ashukha et al., 2020) , which often increases the test accuracy. However, tempering clouds the Bayesian perspective, as the KL to the posterior is no longer minimised and the resulting objective is no longer a lower bound on the marginal likelihood. By contrast, we use the untempered ELBO, thereby retaining the full Bayesian perspective. Dusenberry et al. ( 2020) report better performance on CIFAR-10 without tempering, but only perform variational inference over a rank-1 perturbation to the weights, and maximise over all the other parameters. Our approach retains a full-rank parameterisation of the weight matrices. Ustyuzhaninov et al. (2020) attempted to introduce dependence across layers in a deep GP by coupling inducing inputs to pseudo-outputs, which they term "inducing points as inducing locations". However, as described above, the choice of approximate posterior over U is also critical. They used the standard approximate posterior that is independent across layers, meaning that while functions in adjacent layers were marginally dependent, the functions for non-adjacent layers were independent (Appendix F). By contrast, our approximate posteriors have marginal dependencies across U and functions at all layers, and are capable of capturing the optimal top-layer posterior.

6. CONCLUSIONS

We derived optimal top-layer variational approximate posteriors for BNNs and deep GPs, and used them to develop generic, scalable approximate posteriors. These posteriors make use of global inducing points, which are learned only at the bottom layer and are propagated through the network. This leads to extremely flexible posteriors, which even allow the lower-layer weights to be drawn from the prior. We showed that these global inducing variational posteriors lead to improved performance with better ELBOs, and state-of-the-art performance for variational BNNs on CIFAR-10.

A REPARAMETERISED VARIATIONAL INFERENCE

In variational inference, the ELBO objective takes the form, L(φ) = E Q φ (w) log P (Y|X, w) + log P (w) Q φ (w) ( ) where w is a vector containing all of the elements of the weight matrices in the full network, {W } L+1 =1 , and φ = (U 0 , {V , Λ } L+1 =1 ) are the parameters of the approximate posterior. This objective is difficult to differentiate wrt φ, because φ parameterises the distribution over which the expectation is taken. Following Kingma & Welling (2013) and Rezende et al. (2014) , we sample from a simple, fixed distribution (e.g. IID Gaussian), and transform them to give samples from Q (w), w( ; φ) ∼ Q φ (w) . ( ) Thus, the ELBO can be written, L(φ) = E log P (Y|X, w( ; φ)) + log P (w( ; φ)) Q φ (w( ; φ)) As the distribution over which the expectation is taken is now independent of φ, we can form unbiased estimates of the gradient of L(φ) by drawing one or a few samples of .

B EFFICIENT CONVOLUTIONAL LINEAR REGRESSION

Working in one dimension for simplicity, the standard form for a convolution in deep learning is Y iu,c = c δ X i(u+δ)c W δc ,c . ( ) where X is the input image/feature-map, Y is the output feature-map, W is the convolutional weights, i indexes images, c and c index channels, u indexes the location within the image, and δ indexes the location within the convolutional patch. Later, we will swap the identity of the "patch location" and the "image location" and to facilitate this, we define them both to be centred on zero, u ∈ {-(W -1)/2, . . . , (W -1)/2} δ ∈ {-(S -1)/2, . . . , (S -1)/2} ( ) where W is the size of an image and S is the size of a patch, such that, for example for a size 3 kernel, δ ∈ {-1, 0, 1}. To use the expressions for the fully-connected case, we can form a new input, X by cutting out each image patch, X iu,δc = X i(u+δ)c , leading to Y iu,c = δc X iu,δc W δc ,c . Note that we have used commas to group pairs of indices (iu and δc ) that may be combined into a single index (e.g. using a reshape operation). Indeed, combining i and u into a single index and combining δ and c into a single index, this expression can be viewed as standard matrix multiplication, Y = X W. ( ) This means that we can directly apply the approximate posterior we derived for the fully-connected case in Eq. ( 9) to the convolutional case. To allow for this, we take X = Λ 1/2 φ (U -1 ) , Y = Λ 1/2 V . ( ) While we can explicitly compute X by extracting image patches, this imposes a very large memory cost (a factor of 9 for a 3 × 3 kernel, with stride 1, because there are roughly as many patches as pixels in the image, and a 3 × 3 patch requires 9 times the storage of a pixel. To implement convolutional linear regression with a more manageable memory cost, we instead compute the matrices required for linear regression directly as convolutions of the input feature-maps, X with themselves, and as convolutions of the X with the output feature maps, Y, which we describe here. For linear regression (Eq. 9), we first need to compute, φ (U -1 ) T Λ V δc,c = X T Y δc,c = iu X iu,δc Y iu,c . Rewriting this in terms of X (i.e. without explicitly cutting out image patches), we obtain, X T Y δc,c = iu X i(u+δ)c Y iu,c . This can be directly viewed as the convolution of X and Y , where we treat Y as the "convolutional weights", u as the location within the now very large (size W ) "convolutional patch", and δ as the location in the resulting output. Once we realise that the computation is a spatial convolution, it is possible to fit it into standard convolution functions provided by deep-learning frameworks (albeit with some rearrangement of the tensors). Next, we need to compute, φ (U -1 ) T Λ φ (U -1 ) = X T X δc,δ c = iu X iu,δc X iu,δ c . ( ) Again, rewriting this in terms of X (i.e. without explicitly cutting out image patches), we obtain, X T X δc,δ c = iu X i(u+δ)c X i(u+δ )c . To treat this as a convolution, we first need exact translational invariance, which can be achieved by using circular boundary conditions. Note that circular boundary conditions are not typically used in neural networks for images, and we therefore only use circular boundary conditions to define the approximate posterior over weights. The variational framework does not restrict us to also using circular boundary conditions within our feedforward network, and as such, we use standard zero-padding. With exact translational invariance, we can write this expression directly as a convolution, X T X δc,δ c = iu X iuc X i(u+δ -δ)c where (δ -δ) ∈ {-(S -1), . . . , (S -1)} i.e. for a size 3 kernel, (δ -δ) ∈ {-2, -1, 0, 1, 2}, where we treat X iuc as the "convolutional weights", u as the location within the "convolutional patch", and δ -δ as the location in the resulting output "feature-map". Finally, note that this form offers considerable benefits in terms of memory consumption. In particular, the output matrices are usually quite small -the number of channels is typically 32 or 64, and the number of locations within a patch is typically 9, giving a very manageable total size that is typically smaller than 1000 × 1000.

C WISHART DISTRIBUTIONS WITH REAL-VALUED DEGREES OF FREEDOM

The classical description of the Wishart distribution, Σ ∼ Wishart (I, ν) , ( ) where Σ is a P × P matrix, states that P ≥ ν is an integer and we can generate Σ by taking the product of matrices, X ∈ R ν×P , generated IID from a standard Gaussian, Σ = X T X, X ij ∼ N (0, 1) . ( ) However, for the purposes of defining learnable approximate posteriors, we need to be able to sample and evaluate the probability density when ν is positive real. To do this, consider the alternative, much more efficient means of sampling from a Wishart distribution, using the Bartlett decomposition (Bartlett, 1933) . The Bartlett decomposition gives the probability density for the Cholesky of a Wishart sample. In particular, T =    T 11 . . . T 1m . . . . . . . . . 0 . . . T mm    , P T 2 jj = Gamma T 2 jj ; ν-j+1 2 , 1 2 , P (T j<k ) = N (T jk ; 0, 1) . Here, T jj is usually considered to be sampled from a χ 2 distribution, but we have generalised this slightly using the equivalent Gamma distribution to allow for real-valued ν. Following Chafaï (2015) , We need to change variables to T jj rather than T 2 jj , P (T jj ) = P T 2 jj ∂T 2 jj ∂T jj , ( ) = Gamma T 2 jj ; ν-j+1 2 , 1 2 2T jj , ( ) = T 2 jj (ν-j+1)/2-1 e -T 2 jj /2 2 (ν-j+1)/2 Γ ν-j+1 2 2T jj , = T ν-j jj e -T 2 jj /2 2 (ν-j-1)/2 Γ ν-j+1 2 . Thus, the probability density for T under the Bartlett sampling operation is P (T) = j T ν-j jj e -T 2 jj /2 2 ν-j-1 2 Γ ν-j+1 2 on-diagonals k∈{j+1,...,m} 1 √ 2π e -T 2 jk /2 off-diagonals . ( ) (47) To convert this to a distribution on Σ, we need the volume element for the transformation from T to Σ, dΣ = 2 m m j=1 T m-j+1 jj dT, which can be obtained directly by computing the log-determinant of the Jacobian for the transformation from T to Σ, or by taking the ratio of Eq. ( 46) and the usual Wishart probability density (with integral ν). Thus, P (Σ) = P (T)   2 m m j=1 T m-j+1 jj   -1 (49) = j T ν-m-1 jj e -T 2 jj /2 2 ν-j+1 2 Γ ν-j+1 2 . k∈{j+1,...,m} 1 √ 2π e -T 2 jk /2 . ( ) Breaking this down into separate components and performing straightforward algebraic manipulations, j T ν-m-1 jj = |T| ν-m-1 = |Σ| (ν-m-1)/2 , P (Σ) = j e -T 2 jj /2 k∈{j+1,...,m} e -T 2 jk /2 (52) = e -jk T 2 jk /2 = e -Tr(Σ)/2 , ( ) j 1 2 (ν-j-1)/2 k∈{j+1,...,m} 1 √ 2 =   j 1 2 (ν-j-1)/2     j 1 2 (m-j-1)/2   (54) =   j 1 2 (ν-j+1)/2     j 1 2 (j-1)/2   (55) = j 1 2 (ν-m)/2 = 2 -mν/2 , ( ) where the second-to-last line was obtained by noting that j -1 covers the same range of integers as m -j -1 under the product. Finally, using the definition of the multivariate Gamma function, j Γ ν-j+1 2 k∈{j+1,...,m} √ π = π m(m-1)/4 j Γ ν-j+1 2 = Γ m ν 2 . ( ) We thereby re-obtain the probability density for the standard Wishart distribution, P (Σ) = |Σ| (ν-m-1)/2 e -Tr(Σ)/2 2 mν/2 Γ m ν 2 .

D FULL DESCRIPTION OF OUR METHOD FOR DEEP GAUSSIAN PROCESS

Here we give the full derivation for our doubly-stochastic variational deep GPs, following Salimbeni & Deisenroth (2017) . A deep Gaussian process (DGP; (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017 )) defines a prior over function values, F ∈ R P ×N , where is the layer, P is the number of input points, and N is the "width" of this layer, by stacking L + 1 layers of standard Gaussian processes (we use L + 1 layers instead of the typical L layers to retain consistency with how we define BNNs): P Y, {F } L+1 =1 X = P (Y|F L ) L+1 =1 P (F |F -1 ) . Here, the input is F 0 = X, and the output is Y (which could be continuous values for regression, or class labels for classification), and the distribution over F ∈ R P ×N factorises into independent multivariate Gaussian distributions over each function, P (F |F -1 ) = N λ=1 P f λ F -1 = N λ=1 N f λ 0, K (F -1 ) , where f λ is the λth column of F , giving the activation of all datapoints for the λth feature, and K (•) is a function that computes the kernel-matrix from the features in the previous layer. To define a variational approximate posterior, we augment {F } L+1 =1 with inducing points consisting of the function values {U } L+1 =1 , P Y, {F , U } L+1 =1 |X, U 0 = P (Y|F L+1 ) L+1 =1 P (F , U |F -1 , U -1 ) and because F and U are the function outputs corresponding to different inputs (F -1 and U -1 ), they form a joint multivariate Gaussian distribution, analogous to Eq. ( 60), P (F , U |F -1 , U -1 ) = N l λ=1 N f λ u λ 0, K F -1 U -1 . ( ) Following Salimbeni & Deisenroth (2017) we form an approximate posterior by conditioning the function values, F on the inducing outputs, U , Q {F , U } L+1 =1 X, U 0 = L+1 =1 Q (F , U |F -1 , U -1 ) = L+1 =1 P (F |U , U -1 , F -1 ) Q (U |U -1 ) , where Q (U |U -1 ) is given by Eq. ( 19), i.e. Q (U |U -1 ) = N λ=1 N u λ Σ u Λ v λ , Σ u , Σ u = K -1 (U -1 ) + Λ -1 , and P (F |U , U -1 , F -1 ) is given by standard manipulations of Eq. ( 62). Importantly, the prior can be factorised in an analogous fashion, P (F , U |F -1 , U -1 ) = P (F |U , U -1 , F -1 ) P (U |U -1 ) . ( ) The full ELBO can be written as L = E Q({F ,U } L+1 =1 |X,U0) log P Y, {F , U } L+1 =1 X, U 0 Q {F , U } L+1 =1 X, U 0 . ( ) Substituting Eqs. ( 61), ( 63) and ( 65) into Eq. ( 66), the P (F |U , U -1 , F -1 ) terms cancel and we obtain, L = E Q({F ,U } L+1 =1 |X,U0) log P (Y|F L+1 ) + L+1 =1 log P (U |U -1 ) Q (U |U -1 ) . Analogous to the BNN case, we evaluate the ELBO by alternatively sampling the inducing function values U given U -1 and propagating the data by sampling from P (F |U , U -1 , F -1 ) (see Alg. 2). As in Salimbeni & Deisenroth (2017) , since all likelihoods we use factorise across datapoints, we only need to sample from the marginals of the latter distribution to sample the propagated function values. As in the BNN case, the parameters of the approximate posterior are the global inducing inputs, U 0 , and the pseudo-data and precisions at all layers, {V , Λ } L =1 . We can similarly use standard reparameterised variational inference (Kingma & Welling, 2013; Rezende et al., 2014) to optimise these variational parameters, as Q (U |U -1 ) is Gaussian. We use Adam (Kingma & Ba, 2014) to optimise the parameters. Algorithm 2: Global inducing points for deep Gaussian processes Parameters: inducing inputs, U 0 , inducing outputs and precisions, {V , Λ } L =1 , at all layers. DGP inputs: (e.g. MNIST digits) F 0 DGP outputs: (e.g. classification logits) F L+1 L ← 0 for ∈ {1, . . . , L + 1} do Compute the mean and covariance over the inducing outputs at this layer Σ u = K -1 (U -1 ) + Λ -1 M = Σ u Λ V Sample the inducing outputs and compute the ELBO U ∼ N (M , Σ u ) = Q (U |U -1 ) L ← L + log P (U |U -1 ) -log N (U |M , Σ u ) Propagate the inputs using the sampled inducing outputs, F ∼ P (F |U , U -1 , F -1 ) L ← L + log P (Y|F L+1 )

E MOTIVATING THE APPROXIMATE POSTERIOR FOR DEEP GPS

Our original motivation for the approximate posterior was for the BNN case, which we then extended to deep GPs. Here, we show how the same approximate posterior can be motivated from a deep GP perspective. As with the BNN case, we first derive the form of the optimal approximate posterior for the last layer, in the regression case. Without inducing points, the ELBO becomes L = E Q({F } L+1 =1 ) log P Y, {F } L+1 =1 Q {F } L+1 =1 , where we have defined a generic variational posterior Q {F } L+1 =1 . Since we are interested in the form of Q F L+1 {F } L =1 , we rearrange the ELBO so that all terms that do not depend on F L+1 are absorbed into a constant, c: L = E Q({F } L+1 =1 ) log P Y, F L+1 {F } L =1 -log Q (F L+1 ) + c . ( ) Some straightforward rearrangements lead to a similar form to before, L = E Q({F } L =1 ) log P Y| {F } L =1 -D KL Q F L+1 |{F } L =1 || P F L+1 | Y, {F } L =1 + c , ( ) from which we see that the optimal conditional posterior is given by Q F L+1 |{F } L =1 = Q (F L+1 |F L ) = P (F L+1 | Y, F L ) , which has a closed form for regression: it is simply the standard GP posterior given by training data Y at inputs F L . In particular, for likelihood P (Y|F L+1 , Λ L+1 ) = N L+1 λ=1 N y L+1 λ f L+1 λ , Λ -1 L+1 , where Λ L+1 is the precision, Q (F L+1 |F L ) = N L+1 λ=1 N f L+1 λ Σ f L+1 Λ L+1 y L+1 λ , Σ f L+1 , Σ f L+1 = (K -1 (F L ) + Λ L+1 ) -1 . Finally, as is usual in GPs (Rasmussen & Williams, 2006) , the predictive distribution for test points can be obtained by conditioning using this approximate posterior.

F COMPARING OUR DEEP GP APPROXIMATE POSTERIOR TO PREVIOUS WORK

The standard approach to inference in deep GPs (e.g. Salimbeni & Deisenroth, 2017) involves local inducing points and an approximate posterior over {U } L+1 =1 that is factorised across layers, Q {U } L+1 =1 = L+1 l=1 N λ=1 N u λ ; m λ , Σ λ . ( ) This approximate posterior induces an approximate posterior over the underlying infinite-dimensional functions F at each layer, which are implicitly used to propagate the data through the network via F = F (F -1 ). We show a graphical model summarising the standard approach in Fig. A1a . While, as Salimbeni & Deisenroth (2017) point out, the function values {F } L+1 =1 are correlated, the functions {F } L+1 =1 themselves are independent across layers. We note that for BNNs, this is equivalent to having a posterior over weights that factorises across layers. One approach to introduce dependencies across layers for the functions would be to introduce the notion of global inducing points, propagating the initial U 0 through the model. In fact, as we note in the Related Work section, Ustyuzhaninov et al. (2020) independently proposed this approach to introducing dependencies, using a toy problem to motivate the approach. However, they kept the form of the approximate posterior the same as the standard approach (Eq. 18); we show the corresponding graphical model in Fig. A1b . The graphical model shows that as adjacent functions F and F +1 share the parent node U , they are in fact dependent. However, non-adjacent functions do not share any parent nodes, and so are independent; this can be seen by considering the d-separation criterion (Pearl, 1988) for F -1 and F +1 , which have parents (U -2 , U -1 ) and (U , U +1 ) respectively. Our approach, by contrast, determines the form of the approximate posterior over U by performing Bayesian regression using U -1 as input to that layer's GP, where the output data is V . This results in a posterior that depends on the previous layer, Q (U |U -1 ). We show the corresponding graphical model in Fig. A1c . From this graphical model it is straightforward to see that our approach results in a posterior over functions that are correlated across all layers. F F +1 F -1 F U F +1 U +1 F -1 U -1 (a) Graphical model illustrating the approach of Salimbeni & Deisenroth (2017) . The inducing inputs, {U -1 } L+1 =1 are treated as learned parameters and are therefore omitted from the model. F F +1 F -1 F U F +1 U +1 F -1 U -1 (b) Graphical model illustrating the approach of Ustyuzhaninov et al. (2020) . The inducing inputs are given by the inducing outputs at the previous layer. F F +1 F -1 F U F +1 U +1 F -1 U -1 (c) Graphical model illustrating our approach. 

G PARAMETER SCALING FOR ADAM

The standard optimiser for variational BNNs is ADAM (Kingma & Ba, 2014), which we also use. Considering similar RMSprop updates for simplicity (Tieleman & Hinton, 2012) , ∆w = η g E [g 2 ] where the expectation over g 2 is approximated using a moving-average of past gradients. Thus, absolute parameter changes are going to be of order η. This is fine if all the parameters have roughly the same order of magnitude, but becomes a serious problem if some of the parameters are very large and others are very small. For instance, if a parameter is around 10 -4 and η = 10 -4 , then a single ADAM step can easily double the parameter estimate, or change it from positive to negative. In contrast, if a parameter is around 1, then ADAM, with η -4 can make proportionally much smaller changes to this parameter, (around 0.01%). Thus, we need to ensure that all of our parameters have the same scale, especially as we mix methods, such as combining factorised and global inducing points. We thus design all our new approximate posteriors (i.e. the inducing inputs and outputs) such that the parameters have a scale of around 1. The key issue is that the mean weights in factorised methods tend to be quite small -they have scale around 1/ √ fan-in. To resolve this issue, we store scaled weights, and we divide these stored, scaled mean parameters by the fan-in as part of the forward pass, weights = scaled weights √ fan-in . This scaling forces us to use larger learning rates than are typically used.

H EXPLORING THE EFFECT OF THE NUMBER OF INDUCING POINTS

In this section, we briefly consider the effect of changing the number of inducing points, M , used in global inducing. We reconsider the toy problem from Sec. 4.1, and plot predictive posteriors obtained with global inducing as the number of inducing points increases from 2 to 40 (noting that in Fig. 1 we used 100 inducing points). We plot the results of our experiment in Fig. A2 . While two inducing points are clearly not sufficient, we observe that there is remarkably very little difference between the predictive posteriors for 10 or more inducing points. This observation is reflected in the ELBOs per datapoint (listed above each plot), which show that adding more points beyond 10 gains very little in terms of closeness to the true posterior. However, we note that this is a very simple dataset: it consists of only two clusters of close points with a very clear trend. Therefore, we would expect that for more complex datasets more inducing points would be necessary. We leave a full investigation of how many inducing points are required to obtain a suitable approximate posterior, such as that found in Burt et al. (2020) for sparse GP regression, to future work.

I UNDERSTANDING COMPOSITIONAL UNCERTAINTY

In this section, we take inspiration from the experiments of Ustyuzhaninov et al. (2020) which investigate the compositional uncertainty obtained by different approximate posteriors for DGPs. They noted that methods which factorise over layers have a tendency to cause the posterior distribution for each layer to collapse to a (nearly) deterministic function, resulting in worse uncertainty quantification within layers and worse ELBOs. In contrast, they found that allowing the approximate posterior to have correlations between layers allows those layers to capture more uncertainty, resulting in better 0 2 local inducing ELBO: 4.921 ELBOs and therefore a closer approximation to the true posterior. They argue that this then allows the model to better discover compositional structure in the data. F 1 (x) -1 0 1 F 2 (x) -1 0 1 y -1 0 1 x -2 0 2 global inducing ELBO: 7.483 -1 0 1 x -1 0 1 -1 0 1 x 0 2 We first consider a toy problem consisting of 100 datapoints generated by sampling from a two-layer DGP of width one, with squared-exponential kernels in each layer. We then fit two two-layer DGPs to this data -one using local inducing, the other using global inducing. The results of this experiment can be seen in Fig. A3 , which show the final fit, along with the learned posteriors over intermediate functions F 1 and F 2 . These results mirror those observed by Ustyuzhaninov et al. (2020) on a similar experiment: local inducing, which factorises over layers, collapses to a nearly deterministic posterior over the intermediate functions, whereas global inducing provides a much broader distribution over functions for the two layers. Therefore, global inducing leads to a wider range of plausible functions that could explain the data via composition, which can be important in understanding the data. We observe that this behaviour directly leads to better uncertainty quantification for the out-of-distribution region, as well as better ELBOs. To illustrate a similar phenomenon in BNNs, we reconsider the toy problem of Sec. 4.1. As it is not meaningful to consider neural networks with only one hidden unit per layer, instead of plotting intermediate functions we instead look at the mean variance of the functions at random input points, following roughly the experiment Dutordoir et al. (2019) in Table 1 . For each layer, we consider the quantity where the expectation is over random input points, which we sample from a standard normal. We expect that for methods which introduce correlations across layers, this quantity will be higher, as there will be a wider range of intermediate functions that could plausibly explain the data. We confirm this in Table 2 , which indicates that global inducing leads to many more compositions of functions being considered as plausible explanations of the data. This is additionally reflected in the ELBO, which is far better for global inducing than the other, factorised methods. However, we note that the variances are far closer than we might otherwise expect for the second layer. We hypothesise that this is due to the pruning effects described in Trippe & Turner (2018) , where a layer has many weights that are close to the prior that are then pruned out by the following layer by having the outgoing weights collapse to zero. In fact, we note that the variances in the last layer are small for the factorised methods, which supports this hypothesis. By contrast, global inducing leads to high variances across all layers. E x 1 N N λ=1 V[f l λ (x)] , We believe that understanding the role of compositional uncertainty in variational inference for deep Bayesian models can lead to important conclusions about both the models being used and the compositional structure underlying the data being modelled, and is therefore an important direction for future work to consider.

J UCI RESULTS WITH BAYESIAN NEURAL NETWORKS

For this Appendix, we consider all of the UCI datasets from Hernández-Lobato & Adams (2015), along with four approximation families: factorised (i.e. mean-field), local inducing, global inducing, and fac→gi, which may offer some computational advantages to global inducing. We also considered three priors: the standard N (0, 1) prior, NealPrior, and ScalePrior. The test LLs and ELBOs for BNNs applied to UCI datasets are given in Fig. A4 . Note that the ELBOs for the global inducing methods (both global inducing and fac→gi) are almost always better than those for baseline methods, often by a very large margin. However, as noted earlier, this does not necessarily correspond to better test log likelihoods due to model misspecification: there is not a straightforward relationship between the ELBO and the predictive performance, and so it is possible to obtain better test log likelihoods with worse inference. We present all the results, including for the test RMSEs, in tabulated form in Appendix P. The architecture we considered for all BNN UCI experiments were fully-connected ReLU networks with 2 hidden layers of 50 hidden units each. We performed a grid search to select the learning rate and minibatch size. For the fully factorised approximation, we selected the learning rate from {3e-4, 1e-3, 3e-3, 1e-2} and the minibatch size from {32, 100, 500}, optimising for 25000 gradient steps; for the other methods we selected the learning rate from {3e-3, 1e-2} and fixed the minibatch size to 10000 (as in Salimbeni & Deisenroth (2017) ), optimising for 10000 gradient steps. For all methods we selected the hyperparameters that gave the best ELBO. We trained the models using 10 samples from the approximate posterior, while using 100 for evaluation. For the inducing point methods, we used the selected batch size for the number of inducing points per layer. For all methods, we initialised the log noise variance at -3, but use the scaling trick in Appendix G to accelerate convergence, scaling by a factor of 10. Note that for the fully factorised method we used the local reparameterisation trick (Kingma et al., 2015) ; however, for fac → gi we cannot do so because the inducing point methods require that covariances be propagated through the network correctly. For the inducing point methods, we additionally use output channel-specific precisions, Λ l,λ , which effectively allows the network to prune unnecessary neurons if that benefits the ELBO. However, we only parameterise the diagonal of these precision matrices to save on computational and memory cost.

K UNCERTAINTY CALIBRATION FOR CIFAR-10

To assess how well our methods capture uncertainty, we consider calibration. Calibration is assessed by comparing the model's probabilistic assessment of its confidence with its accuracy -the proportion of the time that it is actually correct. For instance, gathering model predictions with some confidence (e.g. softmax probabilities in the range 0.9 to 0.95), and looking at the accuracy of these predictions, we would expect the model to be correct with probability 0.925; a higher or lower value would represent miscalibration. We begin by plotting calibration curves in Fig. A5 , obtained by binning the predictions in 20 equal bins and assessing the mean accuracy of the binned predictions. For well-calibrated models, we expect the line to lie on the diagonal. A line above the diagonal indicates the model is underconfident (the model is performing better than it expects), whereas a line below the diagonal indicates it is overconfident (it is performing worse than it expects). While it is difficult to draw strong conclusions from these plots, it appears generally that factorised is poorly calibrated for both priors, that SpatialIWPrior generally improves calibration over ScalePrior, and that local inducing with SpatialIWPrior performs very well. To come to more quantitative conclusions, we use expected calibration error (ECE; (Naeini et al., 2015; Guo et al., 2017) ), which measures the expected absolute difference between the model's confidence and accuracy. Confirming the results from the plots (Fig. A5 ), we find that using the more sophisticated SpatialIWPrior gave considerable improvements in calibration. While, as expected, we find that our most accurate prior, SpatialIWPrior, in combination with global inducing points did very well (ECE of 0.021), the model with the best ECE is actually local inducing with SpatialIWPrior, albeit by a very small margin. We leave investigation of exactly why this is to future work. Finally, note our final ECE value of 0.021 is a considerable improvement over those for uncalibrated models 1 ), which are in the region of 0.03-0.045 (although considerably better calibration can be achieved by post-hoc scaling of the model's confidence).

L ASYMPTOTIC COMPLEXITY

In the deep GP case, the complexity for global inducing is exactly that of standard inducing point Gaussian processes, i.e. O(M 3 + P M 2 ) where M is the number of inducing points, and P can be taken to be the number of training inputs, or the size of a minibatch, as appropriate. The first term, M 3 , comes from computing and sampling the posterior over U based on the inducing points (e.g. inverting the covariance). The second term, and P M 2 comes from computing the implied distribution over F . In the fully-connected BNN case, we have three terms, O(N 3 + M N 2 + P N 2 ). The first term, N 3 , where N corresponds to the width of the network, arises from taking the inverse of the covariance matrix in Eq. ( 9), but is also the complexity e.g. for propagating the inducing points from layer to the next (Eq. 8). The second term, M N 2 , comes from computing that covariance in Eq. ( 9), by taking the product of input features with themselves. The third term P N 2 comes from multiplying the training inputs/minibatch by the sampled inputs (Eq. 1).

M UCI RESULTS WITH DEEP GAUSSIAN PROCESSES

In this appendix, we again consider all of the UCI datasets from Hernández-Lobato & Adams (2015) for DGPs with depths ranging from two to five layers. We compare DSVI (Salimbeni & Deisenroth, 2017) , local inducing, and global inducing. While local inducing uses the same inducing-point architecture as Salimbeni & Deisenroth (2017) , the actual implementation and parameterisation is very different and in addition they used a complex architecture involving skip connections that we may not have matched exactly. As such, we do expect to see differences between local inducing and Salimbeni & Deisenroth (2017) . We show our results in Fig. A6 . Here, the results are not as clear-cut as in the BNN case. For the smaller datasets (i.e. boston, concrete, energy, wine, yacht), global inducing generally outperforms both local inducing and DSVI, as noted in the main text, especially when considering the ELBOs. We do however observe that for power, protein, and one model for kin8nm, the local approaches sometimes outperform global inducing, even for the ELBOs. We believe this is due to the fact that relatively few inducing points were used (100), in combination with the fact that global inducing has far fewer variational parameters than the local approaches. This may make optimisation harder in the global inducing case, especially for larger datasets where the model uncertainty does not matter as much, as the posterior concentration will be stronger. Importantly, however, our results on CIFAR-10 indicate that these issues do not arise in very large-scale, high-dimensional datasets, which are of most interest for future work. We provide tabulated results, including for RMSEs, in Appendix P.

M.1 EXPERIMENTAL DETAILS

Here, we matched the experimental setup in Salimbeni & Deisenroth (2017) as closely as possible. In particular, we used 100 inducing points, and full-covariance observation noise. However, our parameterisation is still somewhat different from theirs, in part because our approximate posterior is defined in terms of noisy function-values, while their approximate posterior was defined in terms of the function-values themselves. -0.50 -0.31 As the original results in Salimbeni & Deisenroth (2017) used different UCI splits, and did not provide the ELBO, we reran their code https://github.com/ICL-SML/Doubly-Stochastic-DGP (changing the number of epochs and noise variance to reflect the values in the paper), which gave very similar log likelihoods to those in the paper.

N MNIST 500

For MNIST, we considered a LeNet-inspired model consisting of two conv2d-relu-maxpool blocks, followed by conv2d-relu-linear, where the convolutions all have 3 × 3 kernels with 64 channels. We trained all models using a learning rate of 10 -3 . When training on very small datasets, such as the first 500 training examples in MNIST, we can see a variety of pathologies emerge with standard methods. To help build intuition for these pathologies, we introduce a sanity check for the ELBO. In particular, we could imagine a model that sets the distribution over all lower-layer parameters equal to the prior, and sets the top-layer parameters so as to ensure that the predictions are uniform. With 10 classes, this results in an average test log likelihood of -2.30 ≈ log(1/10), and an ELBO (per datapoint) of approximately -2.30. We found that many combinations of the approximate posterior/prior converged to ELBOs near this baseline. Indeed, the only approximate posterior to escape this baseline for ScalePrior and SpatialIWPrior is global inducing points. This is because ScalePrior and SpatialIWPrior both offer the flexibility to shrink the prior variance, and hence shrink the weights towards zero, giving uniform predictions, and potentially zero KL divergence. In contrast, NealPrior and StandardPrior do not offer this flexibility: you always have to pay something in KL divergence in order to give uniform predictions. We believe that this is the reason that factorised performs better than expected with NealPrior, despite having an ELBO that is close to the baseline. Furthermore, it is unclear why local inducing gives very test log likelihood and performance, despite having an ELBO that is similar to factorised. For StandardPrior, all the ELBOs are far lower than the baseline, and far lower than for any other priors. Despite this, factorised and fac → gi in combination with StandardPrior appear to transiently perform better in terms of predictive accuracy than any other method. These results should sound a note of caution whenever we try to use factorised approximate posteriors with fixed prior covariances (e.g. Blundell et al., 2015; Farquhar et al., 2020) . We leave a full investigation of these effects for future work.

O ADDITIONAL EXPERIMENTAL DETAILS

All the methods were implemented in PyTorch. We ran the toy experiments on CPU, with the UCI experiments being run on a mixture of CPU and GPU. The remaining experiments -linear, CIFAR-10, and MNIST 500 -were run on various GPUs. For CIFAR-10, the most intensive of our experiments, we trained the models on one NVIDIA Tesla P100-PCIE-16GB. We optimised using ADAM (Kingma & Ba, 2014) throughout. Factorised We initialise the posterior weight means to follow the Neal scaling (i.e. drawn from NealPrior); however, we use the scaling described in Appendix G to accelerate training. We initialise the weight variances to 1e-3/ N -1 for each layer. Inducing point methods For global inducing, we initialise the inducing inputs, U 0 , and pseudodata for the last layer, V L+1 , using the first batch of data, except for the toy experiment, where we initialise using samples from N (0, 1) (since we used more inducing points than datapoints). For the remaining layers, we initialise the pseudo-data by sampling from N (0, 1). We initialise the log precision to -4, except for the last layer, where we initialise it to 0. We additionally use a scaling factor of 3 as described in Appendix G. For local inducing, the initialisation is largely the same, except we initialise the pseudo-data for every layer by sampling from N (0, 1). We additionally sample the inducing inputs for every layer from N (0, 1). Toy experiment For each variational method, we optimise the ELBO over 5000 epochs, using full batches for the gradient descent. We use a learning rate of 1e-2. We fix the noise variance at its



Figure 1: Predictive distributions on the toy dataset. Shaded regions represent one standard deviation.

Figure 2: ELBO for different approximate posteriors as we change network depth/width on a dataset generated using a linear Gaussian model. The rand → gi line lies behind the global inducing line in width = 50 and width = 250.

Figure3: Average test log likelihoods for BNNs on the UCI datasets (in nats). Error bars represent one standard error. Shading represents different priors. We connect the factorised models with the fac → gi models with a thin grey line as an aid for easier comparison. Further to the right is better.

Figure A1: Comparison of the graphical models for three approaches to inference in deep GPs: Salimbeni & Deisenroth (2017), Ustyuzhaninov et al. (2020), and ours.

Figure Predictive distributions on the toy dataset as the number of inducing points changes.

Figure A3: Posterior distributions for 2-layer DGPs with local inducing and global inducing. The first two columns show the predictive distributions for each layer taken individually, while the last column shows the predictive distribution of the output y.

ELBOs and variances of the intermediate functions for a BNN fit to the toy data of Fig.1

Figure A4: ELBOs per datapoint and average test log likelihoods for BNNs on UCI datasets.

Figure A5: Calibration curves for CIFAR-10

Figure A6: ELBOs per datapoint and average test log likelihoods for DGPs on UCI datasets. The numbers indicate the depths of the models.

Figure A7: The ELBO, test log likelihoods and classification accuracy with different priors and approximate posteriors on a reduced MNIST dataset consisting of only the first 500 training examples.

CIFAR-10 classification accuracy. The first block shows our results using SpatialIWPrior, with ScalePrior in brackets. The next block shows comparable past results, from GPs and BNNs. The final block show non-comparable (sampling-based) methods. Dashes indicate that the figures were either not reported, are not applicable. The time is reported per epoch with ScalePrior and for MNIST, rather than CIFAR-10 because of a known performance bug in the convolutions required in Sec. 2.2 with 32 × 32 (and above) images https://github.com/pytorch/pytorch/ issues/35603.

Expected calibration error for CIFAR-10

annex

true value, to help assess the differences between each method more clearly. We use 10 samples from the variational posterior for training, using 100 for testing. For HMC, we use 10000 samples to burn in, and 10000 samples for evaluation, which we subsequently thin by a factor of 10. We initialise the samples from a standard normal distribution, and use 20 leapfrog steps for each sample. We hand-tune the leapfrog step sizes to be 0.0007 and 0.003 for the burn-in and sampling phases, respectively.Deep linear network We use 10 inducing points for the inducing point methods. We use 1 sample from the approximate posterior for training and 10 for testing, training for 40 periods of 1000 gradient steps, using full batches for each step, with a learning rate of 1e-2.

UCI experiments

The splits that we used for the UCI datasets can be found at https://github. com/yaringal/DropoutUncertaintyExps.

CIFAR-10

The CIFAR-10 dataset (https://www.cs.toronto.edu/~kriz/cifar. html; (Krizhevsky et al., 2009) ) is a 10-class dataset comprising RGB, 32 × 32 images. It is divided in two sets: a training set of 50,000 examples, and a validation set of 10,000 examples. For the purposes of this paper, we use the validation set as our test set and refer to it as such, as is commonly done in the literature. We use a batch size of 500, with one sample from the approximate posterior for training and 10 for testing. For pre-processing, we normalise the data using the training dataset's mean and standard deviation. Finally, we train for 1000 epochs with a learning rate of 1e -2 (see App. G for an explanation of why our learning rate is higher than might be expected), and we use a tempering scheme for the first 100 epochs, slowly increasing the influence of the KL divergence to the prior by multiplying it by a factor that increases from 0 to 1. In our scheme, we increase the factor in a step-wise manner, meaning that for the first ten epochs it is 0, then 0.1 for the next ten, 0.2 for the following ten, and so on. Importantly, we still have 900 epochs of training where the standard, untempered ELBO is used, meaning that our results reflect that ELBO. Finally, we note that we share the precisions Λ within layers instead of using a separate precision for each output channel as was done in the UCI case. This saves memory and computational cost although possibly at the expense of predictive performance.MNIST 500 The MNIST dataset (http://yann.lecun.com/exdb/mnist/) is a dataset of grayscale handwritten digits, each 28 × 28 pixels, with 10 classes. It comprises 60,000 training images and 10,000 test images. For the MNIST 500 experiments, we trained using the first 500 images from the training dataset and discarded the rest. We normalised the images using the full training dataset's statistics.Deep GPs As mentioned, we largely follow the approach of Salimbeni & Deisenroth (2017) for hyperparameters. For global inducing, we initialise the inducing inputs to the first batch of training inputs, and we initialise the pseudo-outputs for the last layer to the respective training outputs. For the remaining layers, we initialise the pseudo-outputs by sampling from a standard normal distribution. We initialise the precision matrix to be diagonal with log precision zero for the output layer, and log precision -4 for the remaining layers. For local inducing, we initialise inducing inputs and pseudo data by sampling from a standard normal for every layer, and initialise the precision matrices to be diagonal with log precision zero. kin8nm -N (0, 1) 0.07 ± 0.00 0.08 ± 0.00 0.07 ± 0.00 0.07 ± 0.00 NealPrior 0.07 ± 0.00 0.08 ± 0.00 0.07 ± 0.00 0.07 ± 0.00 ScalePrior 0.07 ± 0.00 0.08 ± 0.00 0.07 ± 0.00 0.07 ± 0.00 naval -N (0, 1) 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 NealPrior 0.00 ± 0.00 0.01 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 ScalePrior 0.00 ± 0.00 0.01 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 boston -N (0, 1) -1.55 ± 0.00 -1.54 ± 0.00 -1.02 ± 0.01 -1.03 ± 0.00 NealPrior -1.03 ± 0.00 -0.99 ± 0.00 -0.63 ± 0.00 -0.70 ± 0.00 ScalePrior -1.54 ± 0.00 -0.96 ± 0.00 -0.59 ± 0.00 -0.70 ± 0.00 concrete -N (0, 1) -1.10 ± 0.00 -1.08 ± 0.00 -0.71 ± 0.00 -0.78 ± 0.00 NealPrior -0.88 ± 0.00 -0.87 ± 0.00 -0.59 ± 0.00 -0.65 ± 0.00 ScalePrior -1.45 ± 0.01 -0.88 ± 0.00 -0.57 ± 0.00 -0.63 ± 0.00 energy -N (0, 1) -0.13 ± 0.02 -0.53 ± 0.01 0.72 ± 0.00 0.59 ± 0.00 NealPrior 0.21 ± 0.00 -0.33 ± 0.04 0.95 ± 0.00 0.79 ± 0.01 ScalePrior -1.12 ± 0.00 -0.47 ± 0.01 0.96 ± 0.01 0.80 ± 0.01 kin8nm -N (0, 1) -0.38 ± 0.00 -0.43 ± 0.00 -0.26 ± 0.00 -0.31 ± 0.00 NealPrior -0.35 ± 0.00 -0.43 ± 0.00 -0.31 ± 0.00 -0.28 ± 0.00 ScalePrior -0.51 ± 0.00 -0.39 ± 0.01 -0.29 ± 0.00 -0.27 ± 0.00 power -N (0, 1) -0.08 ± 0.00 -0.06 ± 0.00 -0.02 ± 0.00 -0.03 ± 0.00 NealPrior -0.05 ± 0.00 -0.05 ± 0.00 -0.01 ± 0.00 -0.01 ± 0.00 ScalePrior -0.13 ± 0.00 -0.05 ± 0.00 -0.01 ± 0.00 -0.01 ± 0.00 protein -N (0, 1) -1.09 ± 0.00 -1.14 ± 0.00 -1.06 ± 0.01 -1.09 ± 0.00 NealPrior -1.11 ± 0.00 -1.13 ± 0.00 -1.09 ± 0.00 -1.09 ± 0.00 ScalePrior -1.13 ± 0.00 -1.12 ± 0.00 -1.07 ± 0.00 -1.07 ± 0.00 wine -N (0, 1) -1.48 ± 0.00 -1.47 ± 0.00 -1.36 ± 0.00 -1.36 ± 0.00 NealPrior -1.31 ± 0.00 -1.30 ± 0.00 -1.22 ± 0.00 -1.23 ± 0.00 ScalePrior -1.46 ± 0.00 -1.29 ± 0.00 -1.22 ± 0.00 -1.23 ± 0.00 yacht -N (0, 1) -1.04 ± 0.02 -1.30 ± 0.02 0.08 ± 0.01 -0.23 ± 0.01 NealPrior -0.46 ± 0.02 -0.39 ± 0.01 0.74 ± 0.01 0.31 ± 0.01 ScalePrior -1.61 ± 0.00 -0.77 ± 0.10 0.79 ± 0.01 0.30 ± 0.01 kin8nm -2 0.06 ± 0.00 0.06 ± 0.00 0.06 ± 0.00 3 0.06 ± 0.00 0.06 ± 0.00 0.06 ± 0.00 4 0.06 ± 0.00 0.06 ± 0.00 0.06 ± 0.00 5 0.06 ± 0.00 0.06 ± 0.00 0.06 ± 0.00 naval -2 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 3 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 4 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 5 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 2.17 ± 0.08 3.32 ± 0.04 3.51 ± 0.05 power -2 0.02 ± 0.00 0.05 ± 0.00 0.03 ± 0.00 3 0.03 ± 0.00 0.05 ± 0.00 0.03 ± 0.00 4 0.03 ± 0.00 0.05 ± 0.00 0.04 ± 0.00 5 0.03 ± 0.00 0.05 ± 0.00 0.03 ± 0.00 protein -2 -1.06 ± 0.00 -1.07 ± 0.00 -1.06 ± 0.00 3 -1.02 ± 0.00 -1.05 ± 0.01 -1.05 ± 0.00 4 -1.01 ± 0.00 -1.06 ± 0.01 -1.04 ± 0.00 5 -1.01 ± 0.00 -1.08 ± 0.00 -1.04 ± 0.00 wine -2 -1.19 ± 0.00 -1.18 ± 0.00 -1.17 ± 0.00 3 -1.19 ± 0.00 -1.18 ± 0.00 -1.17 ± 0.00 4 -1.20 ± 0.00 -1.18 ± 0.00 -1.17 ± 0.00 5 -1.20 ± 0.00 -1.18 ± 0.00 -1.17 ± 0.00 

