GLOBAL INDUCING POINT VARIATIONAL POSTERIORS FOR BAYESIAN NEURAL NETWORKS AND DEEP GAUS-SIAN PROCESSES

Abstract

We derive the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned "global" inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard, "local", inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%.

1. INTRODUCTION

Deep models, formed by stacking together many simple layers, give rise to extremely powerful machine learning algorithms, from deep neural networks (DNNs) to deep Gaussian processes (DGPs) (Damianou & Lawrence, 2013) . One approach to reason about uncertainty in these models is to use variational inference (VI) (Jordan et al., 1999) . VI in Bayesian neural networks (BNNs) requires the user to specify a family of approximate posteriors over the weights, with the classical approach being independent Gaussian distributions over each individual weight (Hinton & Van Camp, 1993; Graves, 2011; Blundell et al., 2015) . Later work has considered more complex approximate posteriors, for instance using a Matrix-Normal distribution as the approximate posterior for a full weight-matrix (Louizos & Welling, 2016; Ritter et al., 2018) . By contrast, DGPs use an approximate posterior defined over functions -the standard approach is to specify the inputs and outputs at a finite number of "inducing" points (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . Critically, these classical BNN and DGP approaches define approximate posteriors over functions that are independent across layers. An approximate posterior that factorises across layers is problematic, because what matters for a deep model is the overall input-output transformation for the full model, not the input-output transformation for individual layers. This raises the question of what family of approximate posteriors should be used to capture correlations across layers. One approach for BNNs would be to introduce a flexible "hypernetwork", used to generate the weights (Krueger et al., 2017; Pawlowski et al., 2017) . However, this approach is likely to be suboptimal as it does not sufficiently exploit the rich structure in the underlying neural network. For guidance, we consider the optimal approximate posterior over the top-layer units in a deep network for regression. Remarkably, the optimal approximate posterior for the last-layer weights given the earlier weights can be obtained in closed form without choosing a restrictive family of distributions. In particular, the optimal approximate posterior is given by propagating the training inputs through lower layers to compute the top-layer representation, then using Bayesian linear regression to map from the top-layer representation to the outputs. Inspired by this result, we use Bayesian linear regression to define a generic family of approximate posteriors for BNNs. In particular, we introduce learned "pseudo-data" at every layer, and compute the posterior over the weights by performing linear regression from the inputs (propagated from lower layers) onto the pseudo-data. We reduce the burden of working with many training inputs by summarising the posterior using a small number of "inducing" points. We find that these approximate posteriors give excellent performance in the non-tempered, no-data-augmentation regime, with performance on datasets such as CIFAR-10 reaching 86.7%, comparable to SGMCMC (Wenzel et al., 2020) . Our approach can be extended to DGPs, and we explore connections to the inducing point GP literature, showing that inference in the two classes of models can be unified.

2. METHODS

We consider neural networks with lower-layer weights {W } L =1 , W ∈ R N -1 ×N , and top-layer weights, W L+1 ∈ R N L ×N L+1 , where the activity, F , at layer is given by, F 1 = XW 1 , F = φ (F -1 ) W for ∈ {2, . . . , L} , where φ(•) is an elementwise nonlinearity. The outputs, Y ∈ R P ×N L+1 , depend on the top-level activity, F L , and the output weights, W L+1 , according to a likelihood, P (Y|W L+1 , F L ). In the following derivations, we will focus on > 1; corresponding expressions for the input layer can be obtained by replacing φ(F 0 ) with the inputs, X ∈ R P ×N0 . The prior over weights is independent across layers and output units (see Sec. 2.3 for the form of S ), P (W ) = N λ=1 P w λ , P w λ = N w λ 0, 1 N -1 S , where w λ is a column of W , i.e. all the input weights to unit λ in layer . To fit the parameters of the approximate posterior, Q {W} L+1 =1 , we maximise the evidence lower bound (ELBO), L = E Q({W} L+1 =1 ) log P Y, {W} L+1 =1 |X -log Q {W} L+1 =1 . To build intuition about how to parameterise Q {W} L+1 =1 , we consider the optimal Q W L+1 |{W } L =1 for any given Q {W } L =1 . We begin by simplifying the ELBO by incorporating terms that do not depend on W L+1 into a constant, c, L = E Q({W } L+1 =1 ) log P Y, W L+1 | X, {W } L =1 -log Q W L+1 | {W } L =1 + c . (4) Rearranging these terms, we find that all W L+1 dependence can be written in terms of the KL divergence between the approximate posterior of interest and the true posterior, L = E Q({W } L =1 ) log P Y| X, {W } L =1 -D KL Q W L+1 |{W } L =1 || P W L+1 | Y, X, {W } L =1 + c . (5) Thus, the optimal approximate posterior is, Q W L+1 |{W } L =1 = P W L+1 | Y, X, {W } L =1 ∝ P (Y|W L+1 , F L ) P (W L+1 ) , ) and where the final proportionality comes by applying Bayes theorem and exploiting the model's conditional independencies. For regression, the likelihood is Gaussian, P (Y|W L+1 , F L ) = N L+1 λ=1 N y λ ; φ (F L ) w L+1 λ , Λ -1 L+1 , where y λ is the value of a single output channel for all training inputs, and Λ L+1 is a precision matrix. Thus, the posterior is given in closed form by Bayesian linear regression (Rasmussen & Williams, 2006) .

2.1. DEFINING THE FULL APPROXIMATE POSTERIOR WITH GLOBAL INDUCING POINTS AND PSEUDO-DATA

We adapt the optimal scheme above to give a scalable approximate posterior over the weights at all layers. To avoid propagating all training inputs through the network, which is intractable for large datasets, we instead propagate M global inducing locations, U 0 , U 1 = U 0 W 1 , U = φ (U -1 ) W for = 2, . . . , L + 1. (8)

