GLOBAL INDUCING POINT VARIATIONAL POSTERIORS FOR BAYESIAN NEURAL NETWORKS AND DEEP GAUS-SIAN PROCESSES

Abstract

We derive the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned "global" inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard, "local", inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%.

1. INTRODUCTION

Deep models, formed by stacking together many simple layers, give rise to extremely powerful machine learning algorithms, from deep neural networks (DNNs) to deep Gaussian processes (DGPs) (Damianou & Lawrence, 2013) . One approach to reason about uncertainty in these models is to use variational inference (VI) (Jordan et al., 1999) . VI in Bayesian neural networks (BNNs) requires the user to specify a family of approximate posteriors over the weights, with the classical approach being independent Gaussian distributions over each individual weight (Hinton & Van Camp, 1993; Graves, 2011; Blundell et al., 2015) . Later work has considered more complex approximate posteriors, for instance using a Matrix-Normal distribution as the approximate posterior for a full weight-matrix (Louizos & Welling, 2016; Ritter et al., 2018) . By contrast, DGPs use an approximate posterior defined over functions -the standard approach is to specify the inputs and outputs at a finite number of "inducing" points (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . Critically, these classical BNN and DGP approaches define approximate posteriors over functions that are independent across layers. An approximate posterior that factorises across layers is problematic, because what matters for a deep model is the overall input-output transformation for the full model, not the input-output transformation for individual layers. This raises the question of what family of approximate posteriors should be used to capture correlations across layers. One approach for BNNs would be to introduce a flexible "hypernetwork", used to generate the weights (Krueger et al., 2017; Pawlowski et al., 2017) . However, this approach is likely to be suboptimal as it does not sufficiently exploit the rich structure in the underlying neural network. For guidance, we consider the optimal approximate posterior over the top-layer units in a deep network for regression. Remarkably, the optimal approximate posterior for the last-layer weights given the earlier weights can be obtained in closed form without choosing a restrictive family of distributions. In particular, the optimal approximate posterior is given by propagating the training inputs through lower layers to compute the top-layer representation, then using Bayesian linear regression to map from the top-layer representation to the outputs. Inspired by this result, we use Bayesian linear regression to define a generic family of approximate posteriors for BNNs. In particular, we introduce learned "pseudo-data" at every layer, and compute the posterior over the weights by performing linear regression from the inputs (propagated from

