SPARSE UNCERTAINTY REPRESENTATION IN DEEP LEARNING WITH INDUCING WEIGHTS

Abstract

Bayesian neural networks and deep ensembles represent two modern paradigms of uncertainty quantification in deep learning. Yet these approaches struggle to scale mainly due to memory inefficiency issues, since they require parameter storage several times higher than their deterministic counterparts. To address this, we augment the weight matrix of each layer with a small number of inducing weights, thereby projecting the uncertainty quantification into such low dimensional spaces. We further extend Matheron's conditional Gaussian sampling rule to enable fast weight sampling, which enables our inference method to maintain reasonable run-time as compared with ensembles. Importantly, our approach achieves competitive performance to the state-of-the-art in prediction and uncertainty estimation tasks with fully connected neural networks and ResNets, while reducing the parameter size to ≤ 47.9% of that of a single neural network.

1. INTRODUCTION

Deep learning models are becoming deeper and wider than ever before. From image recognition models such as ResNet-101 (He et al., 2016a) and DenseNet (Huang et al., 2017) to BERT (Xu et al., 2019) and GPT-3 (Brown et al., 2020) for language modelling, deep neural networks have found consistent success in fitting large-scale data. As these models are increasingly deployed in real-world applications, calibrated uncertainty estimates for their predictions become crucial, especially in safety-critical areas such as healthcare. In this regard, Bayesian neural networks (BNNs) (MacKay, 1995; Blundell et al., 2015; Gal & Ghahramani, 2016; Zhang et al., 2020) and deep ensembles (Lakshminarayanan et al., 2017) represent two popular paradigms for estimating uncertainty, which have shown promising results in applications such as (medical) image processing (Kendall & Gal, 2017; Tanno et al., 2017) and out-of-distribution detection (Ovadia et al., 2019) . Though progress has been made, one major obstacle to scaling up BNNs and deep ensembles is the computation cost in both time and space complexities. Especially for the latter, both approaches require the number of parameters to be several times higher than their deterministic counterparts. Recent efforts have been made to improve their memory efficiency (Louizos & Welling, 2017; Swiatkowski et al., 2020; Wen et al., 2020; Dusenberry et al., 2020) . Still, these approaches require storage memory that is higher than storing a deterministic neural network. Perhaps surprisingly, when taking the width of the network layers to the infinite limit, the resulting neural network becomes "parameter efficient". Indeed, an infinitely wide BNN becomes a Gaussian process (GP) that is known for good uncertainty estimates (Neal, 1995; Matthews et al., 2018; Lee et al., 2018) . Effectively, the "parameters" of a GP are the datapoints, which have a considerably smaller memory footprint. To further reduce the computational burden, sparse posterior approximations with a small number of inducing points are widely used (Snelson & Ghahramani, 2006; Titsias, 2009) , rendering sparse GPs more memory efficient than their neural network counterparts. Can we bring the advantages of sparse approximations in GPs -which are infinitely-wide neural networks -to finite width deep learning models? We provide an affirmative answer regarding memory efficiency, by proposing an uncertainty quantification framework based on sparse uncertainty representations. We present our approach in BNN context, but the proposed approach is applicable to deep ensembles as well. In details, our contributions are as follows: • We introduce inducing weights as auxiliary variables for uncertainty estimation in deep neural networks with efficient (approximate) posterior sampling. Specifically: -We introduce inducing weights -lower dimensional counterparts to the actual weight matrices -for variational inference in BNNs, as well as a memory efficient parameterisation and an extension to ensemble methods (Section 3.1). -We extend Matheron's rule to facilitate efficient posterior sampling (Section 3.2). -We show the connection to sparse (deep) GPs, in that inducing weights can be viewed as projected noisy inducing outputs in pre-activation output space (Section 3.3). -We provide an in-depth computation complexity analysis (Section 3.4), showing the significant advantage in terms of parameter efficiency. • We apply the proposed approach to both BNNs and deep ensembles. Experiments in classification, model robustness and out-of-distribution detection tasks show that our inducing weight approaches achieve competitive performance to their counterparts in the original weight space on modern deep architectures for image classification, while reducing the parameter count to less than half of that of a single neural network.

2. VARIATIONAL INFERENCE WITH INDUCING VARIABLES

This section lays out the basics on variational inference and inducing variables for posterior approximations, which serve as foundation and inspiration for this work. Given observations D = {X, Y} with X = [x 1 , ..., x N ], Y = [y 1 , ..., y N ], we would like to fit a neural network p(y|x, W 1:L ) with weights W 1:L to the data. BNNs posit a prior distribution p(W 1:L ) over the weights, and construct an approximate posterior q(W 1:L ) to the intractable exact posterior p(W (Jordan et al., 1999; Zhang et al., 2018a ) constructs an approximation q(θ) to the posterior p(θ|D) ∝ p(θ)p(D|θ) by maximising a variational lower-bound: 1:L |D) ∝ p(D|W 1:L )p(W 1:L ), where p(D|W 1:L ) = p(Y|X, W 1:L ) = N n=1 p(y n |x n , W 1:L ). Variational inference Variational inference log p(D) ≥ L(q(θ)) := E q(θ) [log p(D|θ)] -KL [q(θ)||p(θ)] . For BNNs, θ = {W 1:L }, and a simple choice of q is a fully factorised Gaussian (FFG): q(W 1:L ) = L l=1 d l out i=1 d l in j=1 N (m (i,j) l , v (i,j) l ), with m (i,j) l , v (i,j) l the mean and variance of W (i,j) l and d l in , d l out the respective number of inputs and outputs to layer l. The variational parameters are then φ = {m (i,j) l , v (i,j) l } L l=1 . Gradients of (1) w.r.t. φ can be estimated with mini-batches of data (Hoffman et al., 2013) and with Monte Carlo sampling from the q distribution (Titsias & Lázaro-Gredilla, 2014; Kingma & Welling, 2014) . By setting q to an FFG, a variational BNN can be trained with similar computational requirements as a deterministic network (Blundell et al., 2015) . Improved posterior approximation with inducing variables Auxiliary variable approaches (Agakov & Barber, 2004; Salimans et al., 2015; Ranganath et al., 2016) construct the q(θ) distribution with an auxiliary variable a: q(θ) = q(θ|a)q(a)da, with the hope that a potentially richer mixture distribution q(θ) can achieve better approximations. As then q(θ) becomes intractable, an auxiliary variational lower-bound is used to optimise q(θ, a): log p(D) ≥ L(q(θ, a)) = E q(θ,a) [log p(D|θ)] + E q(θ,a) log p(θ)r(a|θ) q(θ|a)q(a) . (2) Here r(a|θ) is an auxiliary distribution that needs to be specified, where existing approaches often use a "reverse model" for r(a|θ). Instead, we define r(θ|a) in a generative manner: r(a|θ) is the "posterior" of the following "generative model", whose "evidence" is exactly the prior of θ: r(a|θ) = p(a|θ) ∝ p(a)p(θ|a), such that p(θ) := p(a)p(θ|a)da = p(θ). (3) Plugging in (3) to (2) immediately leads to: L(q(θ, a)) = E q(θ) [log p(D|θ)] -E q(a) [KL[q(θ|a)  This approach returns an efficient approximate inference algorithm, translating the complexity of inference in θ to a, if dim(a) < dim(θ) and q(θ, a) = q(θ|a)q(a) has the following properties:



||p(θ|a)]] -KL[q(a)||p(a)].

