SPARSE UNCERTAINTY REPRESENTATION IN DEEP LEARNING WITH INDUCING WEIGHTS

Abstract

Bayesian neural networks and deep ensembles represent two modern paradigms of uncertainty quantification in deep learning. Yet these approaches struggle to scale mainly due to memory inefficiency issues, since they require parameter storage several times higher than their deterministic counterparts. To address this, we augment the weight matrix of each layer with a small number of inducing weights, thereby projecting the uncertainty quantification into such low dimensional spaces. We further extend Matheron's conditional Gaussian sampling rule to enable fast weight sampling, which enables our inference method to maintain reasonable run-time as compared with ensembles. Importantly, our approach achieves competitive performance to the state-of-the-art in prediction and uncertainty estimation tasks with fully connected neural networks and ResNets, while reducing the parameter size to ≤ 47.9% of that of a single neural network.

1. INTRODUCTION

Deep learning models are becoming deeper and wider than ever before. From image recognition models such as ResNet-101 (He et al., 2016a) and DenseNet (Huang et al., 2017) to BERT (Xu et al., 2019) and GPT-3 (Brown et al., 2020) for language modelling, deep neural networks have found consistent success in fitting large-scale data. As these models are increasingly deployed in real-world applications, calibrated uncertainty estimates for their predictions become crucial, especially in safety-critical areas such as healthcare. In this regard, Bayesian neural networks (BNNs) (MacKay, 1995; Blundell et al., 2015; Gal & Ghahramani, 2016; Zhang et al., 2020) and deep ensembles (Lakshminarayanan et al., 2017) represent two popular paradigms for estimating uncertainty, which have shown promising results in applications such as (medical) image processing (Kendall & Gal, 2017; Tanno et al., 2017) and out-of-distribution detection (Ovadia et al., 2019) . Though progress has been made, one major obstacle to scaling up BNNs and deep ensembles is the computation cost in both time and space complexities. Especially for the latter, both approaches require the number of parameters to be several times higher than their deterministic counterparts. Recent efforts have been made to improve their memory efficiency (Louizos & Welling, 2017; Swiatkowski et al., 2020; Wen et al., 2020; Dusenberry et al., 2020) . Still, these approaches require storage memory that is higher than storing a deterministic neural network. Perhaps surprisingly, when taking the width of the network layers to the infinite limit, the resulting neural network becomes "parameter efficient". Indeed, an infinitely wide BNN becomes a Gaussian process (GP) that is known for good uncertainty estimates (Neal, 1995; Matthews et al., 2018; Lee et al., 2018) . Effectively, the "parameters" of a GP are the datapoints, which have a considerably smaller memory footprint. To further reduce the computational burden, sparse posterior approximations with a small number of inducing points are widely used (Snelson & Ghahramani, 2006; Titsias, 2009) , rendering sparse GPs more memory efficient than their neural network counterparts. Can we bring the advantages of sparse approximations in GPs -which are infinitely-wide neural networks -to finite width deep learning models? We provide an affirmative answer regarding memory efficiency, by proposing an uncertainty quantification framework based on sparse uncertainty representations. We present our approach in BNN context, but the proposed approach is applicable to deep ensembles as well. In details, our contributions are as follows: • We introduce inducing weights as auxiliary variables for uncertainty estimation in deep neural networks with efficient (approximate) posterior sampling. Specifically: -We introduce inducing weights -lower dimensional counterparts to the actual weight matrices -for variational inference in BNNs, as well as a memory efficient parameterisation and an extension to ensemble methods (Section 3.1). -We extend Matheron's rule to facilitate efficient posterior sampling (Section 3.2). -We show the connection to sparse (deep) GPs, in that inducing weights can be viewed as projected noisy inducing outputs in pre-activation output space (Section 3.3). -We provide an in-depth computation complexity analysis (Section 3.4), showing the significant advantage in terms of parameter efficiency. • We apply the proposed approach to both BNNs and deep ensembles. Experiments in classification, model robustness and out-of-distribution detection tasks show that our inducing weight approaches achieve competitive performance to their counterparts in the original weight space on modern deep architectures for image classification, while reducing the parameter count to less than half of that of a single neural network.

2. VARIATIONAL INFERENCE WITH INDUCING VARIABLES

This section lays out the basics on variational inference and inducing variables for posterior approximations, which serve as foundation and inspiration for this work. Given observations D = {X, Y} with X = [x 1 , ..., x N ], Y = [y 1 , ..., y N ], we would like to fit a neural network p(y|x, W 1:L ) with weights W 1:L to the data. BNNs posit a prior distribution p(W 1:L ) over the weights, and construct an approximate posterior q(W 1:L ) to the intractable exact posterior p(W (Jordan et al., 1999; Zhang et al., 2018a ) constructs an approximation q(θ) to the posterior p(θ|D) ∝ p(θ)p(D|θ) by maximising a variational lower-bound: 1:L |D) ∝ p(D|W 1:L )p(W 1:L ), where p(D|W 1:L ) = p(Y|X, W 1:L ) = N n=1 p(y n |x n , W 1:L ). Variational inference Variational inference log p(D) ≥ L(q(θ)) := E q(θ) [log p(D|θ)] -KL [q(θ)||p(θ)] . (1) For BNNs, θ = {W 1:L }, and a simple choice of q is a fully factorised Gaussian (FFG): q(W 1:L ) = L l=1 d l out i=1 d l in j=1 N (m (i,j) l , v (i,j) l ), with m (i,j) l , v (i,j) l the mean and variance of W (i,j) l and d l in , d l out the respective number of inputs and outputs to layer l. The variational parameters are then φ = {m (i,j) l , v (i,j) l } L l=1 . Gradients of (1) w.r.t. φ can be estimated with mini-batches of data (Hoffman et al., 2013) and with Monte Carlo sampling from the q distribution (Titsias & Lázaro-Gredilla, 2014; Kingma & Welling, 2014) . By setting q to an FFG, a variational BNN can be trained with similar computational requirements as a deterministic network (Blundell et al., 2015) . Improved posterior approximation with inducing variables Auxiliary variable approaches (Agakov & Barber, 2004; Salimans et al., 2015; Ranganath et al., 2016) construct the q(θ) distribution with an auxiliary variable a: q(θ) = q(θ|a)q(a)da, with the hope that a potentially richer mixture distribution q(θ) can achieve better approximations. As then q(θ) becomes intractable, an auxiliary variational lower-bound is used to optimise q(θ, a): log p(D) ≥ L(q(θ, a)) = E q(θ,a) [log p(D|θ)] + E q(θ,a) log p(θ)r(a|θ) q(θ|a)q(a) . (2) Here r(a|θ) is an auxiliary distribution that needs to be specified, where existing approaches often use a "reverse model" for r(a|θ). Instead, we define r(θ|a) in a generative manner: r(a|θ) is the "posterior" of the following "generative model", whose "evidence" is exactly the prior of θ: r(a|θ) = p(a|θ) ∝ p(a)p(θ|a), such that p(θ) := p(a)p(θ|a)da = p(θ). (3) Plugging in (3) to (2) immediately leads to: L(q(θ, a)) = E q(θ) [log p(D|θ)] -E q(a) [KL[q(θ|a)||p(θ|a)]] -KL[q(a)||p(a)]. This approach returns an efficient approximate inference algorithm, translating the complexity of inference in θ to a, if dim(a) < dim(θ) and q(θ, a) = q(θ|a)q(a) has the following properties: 1. A "pseudo prior" p(a)p(θ|a) is defined such that p(a)p(θ|a)da = p(θ); 2. The conditional distributions q(θ|a) and p(θ|a) are in the same parametric family, so that they can share parameters; 3. Both sampling θ ∼ q(θ) and computing KL[q(θ|a)||p(θ|a)] can be done efficiently; 4. The designs of q(a) and p(a) can potentially provide extra advantages (in time and space complexities and/or optimisation easiness). We call a the inducing variable of θ, which is inspired by varationally sparse GP (SVGP) with inducing points (Snelson & Ghahramani, 2006; Titsias, 2009) . Indeed SVGP is a special case: θ = f , a = u, the GP prior is p(f |X) = GP(0, K XX ), p(u) = GP(0, K ZZ ), p(f , u) = p(u)p(f |X, u), q(f |u) = p(f |X, u), q(f , u) = p(f |X, u)q(u) , and Z are the optimisable inducing inputs. The variational lower-bound is L(q(f , u)) = E q(f ) [log p(Y|f )]-KL[q(u)||p(u)], and the variational parameters are φ = {Z, distribution parameters of q(u)}. SVGP satisfies the marginalisation constraint (3) by definition, and it has KL[q(f |u)||p(f |u)] = 0. Also by using small M = dim(u) and exploiting the q distribution design, SVGP reduces run-time from O(N 3 ) to O(N M 2 ) where N is the number of inputs in X, meanwhile it also makes storing a full Gaussian q(u) affordable. Lastly, u can be whitened, leading to the "pseudo prior" p(f , v) = p(f |X, u = K 1/2 ZZ v)p(v) , p(v) = N (v; 0, I) which could bring potential benefits in optimisation. In the rest of the paper we assume the "pseudo prior" p(θ, a) satisfies the marginalisation constraint (3), allowing us to write p(θ, a) := p(θ, a). It might seem unclear how to design p(θ, a) for an arbitrary probabilistic model, however, for a Gaussian prior on θ the rules for computing conditional Gaussian distributions can be used to construct p. In section 3 we exploit these rules to develop an efficient approximate inference method for Bayesian neural networks with inducing weights.

3.1. INDUCING WEIGHTS FOR NEURAL NETWORK PARAMETERS

Following the design principles of inducing variables, we introduce to each network layer l a smaller inducing weight matrix U l , and construct joint approximate posterior distributions for inference. In the rest of the paper we assume a factorised prior across layers p(W 1:L ) = l p(W l ), and for notation ease we drop the l indices when the context is clear. Augmenting network layers with inducing weights Suppose the weight W ∈ R dout×din has a Gaussian prior p(W ) = p(vec(W )) = N (0, σ 2 I) where vec(W ) concatenates the columns of the weight matrix into a vector. A first attempt to augment p(vec(W )) with an inducing weight variable U ∈ R Mout×Min may be to construct a multivariate Gaussian p(vec(W ), vec(U )), such that p(vec(W ), vec(U ))dU = N (0, σ 2 I). This means for the joint covariance matrix of (vec(W ), vec(U )), it requires the block corresponding to the covariance of vec(W ) to match the prior covariance σ 2 I. We are then free to parameterise the rest of the entries in the joint covariance matrix, as long as this full matrix remains positive definite. Now the conditional distribution p(W |U ) is a function of these parameters, and the conditional sampling from p(W |U ) is further discussed in Appendix A.1. Unfortunately, as dim(vec(W )) is typically large (e.g. of the order of 10 7 ), using a full covariance Gaussian for p(vec(W ), vec(U )) becomes computationally intractable. This issue can be addressed using matrix normal distributions (Gupta & Nagar, 2018) . Notice that the prior p(vec(W )) = N (0, σ 2 I) has an equivalent matrix normal distribution form as p(W ) = MN (0, σ 2 r I, σ 2 c I), with σ r , σ c > 0 the row and column standard deviations satisfying σ = σ r σ c . Now we introduce the inducing variables in matrix space, in addition to U we pad in two auxiliary variables U r ∈ R Mout×din , U c ∈ R dout×Min , so that the full augmented prior is: W U c U r U ∼ p(W, U c , U r , U ) := MN (0, Σ r , Σ c ), with  L r = σ r I 0 Z r D r s.t. Σ r = L r L r = σ 2 r I σ r Z r σ r Z r Z r Z r + D 2 r , and L c = σ c I 0 Z c D c s.t. Σ c = L c L c = σ 2 c I σ c Z c σ c Z c Z c Z c + D 2 c . (U ) = MN (0, Ψ r , Ψ c ) with Ψ r = Z r Z r + D 2 r and Ψ c = Z c Z c + D 2 c . In the experiments we use whitened inducing weights which transforms U so that p(U ) = MN (0, I, I) (Appendix E), but for clarity, we continue using the formulas presented above in the main text. The matrix normal parameterisation introduces two additional variables U r , U c without providing additional expressiveness. Hence it is desirable to integrate them out, leading to a joint multivariate normal with Khatri-Rao product structure for the covariance: p(vec(W ), vec(U )) = N 0, σ 2 c I ⊗ σ 2 r I σ c Z c ⊗ σ r Z r σ c Z c ⊗ σ r Z r Ψ c ⊗ Ψ r . ( ) As the dominating memory complexity here is O(d out M out +d in M in ) which comes from storing Z r and Z c , we see that the matrix normal parameterisation of the augmented prior is memory efficient. Posterior approximation in the joint space We construct a factorised posterior approximation across the layers: q(W 1:L , U 1:L ) = l q(W l |U l )q(U l ). The simplest option for q(W |U ) is q(W |U ) = p(vec(W )| vec(U )) = N (µ W |U , Σ W |U ) , similar to sparse GPs. A slightly more flexible variant rescales the covariance matrix while keeping the mean tied, i.e. q(W |U ) = q(vec(W )| vec(U )) = N (µ W |U , λ 2 Σ W |U ), which still allows for the KL term to be calculated efficiently (see Appendix B): R(λ) := KL [q(W |U )||p(W |U )] = d in d out (0.5λ 2 -log λ -0.5), W ∈ R dout×din . ( ) Plugging θ = {W 1:L }, a = {U 1:L } into (4) results in the following variational lower-bound L(q(W 1:L , U 1:L )) = E q(W 1:L ) [log p(D|W 1:L )] - L l=1 (R(λ l ) + KL[q(U l )||p(U l )]), with λ l the associated scaling parameter for q(W l |U l ). Therefore the variational parameters are now φ = {Z c , Z r , D c , D r , λ, dist. params. of q(U )} for each network layer. Two choices of q(U ) A simple choice is FFG q(vec(U )) = N (m u , diag(v u )), which performs mean-field inference in U space (c.f. Blundell et al., 2015) , and in this case KL[q(U )||p(U )] has a closed-form solution. Another choice is a "mixture of delta measures" q(U ) = 1 K K k=1 δ(U = U (k) ). In other words, we keep K distinct sets of parameters {U (k) 1:L } K k=1 in inducing space that are projected back into the original parameter space via the shared conditional distributions q(W l |U l ) to obtain the weights. This approach can be viewed as constructing "deep ensembles" in U space, and we follow ensemble methods (e.g. Lakshminarayanan et al., 2017) to drop KL[q(U )||p(U )] in (8). Often the inducing weight U is chosen to have significantly lower dimensions compared to W . Combining with the fact that q(W |U ) and p(W |U ) only differ in the covariance scaling constant, we see that U can be regarded as a sparse representation of uncertainty for the network layer, as the major updates in (approximate) posterior belief is quantified by q(U ).

3.2. EFFICIENT SAMPLING WITH EXTENDED MATHERON'S RULE

Computing the variational lower-bound (8) requires samples from q(W ), which asks for an efficient sampling procedure for the conditional q(W |U ). Unfortunately, q(W |U ) derived from (6) with X h 1 X h 2 ReLU noisy proj. X U Z c T X f 1 Z c T u 1 c X f 2 Z c T u 2 c noisy proj.

ReLU

Figure 2 : Showing the U variables in pre-activation spaces. To simplify we set σ c = 1 w.l.o.g. covariance rescaling is not a matrix normal, so direct sampling remains prohibitively expensive. To address this challenge, we extend Matheron's rule (Journel & Huijbregts, 1978; Hoffman & Ribak, 1991; Doucet, 2010) to efficiently sample from q(W |U ). The idea is that one can sample from a conditional Gaussian by transforming a sample from the joint distribution. In detail, we derive in Appendix C the extended Matheron's rule to sample W ∼ q(W |U ): W = λ W + σZ r Ψ -1 r (U -λ Ū )Ψ -1 c Z c , W , Ū ∼ p( W , Ūc , Ūr , Ū ) = MN (0, Σ r , Σ c ). (9) Here W , Ū ∼ p( W , Ūc , Ūr , Ū ) means we sample W , Ūc , Ūr , Ū from the joint and drop Ūc , Ūr . In fact Ūc , Ūr are never computed: as shown in Appendix C, the samples W , Ū can be obtained by: W = σE 1 , Ū = Z r E 1 Z c + Lr Ẽ2 D c + D r Ẽ3 L c + D r E 4 D c , E 1 ∼ MN (0, I dout , I din ), Ẽ2 , Ẽ3 , E 4 ∼ MN (0, I Mout , I Min ), Lr = Cholesky(Z r Z r ), Lc = Cholesky(Z c Z c ). ( ) Therefore the major extra cost to pay is O(2M 3 out + 2M 3 in + d out M out M in + M in d out d in ) required by inverting Ψ r , Ψ c , computing Lr , Lc , and the matrix multiplications. The extended Matheron's rule is visualised in Figure 1 with a comparison to the original Matheron's rule for sampling from q(vec(W )| vec(U )). This clearly shows that our recipe avoids computing big matrix inverses and multiplications, resulting in a significant speed-up for conditional sampling.

3.3. UNDERSTANDING INDUCING WEIGHTS: A FUNCTION-SPACE PERSPECTIVE

We present the proposed approach again but from a function-space inference perspective. Assume a network layer computes the following transformation of the input X = [x 1 , ..., x N ], x i ∈ R din×1 : F = W X, H = g(F), W ∈ R dout×din , X ∈ R din×N , g(•) is the non-linearity. ( ) As W has a Gaussian prior p(vec(W )) = N (0, σ 2 I), each of the rows in F = [f 1 , ..., f dout ] , f i ∈ R N ×1 has a Gaussian process form with linear kernel: f i |X ∼ GP(0, K XX ), K XX (m, n) = σ 2 x m x n . Inference on F directly has O(N 3 + d out N 2 ) cost, so a sparse approximation is needed. Slightly different from the usual approach, we introduce auxiliary variables U c = [u c 1 , ..., u c dout ] ∈ R dout×Min as follows, using shared "inducing inputs" Z c ∈ R din×Min : p(f i , ûi |X) = GP 0, K [X,Z c ],[X,Z c ] , p(u c i |û i ) = N ûi /σ c , σ 2 r D 2 c , By marginalising out the "noiseless inducing outputs" {û i } (derivations in Appendix D), we can compute the marginal distributions as p(U c ) := p({u c i }) = MN (0, σ 2 r I, Ψ c ) and p(F|X, U c ) = MN U c Ψ -1 c σ c Z c X, σ 2 r I, X σ 2 c (I -Z c Ψ -1 c Z c )X . ( ) Note that p(W |U c ) = MN U c Ψ -1 c σ c Z c , σ 2 r I, σ 2 c (I -Z c Ψ -1 c Z c ) (see Appendix A.2). Since W ∼ MN (M, Σ 1 , Σ 2 ) leads to W X d ∼ MN (MX, Σ 1 , X Σ 2 X), this means p(F|X, U c ) is the push-forward distribution of p(W |U c ) for the operation F = W X: F ∼ p(F|X, U c ) ⇔ W ∼ p(W |U c ), F = W X. As {u c i } are the "noisy" versions of {û i } in f space, U c can thus be viewed as "scaled noisy inducing outputs" in function space (see the red bars in the 2nd column of Figure 2 ). So far the inducing variables U c is used as a compact representation of the correlation between columns of F only. In other words the output dimension d out for each f i (and u c i ) remains large (e.g. > 1000 in a fully connected layer). Therefore dimensionality reduction can be applied to the Table 1: Computational complexity for a single layer. We assume W ∈ R dout×din , U ∈ R Mout×Min , and K forward passes are made for each of the N inputs. ( * It uses a parallel computing friendly vectorisation technique (Wen et al., 2020) for further speed-up.)

Method

Time complexity Memory complexity Deterministic-W O(N d in d out ) O(d in d out ) FFG-W O(N Kd in d out ) O(2d in d out ) Ensemble-W O(N Kd in d out ) O(Kd in d out ) Matrix-normal-W O(N Kd in d out ) O(d in d out + d in + d out ) k-tied FFG-W O(N Kd in d out ) O(d in d out + k(d in + d out )) rank-1 BNN O(N Kd in d out ) * O(d in d out + 2(d in + d out )) FFG-U O(N Kd in d out + 2M 3 in + 2M 3 out O(d in M in + d out M out + 2M in M out ) +K(d out M out M in + M in d out d in )) Ensemble-U same as above O(d in M in + d out M out + KM in M out ) column vectors of U c and F. In Appendix D we present a generative approach to do so by extending probabilistic PCA (Tipping & Bishop, 1999) to matrix normals: p(U c ) = p(U c |U )p(U )dU , where the projection's parameters are {Z r , D r }, and p(U c , U ) matches the marginals of ( 5). This means U can be viewed as the "noisy projected inducing output" of the GP whose corresponding "inducing inputs" are Z c (see the red bar in the 1st column of Figure 2 ). Similarly the column vectors in U r X can be viewed as the noisy projections of the column vectors in F. In Appendix D we further show that the push-forward distribution q(F|X, U ) of q(W |U ) only differs from p(F|X, U ) in the covariance matrices up to the same scaling constant λ. Therefore the resulting function-space variational objective is almost identical to (8), except for scaling coefficients that are added to the R(λ l ) terms to account for the change in dimensionality from vec(W ) to vec(F). This result nicely connects posterior inference in weight-and function-space.

3.4. COMPUTATIONAL COMPLEXITIES

In Table 1 we report the computational complexity figures for two types of inducing weight approaches: FFG q(U ) (FFG-U ) and Delta mixture q(U ) (Ensemble-U ). Baseline approaches include: Deterministic-W , variational inference with FFG q(W ) (FFG-W , Blundell et al., 2015) , deep ensemble in W (Ensemble-W , Lakshminarayanan et al., 2017) , as well as parameter efficient approaches such as matrix-normal q(W ) (Matrix-normal-W , Louizos & Welling (2017)), variational inference with k-tied FFG q(W ) (k-tied FFG-W , Swiatkowski et al. ( 2020)), and rank-1 BNN (Dusenberry et al., 2020) . The gain in memory is significant for the inducing weight approaches, in fact with M in < d in and M out < d out the parameter storage requirement is smaller than a single deterministic neural network. The major overhead in run-time comes from the extended Matheron's rule for sampling q(W |U ). Some of the computations there are performed only once, and in our experiments we show that by using a relatively low-dimensional U , the overhead is acceptable.

4. EXPERIMENTS

We evaluate the inducing weight approaches on regression, classification and related uncertainty estimation tasks. The goal is to demonstrate competitive performance to popular W -space uncertainty estimation methods while using significantly fewer parameters. The evaluation baselines are: (1) variational inference with FFG q(W ) (FFG-W , Blundell et al., 2015) v.s. FFG q(U ) (FFG-U , ours); (2) ensemble methods in W space (Ensemble-W Lakshminarayanan et al., 2017) v.s. ensemble in U space (Ensemble-U , ours). Another baseline is training a deterministic neural network with maximum likelihood. Details and additional results can be found in Appendix F and G.

4.1. SYNTHETIC 1-D REGRESSION

We follow Foong et al. ( 2019) to construct a synthetic regression task, by sampling two clusters of inputs x 1 ∼ U [-1, -0.7], x 2 ∼ U [0.5, 1], and targets y ∼ N (cos(4x + 0.8), 0.01). As ground truth we show the exact posterior results using the NUTS sampler (Hoffman & Gelman, 2014) . The results are visualised in Figure 3 with the noiseless function in black, predictive mean in blue, and up to three standard deviations as shaded area. Similar to prior results in the literature, FFG-W fails to represent the increased uncertainty away from the data and in between clusters. While underestimating predictive uncertainty overall, FFG-U show a small increase in predictive uncertainty away from the data. In contrast, a per-layer full covariance Gaussian in both weight (FCG-W ) and inducing space (FCG-U ) as well as Ensemble-U better captures the increased predictive variance, although the mean function is more similar to that of FFG-W . 2 0 2 1 0 1 (a) FFG-W 2 0 2 1 0 1 (b) FFG-U 2 0 2 1 0 1 (c) FCG-W 2 0 2 1 0 1 (d) FCG-U 2 0 2 1 0 1 (e) Ensemble-U 2 0 2 1 0 1 (f) NUTS

4.2. CLASSIFICATION AND IN-DISTRIBUTION CALIBRATION

As the core empirical evaluation, we train Resnet-18 models (He et al., 2016b) on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . To avoid underfitting issues with FFG-W, a useful trick is to set an upper limit σ 2 max on the variance of q(W ) (Louizos & Welling, 2017) . This trick is similarly applied to the U -space methods, where we cap λ ≤ λ max for q(W |U ), and for FFG-U we also set σ 2 max for the variance of q(U ). In convolution layers, we treat the 4D weight tensor W of shape (c out , c in , h, w) as a c out × c in hw matrix. We use U matrices of shape 128 × 128 for all layers (i.e. M = M in = M out = 128), except that for CIFAR-10 we set M out = 10 for the last layer. In Table 2 we report test accuracy and test expected calibration error (ECE) (Guo et al., 2017) as a first evaluation of the uncertainty estimates. Overall, Ensemble-W achieves the highest accuracy, but is not as well-calibrated as variational methods. For the inducing weight approaches, Ensemble-U outperforms FFG-U on both datasets. It is overall the best performing approach on the more challenging CIFAR-100 dataset (close-to-Ensemble-W accuracy and lowest ECE). In Figure 4 we show prediction run-times on trained models, relative to those of an ensemble of deterministic networks, as well as relative parameter sizes to a single ResNet-18. The extra run-time costs for the inducing methods come from computing the extended Matheron's rule. However, as they can be calculated once and then cached when drawing multiple samples, the overhead reduces to a small factor when using larger number of samples K and large batch-size N . More importantly, when compared to a deterministic ResNet-18 network, the inducing weight models reduce the parameter count by over 50% (5, 352, 853 vs. 11, 173, 962, 47.9% ) even for a large M = 128. Hyper-parameter choices We visualise in Figure 5 the accuracy and ECE results for the inducing weight models with different hyper-parameters. It is clear from the right-most panels that performances in both metrics improve as the U matrix size M is increased, and the results for M = 64 and M = 128 are fairly similar. Also setting proper values for λ max , σ max is key to the improved performances. The left-most panels show that with fixed σ max values (or with ensemble in U space), the preferred conditional variance cap values λ max are fairly small (but still larger than 0 which corresponds to a point estimate for W given U ). For σ max which controls variance in U space, we see from the top middle panel that the accuracy metric is fairly robust to σ max as long as λ max is not too large. But for ECE, a careful selection of σ max is required (bottom middle panel). Figure 6 : Accuracy (↑) and ECE (↓) on corrupted CIFAR. We show the mean and two standard errors for each metric on the 19 perturbations provided in (Hendrycks & Dietterich, 2019) .

4.3. MODEL ROBUSTNESS AND OUT-OF-DISTRIBUTION DETECTION

To investigate the inducing weight's robustness to dataset shift, we compute predictions on corrupted CIFAR datasets (Hendrycks & Dietterich, 2019) after training on clean data. Figure 6 shows accuracy and ECE results. Ensemble-W is the most accurate model across skew intensities, while FFG-W , though performing well on clean data, returns the worst accuracy under perturbation. The inducing weight methods perform competitively to Ensemble-W , although FFG-U surprisingly maintains slightly higher accuracy on CIFAR-100 than Ensemble-U despite being less accurate on the clean data. In terms of ECE, the inducing weight methods again perform competitively to Ensemble-W , with Ensemble-U sometimes being the best among the three. Interestingly, while the accuracy of FFG-W decays quickly as the data is perturbed more strongly, its ECE remains roughly constant. We further present in Table 3 the utility of the maximum predicted probability for out-of-distribution (OOD) detection when presented with both the in-distribution data (CIFAR10 and CIFAR100 test sets) and an OOD dataset (CIFAR100/SVHN and CIFAR10/SVHN). The metrics are the area under the receiver operator characteristic (AUROC) and the area under the precision-recall curve (AUPR). Again Ensemble-W performs the best in most settings, but more importantly, the inducing weight methods achieve very close results despite using the smallest number of parameters.

5. RELATED WORK

Parameter-efficient uncertainty quantification methods Recent research has proposed Gaussian posterior approximations for BNNs with efficient covariance structure (Ritter et al., 2018; Zhang et al., 2018b; Mishkin et al., 2018) . The inducing weight approach differs from these in introducing structure via a hierarchical posterior with low-dimensional auxiliary variables. Another line of work reduces the memory overhead via efficient parameter sharing (Louizos & Welling, 2017; Wen et al. , Swiatkowski et al., 2020; Dusenberry et al., 2020) . They all maintain a "mean parameter" for the weights, making the memory footprint at least that of storing a deterministic neural network. Instead, our approach shares parameters via the augmented prior with efficient low-rank structure, reducing the memory use compared to a deterministic network. In a similar spirit to our approach, Izmailov et al. ( 2019) perform inference in a d-dimensional sub-space obtained from PCA on weights collected from an SGD trajectory. However, this approach does not leverage the layer-structure of neural networks and requires d× memory of a single network. Sparse GP and function-space inference As BNNs and GPs are closely related (Neal, 1995; Matthews et al., 2018; Lee et al., 2018) , recent efforts have introduced GP-inspired techniques to BNNs (Ma et al., 2019; Sun et al., 2019; Khan et al., 2019; Ober & Aitchison, 2020) . Compared to weight-space inference, function-space inference is appealing since its uncertainty is more directly relevant for many predictive uncertainty estimation tasks. While the inducing weight approach performs computations in weight-space, Section 3.3 establishes the connection to function-space posteriors. Our approach is related to sparse deep GP methods with U c having similar interpretations as inducing outputs in e.g. Salimbeni & Deisenroth (2017) . The major difference is that U lies in a low-dimensional space, projected from the pre-activation output space of a network layer. Priors on neural network weights Hierarchical priors for weights has also been explored (Louizos et al., 2017; Krueger et al., 2017; Atanov et al., 2019; Ghosh et al., 2019; Karaletsos & Bui, 2020) . However, we emphasise that p(W, U ) is a pseudo prior that is constructed to assist posterior inference rather than to improve model design. Indeed, parameters associated with the inducing weights are optimisable for improving posterior approximations. Our approach can be adapted to other priors, e.g. for a Horseshoe prior p(θ, ν) = p(θ|ν)p(ν) = N (θ; 0, ν 2 )C + (ν; 0, 1), the pseudo prior can be defined as p(θ, ν, a) = p(θ|ν, a)p(a)p(ν) such that p(θ|ν, a)p(a)da = p(θ|ν). In general, pseudo priors have found broader success in Bayesian computation (Carlin & Chib, 1995) .

6. CONCLUSION

We have proposed a parameter-efficient uncertainty quantification framework for neural networks. It augments each of the network layer weights with a small matrix of inducing weight, and by extending Matheron's rule to matrix-normal related distributions, maintains a relatively small run-time overhead as compared with ensemble methods. Critically, experiments on prediction and uncertainty estimation tasks demonstrate the competence of the inducing weight methods to the state-of-the-art, while reducing the parameter count to less than half of a deterministic ResNet-18. Several directions are to be explored in the future. First, modelling correlations across layers might further improve the inference quality. We outline an initial approach leveraging inducing variables in Appendix E. Second, based on the function-space interpretation of inducing weights, better initialisation techniques can be inspired from the sparse GP and dimension reduction literature. Lastly, the small run-time overhead of our approach can be mitigated by a better design of the inducing weight structure as well as vectorisation techniques amenable to parallelised computation.



Figure 1: Visualisation of (a) the inducing weight augmentation, and compare (b) the original Matheron's rule to (c) our extended version. The white blocks represent random noises from the joint.

Figure 3: Toy regression results, with observations in red dots and the ground truth function in black.

Figure 4: Resnet-18 run-times and model sizes.

Figure 5: Averaged CIFAR-10 accuracy (↑) and ECE (↓) results for the inducing weight methods with different hyper-parameters. Models reported in the first two-columns uses M = 128 for U dimensions. For λ max = 0 (and σ max = 0) we use point estimates for the corresponding variables. CIFAR-10 CIFAR-10

CIFAR in-distribution metrics (in %).

OOD detection metrics for Resnet-18 trained on CIFAR10/100.

