LEARNABLE UNCERTAINTY UNDER LAPLACE AP-PROXIMATIONS

Abstract

Laplace approximations are classic, computationally lightweight means for constructing Bayesian neural networks (BNNs). As in other approximate BNNs, one cannot necessarily expect the induced predictive uncertainty to be calibrated. Here we develop a formalism to explicitly "train" the uncertainty in a decoupled way to the prediction itself. To this end we introduce uncertainty units for Laplaceapproximated networks: Hidden units with zero weights that can be added to any pre-trained, point-estimated network. Since these units are inactive, they do not affect the predictions. But their presence changes the geometry (in particular the Hessian) of the loss landscape around the point estimate, thereby affecting the network's uncertainty estimates under a Laplace approximation. We show that such units can be trained via an uncertainty-aware objective, making the Laplace approximation competitive with more expensive alternative uncertaintyquantification frameworks.

1. INTRODUCTION

The point estimates of neural networks (NNs)-constructed as maximum a posteriori (MAP) estimates via (regularized) empirical risk minimization-empirically achieve high predictive performance. However, they tend to underestimate the uncertainty of their predictions, leading to an overconfidence problem (Hein et al., 2019) , which could be disastrous in safety-critical applications such as autonomous driving. Bayesian inference offers a principled path to overcome this issue. The goal is to turn a "vanilla" NN into a Bayesian neural network (BNN) , where the posterior distribution over the network's weights are inferred via Bayes' rule and subsequently taken into account when making predictions. Since the cost of exact posterior inference in a BNN is often prohibitive, approximate Bayesian methods are employed instead. Laplace approximations (LAs) are classic methods for such a purpose (MacKay, 1992b) . The key idea is to obtain an approximate posterior by "surrounding" a MAP estimate of a network with a Gaussian, based on the loss landscape's geometry around it. A standard practice in LAs is to tune a single hyperparameter-the prior precision-which is inflexible (Ritter et al., 2018b; Kristiadi et al., 2020) . Here, we aim at improving the flexibility of uncertainty tuning in LAs. To this end, we introduce Learnable Uncertainty under Laplace Approximations (LULA) units, which are hidden units associated with a zeroed weight. They can be added to the hidden layers of any MAP-trained network. Because they are inactive, such units do not affect the prediction of the underlying network. However, they can still contribute to the Hessian of the loss with respect to the parameters, and hence induce additional structures to the posterior covariance under a LA. LULA units can be trained via an uncertainty-aware objective (Hendrycks et al., 2019; Hein et al., 2019, etc.) , such that they improve the predictive uncertainty-quantification (UQ) performance of the Laplace-approximated BNN. Figure 1 demonstrates trained LULA units in action: They improve the UQ performance of a standard LA, while keeping the MAP predictions in both regression and classification tasks. In summary, we (i) introduce LULA units: inactive hidden units for uncertainty tuning of a LA, (ii) bring a robust training technique from non-Bayesian literature for training these units, and (iii) show empirically that LULA-augmented Laplace-approximated BNNs can yield better UQ performance compared to both previous tuning techniques and contemporary, more expensive baselines. LULA-augmented Laplace-approximated (LA+LULA) neural network on regression (top) and classification (bottom) tasks. Black curve represent predictive mean and decision boundary, and shade represents ±3 standard deviation and confidence in regression and classification, respectively. MAP is overconfident and LA can mitigate this. However, LA can still be overconfident away from the data. LULA improve LA's uncertainty further without affecting its predictions.

2. BACKGROUND

2.1 BAYESIAN NEURAL NETWORKS Let f : R n × R d → R k defined by (x, θ) → f (x; θ) be an L-layer neural network. Here, θ is the concatenation of all the parameters of f . Suppose that the size of each layer of f is given by the sequence of (n l ∈ Z >0 ) L l=1 . Then, for each l = 1, . . . , L, the l-th layer of f is defined by a (l) := W (l) h (l-1) + b (l) , with h (l) := ϕ(a (l) ) if l < L a (l) if l = L , where W (l) ∈ R n l ×n l-1 and b (l) ∈ R n l are the weight matrix and bias vector of the layer, and ϕ is a component-wise activation function. We call the vector h (l) ∈ R n l the l-th hidden units of f . Note that by convention, we consider n 0 := n and n L := k, while h (0) := x and h (L) := f (x; θ). From the Bayesian perspective, the ubiquitous training formalism of neural networks amounts to MAP estimation: The empirical risk and the regularizer are interpretable as the negative loglikelihood under an i.i.d. dataset D := {x i , y i } m i=1 and the negative log-prior, respectively. That is, the loss function is interpreted as L(θ) := - m i=1 log p(y i | f (x i ; θ)) -log p(θ) = -log p(θ | D) . (2) In this view, the de facto weight decay regularizer amounts to a zero-mean isotropic Gaussian prior p(θ) = N (θ | 0, λ -1 I) with a scalar precision hyperparameter λ. Meanwhile, the usual softmax and quadratic output losses correspond to the Categorical and Gaussian distributions over y i in the case of classification and regression, respectively. MAP-trained neural networks have been shown to be overconfident (Hein et al., 2019) and BNNs can mitigate this issue (Kristiadi et al., 2020) . They quantify epistemic uncertainty by inferring the full posterior distribution of the parameters θ (instead of just a single point estimate in MAP training). Given that p(θ | D) is the posterior, then the prediction for any test point x ∈ R n is obtained via marginalization p(y | x, D) = p(y | f (x; θ)) p(θ | D) dθ , which captures the uncertainty encoded in the posterior.

2.2. LAPLACE APPROXIMATIONS

In deep learning, since the exact Bayesian posterior is intractable, approximate Bayesian inference methods are used. An important family of such methods is formed by LAs. Let θ MAP be the minimizer of (2), which corresponds to a mode of the posterior distribution. A LA locally approximates the posterior using a Gaussian p(θ | D) ≈ N (θ | θ MAP , Σ) := N (θ | θ MAP , (∇ 2 L| θMAP ) -1 ) . Thus, LAs construct an approximate Gaussian posterior around θ MAP , whose precision equals the Hessian of the loss at θ MAP -the "curvature" of the loss landscape at θ MAP . While the covariance of a LA is tied to the weight decay of the loss, a common practice in LAs is to tune the prior precision under some objective, in a post-hoc manner. In other words, the MAP estimation and the covariance inference are thought as separate, independent processes. For example, given a fixed MAP estimate, one can maximize the log-likelihood of a LA w.r.t. the prior precision to obtain the covariance. This hyperparameter tuning can thus be thought as an uncertainty tuning. A recent example of LAs is the Kronecker-factored Laplace (KFL, Ritter et al., 2018b) . The key idea is to approximate the Hessian matrix with the layer-wise Kronecker factorization scheme proposed by Heskes (2000) ; Martens & Grosse (2015) . That is, for each layer l = 1, . . . , L, KFL assumes that the Hessian corresponding to the l-th weight matrix W (l) ∈ R n l ×n l-foot_0 can be written as the Kronecker product G (l) ⊗ A (l) for some G (l) ∈ R n l ×n l and A (l) ∈ R n l-1 ×n l-1 . This assumption brings the inversion cost of the Hessian down to Θ(n 3 l + n 3 l-1 ), instead of the usual Θ(n 3 l n 3 l-1 ) cost. The approximate Hessian can easily be computed via tools such as BackPACK (Dangel et al., 2020) . Even with a closed-form Laplace-approximated posterior, the predictive distribution (3) in general does not have an analytic solution since f is nonlinear. Instead, one can employ Monte-Carlo (MC) integration by sampling from the Gaussian: p(y | x, D) ≈ 1 S S s=1 p(y | f (x; θ s )) ; θ s ∼ N (θ | θ MAP , Σ) , for S number of samples. In the case of binary classification with f : R n × R d → R, one can use the following well-known approximation, due to MacKay (1992a): p(y = 1 | x, D) ≈ σ f (x; θ MAP ) 1 + π/8 v(x) , ( ) where σ is the logistic-sigmoid function and v(x) is the marginal variance of the network output f (x), which is often approximated via a linearization of the network around the MAP estimate: v(x) ≈ (∇ θ f (x; θ)| θMAP ) Σ (∇ θ f (x; θ)| θMAP ) . (This approximation has also been generalized to multi-class classifications by Gibbs (1997).) In particular, as v(x) increases, the predictive probability of y = 1 goes to 0.5 and therefore the uncertainty increases. This relationship has also been shown empirically in multi-class classifications with MC-integration (Kristiadi et al., 2020) .

3. LULA UNITS

The problem with the standard uncertainty tuning in LAs is that the only degree-of-freedom available for performing the optimization is the scalar prior precision and therefore inflexible. 1 We shall address this by introducing "uncertainty units", which can be added on top of the hidden units of any MAP-trained network (Section 3.1) and can be trained via an uncertainty-aware loss (Section 3.2).

3.1. CONSTRUCTION

Let f : R n × R d → R k be a MAP-trained L-layer neural network with parameters θ MAP = {W (l) MAP , b MAP } L l=1 . The premise of our method is simple: At each hidden layer l = 1, . . . , L -1, MAP . The additional units are represented by the additional block at the bottom of each layer. Dashed lines correspond to the free parameters W (1) , . . . , W (L-1) , while dotted lines to the zero weights. x h (1) h (2) f (x) suppose we add m l ∈ Z ≥0 additional hidden units, under the original activation function, to h (l) . As a consequence, we need to augment each of the weight matrices to accommodate them. Consider the following construction: for each layer l = 1, . . . , L -1 of the network, we expand W (l) and b (l) to obtain the block matrix and vector W (l) := W (l) MAP 0 W (l) 1 W (l) 2 ∈ R (n l +m l )×(n l-1 +m l-1 ) ; b (l) := b (l) MAP b (l) ∈ R n l +m l , respectively, with m 0 = 0 since we do not add additional units to the input. For l = L, we define W (L) := (W (L) MAP , 0) ∈ R k×(n L-1 +m L-1 ) ; b (L) := b (L) MAP ∈ R k , so that the output dimensionality is unchanged. For brevity, we denote W (l) := ( W (l) 1 , W (l) 2 ). Refer to Figure 2 for an illustration and Algorithm 2 in Appendix B for a step-by-step summary. Taken together, we denote the resulting augmented network as f and the resulting parameter vector as θ MAP ∈ R d , where d it the resulting number of parameters. Note that we can easily extend this construction to convolutional nets by expanding the "channel" of a hidden layer. 2Let us inspect the implication of this construction. Here for each l = 1, . . . , L -1, since they are zero, the upper-right quadrant of W (l) deactivates the m l-1 additional hidden units in the previous layer, thus they do not contribute to the original hidden units in the l-th layer. Meanwhile, the submatrix W (l) and the sub-vector b (l) contain parameters for the additional m l hidden units in the l-th layer. We are free to choose the the values of these parameters since the following proposition guarantees that they will not change the output of the network (the proof is in Appendix A). Proposition 1. Let f : R n × R d → R k be a MAP-trained L-layer network parametrized by θ MAP . Suppose f : R n × R d → R and θ MAP ∈ R d are obtained via the previous construction. For any input x ∈ R n , we have f (x; θ MAP ) = f (x; θ MAP ). So far, it looks like all our changes to the network are inconsequential. However, they do affect the curvature of the landscape of L, 3 and thus the uncertainty arising in a LA. Let θ be a random variable in R d and θ MAP be an instance of it. Suppose we have a Laplace-approximated posterior p( θ | D) ≈ N ( θ | θ MAP , Σ) over θ, where the covariance Σ is the inverse Hessian of the negative log-posterior w.r.t. the augmented parameters at θ MAP . Then, Σ contains additional dimensions (and thus in general, additional structured, non-zero uncertainty) absent in the original network, which depend on the values of the free parameters { W (l) , b (l) } L-1 l=1 . The implication of the previous finding can be seen clearly in real-valued networks with diagonal LA posteriors. The following proposition shows that, under such a network and posterior, the construction above will affect the output uncertainty of the original network f (the proof is in Appendix A). Proposition 2. Suppose f : R n × R d → R is a real-valued network and f is as constructed above. Suppose further that diagonal Laplace-approximated posteriors N (θ | θ MAP , diag(σ)), N ( θ | θ, diag( σ)) are employed. Using the linearization (5), for any input x ∈ R n , the variance over the output f (x; θ) is at least that of f (x; θ). In summary, the construction along with Propositions 1 and 2 imply that the additional hidden units we have added to the original network are uncertainty units under LAs, i.e. hidden units that only contribute to the Laplace-approximated uncertainty and not the predictions. This property gives rise to the name Learnable Uncertainty under Laplace Approximations (LULA) units.

3.2. TRAINING

We have seen that by adding LULA units to a network, we obtain additional free parameters that only affect uncertainty under a LA. These parameters are thus useful for uncertainty calibration. Our goal is therefore to train them to induce low uncertainty over the data (inliers) and high uncertainty on outliers-the so-called out-of-distribution (OOD) data. Specifically, this can be done by minimizing the output variance over inliers while maximizing it over outliers. Note that using variance makes sense in both the regression and classification cases: In the former, this objective directly maintains narrow error bars near the data while widen those far-away from them-cf. Figure 1 (c, top). Meanwhile, in classifications, variances over function outputs directly impact predictive confidences, as we have noted in the discussion of (4)-higher variance implies lower confidence. Thus, following the contemporary technique from non-Bayesian robust learning literature (Hendrycks et al., 2019; Hein et al., 2019, etc.) , we construct the following loss. Let f : R n × R d → R k be an L-layer neural network with a MAP-trained parameters θ MAP and let f : R n × R d → R k along with θ MAP be obtained by adding LULA units. Denoting the dataset sampled i.i.d. from the data distribution as D in and that from some outlier distribution as D out , we define L LULA ( θ MAP ) := 1 |D in | xin∈Din ν( f (x in ); θ MAP ) - 1 |D out | xout∈Dout ν( f (x out ); θ MAP ) , where ν( f (x); θ MAP ) is the total variance over the k components of the network output f 1 (x; θ), . . . , f k (x; θ) under the Laplace-approximated posterior p( θ | D) ≈ N ( θ | θ MAP , Σ( θ MAP )), which can be approximated via an S-samples MC-integral ν( f (x); θ MAP ) := k i=1 var p( θ|D) f i (x; θ) ≈ k i=1 1 S S s=1 f i (x; θ s ) 2 - 1 S S s=1 f i (x; θ s ) 2 ; with θ s ∼ p( θ | D). Here, for clarity, we have shown explicitly the dependency of Σ on θ MAP . Note that we can simply set D in to be the training set of a dataset. Furthermore, throughout this paper, we use the simple OOD dataset proposed by Hein et al. (2019) which is constructed via permutation, blurring, and contrast rescaling of the in-distribution dataset D in . As we shall show in Section 5, this artificial, uninformative OOD dataset is sufficient for obtaining good results across benchmark problems. More complex dataset as D out might improve LULA's performance further but is not strictly necessary. Since our aim is to solely improve the uncertainty, we must maintain the structure of all weights and biases in θ MAP , in accordance to (6). This can simply be enforced via gradient masking: For all l = 1, . . . , L -1, set the gradients of the blocks of W (l) and b (l) not corresponding to W (l) and b (l) , respectively, to zero. Furthermore, since the covariance matrix Σ( θ) of the Laplace-approximated posterior is a function of θ, it needs to be updated at every iteration during the optimization of L LULA . Algorithm 1 Training LULA units.

Input:

MAP-trained network f . Dataset D, OOD dataset D out . Learning rate α. Number of epochs E. 1: Construct f from f by following Section 3.1. 2: for i = 1, . . . , E do 3: p( θ | D) ≈ N ( θ | θ MAP , Σ( θ MAP )) Obtain a Laplace-approximated posterior of f 4: Compute L LULA ( θ MAP ) via (7) using p( θ | D), D, and D out 5: g = ∇L LULA ( θ MAP ) 6: g = mask gradient(g) Zero out the derivatives not corresponding to θ 7: θ MAP = θ MAP -α g 8: end for 9: p( θ | D) ≈ N ( θ | θ MAP , Σ( θ MAP )) Obtain the final Laplace approximation 10: return f and p( θ | D) The cost scales in the network's depth and can thus be expensive. Inspired by recent findings that last-layer Bayesian methods are competitive to all-layer alternatives (Ober & Rasmussen, 2019), we thus consider a last-layer LA (Kristiadi et al., 2020) as a proxy: We apply a LA only at the last hidden layer and assume that the first L -2 layers are learnable feature extractor. We use a diagonal last-layer Fisher matrix to approximate the last-layer Hessian. Note that backpropagation through this matrix does not pose a difficulty since modern deep learning libraries such as PyTorch and TensorFlow supports "double backprop" (backpropagation through a gradient) efficiently. Finally, the loss L LULA can be minimized using standard gradient-based optimizers. Refer to Algorithm 1 for a summary. Last but not least, the intuition of LULA training is as follows. By adding LULA units, we obtain an augmented version of the network's loss landscape. The goal of LULA training is then to exploit the weight-space symmetry (i.e. different parameters but induce the same output) arisen from the construction as shown by Proposition 1 and pick one of these parameters that is symmetric to the original parameter but has "better" curvatures. Here, we define a "good curvature" in terms of the above objective. These curvatures, then, when used in a Laplace approximation, could yield better uncertainty estimates compared to the standard non-LULA-augmented Laplace approximations.

4. RELATED WORK

While traditionally hyperparameter optimization in a LA requires re-training of the network-under the evidence framework (MacKay, 1992b) or empirical Bayes (Bernardo & Smith, 2009) , tuning it in a post-hoc manner has increasingly becomes a common practice. Ritter et al. (2018a; b) tune the prior precision of a LA by maximizing the predictive log-likelihood. Kristiadi et al. (2020) extend this procedure by also using OOD data to better calibrate the uncertainty. However, they are limited in terms of flexibility since the prior precision of the LAs constitutes to just a single parameter. LULA can be seen as an extension of these approaches with greater flexibility and is complementary to them since LULA is independent to the prior precision. Confidence calibration via OOD data has achieved state-of-the-art performance in non-Bayesian outlier detection. Hendrycks et al. (2019) ; Hein et al. (2019) ; Meinke & Hein (2020) use OOD data to regularize the standard maximum-likelihood training. Malinin & Gales (2018; 2019) use OOD data to train probabilistic models based on the Dirichlet distribution. All these methods are non-Bayesian and non-post-hoc. Our work is thus orthogonal since we aim at improving a class of Bayesian models in a post-hoc manner. LULA can be seen as bringing the best of both worlds: Bayesian uncertainty that is tuned via the state-of-the-art non-Bayesian technique.

5. EXPERIMENTS

In this section, we focus on classification using the standard OOD benchmark problems. Supplementary results on regression are in Section C.2. 

5.1. IMAGE CLASSIFICATIONS

Here, we aim to show that LULA units and the proposed training procedure are (i) a significantly better method for tuning the uncertainty of a LA than previous methods and (ii) able to make a vanilla LA better than non-post-hoc (thus more expensive) UQ methods. For the purpose of (i), we compare We use 5-and 8-layer CNNs for MNIST and CIFAR-10, SVHN, CIFAR-100, respectively. These networks achieve around 99%, 90%, and 50% accuracies for MNIST, CIFAR-10, and CIFAR-100, respectively. For MC-integration during the predictions, we use 10 posterior samples. We quantify the results using the standard metrics: the mean-maximum-confidence (MMC) and area-under-ROC (AUR) metrics (Hendrycks & Gimpel, 2017) . All results reported are averages over ten prediction runs. Finally, we use standard test OOD datasets along with the "asymptotic" dataset introduced by Kristiadi et al. (2020) where random uniform noise images are scaled with a large number (5000). For simplicity, we add LULA units and apply only at the last layer of the network. For each indistribution dataset, the training of the free parameters of LULA units is performed for a single epoch using Adam over the respective validation dataset. The number of these units is obtained via a grid-search over the set {64, 128, 256, 512}, balancing both the in-and out-distribution confidences (see Appendix B for the details).

OOD Detection

The results for MMC and AUR are shown in Table 1 and Table 2 , respectively. First, we would like to direct the reader's attention to the last four columns of the table. We can see that in general KFL+LULA performs the best among all LA tuning methods. These results validate the effectiveness of the additional flexibility given by LULA units and the proposed training procedure. Indeed, without this additional flexibility, OOD training on just the prior precision becomes less effective, as shown by the results of KFL+OOD. Finally, as we can see in the results on the Asymptotic OOD dataset, LULA makes KFL significantly better at mitigating overconfidence far away from the training data. Now, compared to the contemporary baselines (MCD, DE), we can see that the vanilla KFL yields somewhat worse results. Augmenting the base KFL with LULA makes it competitive to MCD and DE in general. Keep in mind that both KFL and LULA are post-hoc methods. Dataset Shift We use the corrupted CIFAR-10 dataset (Hendrycks & Dietterich, 2019) for measuring the robustness of LULA-augmented LA to dataset shifts, following Ovadia et al. (2019) . Note that dataset shift is a slightly different concept to OOD data: it concerns about small perturbations of the true data, while OOD data are data that do not come from the true data distribution. Intuitively, humans perceive that the data under a dataset shift lies near the true data while OOD data are farther away. We present the results in Table 3 . Focusing first on the last four columns in the table, we see that LULA yields the best results compared other tuning methods for KFL. Furthermore, we see that KFL+LULA outperforms DE, which has been shown by Ovadia et al. (2019) to give the state-of-the-art results in terms of robustness to dataset shifts. Finally, while MCD achieve the best results in this experiment, considering its performance in the previous OOD experiment, we draw a conclusion that KFL+LULA provides a more consistent performance over different tasks. Comparison with DPN Finally we compare KFL+LULA with the (non-Bayesian) Dirichlet prior network (DPN, Malinin & Gales, 2018) in the Rotated-MNIST benchmark (Ovadia et al., 2019) (Figure 3 ). We found that LULA makes the performance of KFL competitive to DPN. We stress that KFL and LULA are post-hoc methods, while DPN requires training from scratch.

5.2. COST ANALYSIS

The cost of constructing a LULA network is negligible even for our deepest network: on both the 5and 8-layer CNNs, the wall-clock time required (with a standard consumer GPU) to add additional LULA units is on average 0.01 seconds (over ten trials). For training, using the last-layer LA as a proxy of the true LA posterior, it took on average 7.25 seconds and 35 seconds for MNIST and CIFAR-10, SVHN, CIFAR-100, respectively. This tuning cost is cheap relative to the training time of the base network, which ranges between several minutes to more than an hour. We refer the reader to Table 9 (Appendix C) for the detail. All in all, LULA is not only effective, but also cost-efficient.

6. CONCLUSION

We have proposed LULA units: hidden units that can be added to any pre-trained MAP network for the purpose of exclusively tuning the uncertainty of a Laplace approximation without affecting its predictive performance. They can be trained via an objective that depends on both inlier and outlier datasets to minimize (resp. maximize) the network's output variance, bringing the state-ofthe-art technique from non-Bayesian robust learning literature to the Bayesian world. Even with very simple outlier dataset for training, we show in extensive experiments that LULA units provide more effective post-hoc uncertainty tuning for Laplace approximations and make their performance competitive to more expensive baselines which require re-training of the whole network.

APPENDIX A PROOFS

Proposition 1. Let f : R n × R d → R k be a MAP-trained L-layer network parametrized by θ MAP . Suppose f : R n × R d → R and θ MAP ∈ R d are obtained via the previous construction. For any input x ∈ R n , we have f (x; θ MAP ) = f (x; θ MAP ). Proof. Let x ∈ R n be arbitrary. For each layer l = 1, . . . , L we denote the hidden units and preactivations of f as h (l) and a (l) , respectively. We need to show that the output of f , i.e. the last pre-activations a (L) , is equal to the last pre-activations a (L) of f . For the first layer, we have that a (1) = W (1) x + b (1) = W (1) W (1) x + b (1) b (1) = W (1) x + b (1) W (1) x + b (1) =: a (1) a (1) . For every layer l = 1, . . . , L -1, we denote the hidden units as the block vector l) . h (l) = ϕ(a (l) ) ϕ( a (l) ) = h (l) h Now, for the intermediate layer l = 2, . . . , L -1, we observe that a (l) = W (l) h (l-1) + b (l) = W (l) 0 W (l) 1 W (l) 2 h (l-1) h (l-1) + b (l) b (l) = W (l) h (l-1) + 0 + b (l) W (l) 1 h (l-1) + W (l) 2 h (l-1) + b (l) =: a (l) a (l) . Finally, for the last-layer, we get a (L) = W (L) x + b (L) = W (L) 0 h (L-1) h (L-1) + b (L) = W (L) h (L-1) + 0 + b (L) = a (L) . This ends the proof. Proposition 2. Suppose f : R n × R d → R is a real-valued network and f is as constructed above. Suppose further that diagonal Laplace-approximated posteriors N (θ | θ MAP , diag(σ)), N ( θ | θ, diag( σ)) are employed. Using the linearization (5), for any input x ∈ R n , the variance over the output f (x; θ) is at least that of f (x; θ). Proof. W.l.o.g. we arrange the parameters θ := (θ , θ ) where θ ∈ R d-d contains the weights corresponding to the the additional LULA units. If g(x) is the gradient of the output f (x; θ) w.r.t. θ, then the gradient of f (x; θ) w.r.t. θ, say g(x), can be written as the concatenation (g(x) , g(x) ) where g(x) is the corresponding gradient of θ. Furthermore, diag( σ) has diagonal elements σ 11 , . . . , σ dd , σ 11 , . . . , σ d-d, d-d =: (σ , σ ) . Hence we have v(x) = g(x) diag( σ) g(x) = g(x) diag(σ)g(x) =v(x) + g(x) diag( σ) g(x) ≥ v(x) , since diag( σ) is positive-definite. Algorithm 2 Adding LULA units.

Input:

L-layer net with a MAP estimate θ MAP = (W (l) MAP , b (l) MAP ) L l=1 . Sequence of non-negative integers (m l ) L l=1 . 1: for l = 1, . . . , L -1 do 2: vec W (l) ∼ p(vec W (l) ) Draw from a prior 3: b (l) ∼ p( b (l) ) Draw from a prior 4: and fine-tuned LULA's confidence in Figure 4 . Even when set randomly from the standard Gaussian prior, LULA weights provide a significant improvement over the vanilla Laplace. Moreover, training them yields even better predictive confidence estimates. In particular, far from the data, the confidence becomes even lower, while still maintaining high confidence in the data regions. W (l) MAP =   W (l) MAP 0 W (l) 1 W (l) 2   The zero submatrix 0 is of size n l × m l-1 5: b (l) MAP :=   b (l) MAP b (l)   6: end for 7: W (L) MAP = (W (L) MAP , 0) The zero submatrix is of size k × m L-1 8: b (L) MAP = b (L) MAP 9: θ MAP = ( W (l) MAP , b

C.2 UCI REGRESSION

To validate the performance of LULA in regressions, we employ a subset of the UCI regression benchmark datasets. Following previous works, the network architecture used here is a singlehidden-layer ReLU network with 50 hidden units. The data are standardized to have zero mean and unit variance. We use 50 LULA units and optimize them for 40 epochs using OOD data sampled uniformly from [-10, 10] n . For MCD, KFL, and KFL+LULA, each prediction is done via MCintegration with 100 samples. For the evaluation of each dataset, we use a 60-20-20 train-validationtest split. We repeat each train-test process 10 times and take the average. In Table 5 we report the average predictive standard deviation for each dataset. Note that this metric is the direct generalization of the 1D error bar in Figure 1 (top) to multi dimension. The outliers are sampled uniformly from [-10, 10] n . Note that since the inlier data are centered around the origin and have unit variance, they lie approximately in a Euclidean ball with radius 2. Therefore, these outliers are very far away from them. Thus, naturally, high uncertainty values over these outliers are desirable. Uncertainties over the test sets are generally low for all methods, although KFL+LULA has slightly higher uncertainties compared to the base KFL. However, KFL+LULA yield much higher uncertainties over outliers across all datasets, significantly more than the baselines. Moreover, in Table 4 , we show that KFL+LULA maintains the predictive performance of the base KFL. Altogether, they imply that KFL+LULA can detect outliers better than other methods without costing the predictive performance.

C.3 IMAGE CLASSIFICATION

In Table 6 we present the sensitivity analysis of confidences under a Laplace approximation w.r.t. the number of additional LULA units. Generally, we found that small number of additional LULA units, e.g. 32 and 64, is optimal. It is clear that increasing the number of LULA units decreases both the in-and out-distribution confidences. In the case of larger networks, we found that larger values (e.g. 512 in CIFAR-10) make the Hessian badly conditioned, resulting in numerical instability during its inversion. One might be able to alleviate this issue by additionally tuning the prior precision hyperparameter of the Laplace approximation (as in (Ritter et al., 2018b; Kristiadi et al., 2020) ), which corresponds to varying the strength of the diagonal correction of the Hessian. However, we emphasize that even with small amounts of additional LULA units, we can already improve vanilla Laplace approximations significantly, as shown in the main text (Section 5.1). We present the predictive performances of all methods in Table 7 . LULA achieves similar accuracies to the base MAP and KFL baselines. Differences in their exact values likely due to various approximations used (e.g. MC-integral). In the case of CIFAR-100, we found that MAP underperforms compared to MCD and DE. This might be because of overfitting, since only weight decay is used for regularization, in contrast to MCD where dropout is used on top of weight decay. Due to MAP's underperformance, LULA also underperform. However, we stress that whenever the base MAP model performs well, by construction LULA will also perform well. As a supplement we show the performance of KFL+LULA against DPN in OOD detection on MNIST (Table 8 ). We found that KFL+LULA performance on OOD detection are competitive or better than DPN.

C.4 COMPUTATIONAL COST

To supplement the cost analysis in the main text, we show the wall-clock times required for the construction and training of LULA units in Table 9 . 

C.5 DEEPER NETWORKS

We also asses the performance of LULA in larger network. We use a 20-layer CNN on CIFAR-10, SVHN, and CIFAR-100. Both the KFL and LULA are applied only at the last-layer of the network. The results, in terms of MMC, expected calibration error (ECE), and AUR, are presented in Table 10 and Table 11 . We observe that LULA is the best method for uncertainty tuning in LA: It makes KFL better calibrated in both in-and out-distribution settings. Moreover, the LULA-imbued KFL is competitive to DE, which has been shown by Ovadia et al. (2019) to be the best Bayesian method for uncertainty quantification. Note that, KFL+LULA is a post-hoc method and thus can be applied to any pre-trained network. In contrast, DE requires training multiple networks (usually 5) from scratch which can be very expensive. We additionally show the performance of LULA when applied on top of a KFL-approximated DenseNet-121 (Huang et al., 2017) in Tables 12 and 13 . LULA generally outperforms previous uncertainty tuning methods for LA and is competitive to DE. However, we observe in SVHN that LULA do not improve KFL significantly. This issue is due to the usage the Smooth Noise dataset, which has already assigned low confidence in this case, for training LULA. Thus, we re-train LULA with the Uniform Noise dataset and present the result in Table 14 . We show that using this dataset, we obtain better OOD calibration performance, outperforming DE. 



While one can also use a non-scalar prior precision, it appears to be uncommon in deep learning. In any case, such a element-wise weight-cost would interact with the training procedure. E.g. if the hidden units are a 3D array of (channel × height × width), then we expand the first dimension.3 More formally: The principal curvatures of the graph of L, seen as a d-dimensional submanifold of R d+1 .



Figure 1: Predictive uncertainty of a (a) MAP-trained, (b) Laplace-approximated (LA), and (c) LULA-augmented Laplace-approximated (LA+LULA) neural network on regression (top) and classification (bottom) tasks. Black curve represent predictive mean and decision boundary, and shade represents ±3 standard deviation and confidence in regression and classification, respectively. MAP is overconfident and LA can mitigate this. However, LA can still be overconfident away from the data. LULA improve LA's uncertainty further without affecting its predictions.

Figure 2: An illustration of the proposed construction. Rectangles represent layers, solid lines represent connection between layers, given by the original weight matrices W (1) MAP , . . . , W (L)

Figure 3: LULA compared to DPN on the Rotated-MNIST benchmark.

Figure4: Even when their weights are assigned randomly, LULA units improve the vanilla Laplace in terms of UQ. Fine-tuning the LULA weights improve it even further, in particular in terms of confidence far from the data-trained, LULA yields less confident prediction in this region.

Average confidences (MMCs in percent) over ten prediction runs. Lower is better for OOD data while higher is better for in-distribution data. See Table2for the AUR values.

OOD detection performance measured by the AUR metric. Values reported are averages over ten prediction runs. Higher is better. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

Robustness to dataset shifts on the corrupted CIFAR-10 dataset(Hendrycks & Dietterich,  2019), followingOvadia et al. (2019). All values are averages and standard deviations over all perturbation types and intensities (for total of 95 dataset shifts). For accuracy, higher is better, while for ECE and NLL, lower is better.

Predictive performances on UCI regression datasets in term of average test log-likelihood. The numbers reported are averages over ten runs along the corresponding standard deviations. The performances of LULA are similar to KFL's. The differences between their exact values are likely due to MC-integration.

UQ performances on UCI datasets. Values are the average (over all data points and ten trials) predictive standard deviations. Lower is better for test data and vice-versa for outliers. By definition, MAP does not have (epistemic) uncertainty.

In-and out-distribution validation MMCs for varying numbers of additional LULA units. "In" and "out" values are in percent. "Loss" is the value of the loss in (9). Missing entries signify that errors occurred, see Section C.3 for details.

Accuracies (in percent) over image classification test sets. Values are averages over ten trials.

LULA compared to DPN on OOD detection in terms of MMC and AUR, both in percent.

Wall-clock time of adding and training LULA units. All values are in seconds.

Average confidences (MMCs in percent) on 20-layer CNNs over ten prediction runs. Lower is better for OOD data. Values shown for each in-distribution dataset are ECE-lower is better. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

OOD detection performance measured by the AUR metric on 20-layer CNNs. Values reported are averages over ten prediction runs. Higher is better. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

Average confidences (MMCs in percent) on DenseNet-121 over ten prediction runs. Lower is better for OOD data. Values shown for each in-distribution dataset are ECE-lower is better. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

OOD detection performance measured by the AUR metric on DenseNet-121. Values reported are averages over ten prediction runs. Higher is better. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

LULA's OOD detection performance on DenseNet-121 with uniform noises as the training OOD data. Values are the ECE, MMC, and AUR metrics, averaged over ten prediction runs. Underline and bold faces indicate the highest values over the last four columns and all columns in a given row, respectively.

APPENDIX B IMPLEMENTATION

We summarize the augmentation of a network with LULA units in Algorithm 2. Note that the priors of the free parameters W (l) , b (l) (lines 2 and 3) can be chosen as independent Gaussians-this reflects the standard procedure for initializing NNs' parameters.We train LULA units for a single epoch (since for each dataset, we have a large amount of training points) with learning rate 0.01. For each dataset, the number of the additional last-layer units m L is obtained via a grid search over the set {64, 128, 256, 512} =: M L , minimizing the absolute distance to the optimal MMC, i.e. 1 and 1/k for the in-and out-distribution validation set, respectively:where MMC in and MMC out are the validation in-and out-distribution MMC of the Laplaceapproximated, trained-LULA, respectively.

C.1 TOY DATASET

In practice, one can simply set { W (l) , b (l) } L-1 l=1 randomly given a prior, e.g. the standard Gaussian (see also Algorithm 2). To validate this practice, we show the vanilla Laplace, untrained LULA,

