FIXING ASYMPTOTIC UNCERTAINTY OF BAYESIAN NEURAL NETWORKS WITH INFINITE RELU FEATURES

Abstract

Approximate Bayesian methods can mitigate overconfidence in ReLU networks. However, far away from the training data, even Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be overconfident. We suggest to fix this by considering an infinite number of ReLU features over the input domain that are never part of the training process and thus remain at prior values. Perhaps surprisingly, we show that this model leads to a tractable Gaussian process (GP) term that can be added to a pre-trained BNN's posterior at test time with negligible cost overhead. The BNN then yields structured uncertainty in the proximity of training data, while the GP prior calibrates uncertainty far away from them. As a key contribution, we prove that the added uncertainty yields cubic predictive variance growth, and thus the ideal uniform (maximum entropy) confidence in multi-class classification far from the training data.

1. INTRODUCTION

Calibrated uncertainty is crucial for safety-critical decision making by neural networks (NNs) (Amodei et al., 2016) . Standard training methods of NNs yield point estimates that, even if they are highly accurate, can still be severely overconfident (Guo et al., 2017) . Approximate Bayesian methods, which turn NNs into Bayesian neural networks (BNNs), can be used to address this issue. Kristiadi et al. (2020) recently showed that for binary ReLU classification networks, far away from the training data (more precisely: when scaling any input x with a scalar α > 0 and taking the limit α → ∞), the uncertainty of BNNs can be bounded away from zero. This is an encouraging result when put in contrast to the standard point-estimated networks, for which Hein et al. (2019) showed earlier that the same asymptotic limit always yields arbitrarily high (over-)confidence. Nevertheless, BNNs can still be asymptotically overconfident (albeit less so than the standard NNs) since the aforementioned uncertainty bound can be loose. This issue is our principal interest in this paper. An intuitive interpretation is that ReLU NNs "miss out on some uncertainty" even in their Bayesian formulation, because they fit a finite number of ReLU features to the training data, by "moving around" these features within the coverage of the data. This process has no means to encode a desideratum that the model should be increasingly uncertain away from the data. In this work, we "add in" additional uncertainty by considering an infinite number of additional ReLU features spaced at regular intervals away from the data in the input and hidden spaces of the network. Since these features have negligible values in the data region, they do not contribute to the training process. Hence, we can consider a prior for their weights, chosen to be an independent Gaussian, and arrive at a specific Gaussian process (GP) which covariance function is a generalization of the classic cubic-spline kernel (Wahba, 1990) . This GP prior can be added to any pre-trained ReLU BNN as a simple augmentation to its output. Considering the additive combination of a parametric BNN and GP prior together, we arrive at another view of the method: It approximates the "full GP posterior" that models the residual of a point-estimated NN (Blight & Ott, 1975; Qiu et al., 2020) . In our factorization, the BNN models uncertainty around the training data, while the GP prior models uncertainty far away from them. By factorizing these two parts from each other, our formulation requires no (costly) GP posterior inference, and thus offers lightweight, modular uncertainty calibration. See Fig. 1 for illustration. Theoretical analysis is a core contribution of this work. We show that the proposed method (i) preserves the predictive performance of the base ReLU BNN. Furthermore, it (ii) ensures that the Figure 1 : Toy classification with a BNN and our method. Shade represents confidence, the suffix "ZO" stands for "zoomed-out". Far away from the training data, vanilla BNNs can still be overconfident (a, c). Our method fixes this issue while keeping predictions unchanged (b, d). surrounding output variance asymptotically grows cubically in the distance to the training data, and thus (iii) yields uniform asymptotic confidence in the multi-class classification setting. These results extend those of Kristiadi et al. (2020) in so far as their analysis is limited to the binary classification case and their bound can be loose. Furthermore, our approach is complementary to the method of Meinke & Hein (2020) which attains maximum uncertainty far from the data for non-Bayesian point-estimate NNs. Finally, our empirical evaluation confirms our analysis and shows that the proposed method also improves uncertainty estimates in the non-asymptotic regime.

2. BACKGROUND

2.1 BAYESIAN NEURAL NETWORKS Let f : R N × R D → R C defined by (x, θ) → f (x; θ) =: f θ (x) be a neural network. Here, θ is the collection of all parameters of f . Given an i.i.d. dataset D := (x m , y m ) M m=1 , the standard training procedure amounts to finding a point estimate θ * of the parameters θ, which can be identified in the Bayesian framework with maximum a posteriori (MAP) estimation 1 θ * = arg max θ log p(θ | D) = arg max θ M m=1 log p(y m | f θ (x m )) + log p(θ). While this point estimate may yield highly accurate predictions, it does not encode uncertainty over θ, causing an overconfidence problem (Hein et al., 2019) . Bayesian methods can mitigate this issue, specifically, by treating the parameter of f as a random variable and applying Bayes rule. The resulting network is called a Bayesian neural network (BNN). The common way to approximate the posterior p(θ | D) of a BNN is by a Gaussian q(θ | D) = N (θ | µ, Σ), which can be constructed for example by a Laplace approximation (MacKay, 1992b) or variational Bayes (Hinton & Van Camp, 1993) . Given such an approximate posterior q(θ | D) and a test point x * ∈ R N , one then needs to marginalize the parameters to make predictions, i.e. we compute the integral y * = h(f (x * ; θ)) q(θ | D) dθ, where h is an inverse link function, such as the identity function for regression or the logistic-sigmoid and softmax functions for binary and multi-class classifications, respectively. Since the network f is a non-linear function of θ, this integral does not have an analytic solution. However, one can obtain a useful approximation via the following network linearization: Let x * ∈ R N be a test point and q(θ | D) = N (θ | µ, Σ) be a Gaussian approximate posterior. Linearizing f around µ yields the following marginal distribution over the function output f (x * ): 2 p(f (x * ) | x * , D) ≈ N (f (x * ) | f (x * ; µ) =:m * , J * ΣJ * =:V * ), where J * is the Jacobian of f (x * ; θ) w.r.t. θ at µ. (In the case of a real-valued network f , we use the gradient g * := ∇ θ f (x * ; θ)| µ instead of J * .) This distribution can then be used as the predictive distribution p(y * | x * , D) in the regression case. For classifications, we need another approximation since h is not the identity function. One such approximation is the generalized probit approximation (Gibbs, 1997; Spiegelhalter & Lauritzen, 1990; MacKay, 1992a) : p(y * = c | x * , D) ≈ exp(m * c κ * c ) C i=1 exp(m * i κ * i ) , for all c = 1, . . . , C, where for each i = 1, . . . , C, the real numbers m * i is the i-th component of the vector m * , and κ * i := (1 + π/8 v * ii ) -1/2 where v * ii is the i-th diagonal term of the matrix V * . These approximations are analytically useful, but can be expensive due to the computation of the Jacobian matrix J * . Thus, Monte Carlo (MC) integration is commonly used as an alternative, i.e. we approximate  y * ≈ 1 S S s=1 h(f (x * ; θ s )) with θ s ∼ q(θ | D).

2.2. RELU AND GAUSSIAN PROCESSES

The ReLU activation function ReLU(z) := max(0, z) (Nair & Hinton, 2010) has become the defacto choice of non-linearity in deep learning. Given arbitrary real numbers c, it can be generalized as ReLU(z; c) := max(0, zc), with the "kink" at location c. An alternative formulation, useful below, is in terms of the Heaviside function H as ReLU(z; c) = H(zc)(zc). We may define a collection of d such ReLU functions evaluated at some point in R as the function φ : R → R K with z → (ReLU(z; c 1 ), . . . , ReLU(z; c K )) . We call this function the ReLU feature map; it can be interpreted as "placing" ReLU functions at different locations in R. Consider a linear model g : R × R K → R defined by g(x; w) := w φ(x). Suppose φ regularly places the K generalized ReLU functions centered at (c i ) K i=1 over [c min , c max ] ⊂ R, where c min < c max . If we consider a Gaussian prior p(w) := N w 0, σ 2 K -1 (c maxc min )I over the weights w then, as K goes to infinity, the distribution over g(x) is a Gaussian process with mean 0 and covariance (using the shorthand g x := g(x) and x := min(x, x ); full derivation in Appendix A): lim K→∞ cov(g x , g x ) = σ 2 H(x -c min ) 1 3 (x 3 -c 3 min ) - 1 2 (x 2 -c 2 min )(x + x ) + (x -c min )xx =: k 1 (x, x ; c min , σ 2 ), for x ≤ c max . Since this expression does not depend on c max , we consider the limit c max → ∞. The resulting covariance function is the cubic spline kernel (Wahba, 1990) . 3 METHOD Hein et al. (2019) showed that the confidence of point-estimated ReLU networks (i.e. feed-forward nets which use piecewise-affine activation functions and are linear in the output layer) approaches 1 with increasing distance from the training data. For binary classification, Kristiadi et al. (2020) showed that Gaussian-approximated ReLU BNNs f instead approach a constant confidence bounded away from 1, but not necessarily close to the maximum uncertainty value of 1/2. Thus, just being Bayesian as such does not fix overconfidence entirely. A close look at their proof suggests that the issue is a structural limitation of the deep model itself: for any input x * and a sufficiently large scalar α, both the mean and standard deviation of the output f (αx * ) are linear functions of x * . Intuitively, this issue arises because the net only has finitely many ReLU features available to "explain" the data, and thus it "lacks" ReLU features for modeling uncertainty away from the data. In this section, we will utilize the cubic spline kernel to construct a new kernel and method that, intuitively speaking, adds an infinite number ReLU features away from the data to pre-trained BNNs. This construction adds the "missing" ReLU features and endows BNNs with super-quadratic output variance growth, without affecting predictions. All proofs are in Appendix B.

3.1. THE DOUBLE-SIDED CUBIC SPLINE KERNEL

The cubic spline kernel constructed above is non-zero only on (c min , ∞) ⊂ R. To make it suitable for modeling uncertainty in an unbounded domain, we set c min = 0 and obtain a kernel k 1 → (x, x ; σ 2 ) := k 1 (x, x ; 0, σ 2 ) which is non-zero only on (0, ∞). Doing an entirely analogous construction with infinitely many ReLU functions pointing to the left, i.e. ReLU(-z; c), we obtain the kernel k 1 ← (x, x ; σ 2 ) := k 1 → (-x, -x ; σ 2 ), which is non-zero only on (-∞, 0). We combine both into the kernel k 1 ↔ (x, x ; σ 2 ) := k 1 ← (x, x ; σ 2 ) + k 1 → (x, x ; σ 2 ), which covers the whole real line (the value at the origin k 1 ↔ (0, 0) is zero)-see Figure 2 . For multivariate input domains, we define k ↔ (x, x ; σ 2 ) := 1 N N i=1 k 1 ↔ (x i , x i ; σ 2 ) (3) for any x, x ∈ R N with N > 1. We here deliberately use a summation, instead of the alternative of a product, since we want the associated GP to add uncertainty anywhere where at least one input dimension has non-vanishing value. 3 We call this kernel the double-sided cubic spline (DSCS) kernel. Two crucial properties of this kernel are that it has negligible values around the origin and for any x * ∈ R N and α ∈ R, the value k ↔ (αx * , αx * ) is cubic in α.

3.2. RELU-GP RESIDUAL

Let f : R N × R D → R be an L-layer, real-valued ReLU BNN. Suppose we place infinitely many ReLU features by following the previous construction. Then, we arrive at a zero-mean GP prior GP( f (0) | 0, k ↔ ) of some real-valued function f (0) : R N → R over the input space R N . We can use this GP to model the "missing" uncertainty which, due to the lack of its presence, makes f overconfident far-away from the data. We do so in a standard manner by assuming that the "true" latent function f is the sum of f and f (0) : f := f + f (0) , where f (0) ∼ GP( f (0) | 0, k ↔ ). Under this assumption, given an input x * , it is clear that f (0) does not affect the expected output of the BNN since the GP over f (0) has zero mean. However, f (0) do additively affect the uncertainty of the BNN's output f * := f (x * ) since if we assume that f * ∼ N (E f * , var f * ), then it follows that f * ∼ N (E f * , var f * + k ↔ (x * , x * )). Hence, the random function f (0) , resulting from placing an infinite number of ReLU features in the input space, indeed models the uncertainty residual of the BNN f . We thus call our method ReLU-GP residual (RGPR). Unlike previous methods for modeling residuals with GPs, RGPR does not require a posterior inference since intuitively, the additional infinitely many ReLU features are never part of the training process-their "kinks" are pointing away from the data. So even if we were to actively include them in the training process somehow, they would have (near) zero training gradient and just stay where and as they are. The following statements illustrate this intuition more formally in GP regression under the linearization (1) by assuming w.l.o.g. that the kernel values over the dataset are negligible (by shifting and scaling until the data is sufficiently close to 0 ∈ R N ). 0) and f be defined as in (4), and let x * ∈ R N be arbitrary. Under the linearization of f w.r.t. θ around 0, given that all x 1 , . . . , x M are sufficiently close to the origin, the GP posterior of f Proposition 1. Suppose f : R N × R D → R defined by (x, θ) → f (x; θ) is a ReLU regression BNN with a prior p(θ) = N (θ | 0, B) and D := {x m , y m } M m=1 is a dataset. Let f ( * := f (x * ) is given by p( f * | x * , D) ≈ N ( f * | f (x; µ), g * Σg * + k ↔ (x * , x * )), where µ and Σ are the mean and covariance of the posterior of the linearized network, respectively, and g * := ∇ θ f (x * ; θ)| 0 . The previous proposition shows that the GP prior of f (0) does not affect the BNN's approximate posteriorf is written as a posteriori f plus a priori f (0) . Therefore, given a pre-trained BNN f with its associated posterior p(θ | D) ≈ N (θ | µ, Σ), we can simply add to its output f (x * ; θ) (with θ ∼ p(θ | D)) a random number f (0) (x * ) ∼ GP( f (0) | 0, k ↔ (x * , x * )). We henceforth assume that f is a pre-trained BNN. While the previous construction is sufficient for modeling uncertainty far away from the data, it does not model the uncertainty near the data region well. Figure 3 (a) shows this behavior: placing infinitely many ReLU features over just the input space yields uncertainty that is not adapted to the data and hence, far away from them, we can still have low variance. To alleviate this issue, we additionally place infinite ReLU features on the representation space of the point-estimated f µ (•) = f ( • ; µ), which indeed encodes information about the data since f is a trained BNN, as follows. For each l = 1, . . . , L -1 and any input x * , let N l be the size of the l-th hidden layer of f µ and h (l) (x * ) =: h (l) * be the l-th hidden units. By convention, we assume that N 0 := N and h (0) * := x * . We place for each l = 0, . . . , L -1 an infinite number of ReLU features on the representation space R N l , and thus we obtain a random function f (l) : R N l → R distributed by the Gaussian process -1) . This function is therefore a function over all representation (including the input) spaces of f µ , distributed by the additive Gaussian process GP( f | 0, L-1 l=0 k ↔ ). In other words, given the representations h * := (h GP( f (l) | 0, k ↔ ). Now, given that N := L-1 l=0 N l , we define the function f : R N → R by f := f (0) + • • • + f (L (l) * ) L-1 l=0 of x * , the marginal over the function output f (h * ) =: f * is thus given by p( f * ) = N f * 0, L-1 l=0 k ↔ h (l) * , h (l) * ; σ 2 l . Figure 3 (c) visualizes the effect of this definition. The low-variance region modeled by the random function f becomes more compact around the data and can be controlled by varying the kernel hyperparameter σ 2 l for each layer l = 0, . . . , L -1. Finally, we can then model the residual in (4) using f instead, i.e. we assume f = f + f . The generalization of RGPR to BNNs with multiple outputs is straightforward. Let f : R N × R D → R C be a vector-valued, pre-trained, L-layer ReLU BNN. We assume that the sequence of random functions ( f c : R N → R) C c=1 is independent and identically distributed by the previous Gaussian process GP( f | 0, L-1 l=0 k ↔ ). Thus, defining f * := f (h * ) := ( f 1 (h * ), . . . , f C (h * )) , we have p( f * ) = N f * 0, L-1 l=0 k ↔ h (l) * , h (l) * ; σ 2 l I . Furthermore, as in the real-valued case, for any x * , the GP posterior p( f * | x * , D) is approximately (under the linearization of f ) given by the Gaussians derived from (1) and (7): p( f * | x * , D) ≈ N f * f µ (x * ), J * ΣJ * + L-1 l=0 k ↔ h (l) * , h (l) * ; σ 2 l I . ( ) Although the derivations above may appear involved, it is worth emphasizing that in practice, the only overheads compared to the usual MC-integrated BNN prediction step are (i) a single additional forward-pass over f µ , (ii) L evaluations of the kernel k ↔ and (ii) sampling the C-dimensional Gaussian (7). Note that their costs are negligible compared to the cost of obtaining the standard MC-prediction of f . We refer the reader to Algorithm 1 for a step-by-step pseudocode. , it captures the data region better than when f is only defined on the input space (a). Increasing the kernel hyperparameters (here we assume they have the same value for all layers) makes the low-variance region more compact around the data (c). Algorithm 1 MC-prediction using RGPR. Differences from the standard procedure are in red. Input: Pre-trained multi-class BNN classifier f : R N × R D → R C with posterior p(θ | D). Test point x * ∈ R N . Prior variance hyperparameters (σ 2 l ) L-1 l=0 of f . Inverse link function h. Number of MC samples S. 1: {h (l) * } L-1 l=1 = forward(f µ , x * ) Compute representations of x * via a forward pass on f µ 2: v s (x * ) = L-1 l=0 k ↔ (h (l) * , h (l) * ; σ 2 l ) Compute the prior variance of f 3: for s = 1, . . . , S do 4: θ s ∼ N (θ | µ, Σ) Sample from the (approximate) posterior of f 5: f s (x * ) = f (x * ; θ s ) Forward pass on f using the sampled parameter 6: f s (x * ) ∼ N ( f (h * ) | 0, v s (x * )I) Sample from the marginal (7) 7: f s (x * ) = f s (x * ) + f s (x * ) Compute f (x * ; θ s ) 8: end for 9: return S -1 S s=1 h( f s (x * )) Make prediction by averaging

4. ANALYSIS

Here, we will study the theoretical properties of RGPR. Our assumptions are mild: we (i) assume that RGPR is applied only to the input space and (ii) use the network linearization technique. Assumption (i) is the minimal condition for the results presented in this section to hold-similar results can also easily be obtained when hidden layers are also utilized in RGPR. Meanwhile, assumption (ii) is necessary for tractability-in Section 6 we will validate our analysis in general settings. The following two propositions (i) summarize the property that RGPR preserves the original BNN's prediction and (ii) show that asymptotically, the marginal variance of the output of f grows cubically. Proposition 2 (Invariance in Predictions). Let f : R N × R D → R C be any network with posterior N (θ | µ, Σ) and f be obtained from f via RGPR (4). Then under the linearization of f , for any x * ∈ R N , we have E p( f * |x * ,D) f * = Ep(f * |x * ,D) f * . Proposition 3 (Asymptotic Variance Growth). Let f : R N × R D → R C be a C-class ReLU net- work with posterior N (θ | µ, Σ) and f be obtained from f via RGPR over the input space. Suppose that the linearization of f w.r.t. θ around µ is employed. For any x * ∈ R N with x * = 0 there exists β > 0 such that for any α ≥ β, the variance of each output component f 1 (αx * ), . . . , f C (αx * ) under p( f * | x * , D) (8) is in Θ(α 3 ). As a consequence of Proposition 3, in the binary classification case, the confidence of αx * decays like 1/ √ α far away from the training data. This can be seen using the (binary) probit approximation. Thus, in this case we obtain the maximum entropy in the limit of α → ∞. In the following theorem we formalize this statement in the more general multi-class classification setting. Theorem 4 (Uniform Asymptotic Confidence). Let f : R N × R D → R C be a C-class ReLU network equipped with the posterior N (θ | µ, Σ) and let f be obtained from f via RGPR over the input space. Suppose that the linearization of f and the generalized probit approximation (2) is used for approximating the predictive distribution p(y * = c | αx * , f , D) under f . Then for any input x * ∈ R N with x * = 0 and for every class c = 1, . . . , C, lim α→∞ p(y * = c | αx * , f , D) = 1 C .

5. RELATED WORK

The mitigation of the asymptotic overconfidence problem has been studied recently. Although Hein et al. (2019) theoretically demonstrated this issue, their proposed method does not fix this issue for α large enough. Kristiadi et al. (2020) showed that any Gaussian-approximated BNN could mitigate this issue even for α = ∞. However, the asymptotic confidence estimates of BNNs converge to a constant in (0, 1), not to the ideal uniform confidence. In a non-Bayesian framework, using Gaussian mixture models, Meinke & Hein (2020) integrate density estimates of inliers and outliers data into the confidence estimates of an NN to achieve the uniform confidence far away from the data. Nevertheless, this property has not been previously achieved in the context of BNNs. Modeling the residual of a predictive model with GP has been proposed by Blight & Ott (1975) ; Wahba (1978) ; O'Hagan (1978); Qiu et al. (2020) . The key distinguishing factors between RGPR and those methods are (i) RGPR models the residual of BNNs, in contrast to that of point-estimated networks, (ii) RGPR uses a novel kernel which guarantees cubic uncertainty growth, and (iii) RGPR requires no posterior inference. Nevertheless, whenever those methods uses our DSCS kernel, RGPR can be seen as an economical approximation of their posterior: RGPR estimates uncertainty near the data with a BNN, while the GP-DSCS prior estimates uncertainty far away from them. A combination of weight-and function-space models has been proposed in the context of nonparametric GP posterior sampling. Wilson et al. (2020) proposed to approximate a function as the sum of a weight-space prior and function-space posterior. In contrast, RGPR models a function as the sum of weight-space posterior and function-space prior in the context of parametric BNNs.

6. EMPIRICAL EVALUATIONS

Our goal in this section is (i) to validate our analysis in the preceding section: we aim to show that RGPR's low confidence far-away from the training data is observable in practice, and (ii) to explore the effect of the hyperparameters of RGPR to the non-asymptotic confidence estimates. We focus on classification-experiments on regression are in Appendix D.

6.1. ASYMPTOTIC REGIME

We use standard benchmark datasets: MNIST, CIFAR10, SVHN, and CIFAR100. We use LeNet and ResNet-18 for MNIST and the rest of the datasets, respectively. Our main reference is the method based on Blight & Ott (1975) (with our kernel): We follow Qiu et al. (2020) for combining the network and GP, and for carrying out the posterior inference. We refer to this baseline as the Blight and Ott method (BNO)-cf. Appendix C for an exposition about this method. The base methods, which RGPR is implemented on, are the following recently-proposed BNNs: (i) last-layer Laplace (LLL, Kristiadi et al., 2020) , (ii) Kronecker-factored Laplace (KFL, Ritter et al., 2018) , (iii) stochastic weight averaging-Gaussian (SWAG, Maddox et al., 2019) , and (iv) stochastic variational deep kernel learning (SVDKL, Wilson et al., 2016) . All the kernel hyperparameters for RGPR are set to 1. In all cases, MC-integral with 10 posterior samples is used for making prediction. To validate Theorem 4, we construct a test dataset artificially by sampling 2000 uniform noises in [0, 1] N and scale them with a scalar α = 2000. The goal is to distinguish test points from these outliers based on the confidence estimates. Since a visual inspection of these confidence estimates as in Figure 1 is not possible in high dimension, we measure the results using the mean maximum confidence (MMC) and area under ROC (AUR) metrics (Hendrycks & Gimpel, 2017) . MMC is useful for summarizing confidence estimates, while AUR tells us the usefulness of the confidences for distinguishing between inliers and outliers. The results are presented in Table 1 . We observe that the RGPR-augmented methods are significantly better than their respective base methods. In particular, the confidences drop, as shown by the MMC values. We also observe in Table 3 (Appendix D) that the confidence estimates close to the training data do not significantly change. These two facts together yield high AUR values, close to the ideal value of 100. Moreover, most RGPR-imbued methods achieve similar or better performance to BNO baseline, likely be due to uncertainty already presents in the base BNNs. However, these confidences on far-away points are not quite the uniform confidence due to the number of MC samples used-recall that far away from the data, RGPR yields high variance; since the error of MC-integral depends on both the variance and number of samples, a large amount of samples are needed to get accurate MC-estimates. See Figure 5 (Appendix D) for results with 1000 samples: in this more accurate setting, the convergence to the uniform confidence happens at finite (and small) α. Nevertheless, this issue not a detrimental to the detection of far-away outliers, as shown by the AUR values in Table 1 .

6.2. NON-ASYMPTOTIC REGIME

The main goal of this section is to show that RGPR can also improve uncertainty estimates near the data by varying its kernel hyperparameters. For this purpose, we use a simple hyperparameter optimization using a noise out-of-distribution (OOD) data, similar to Kristiadi et al. (2020) , to tune (σ 2 l )-the details are in Section C.2. We use LLL as the base BNN. First, we use the rotated-MNIST experiment proposed by Ovadia et al. (2019) , where we measure methods' calibration at different rotation angle, see Figure 4 . LLL gives significantly better performance than BNO and RGPR improves the performance further. Moreover, we use standard OOD data tasks where one distinguishes in-distribution from out-distribution samples. We do this with CIFAR10 as the in-distribution dataset against various OOD datasets (more results in Appendix D). As shown in Table 2 , LLL outperforms for CIFAR10 BNO and RGPR further improves LLL.

7. CONCLUSION

We have shown that adding "missing uncertainty" to ReLU BNNs with a carefully-crafted GP prior that represents infinite ReLU features fixes the asymptotic overconfidence problem of such networks. The core of our method is a generalization of the classic cubic-spline kernel, which, when used as the covariance function of the GP prior, yields a marginal variance which scales cubically in the distance between a test point and the training data. Our main strength lies in the simplicity of the proposed method: RGPR is relative straightforward to implement, and can be applied inexpensively to any pre-trained BNN. Furthermore, extensive theoretical analyses show that RGPR provides significant improvements to previous results with vanilla BNNs. In particular, we were able to show uniform confidence far-away from the training data in multi-class classifications. On a less formal note, our construction, while derived as a post-hoc addition to the network, follows a pleasingly simple intuition that bridges the worlds of deep learning and non-parametric/kernel models: The RGPR model amounts to considering a non-parametric model of infinitely many ReLU features, only finitely many of which are trained as a deep ReLU network. Now, the definition of RGPR implies that we have f (x) ≈ g(x) θ + f (0) (x); f (0) (x) ∼ N (0, k ↔ (x, x)). Following O'Hagan (1978) , we thus obtain the following GP prior over f , which marginal is f (x) ∼ N ( f (x) | 0, g(x) Bg(x) + k ↔ (x, x)). Suppose we write the dataset as D = (X, y) where X is the data matrix and y is the target vectors, and x * ∈ R N is an arbitrary test point. Let k ↔ := (k ↔ (x * , x 1 ), . . . k ↔ (x * , x M )) , let K ↔ := (K +σ 2 I) with K ij := k ↔ (x i , x j ) and σ 2 > 0 sufficiently large be the (regularized) kernel matrix, and let G := (g(x 1 ), . . . , g(x M )) be the matrix of training "features". As Rasmussen & Williams (2005, Sec. 2.7) suggests, we have then the following GP posterior mean and variance E( f (x * ) | D) = g(x * ) µ + k ↔ K -1 ↔ (y -g(x * ) µ) (10) var ( f (x * ) | D) = k ↔ (x * , x * ) + k ↔ K -1 ↔ k ↔ + r (B -1 + GK -1 ↔ G ) -1 r, ( ) where µ := (B -1 +GK -1 ↔ G ) -1 GK -1 ↔ y and r := g(x * )-GK -1 ↔ k ↔ . Since all training points x 1 , . . . , x M are sufficiently close to the origin, by definition of the DSCS kernel, we have k ↔ ≈ 0 and K -1 ↔ ≈ 1/σ 2 I. These imply that µ ≈ (B -1 + 1/σ 2 GG ) -1 (1/σ 2 Gy) and r ≈ g(x * ). In particular, notice that µ is approximately the posterior mean of the Bayesian linear regression on f (Bishop, 2006, Sec. 3.3) . Furthermore ( 10) and ( 11) become E( f (x * ) | D) ≈ g(x * ) µ = f (x * ; µ) var ( f (x * ) | D) ≈ k ↔ (x * , x * ) + g(x * ) (B -1 + 1/σ 2 GG ) -1 =:Σ g(x * ), respectively. Notice in particular that Σ is the posterior covariance of the Bayesian linear regression on f . Thus, the claim follows. To prove Proposition 3 and Theorem 4, we need the following definition. Let f : R N × R D → R C defined by (x, θ) → f (x; θ) be a feed-forward neural network which use piecewise affine activation functions (such as ReLU and leaky-ReLU) and are linear in the output layer. Such a network is called a ReLU network and can be written as a continuous piecewise-affine function (Arora et al., 2018) . That is, there exists a finite set of polytopes {Q i } P i=1 -referred to as linear regions f -such that ∪ P i=1 Q i = R N and f | Qi is an affine function for each i = 1, . . . , P (Hein et al., 2019) . The following lemma is central in our proofs below (the proof is in Lemma 3.1 of Hein et al. (2019) ). Lemma 5 (Hein et al., 2019) . Let {Q i } P i=1 be the set of linear regions associated to the ReLU network f : R N × R D → R C , For any x ∈ R N with x = 0 there exists a positive real number β and j ∈ {1, . . . , P } such that αx ∈ Q j for all α ≥ β. Proposition 3 (Asymptotic Variance Growth). Let f : R N × R D → R C be a C-class ReLU network with posterior N (θ | µ, Σ) and f be obtained from f via RGPR over the input space. Suppose that the linearization of f w.r.t. θ around µ is employed. For any x * ∈ R N with x * = 0 there exists β > 0 such that for any α ≥ β, the variance of each output component f 1 (αx * ), . . . , f C (αx * ) under p( f * | x * , D) (8) is in Θ(α 3 ). Proof. Let x * ∈ R N with x * = 0 be arbitrary. By Lemma 5 and definition of ReLU network, there exists a linear region R and real number β > 0 such that for any α ≥ β, the restriction of f to R can be written as f | R (αx; θ) = W (αx) + b, for some matrix W ∈ R C×N and vector b ∈ R C , which are functions of the parameter θ, evaluated at µ. In particular, for each c = 1, . . . , C, the c-th output component of f | R can be written by f c | R = w c (αx) + b c , where w c and b c are the c-th row of W and b, respectively. Let c ∈ {1, . . . , C} and let j c (αx * ) be the c-th column of the Jacobian J (αx * ) as defined in (1). Then by definition of p( f * | x * , D), the variance of f c | R (αx * )-the c-th diagonal entry of the covariance of p( f * | x * , D)-is given by var( f c | R (αx * )) = j c (αx * ) Σj c (αx * ) + k ↔ (αx * , αx * ). Now, from the definition of the DSCS kernel in (3), we have k ↔ (αx * , αx * ) = 1 N N i=1 k 1 ↔ (αx * i , αx * i ) = 1 N N i=1 α 3 σ 2 3 x 3 * i = α 3 N N i=1 k 1 ↔ (x * i , x * i ) ∈ Θ(α 3 ). Furthermore, we have j c (αx * ) Σj c (αx * ) = α(∇ θ w c | µ ) x + ∇ θ b c | µ Σ α(∇ θ w c | µ ) x + ∇ θ b c | µ . Thus, j c (αx * ) Σj c (αx * ) is a quadratic function of α. Therefore, var( f c | R (αx * )) is in Θ(α 3 ). Theorem 4 (Uniform Asymptotic Confidence). Let f : R N × R D → R C be a C-class ReLU network equipped with the posterior N (θ | µ, Σ) and let f be obtained from f via RGPR over the input space. Suppose that the linearization of f and the generalized probit approximation (2) is used for approximating the predictive distribution p(y * = c | αx * , f , D) under f . Then for any input x * ∈ R N with x * = 0 and for every class c = 1, . . . , C, lim α→∞ p(y * = c | αx * , f , D) = 1 C . Proof. Let x * = 0 ∈ R N be arbitrary. By Lemma 5 and definition of ReLU network, there exists a linear region R and real number β > 0 such that for any α ≥ β, the restriction of f to R can be written as f | R (αx) = W (αx) + b, where the matrix W ∈ R C×N and vector b ∈ R C are functions of the parameter θ, evaluated at µ. Furthermore, for i = 1, . . . , C we denote the i-th row and the i-th component of W and b as w i and b i , respectively. Under the linearization of f , the marginal distribution (8) over the output f (αx) holds. Hence, under the generalized probit approximation, the predictive distribution restricted to R is given by p(y * = c | αx * , D) ≈ exp(m c (αx * ) κ c (αx * )) C i=1 exp(m i (αx * ) κ i (αx * )) = 1 1 + C i =c exp(m i (αx * ) κ i (αx * ) -m c (αx * ) κ c (αx * ) =:zic(αx * ) ) , Under review as a conference paper at 2021 where for all = . . . , C, m i (αx * ) = f i | R (αx; µ) = w i (αx) + b i ∈ R, and κ i (αx) = (1 + π/8 (v ii (αx * ) + k ↔ (αx * , αx * ))) -1 2 ∈ R >0 . In particular, for all i = 1, . . . , C, note that m(αx * ) i ∈ Θ(α) and κ(αx ) i ∈ Θ(1/α 3 2 ) since v ii (αx * ) + k ↔ (αx * , αx * ) is in Θ(α 3 ) by Proposition 3. Now, notice that for any c = 1, . . . , C and any i ∈ {1, . . . , C} \ {c}, we have z ic (αx * ) = (m i (αx * ) κ i (αx * )) -(m c (αx * ) κ c (αx * )) = (κ i (αx * ) w i Θ 1/α 3 2 -κ c (αx * ) w c Θ 1/α 3 2 ) (αx * ) + κ i (αx * ) b i Θ 1/α 3 2 -κ c (αx * ) b c Θ 1/α 3 2 . Thus, it is easy to see that lim α→∞ z ic (αx * ) = 0. Hence we have lim α→∞ p(y * = c | αx * , D) = lim α→∞ 1 1 + C i =c exp(z ic (αx * )) = 1 1 + C i =c exp(0) = 1 C , as required.

APPENDIX C FURTHER DETAILS C.1 THE BLIGHT AND OTT'S METHOD

The Blight and Ott's method (BNO) models the residual of polynomial regressions. That is, suppose φ : R → R D is a polynomial basis function defined by φ(x) := (1, x, x 2 , . . . , x D-1 ), k is an arbitrary kernel, and w ∈ R D is a weight vector, BNO assumes f (x) := w φ(x) + f (x), where f (x) ∼ GP(0, k(x, x)). Recently, this method has been extended to neural network. Qiu et al. (2020) apply the same ideamodeling residuals with GPs-to pre-trained networks, resulting in a method called RIO. Suppose that f µ : R N → R is a neural-network with a pre-trained, point-estimated parameters µ. Their method is defined by f (x) := f µ (x) + f (x), where f (x) ∼ GP(0, k IO (x, x)). The kernel k IO is a sum of RBF kernels applied on the dataset D (inputs) and the network's predictions over D (outputs), hence the name IO-input-output. As in the original Blight and Ott's method, RIO also focuses in doing posterior inference on the GP. Suppose that m(x) and v(x) is the a posteriori marginal mean and variance of the GP, respectively. Then, via standard computations, one can see that even though f is a point-estimated network, f is a random function, distributed a posteriori by f (x) ∼ N f (x) f µ (x) + m(x), v(x) . Thus, BNO and RIO effectively add uncertainty to point-estimated networks. The posterior inference of BNO and RIO can be computationally intensive, depending on the number of training examples M : The cost of exact posterior inference is in Θ(M 3 ). While it can be alleviated by approximate inference, such as via inducing point methods and stochastic optimizations, the posterior inference requirement can still be a hindrance for a practical adoption of BNO and RIO, especially on large problems. 

C.2 HYPERPARAMETER TUNING

We have shown in the main text (both theoretically and empirically) that the asymptotic performance of RGPR does not depend on the choice of its hyperparameters (σ 2 l ) L-1 l=0 . Indeed we simply set each σ 2 l to its default value 1 for all experiments and showed that RGPR could already fix the asymptotic overconfidence problem effectively. Nevertheless, Figure 3 gives us a hint that learning these hyperparameters might be beneficial for uncertainty estimation. Intuitively, by increasing (σ 2 l ), one might be able to make the high confidence (low uncertainty) region more compact. However, if the values of (σ 2 l ) were too large, the uncertainty will be high even in the data region, resulting in underconfidenct predictions. Borrowing the contemporary method in robust learning literature (Hendrycks et al., 2019; Hein et al., 2019; Meinke & Hein, 2020, etc.) one way to train (σ 2 l ) is by using the following min-max objective which intuitively balances high-confidence predictions on inliers and low-confidence predictions on outliers. Let H be the entropy functional, D the training dataset, D out an outlier dataset, σ 2 := (σ 2 l ), and λ ∈ R be a trade-off parameter. We define: L(σ 2 ) := E x (in) * ∈D H p(y * | x (in) * , D; σ 2 ) -λ E x (out) * ∈Dout H p(y * | x (out) * , D; σ 2 ) , where the predictive distribution p(y * | x * , D; σ 2 ) is as defined in Section 4 with its dependency to σ 2 explicitly shown. In this paper, for the outlier dataset D out , we use noise dataset constructed by Gaussian blur and contrast scaling as proposed by Hein et al. (2019) . We found that this simple dataset is already sufficient for showing good improvements over the default values (σ 2 l = 1). Nevertheless, using more sophisticated outlier datasets, e.g. those used in robust learning literature, could potentially improve the results further. Lastly, we use the trade-off value of λ = 1 and λ = 0.75 for our experiments with LeNet/ResNet-18 and DenseNet-BC-121, respectively since we found that λ = 1 in the latter architecture generally make the network severely underconfident.

APPENDIX D ADDITIONAL EXPERIMENTS D.1 CLASSIFICATION

We show the behavior of a RGPR-imbued image classifier (LLL) in terms of α in Figure 5 . While Table 1 has already shown that RGPR makes confidence estimates close to uniform, here we show that the convergence to low confidence occurred for some small α. Furthermore, notice that when α = 1, i.e. at the test data, RGPR maintains the high confidence of the base method. 

D.2 REGRESSION

To empirically validate our method and analysis (esp. Proposition 3), we present a toy regression results in Figure 6 . RGPR improves the BNN further: Far-away from the data, the error bar becomes wider. For more challenging problems, we employ a subset of the standard UCI regression datasets. Our goal here, similar to the classification case, is to compare the uncertainty behavior of RGPRaugmented BNN baselines near the training data (inliers) and far-away from them (outliers). The outlier dataset is constructed by sampling 1000 points from the standard Gaussian and scale them with α = 2000. Naturally, the metric we choose is the predictive error bar (standard deviation), i.e. the same metric used in Figure 1 . Following the standard practice (see e.g. Sun et al. (2019) ), we use a two-layer ReLU network with 50 hidden units. The Bayesian methods used are LLL, KFL, SWAG, and stochastic variational GP (SVGP, Hensman et al., 2015) using 50 inducing points. Finally, we standardize the data and the hyperparameter for RGPR is set to 0.001 so that RGPR does not incur significant uncertainty on the inliers. The results are presented in Table 4 . We can observe that all RGPRs retain high confidence estimates over inlier data and yield much larger error bar compared to the base methods. Furthermore, as we show in Table 5 , the RGPR-augmented methods retain the base methods' predictive performances in terms of test RMSE. All in all, these findings confirm the effectiveness of RGPR in far-away outlier detection.

D.3 NON-ASYMPTOTIC REGIME

Using (12), we show the results of a tuned-RGPR on standard out-of-distribution (OOD) data detection benchmark problems on LeNet/ResNet architecture in Tables 2 and 6 . Furthermore, we show results for deeper network (121-layer DenseNet-BC) in Table 7 . We optimize (σ 2 l ) using Adam with learning rate 0.1 over each validation set and the noise dataset (both contain 2000 points) for 10 epochs. Note that this process is quick since no backpropagation over the networks is required. In general tuning the kernel hyperparameters of RGPR lead to significantly lower average confidence (MMC) over outliers compared to the vanilla method (LLL) which leads to higher detection 



In the statistical learning view, log p(ym | f θ (xm)) is identified with the empirical risk, log p(θ) with the regularizer. The two views are equivalent in this regard. See Bishop (2006, Sec. 5.7.1) for more details. By contrast, a product k↔(x, x ; σ 2 ) is zero if one of the k 1 ↔ (xi, x i ; σ 2 ) is zero.



Finally, given a classification predictive distribution p(y * | x * , D), we define the predictive confidence of x * as the maximum probability conf(x * ) := max c∈{1,...,C} p(y * = c | x * , D) over class labels.

Figure 2: The construction of our kernel in 1D, as the limiting covariance of the output of a Bayesian linear model with D ReLU features. Grey curves are ReLU features while thin red curves are samples. Red shades are the ±1 standard deviations of those samples.

Figure3: Variance of f (6) as a function of x * . When f is a function over neural network representations of the data (b), it captures the data region better than when f is only defined on the input space (a). Increasing the kernel hyperparameters (here we assume they have the same value for all layers) makes the low-variance region more compact around the data (c).

Figure4: Rotated-MNIST results (averages of 10 predictions). x-axes are rotation angles. In (a), all methods achieve similar accuracies.

Invariance in Predictions). Let f : R N × R D → R C be any network with posterior N (θ | µ, Σ) and f be obtained from f via RGPR (4). Then under the linearization of f , for any x * ∈ R N , we have E p( f * |x * ,D) f * = Ep(f * |x * ,D) f * . Proof. Simply compare the means of the Gaussians p( f * | x * , D) in (8) and p(f * | x * , D) in (1).

Figure 5: Average confidence as a function of α. Top: the vanilla LLL. Bottom: LLL with RGPR. Test data are constructed by scaling the original test sets with α. Error bars are ±1 standard deviation. Black lines are the uniform confidences. MC-integral with 1000 samples is employed.

Figure 6: Toy regression with a BNN and additionally, our RGPR. Shades represent ±1 standard deviation.

Performances of RGPRs compared to their respective base methods on the detection of far-away outliers. Error bars are standard deviations of ten trials. For each dataset, best values over each vanilla and RGPR-imbued method (e.g. LLL against LLL-RGPR) are in bold.

OOD data detection. Datasets in bold face are the in-distribution datasets.

Confidence over test sets (i.e. α = 1) in term of MMC. Values are averaged over ten trials. Larger is better.

Regression far-away outlier detection. Values correspond to predictive error bars (averaged over ten trials), similar to what shades represent in Figures1 and 2. "In" and "Out" correspond to inliers and outliers, respectively.

The corresponding predictive performance to Table4in terms of the RMSE metric. Values are averaged over ten trials. Smaller is better.

OOD data detection results using the hyperparameter tuning objective in (12). All values are averages and standard deviations over 10 trials.

OOD data detection results using the hyperparameter tuning objective in (12) on DenseNet-BC-121 network. All values are averages and standard deviations over 10 trials.

Calibration performance of RGPR on DenseNet-BC-121. Values are expected calibration errors (ECEs), averaged over ten prediction runs. RGPR makes the base BNN (LLL) more calibrated-even more than the "gold standard GP" in BNO.

Optimal hyperparameter for each layer (or residual block and dense block for ResNet and DenseNet, respectively) on LLL.

Comparison between Deep Ensemble (DE) and LLL-RGPR in terms of AUR. Results for DE are obtained from (Meinke & Hein, 2020) since we use the same networks and training protocol.

APPENDIX A DERIVATIONS

A.1 THE CUBIC SPLINE KERNEL Recall that we have a linear model f : [c min , c max ] × R K → R with the ReLU feature map φ defined by f (x; w) := w φ(x) over the input space [c min , c max ] ⊂ R, where c min < c max . Furthermore, φ regularly places the K generalized ReLU functions centered at (c i ) K i=1 where c i = c min + i-1 K-1 (c maxc min ) in the input space and we consider a Gaussian prior p(w) := N w 0, σ 2 K -1 (c maxc min )I over the weight w. Then, as K goes to infinity, the distribution over the function output f (x) is a Gaussian process with mean 0 and covariancewhere the last equality follows from (i) the fact that both x and x must be greater than or equal to c i , and (ii) by expanding the quadratic form in the second line.Let x := min(x, x ). Since ( 9) is a Riemann sum, in the limit of K → ∞, it is expressed by the following integralwhere we have defined z := min{x, c max }. The term H(xc min ) has been added in the second equality as the previous expression is zero if x ≤ c min (since in this region, all the ReLU functions evaluate to zero). Note thatis itself a positive definite kernel. We also note that c max can be chosen sufficiently large so that [-c max , c max ] d contains for sure the data, e.g. this is anyway true for data from bounded domains like images in [0, 1] d , and thus we can set z = x = min(x, x ). 0) and f be defined as in (4), and let x * ∈ R N be arbitrary. Under the linearization of f w.r.t. θ around 0, given that all x 1 , . . . , x M are sufficiently close to the origin, the GP posterior of f * := f (x * ) is given by

APPENDIX B PROOFS

where µ and Σ are the mean and covariance of the posterior of the linearized network, respectively, andProof. Under the linearization of f w.r.t. θ around 0, we have performance (AUR). Finally, we show the calibration performance of RGPR on the DenseNet in Table 8 . We observe that the base BNN we use, LLL, does not necessarily give good calibration performance. Applying RGPR improves this, making LLL better calibrated than the "gold standard" baseline BNO.We also compare LLL-RGPR to Deep Ensemble (DE) (Lakshminarayanan et al., 2017) which has been shown to perform better compared to Bayesian methods (Ovadia et al., 2019) . As we can see in Table 10 , LLL-RGPR is competitive to DE. These results further reinforce our finding that RGPR is also useful in non-asymptotic regime.Inspecting the optimal hyperparameters (σ 2 l ), we found that high kernel variances on higher layers tend to be detrimental to the uncertainty estimate, as measured by ( 12), leading to low variance values on those layers, cf. Table 9 . Specifically, for the LeNet architecture, we found that having high kernel variance on the input (the bottom-most layer) is desirable. Meanwhile, the first residual block and the second dense block are the most impactful in terms of uncertainty estimation for the ResNet and DenseNet architectures, respectively.

