FIXING ASYMPTOTIC UNCERTAINTY OF BAYESIAN NEURAL NETWORKS WITH INFINITE RELU FEATURES

Abstract

Approximate Bayesian methods can mitigate overconfidence in ReLU networks. However, far away from the training data, even Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be overconfident. We suggest to fix this by considering an infinite number of ReLU features over the input domain that are never part of the training process and thus remain at prior values. Perhaps surprisingly, we show that this model leads to a tractable Gaussian process (GP) term that can be added to a pre-trained BNN's posterior at test time with negligible cost overhead. The BNN then yields structured uncertainty in the proximity of training data, while the GP prior calibrates uncertainty far away from them. As a key contribution, we prove that the added uncertainty yields cubic predictive variance growth, and thus the ideal uniform (maximum entropy) confidence in multi-class classification far from the training data.

1. INTRODUCTION

Calibrated uncertainty is crucial for safety-critical decision making by neural networks (NNs) (Amodei et al., 2016) . Standard training methods of NNs yield point estimates that, even if they are highly accurate, can still be severely overconfident (Guo et al., 2017) . Approximate Bayesian methods, which turn NNs into Bayesian neural networks (BNNs), can be used to address this issue. Kristiadi et al. (2020) recently showed that for binary ReLU classification networks, far away from the training data (more precisely: when scaling any input x with a scalar α > 0 and taking the limit α → ∞), the uncertainty of BNNs can be bounded away from zero. This is an encouraging result when put in contrast to the standard point-estimated networks, for which Hein et al. (2019) showed earlier that the same asymptotic limit always yields arbitrarily high (over-)confidence. Nevertheless, BNNs can still be asymptotically overconfident (albeit less so than the standard NNs) since the aforementioned uncertainty bound can be loose. This issue is our principal interest in this paper. An intuitive interpretation is that ReLU NNs "miss out on some uncertainty" even in their Bayesian formulation, because they fit a finite number of ReLU features to the training data, by "moving around" these features within the coverage of the data. This process has no means to encode a desideratum that the model should be increasingly uncertain away from the data. In this work, we "add in" additional uncertainty by considering an infinite number of additional ReLU features spaced at regular intervals away from the data in the input and hidden spaces of the network. Since these features have negligible values in the data region, they do not contribute to the training process. Hence, we can consider a prior for their weights, chosen to be an independent Gaussian, and arrive at a specific Gaussian process (GP) which covariance function is a generalization of the classic cubic-spline kernel (Wahba, 1990) . This GP prior can be added to any pre-trained ReLU BNN as a simple augmentation to its output. Considering the additive combination of a parametric BNN and GP prior together, we arrive at another view of the method: It approximates the "full GP posterior" that models the residual of a point-estimated NN (Blight & Ott, 1975; Qiu et al., 2020) . In our factorization, the BNN models uncertainty around the training data, while the GP prior models uncertainty far away from them. By factorizing these two parts from each other, our formulation requires no (costly) GP posterior inference, and thus offers lightweight, modular uncertainty calibration. See Fig. 1 for illustration. Theoretical analysis is a core contribution of this work. We show that the proposed method (i) preserves the predictive performance of the base ReLU BNN. Furthermore, it (ii) ensures that the Figure 1 : Toy classification with a BNN and our method. Shade represents confidence, the suffix "ZO" stands for "zoomed-out". Far away from the training data, vanilla BNNs can still be overconfident (a, c). Our method fixes this issue while keeping predictions unchanged (b, d). surrounding output variance asymptotically grows cubically in the distance to the training data, and thus (iii) yields uniform asymptotic confidence in the multi-class classification setting. These results extend those of Kristiadi et al. (2020) in so far as their analysis is limited to the binary classification case and their bound can be loose. Furthermore, our approach is complementary to the method of Meinke & Hein (2020) which attains maximum uncertainty far from the data for non-Bayesian point-estimate NNs. Finally, our empirical evaluation confirms our analysis and shows that the proposed method also improves uncertainty estimates in the non-asymptotic regime. While this point estimate may yield highly accurate predictions, it does not encode uncertainty over θ, causing an overconfidence problem (Hein et al., 2019) . Bayesian methods can mitigate this issue, specifically, by treating the parameter of f as a random variable and applying Bayes rule. The resulting network is called a Bayesian neural network (BNN). The common way to approximate the posterior p(θ | D) of a BNN is by a Gaussian q(θ | D) = N (θ | µ, Σ), which can be constructed for example by a Laplace approximation (MacKay, 1992b) or variational Bayes (Hinton & Van Camp, 1993) . Given such an approximate posterior q(θ | D) and a test point x * ∈ R N , one then needs to marginalize the parameters to make predictions, i.e. we compute the integral y * = h(f (x * ; θ)) q(θ | D) dθ, where h is an inverse link function, such as the identity function for regression or the logistic-sigmoid and softmax functions for binary and multi-class classifications, respectively. Since the network f is a non-linear function of θ, this integral does not have an analytic solution. However, one can obtain a useful approximation via the following network linearization: (1)

2. BACKGROUND

2.1 BAYESIAN NEURAL NETWORKS Let f : R N × R D → R C defined by (x, θ) → f (x; θ) =: f θ (x) Let x * ∈ R N be a test point and q(θ | D) = N (θ | µ, Σ) where J * is the Jacobian of f (x * ; θ) w.r.t. θ at µ. (In the case of a real-valued network f , we use the gradient g * := ∇ θ f (x * ; θ)| µ instead of J * .) This distribution can then be used as the predictive distribution p(y * | x * , D) in the regression case. For classifications, we need another approximation since h is not the identity function. One such approximation is the generalized probit approximation



In the statistical learning view, log p(ym | f θ (xm)) is identified with the empirical risk, log p(θ) with the regularizer. The two views are equivalent in this regard. See Bishop (2006, Sec. 5.7.1) for more details.



be a neural network. Here, θ is the collection of all parameters of f . Given an i.i.d. dataset D := (x m , y m ) M m=1 , the standard training procedure amounts to finding a point estimate θ * of the parameters θ, which can be identified in the Bayesian framework with maximum a posteriori (MAP) estimation 1 θ * = arg max θ log p(θ | D) = arg max θ M m=1 log p(y m | f θ (x m )) + log p(θ).

be a Gaussian approximate posterior. Linearizing f around µ yields the following marginal distribution over the function output f (x * ):2 p(f (x * ) | x * , D) ≈ N (f (x * ) | f (x * ; µ)

