FIXING ASYMPTOTIC UNCERTAINTY OF BAYESIAN NEURAL NETWORKS WITH INFINITE RELU FEATURES

Abstract

Approximate Bayesian methods can mitigate overconfidence in ReLU networks. However, far away from the training data, even Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be overconfident. We suggest to fix this by considering an infinite number of ReLU features over the input domain that are never part of the training process and thus remain at prior values. Perhaps surprisingly, we show that this model leads to a tractable Gaussian process (GP) term that can be added to a pre-trained BNN's posterior at test time with negligible cost overhead. The BNN then yields structured uncertainty in the proximity of training data, while the GP prior calibrates uncertainty far away from them. As a key contribution, we prove that the added uncertainty yields cubic predictive variance growth, and thus the ideal uniform (maximum entropy) confidence in multi-class classification far from the training data.

1. INTRODUCTION

Calibrated uncertainty is crucial for safety-critical decision making by neural networks (NNs) (Amodei et al., 2016) . Standard training methods of NNs yield point estimates that, even if they are highly accurate, can still be severely overconfident (Guo et al., 2017) . Approximate Bayesian methods, which turn NNs into Bayesian neural networks (BNNs), can be used to address this issue. Kristiadi et al. (2020) recently showed that for binary ReLU classification networks, far away from the training data (more precisely: when scaling any input x with a scalar α > 0 and taking the limit α → ∞), the uncertainty of BNNs can be bounded away from zero. This is an encouraging result when put in contrast to the standard point-estimated networks, for which Hein et al. (2019) showed earlier that the same asymptotic limit always yields arbitrarily high (over-)confidence. Nevertheless, BNNs can still be asymptotically overconfident (albeit less so than the standard NNs) since the aforementioned uncertainty bound can be loose. This issue is our principal interest in this paper. An intuitive interpretation is that ReLU NNs "miss out on some uncertainty" even in their Bayesian formulation, because they fit a finite number of ReLU features to the training data, by "moving around" these features within the coverage of the data. This process has no means to encode a desideratum that the model should be increasingly uncertain away from the data. In this work, we "add in" additional uncertainty by considering an infinite number of additional ReLU features spaced at regular intervals away from the data in the input and hidden spaces of the network. Since these features have negligible values in the data region, they do not contribute to the training process. Hence, we can consider a prior for their weights, chosen to be an independent Gaussian, and arrive at a specific Gaussian process (GP) which covariance function is a generalization of the classic cubic-spline kernel (Wahba, 1990) . This GP prior can be added to any pre-trained ReLU BNN as a simple augmentation to its output. Considering the additive combination of a parametric BNN and GP prior together, we arrive at another view of the method: It approximates the "full GP posterior" that models the residual of a point-estimated NN (Blight & Ott, 1975; Qiu et al., 2020) . In our factorization, the BNN models uncertainty around the training data, while the GP prior models uncertainty far away from them. By factorizing these two parts from each other, our formulation requires no (costly) GP posterior inference, and thus offers lightweight, modular uncertainty calibration. See Fig. 1 for illustration. Theoretical analysis is a core contribution of this work. We show that the proposed method (i) preserves the predictive performance of the base ReLU BNN. Furthermore, it (ii) ensures that the 1

