EXPLORING THE UNCERTAINTY PROPERTIES OF NEU-RAL NETWORKS' IMPLICIT PRIORS IN THE INFINITE-WIDTH LIMIT

Abstract

Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-ofdistribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an infinitely-wide NN as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue.

1. INTRODUCTION

Large, neural network (NN) based models have demonstrated remarkable predictive performance on test data drawn from the same distribution as their training data, but the demands of real world applications often require stringent levels of robustness to novel, changing, or shifted distributions. Specifically, we might ask that our models are calibrated. Aggregated over many predictions, well calibrated models report confidences that are consistent with measured performance. The Brier score (BS), expected calibration error (ECE), and negative log-likelihood (NLL) are common measurements of calibration (Brier, 1950; Naeini et al., 2015; Gneiting & Raftery, 2007) . Empirically, there are many concerning findings about the calibration of deep learning techniques, particularly on out-of-distribution (OOD) data whose distribution differs from that of the training data. For example, MacKay (1992) showed that non-Bayesian NNs are overconfident away from the training data and Hein et al. (2019) confirmed this theoretically and empirically for deep NNs that use ReLU. For in-distribution data, post-hoc calibration techniques such as temperature scaling tuned on a validation set (Platt et al., 1999; Guo et al., 2017) often give excellent results; however, such methods have not been found to be robust on shifted data and indeed sometimes even reduce calibration on such data (Ovadia et al., 2019) . Thus finding ways to detect or build models that produce reliable probabilities when making predictions on OOD data is a key challenge. Sometimes data can be only slightly OOD or can shift from the training distribution gradually over time. This is called dataset shift (Quionero-Candela et al., 2009) and is important in practice for models dealing with seasonality effects, for example. While perfect calibration under arbitrary distributional shift is impossible, simulating plausible kinds of distributional shift that may occur in practice at different intensities can be a useful tool for evaluating the calibration of existing models. A recently proposed benchmark takes this approach (Ovadia et al., 2019) . For example, using several kinds of common image corruptions applied at various intensities, the authors observed the degradation in accuracy expected of models trained only on clean images (Hendrycks & Dietterich, 2019; Mu & Gilmer, 2019) , but also saw very different levels of calibration, with deep ensembles (Lakshminarayanan et al., 2017) proving the best. Calibration metrics. There are many common metrics used to evaluate the calibration of a model. Probably most common is the negative log-likelihood (NLL), which is a proper scoring rule and whose units, nats, are a coherent unit for information entropy. The Brier score (BS) from Brier ( 1950) is also a proper scoring rule, but its scale is harder to interpret and its use of the squared loss makes it sensitive to in/frequent events. We also consider the expected calibration error (ECE) of Naeini et al. (2015) , which is not a proper scoring rule. In particular, uniformly random predictions on a balanced, classification dataset will produce an ECE of 0. However, since it measures the absolute difference between predicted and observed probabilities, the scale of ECE is easier to interpret. Finally, since the previous metrics require labels, we consider the average confidence and entropy on OOD data, where the confidence is the probability assigned to the most likely class and the entropy is the entropy of the predicted distribution over labels. Ovadia et al. ( 2019) has more discussion on calibration metrics. Bridging Bayesian Learning and Neural Networks. In principle, Bayesian methods provide a promising way to tackle calibration, allowing us to define models with and infer under specific aleatory and epistemic uncertainty. Typically, the datasets on which deep learning has proven successful have high SNR (Signal-to-Noise Ratio), meaning epistemic uncertainty is dominant and model averaging is crucial because our overparameterized models are not determined by the training data. Indeed Wilson (2020) argues that ensembles are a kind of Bayesian model average. Ongoing theoretical work has built a bridge between NNs and Bayesian methods (Neal, 1994a; Lee et al., 2018; Matthews et al., 2018b) , by identifying NNs as converging to Gaussian processes in the limit of very large width. Specifically, the neural network Gaussian process (NNGP) describes the prior on function space that is realized by an i.i.d. prior over the parameters. The function space prior is a GP with a specific kernel that is defined recursively with respect to the layers. While the many heuristics used in training NNs may obfuscate the issue, little is known theoretically about the uncertainty properties implied by even basic architectures and initializations of NNs. Indeed theoretically understanding overparameterized NNs is a major open problem. With the NNGP prior in hand, it is possible to disambiguate between the uncertainty properties of the NN prior and those due to the specific optimization decisions by performing Bayesian inference. Moreover, it is only in this infinite-width limit that the posterior of a Bayesian NN can be computed exactly.

1.1. SUMMARY OF CONTRIBUTIONS

This work is the first extensive evaluation of the uncertainty properties of infinite-width NNs. Unlike previous work, we construct a valid probabilistic model for classification tasks using the NNGP, i.e. a label's prediction is always a categorical distribution. We perform neural network Gaussian process classification (NNGP-C) using a softmax link function to exactly mirror NNs used in practice. Note that prior work on the NNGP has also used a softmax link-function, but considered approximate inference using inducing points on MNIST (Garriga-Alonso et al., 2019) . We perform a detailed comparison of NNGP-C against its corresponding NN on clean, OOD, and shifted test data and find NNGP-C to be significantly better calibrated and more accurate than the NN. Next, we evaluate the calibration of neural network Gaussian process regression (NNGP-R) on both UCI regression problems and classification on CIFAR10. As the posterior of NNGP-R is a multivariate normal and so not a categorical distribution, a heuristic must be used to calculate confidences for classification problems. On the full benchmark of Ovadia et al. (2019) , we compare several such heuristics, and against the standard RBF kernel and ensemble methods. We find the calibration of NNGP-R to be competitive with the best results reported in Ovadia et al. (2019) . However in the process of preparing our findings for publication, newer strong baselines have been

