FAST PREDICTIVE UNCERTAINTY FOR CLASSIFICATION WITH BAYESIAN DEEP NETWORKS

Abstract

In Bayesian Deep Learning, distributions over the output of classification neural networks are approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the categorical output distribution. This is costly. We extend existing work to construct a Dirichlet approximation of this output distribution, yielding an analytic map between Gaussian distributions in logit space and Dirichlet distributions in the output space. We argue that the resulting Dirichlet distribution has theoretical and practical advantages, in particular more efficient computation of the uncertainty estimate, scaling to large datasets and networks like ImageNet and DenseNet. We demonstrate the use of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for the ImageNet setup.

1. INTRODUCTION

Quantifying the uncertainty of Neural Networks' (NNs) predictions is important in safety-critical applications such as medical-diagnosis (Begoli et al., 2019) and self-driving vehicles (McAllister et al., 2017; Michelmore et al., 2018) . Architectures for classification tasks produce a probability distribution as their output, constructed by applying the softmax to the point-estimate output of the penultimate layer. However, it has been shown that this distribution is overconfident (Nguyen et al., 2015; Hein et al., 2019) and thus cannot be used for predictive uncertainty quantification. Approximate Bayesian methods provide quantified uncertainty over the network's parameters and thus the outputs in a tractable fashion. The commonly used Gaussian approximate posterior (MacKay, 1992a; Graves, 2011; Blundell et al., 2015; Ritter et al., 2018) approximately induces a Gaussian distribution over the logits of a NN (Mackay, 1995) . But the associated predictive distribution does not have an analytic form. It is thus generally approximated by Monte Carlo (MC) integration requiring multiple samples. Predictions in Bayesian Neural Networks (BNNs) are thus generally expensive operations. In this paper, we re-consider an old but largely overlooked idea originally proposed by David JC MacKay (1998) in a different setting (arguably the inverse of the Deep Learning setting) which transforms a Dirichlet distribution into a Gaussian. Dirichlet distributions are generally defined on the simplex. But when its variable is defined on the inverse softmax's domain, its shape effectively approximates a Gaussian. The inverse of this approximation, which will be called the Laplace Bridge here (Hennig et al., 2012) , analytically maps a Gaussian distribution onto a Dirichlet distribution. Given a Gaussian distribution over the logits of a NN, one can thus efficiently obtain an approximate Dirichlet distribution over the softmax outputs (Figure 1 ). Our contributions are: We re-visit MacKay's derivation with particular attention to a symmetry constraint that becomes necessary in our "inverted" use of the argument from the Gaussian to the Dirichlet family. We then validate the quality of this approximation both theoretically and empirically, and demonstrate significant speed-up over MC-integration. Finally, we show a use-case, leveraging the analytic properties of Dirichlets to improve the popular top-k metric through uncertainties. Section 2 provides the mathematical derivation. Section 3 discusses the Laplace Bridge in the context of NNs. We compare it to the recent approximations of the predictive distributions of NNs in Section 4. Experiments are presented in Section 5. and "Laplace Bridge" approximation constructed in this paper (bottom row). For column (a) and (b), two different Gaussians were constructed, such that the resulting MAP estimate is the same, but the uncertainty differs. For (c), ( d) and (e) the same mean with decreasing uncertainty was used. We find that in all cases the Laplace Bridge is a good approximation and captures the desired properties.

2. THE LAPLACE BRIDGE

Laplace approximationsfoot_0 are a popular and light-weight method to approximate a general probability distribution q(x) with a Gaussian N (x|µ, Σ) when q(x) is twice differentiable and the Hessian at the mode is positive definite. It sets µ to a mode of q, and Σ = -(∇ 2 log q(x)| µ ) -1 , the inverse Hessian of log q at that mode. This scheme can work well if the true distribution is unimodal and defined on the real vector space. The Dirichlet distribution, which has the density function Dir(π|α) := Γ K k=1 α k K k=1 Γ(α k ) K k=1 π α k -1 k , is defined on the probability simplex and can be "multimodal" in the sense that the distribution diverges in the k-corner of the simplex when α k < 1. Both issues preclude a Laplace approximation, at least in the naïve form described above. However, MacKay (1998) noted that both can be fixed, elegantly, by a change of variable (see Figure 2 ). Details of the following argument can be found in the supplements. Consider the K-dimensional variable π ∼ Dir(π|α) defined as the softmax of z ∈ R K : π k (z) := exp(z k ) K l=1 exp(z l ) , for all k = 1, . . . , K. We will call z the logit of π. When expressed as a function of z, the density of the Dirichlet in π has to be multiplied by the absolute value of the determinan of the Jacobian det ∂π ∂z = k π k (z), thus removing the -1 terms in the exponent: Dir z (π(z)|α) := Γ K k=1 α k K k=1 Γ(α k ) K k=1 π k (z) α k , This density of z, the Dirichlet distribution in the softmax basis, can now be accurately approximated by a Gaussian through a Laplace approximation, yielding an analytic map from the parameter space α ∈ R K + to the parameter space of the Gaussian (µ ∈ R K and symmetric positive definite Σ ∈



For clarity: Laplace approximations are also one out of several possible ways to construct a Gaussian approximation to the weight posterior of a NN, by constructing a second-order Taylor approximation of the empirical risk at the trained weights. This is not the way they are used in this section. The Laplace Bridge is agnostic to how the input Gaussian distribution is constructed. It could, e.g., also be constructed as a variational approximation, or the moments of Monte Carlo samples. See also Section 5.



Figure1: Densities on the simplex of the true distribution (top row, computed by MC integration) and "Laplace Bridge" approximation constructed in this paper (bottom row). For column (a) and (b), two different Gaussians were constructed, such that the resulting MAP estimate is the same, but the uncertainty differs. For (c), (d) and (e) the same mean with decreasing uncertainty was used. We find that in all cases the Laplace Bridge is a good approximation and captures the desired properties.

