BAYESIAN NEURAL NETWORKS WITH VARIANCE PROPAGATION FOR UNCERTAINTY EVALUATION

Abstract

Uncertainty evaluation is a core technique when deep neural networks (DNNs) are used in real-world problems. In practical applications, we often encounter unexpected samples that have not seen in the training process. Not only achieving the high-prediction accuracy but also detecting uncertain data is significant for safetycritical systems. In statistics and machine learning, Bayesian inference has been exploited for uncertainty evaluation. The Bayesian neural networks (BNNs) have recently attracted considerable attention in this context, as the DNN trained using dropout is interpreted as a Bayesian method. Based on this interpretation, several methods to calculate the Bayes predictive distribution for DNNs have been developed. Though the Monte-Carlo method called MC dropout is a popular method for uncertainty evaluation, it requires a number of repeated feed-forward calculations of DNNs with randomly sampled weight parameters. To overcome the computational issue, we propose a sampling-free method to evaluate uncertainty. Our method converts a neural network trained using dropout to the corresponding Bayesian neural network with variance propagation. Our method is available not only to feed-forward NNs but also to recurrent NNs including LSTM. We report the computational efficiency and statistical reliability of our method in numerical experiments of language modeling using RNNs, and the out-of-distribution detection with DNNs.

1. INTRODUCTION

Uncertainty evaluation is a core technique in practical applications of deep neural networks (DNNs). As an example, let us consider the Cyber-Physical Systems (CPS) such as the automated driving system. In the past decade, machine learning methods are widely utilized to realize the environment perception and path-planing components in the CPS. In particular, the automated driving system has drawn a huge attention as a safety-critical and real-time CPS (NITRD CPS Senior Steering Group, 2012; Wing, 2009) . In the automated driving system, the environment perception component is built using DNN-based predictive models. In real-world applications, the CPS is required to deal with unexpected samples that have not seen in the training process. Therefore, not only achieving the high-prediction accuracy under the ideal environment but providing uncertainty evaluation for real-world data is significant for safety-critical systems (Henne et al., 2019) . The CPS should prepare some options such as the rejection of the recommended action to promote the user's intervention when the uncertainty is high. Such an interactive system is necessary to build fail-safe systems (Varshney & Alemzadeh, 2017; Varshney, 2016) . On the other hand, the uncertainty evaluation is useful to enhance the efficiency of learning algorithms, i.e., samples with high uncertainty are thought to convey important information for training networks. Active data selection based on the uncertainty has been studied for long time under the name of active learning (David et al., 1996; Gal et al., 2017; Holub et al., 2008; Li & Guo, 2013; Shui et al., 2020) . In statistics and machine learning, Bayesian estimation has been commonly exploited for uncertainty evaluation (Bishop, 2006.) . In the Bayesian framework, the prior knowledge is represented as the prior distribution of the statistical model. The prior distribution is updated to the posterior distribution based on observations. The epistemic model uncertainty is represented in the prior distribution, and upon observing data, those beliefs can be updated in the form of a posterior distribution, which yields model uncertainty conditioned on observed data. The entropy or the variance is representative of uncertainty measures (Cover & Thomas, 2006) . For complicated models such as DNNs, however, a direct application of Bayesian methods is prohibited as the computation including the high-dimensional integration highly costs. In deep learning, Bayesian methods are related to stochastic learning algorithms. This relation is utilized to approximate the posterior over complex models. The stochastic method called dropout is a powerful regularization method for DNNs (Srivastava et al., 2014) . In each layer of the DNN, some units are randomly dropped in the learning using stochastic gradient descent methods. Gal & Ghahramani (2016a) revealed that the dropout is interpreted as the variational Bayes method. Based on this interpretation, they proposed a simple sampling method of DNN parameters from the approximate posterior distribution. Furthermore, the uncertainty of the DNN-based prediction is evaluated using the Monte-Carlo (MC) method called MC dropout. While the Bayesian DNN trained using dropout is realized by a simple procedure, the computational overhead is not ignorable. In the MC dropout, dropout is used also at the test time with a number of repeated feed-forward calculations to effectively sample from the approximate posterior. Hence, the naive MC dropout is not necessarily relevant to the system demanding the real-time response. In this work, we propose a sampling-free method to evaluate the uncertainty of the DNN-based prediction. Our method is computationally inexpensive comparing to the MC dropout and provides reliable uncertainty evaluation. In the following, we will first outline related works. Section 3 is devoted to show the detailed formulae of calculating the uncertainty. In our method, an upper bound of the variance is propagated in each layer to evaluate the uncertainty of the output. We show that the our method alleviates the overconfident prediction. This property is shared with scaling methods for the calibration of the class-probability on test samples. In Section 4, we study the relation between our method and scaling methods. In Section 5, we demonstrate the computational efficiency and statistical reliability of our method through some numerical experiments using both DNNs and RNNs.

2. RELATED WORKS

The framework of Bayesian inference is often utilized to evaluate the uncertainty of DNN-based predictions. In Bayesian methods, the uncertainty is represented by the predictive distribution defined from the posterior distribution of the weight parameters. MacKay (1992) proposed a simple approximation method of the posterior distribution for neural networks, and demonstrated that the Bayesian method improves the prediction performance on classification tasks. Graves (2011) showed that the variational method efficiently works to approximate the posterior distribution of complex neural network models. There are many approaches to evaluate the uncertainty of modern DNNs (Alex Kendall & Cipolla, 2017; Choi et al., 2018; Lu et al., 2017; Le et al., 2018) . We briefly review MC-based methods and sampling-free methods.

Monte-Carlo methods based on Stochastic Learning:

The randomness in the learning process can be interpreted as a prior distribution. In particular, the dropout is a landmark of stochastic regularization method to train DNNs (Srivastava et al., 2014) . Gal & Ghahramani (2016a) proposed a simple method to generate weight parameters from the posterior distribution induced from the prior corresponding to the dropout regularization. The predictive distribution is approximated by the MC dropout, which compute the expected output over the Monte-Carlo sampling of the weight parameters. Gal & Ghahramani (2016b) reported that the MC dropout efficiently works not only for feed-forward DNNs but for recurrent neural networks (RNNs). Another sampling based method is the ensemble-based posteriors with different random seeds (Lakshminarayanan et al., 2017) . However, the computation cost is high as the bootstrap method requires repeated training of parameters using resampling data. Sampling-free methods: Though the MC dropout is a simple and practical method to evaluate the uncertainty, a number of feed-forward computations are necessary to approximate the predictive distribution. Recently, some sampling-free methods have been proposed for the uncertainty evaluation. Probabilistic network is a direct way to deal with uncertainty. The parameters of the probabilistic model, say the mean and the variance of the Gaussian distribution, are propagated in probabilistic neural networks. Then, the uncertainty evaluation is given by a single feed-forward calculation. Choi et al. (2018) used the mixture of Gaussian distributions as a probabilistic neural network and Wang et al. (2016) proposed natural-parameter networks as a class of probabilistic neural networks based on exponential families. For a given input vector, the network outputs the parameters of the distribution. For the recurrent neural networks, Hwang et al. (2019) proposed a variant of the natural-parameter networks. Instead of parameters of statistical models, Wu et al. (2019) developed a sampling-free method to propagate the first and second order moments of the posterior distribution. Sampling-free methods can evaluate the uncertainty with a one-pass computation for neural networks. However, specialized learning algorithms are required to train the probabilistic networks. Our method is applicable to DNNs and RNNs trained by common learning methods with the dropout. Postels et al. (2019) and Shekhovtsov & Flach (2019) proposed similar methods that propagate the uncertainty of the network to the output layer. Differently from the past works, our method takes the upper limit of the correlations among the inputs at the affine layer into account when the uncertainty is evaluated. In addition, we show that our method efficiently works even for RNNs.

3. UNCERTAINTY EVALUATION WITH VARIANCE PROPAGATION

In this work, we assume that we can access to the weight parameters in the DNN and the dropout probability in the training process. As the variance is a common measure of uncertainty, we propose a variance propagation algorithm for the trained DNN. Implementation of our method called nn2vpbnn is presented in Section A in the appendix. In our method, we need only the DNN or RNN trained using dropout. Unlike various kinds of probabilistic NNs, we do not need any specialized training procedure to evaluate the uncertainty. This is a great advantage for our implementation. Furthermore, the representative values of the predictive distribution, i.e. the mean and variance, are obtained by a one-path feed-forward calculation. Hence, we can circumvent iterative Monte-Carlo calculations.

3.1. UNCERTAINTY IN AFFINE LAYER

Let us consider the output of the affine layer y = W x + b for the random input x, where W = (W ij ) ∈ R ×m and b = (b i ) i=1 ∈ R . Suppose that the random vector x has the mean vector E[x] and the variance covariance matrix (Σ x ) i,j = Cov(x i , x j ) for i, j = 1, . . . , m. Then, the mean vector E[y] and the variance covariance matrix Σ y of y are given by E [y] = W E[x] + b and Σ y = W Σ x W T . As the estimation of the full variable-covariance matrix is not necessarily reliable, we use only the variances of each x i and an upper bound of the absolute correlation coefficient to evaluate the uncertainty. For W = (W ij ), the variance Var[y i ] is Var[y i ] = j W 2 ij Var[x j ] + j,j :j =j W ij W ij Cov(x j , x j ). Suppose the absolute correlation coefficient among x 1 , . . . , x m is bounded above by ρ, 0 ≤ ρ ≤ 1. Using the relation between the correlation and variance, we have Var[y i ] ≤ j W 2 ij Var[x j ] + ρ j,j :j =j |W ij ||W ij | Var(x j ) Var(x j ) = (1 -ρ) j |W ij | 2 Var[x j ] + ρ j |W ij | Var(x j ) 2 , i = 1, . . . , . Under the independent assumption, i.e., ρ = 0, the minimum upper bound is obtained. The prediction with a small variance leads to overconfident decision making. Hence, the upper bounding of the variance is important to build fail-safe systems. A simple method of estimating ρ is presented in Section 3.5. Using the above formula, the mean and an upper bound of the variance of y are computed using the mean and an upper bound of the variance of x. In this paper, such a computation is referred to as the Variance Propagation or VP for short. Let us define the variance vector of the m-dimensional random vector x = (x 1 , . . . , x m ) ∈ R m by Var[x] = (Var[x 1 ], . . . , Var[x m ]) ∈ R m . Furthermore, we denote the concatenated vector of the mean and variance of z or its approximation as U(z), i.e., U(z) = (E[z], Var[z] ). The VP at the affine layer is expressed by the function T aff , U(y) = (m, v) = T aff (U(x)), where m = W E[x] + b ∈ R m and each element of v ∈ R m is defined by equation 1. The average pooling layer, global average pooling layer (Lin et al., 2013) , and the batch normalization layer (Ioffe & Szegedy, 2015) are examples of the affine layer. Hence, the VP of the affine layer also works to evaluate the uncertainty of these layers. The distribution of y i is well approximated by the univariate Gaussian distribution if the correlation among x is small (Wang & Manning, 2013; Wu et al., 2019) . Based on this fact, the uncertainty of y i can be represented by the univariate Gaussian distribution N (E[y i ], Var[y i ]). In our method, the variance Var[y i ] of the approximate Gaussian is given by the variance v in equation 2.

3.2. OUTPUT OF DROPOUT LAYER

Let us consider the uncertainty induced from the dropout layer (Srivastava et al., 2014) . The dropout probability is denoted by p. In the dropout layer, the m-dimensional random input vector x = (x 1 , . . . , x m ) is transformed by the element-wise product z = xd, where d = (d 1 , . . . , d m ) is the i.i.d. Bernoulli random variables, i.e., Bernoulli(p). As x and d are independent, the VP in the dropout layer is given by (E[z], Var[z]) = T drop (U(x)), where E[z] = pE[x] and Var[z] = pVar[x] + p(1 -p)E[x] 2 . According to the Bayesian interpretation of the dropout revealed by Gal et al. (2017) , the approximate posterior distribution of the output from the affine layer trained using dropout is given by the distribution of the random variable y i = m j=1 W ij x j d j + b i , d 1 , . . . , d m ∼ Bernoulli(p) . The mean and the variance of y i satisfy E [y] = pW E[x]+b and Var[y i ] ≤ (1-ρ) j |W ij | 2 Var[x j d j ]+ ρ( j |W ij | Var[x j d j ]) 2 . Since the stochastic input and the weight parameter in the dropout layer are independent, one can exactly calculate the variance of the product using each expectation and variance. The VP at the affine layer with the dropout is given by the composite function, (m, v) = T aff • T drop (U(x)). The uncertainty of y i is then represented by the Gaussian distribution N (m i , v i ). A similar formula is found in the uncertainty evaluation of the LSTM unit in Section 3.4 with the explicit expressions.

3.3. UNCERTAINTY VIA ACTIVATION FUNCTIONS

The nonlinear activation function is an important component of neural network models in order to achieve high representation ability and accurate prediction (Cybenko, 1989) . The ReLU, sigmoid function, and their variants are common activation functions. In several works, the expectation and the variance of the output from activation functions have been calculated (Frey & Hinton, 1999; MacKay, 1992; Daunizeau, 2017) . Let us introduce the transformed distribution by the ReLU and sigmoid function. The ReLU function is defined by y = max(x, 0). For x i ∼ N (E[x i ], Var[x i ] ), the exact expectation and variance of y are expressed by the probability density φ and the cumulative function Φ of the standard Gaussian distribution (Frey & Hinton, 1999; Wu et al., 2019) : MacKay (1992) and Daunizeau (2017) derived the approximate expectation and variance of y, E[y] = E[x]Φ(E[x]/ Var[x]) + Var[x]φ(E[x]/ Var[x]) and Var[y] = (E[x] 2 + Var[x])Φ(E[x]/ Var[x]) + E[x] Var[x]φ(E[x]/ Var[x]) -E[y]

The sigmoid function is defined by y

i = s(x i ) = 1/(1 + e -xi ). For x i ∼ N (E[x i ], Var[x i ]), E[y] ≈ s( E[x] √ 1+cVar[x] ), Var[y] ≈ s( E[x] √ 1+cVar[x] )(1 -s( E[x] √ 1+cVar[x] )(1 - 1 √ 1+cVar[x] ), where the constant c depends on the approximation method. The common choice is c = π/8 ≈ 0.393, while Daunizeau (2017) found c = 0.368 based on numerical optimization. In the same way, one can calculate approximate expectation and variance of tanh(y). The VP at the activation layer is expressed by U[y] = T act (U[x]), where the operation T act depends on the activation function. The output U[y] is defined by the above expectation and variance. In the multiclass classification problems, the softmax function is commonly used at the last layer in DNNs. However, the expectation of the softmax function does not have analytic expression under the multivariate Gaussian distribution. Daunizeau (2017) utilized the approximate expectation of the sigmoid function to approximate the expected softmax output. However, the variance of the softmax function was not provided. In this paper, we interpret the multiclass classification problem as the multi-label problem and at the last layer, we use the sigmoid functions as many as the number of labels. Given the transformations z k -→ s(z k ), k = 1, . . . , G at the last layer for the classification with G labels, the prediction is given by the label that attains the maximum value of s(z k ). The advantage of this replacement is that the reliable evaluation of the uncertainty is possible for the sigmoid function as shown above. In numerical experiments, we show that the multi-label formulation with several sigmoid functions provides a comparable prediction accuracy as the standard multi-class formulation using the softmax function, while it also gives a reliable uncertainty evaluation.

3.4. LSTM UNIT WITH DROPOUT

The uncertainty evaluation of the Recurrent Neural Networks (RNNs) is an important task as the RNNs are widely used in real-world problems. This section is devoted to the uncertainty propagation in the LSTM unit when the dropout is used to train the weight parameters (Gal & Ghahramani, 2016b) . According to Greff et al. (2017) , the standard form of the LSTM unit is defined by (i f g o) = (s s tanh s) • (h t-1 x t ) U i U f U g U o W i W f W g W o , h t = o tanh(c t ), c t = fc t-1 + ig using the sigmoid function s, where the multiplication of two vectors is the element-wise operation and • is the composition of the linear layer and the activation function, i.e., i = s(h t-1 U i + x t W i ), g = tanh(h t-1 U g + x t W g ), etc. The matrices W 's and U 's are the input weights and recurrent weights, respectively. The vectors, i, f , g, and o, denote the input gate, forget gate, new candidate vector, and output gate. The cell state c t and the hidden state h t retain the long and short term memory. Here, U * and W * are regarded as random matrices distributed from the posterior distribution induced from the dropout using Bernoulli(p). Hence, each row of U * and W * are set to the null row vector with probability 1 -p. When the tied dropout is used for LSTM, the same rows of all U * are randomly dropped and the same rule is applied to W * . On the other hand, in the untied dropout layer, the dropout is separately executed for each U * and W * . Detail of the tied and untied dropout is found in Gal & Ghahramani (2016b) . Let us consider the map from U(h t-1 , c t-1 ) to U(h t , c t ). The map depends on the data x t . Since the computation in the LSTM with the dropout is expressed as the composite function of the dropout layer, affine layer and the activation function, we have U(i, f , g, o) = T act • T aff • T drop (U(h t-1 , x t )). Hence, the mean and variance vectors of h t and c t are obtained from those of i, f , g, o and c t-1 . This computation is shown below. We need an appropriate assumption to calculate E[c t ] and Var[c t ] as we do not use the correlations. The simplest assumption is the independence of random vectors. When f , c t-1 , i and g are independent, we obtain E[c t ] = E[fc t-1 ] + E[ig] = E[f ]E[c t-1 ] + E[i]E[g], Var[c t ] = Var[f ]Var[c t-1 ] + Var[f ]E[c t-1 ] 2 + E[f ] 2 Var[c t-1 ] + Var[i]Var[g] + Var[i]E[g] 2 + E[i] 2 Var[g]. This is the VP for the cell state vector c t-1 in the LSTM. Likewise, the VP for h t is obtained. The above update function to compute the uncertainty of h t and c t from i, f , g, o and c t-1 is denoted by T cell . As a result, we have U(h t , c t ) = T cell (U(i, f , g, o), U(c t-1 )) = T cell (T act • T aff • T drop (U(h t-1 , x t )), U(c t-1 )). This is the VP formula from (c t-1 , h t-1 ) to (c t , h t ). Repeating the above computation with the observed sequence {x t } T t=1 , one can evaluate the uncertainty of the cell state vectors and the outputs {y t } T t=1 , where y t = h t , t = 1, . . . , T . Let us consider the validity of the above independence assumption. For given h t-1 , the conditional independence of i, f , g, o and c t-1 holds when the untied dropout is used to train the LSTM unit, i.e., the equality p(i, f , g, o, c t-1 |h t-1 ) = p(c t-1 |h t-1 ) s∈{i,f ,g,o} p(s|h t-1 ) holds for the posterior distribution. The randomness comes from the Bayesian interpretation of the untied dropout. Here, the observation x t is regarded as a constant without uncertainty. Then, equation 3 and 4 exactly hold by replacing the mean and variance with conditional expectation and the conditional variance under the condition of h t-1 . If the variance of h t-1 is small, the independence assumption is expected to be approximately valid. When the uncertainty of h t-1 is not ignorable, the sampling from the Gaussian distribution representing the uncertainty of h t-1 is available with the formulae E[c t ] = E ht-1 [E[c t |h t-1 ]] and Var[c t ] = E ht-1 [Var[c t |h t-1 ]] to compute E[c t ] and Var[c t ] approximately.

3.5. ESTIMATION OF CORRELATION PARAMETER

When we evaluate the uncertainty in the affine layer, we need to determine the correlation parameter ρ in equation 1. If the correlation of input to the affine layer is not ignored, the upper bound of the variance with an appropriate ρ is used to avoid overconfidence. Hence, the estimation of the parameter ρ is important. A simple method of estimating an appropriate ρ is to use the validation set as follows. 1. For each candidate of ρ, execute the following steps. (a) For each data (x i , y i ) in the validation set, compute the mean vector m i and variance vector v i of the output of the network for given x i using VPBNN. (b) Compute the predictive log-likelihood on the validation set, LL ρ = i:validation set log p(y i ; m i , v i ), where p(y; m, v) is the probability density of the uncorrelated normal distribution with the mean m j and variance v j for each element. 2. Choose ρ that maximizes LL ρ . Though we need to prepare several candidates of the correlation parameter, the computation cost is still lower than MC dropout. To evaluate the uncertainty on N test test samples, MC dropout with T samplings requires N test T feed-forward calculations. The VPBNN with adaptive choice of ρ needs approximately 2N val K cor + 2N test feed-forward calculates, where N val is the number of the validation set and K cor is the number of candidates of ρ. The factor 2 comes from the computation of both mean and variance. Usually, K cor is much less than T and N val is not extremely large in comparison to N test . In practice, we do not need a large validation set to estimate ρ, as shown in numerical experiments. Hence, VPBNN with adaptive ρ is computationally efficient than MC dropout. If distinct correlation parameters are used in each affine layer, the computation cost becomes large. In numerical experiments, we find that the uncertainty evaluation using the same ρ in all the affine layers works well.

4. SCALING METHODS FOR CALIBRATION AND VARIANCE PROPAGATION

The variance propagation is regarded as a calibration based on the uncertainty. There are some scaling methods for the calibration. Platt scaling (Platt, 1999 ) is a classical calibration method for multiclass classification problems, and the temperature scaling (Guo et al., 2017; Ji et al., 2019) is a simplified method of the Platt scaling. Let us consider the conditional probability function Pr(y|z) = e zy / m j=1 e zj for z = (z 1 , . . . , z m ) ∈ R m and y = 1, . . . , m. The temperature scaling of Pr(y|z) is given by Pr(y|z/T ), where T > 0 is the temperature parameter. Usually T is greater than one, and it softens the class probability. The Platt scaling of Pr(y|z) is defined by the scaling of z such that Pr(y|W z + b), where W is a diagonal matrix and b is a m-dimensional vector. The Platt scaling is the coordinatewise calibration, while the temperature scaling is the homogeneous scaling for the feature vector z. Another expansion of the temperature scaling is the bin-wise temperature scaling (Ji et al., 2019) . In the bin-wise scaling, the sample space are divided into K bins. The label probability is calibrated by Pr(y|z/T k(z) ) in which k(z) is the index of the bin including the sample z, and T k > 0 is the temperature for the calibration at the k-th bin. Each T k is determined from validation data in the k-th bin. Intuitively, an extremely large label probability tends to yield an overconfident prediction. At the point z having the large maximum probability, the large scaling parameter T k is introduced to soften the overconfidence at the region including z. The calculation of the VP at the sigmoid activation layer for the multi-label classification is given by s(z j ) -→ s(E[z j ]/ 1 + cVar[z j ]), j = 1, . . . , m. When the uncertainty of the random vector z is not taken into account, the prediction using s(E[z j ]), j = 1, . . . , m is apt to be overconfident. Comparing to s(E[z j ]), the uncertainty defined by the variance Var[z j ] works as a coordinate-wise calibration like the Platt scaling. If the variance is isotropic, the scaling does not change the ranking of the label probability like the temperature scaling. In the variance propagation using the Taylor approximation (Postels et al., 2019) , the class probability is calculated as the standard manner without calibration, while the variance is propagated along the layers in DNNs in order to evaluate the uncertainty. Hence, the calibration effect is not incorporated into the naive Taylor approximation method.

5.1. NUMERICAL EXPERIMENTS ON SYNTHETIC DATA

We assess the uncertainty computed by the VPBNN and the Taylor approximation using synthetic data. Let us consider the uncertainty evaluation for regression problems. The target function is the real-valued function on one-dimensional input space shown in Figure 1 . We compared three methods; MC dropout with 1000 samples from the posterior distribution associated with the dropout, Taylor approximation, and VPBNN with adaptive ρ and several fixed ρ's. The architecture of the NN model is shown in Figure 4 of the Appendix. The results are presented in Figure 1 . The Taylor approximation and the VPBNN with ρ = 0 tends to be overconfident and the VPBNN with ρ = 0.15 gives a similar result to the MC dropout. We find that the adaptive choice of ρ can avoid the overconfidence, while providing a meaningful result. In Section C in the appendix, we present the uncertainty evaluation for RNNs. Likewise, we find that the appropriate choice of ρ relaxes the overconfident prediction and that the adaptive ρ provides a meaningful result as well as MC dropout. Overall, the VPBNN with appropriate ρ provides similar results to the MC dropout. As the VPBNN needs only the one-path calculation as well as the VP using Taylor approximation, the computation cost is much less than the MC dropout. The adaptive choice of ρ using the validation set efficiently works to produce a similar result as MC dropout. Further numerical results are presented in Section B of the appendix. The Taylor approximation for the uncertainty evaluation proposed by Postels et al. (2019) also leads to a computationally efficient single-shot method to compute the uncertainty. However, we find that Taylor approximation tends to lead overconfident result compared to our method in the present experiments.

5.2. RNN FOR LANGUAGE MODELING

We report numerical experiments of language modeling problem. The problem setup is the same as the problem considered by Zaremba et al. (2014) and Gal & Ghahramani (2016b) . We use Penn Treebank, which is a standard benchmark in this field. In the experiments, the LSTM consisting of two-layers with 650 units in each layer is used. The model architecture and most of hyper parameters are set to those used by Gal & Ghahramani (2016b) . Figure 7 in the appendix shows the RNN and the converted VPBNN. The weight decay parameter is set to 10 -7 according to the code in Github provided by the authors of Gal & Ghahramani (2016b) , as the parameter was not explicitly written in their paper. The results are shown in Table 1 . The prediction performance is evaluated by the perplexity on the test set. In the table, the standard dropout approximation propagates the mean of each approximating distribution as input to the next layer (Gal & Ghahramani, 2016b) . As the Taylor approximation computes the mean of the output without using the variance, it must provide the same result as the standard dropout approximation. In our experiment, both methods produced almost identical perplexity scores. This result means that we approximately reproduced the numerical results reported in the past papers. The MC dropout and the VPBNN with ρ = 0 achieved a lower perplexity than the others. Our method using only a one-path calculation can provide almost the same accuracy as the MC dropout that requires more than 1000 times feed-forward calculations of the output values. Note that the VPBNN is not the approximation of MC dropout. Both MC dropout and VPBNN are an approximation of the posterior distribution, though MC dropout with a sufficient number of feedforward calculations tends to provide a satisfactory result. The numerical experiments indicates that the number of feed-forward calculations in MC dropout is not sufficient for this task.

5.3. OUT-OF-DISTRIBUTION DETECTION

Let us consider the out-of-distribution detection problem. The task is to find samples whose distribution is different from that of the training samples. The uncertainty of samples is evaluated for this task. First of all, the neural network is trained using Fashion-MNIST dataset (Xiao et al., 2017) . Then, several methods for uncertainty evaluation are used to detect samples from non-training 0.923 ± 0.002 0.775 ± 0.028 0.833 ± 0.017 0.766 ± 0.023 0.860 ± 0.017 VPBNN:adaptive 0.923 ± 0.002 0.923 ± 0.026 * * 0.946 ± 0.016 * * 0.916 ± 0.020 * * 0.981 ± 0.005 * * datasets. In this experiments, we use MNIST (Lecun et al., 1998) , EMNIST-MNIST (Cohen et al., 2017) , Kannada (Prabhu, 2019), and Kuzushiji (Clanuwat et al., 2018) as non-training datasets. The detection accuracy of each method is evaluated by the AUC measure on the test dataset. We compared MC dropout with 100 sampling (MC100) or 2000 sampling (MC2000), Taylor approximation, and VPBNN with adaptive ρ. The network architecture is the CNN shown in Figure 8 of the Appendix. At the output layer, the multi-label sigmoid function is used. Numerical results for the softmax function are reported in Section E of the appendix. In the training process, Adam optimizer with early stopping on validation dataset is used. The 60k training data was divided into 50k training data for weight parameter tuning and 10k validation data for hyper-parameter tuning. We confirmed that all methods achieve almost the same prediction accuracy on the test data of Fashion-MNIST. The result is shown in the "Test accuracy" column in Table 2 . The prediction is done by using the top-ranked labels. Though MC dropout and VPBNN tend to relax the overconfident prediction, the calibration does not significantly affect the label prediction accuracy of this problem. For the uncertainty evaluation, we used two criteria. One is the entropy computed from the mean value of the output, H[y] = -1 m m i=1 {E[y i ] log E[y i ] + (1 -E[y i ]) log(1 -E[y i ] )}, for the output y = (y i , . . . , y m ) of the NN, and the other is the mean-standard deviation (meanstd) (Kampffmeyer et al., 2016; Gal et al., 2017) that is the averaged standard deviation, i.e., σ(y) = 1 m m i=1 Var[y i ]. In Section E of the appendix, we report the results of other uncertainty measure using not only sigmoid function but the softmax functions. Overall, we find that the mean-standard deviation outperforms the entropy measure in the out-of-distribution detection. This is because the uncertainty is rather related to the variance than expectation of the output value. Table 2 shows the results of the mean-standard deviation. In the adaptive VPBNN of this task, the estimated correlation parameter ρ approximately ranges from 0.0 to 0.0005. Hence, VPBNN with the independent assumption, i.e., ρ = 0, also works well. Taylor approximation method fails to detect the sample from non-training distribution, This is because the estimation accuracy of the variance is not necessarily high as shown in Section B of the appendix. Let us consider the computation cost. In our experiments, the computation time of MC dropout with 100 sampling is 15.6[sec] ± 28.3[ms] on average for the uncertainty evaluation of 10K test samples. For MC dropout with 2000 sampling, the computation cost is approximately 20 times higher since it is proportional to the number of sampling. For the adaptive VPBNN, the computation time is 175[ms] ± 31.7[ms] including the adaptive choice of ρ from 10 candidates when 10K validation samples are used. VPBNN with adaptive ρ provides comparable performance to MC dropout using a sufficient number of sampling while keeping much less computation cost. As discussed in language modeling, the number of feed-forward calculations in MC dropout is considered not sufficient for this task. At the affine layer, our method allows the users to tune the degree of the dependence among input variables by setting the parameter ρ. This parameter is determined according to the balance between the validity of the independence assumption and the safety required to the system. On the numerical experiments shown in Figure 6 , setting ρ = 0 tends to produce an over-confident prediction, while the parameter setting with ρ = 1 corresponds to the least-confident prediction. In the BNN, the dropout layer is the main source of the uncertainty. Besides the dropout layer, the batch normalization layer also has a Bayesian interpretation (Teye et al., 2018) . Gaussian Dropout layer and Gaussian Noise layer also yield the uncertainty to the output of the layer, while these layers do not necessarily have the Bayesian interpretation. One can easily implement the variance propagation for these layers as a part of nn2vpbnn. In our method, we utilize neural networks trained with dropout. Unlike various kinds of probabilistic NNs, we do not need any specialized training procedure to evaluate the uncertainty. This is a great advantage for our implementation. Furthermore, the representative values of the predictive distribution, i.e. the mean and variance, are obtained by the one-path feed-forward calculation. Hence, we can circumvent iterative Monte-Carlo calculations. These advantages are shared with the Taylor approximation method by Postels et al. (2019) .

LSTM

VPBNN for LSTM The absolute error of VPBNN and Taylor approximation from the MC method is shown in Figure 3 . The horizontal axis and vertical axis denote the variance and mean of the input distribution, respectively. In numerical experiments, VPBNN achieved higher accuracy than Taylor approximation. We find that Taylor approximation tends to yield extremely large variance even when the sigmoid function is used as the activation function. Overall, VPBNN has a preferable property compared to the Taylor approximation. The architectures of the NN and RNN used in Section 5.1 are shown in Figure 4 where the input is 30 length sequence. The results are shown in Figure 6 . Again the Taylor approximation and the VPBNN with the independent assumption (ρ = 0) tend to yield overconfident results compared to the MC dropout. We find that the VPBNN with adaptive ρ provides a similar results to the MC dropout.

D SUPPLEMENTARY OF RNN FOR LANGUAGE MODELING

The architectures of RNN used in Section 5.2 is shown in Figure 7 .

E SUPPLEMENTARY OF OUT-OF-DISTRIBUTION

Let us consider the out-of-distribution detection. First of all, the neural network is trained using Fashion-MNIST (Xiao et al., 2017) . Then, several methods for uncertainty evaluation are used to 



CONCLUSIONWe developed a sampling-free method for uncertainty evaluation. Our method requires only a onepath calculation of DNNs or RNNs, while the MC-based methods need thousands of feed-forward calculations to evaluate the uncertainty. In the numerical experiments, we show that our method provides more reliable results than existing sampling-free methods such as the Taylor approximation.



2 using the element-wise operations for the two vectors E[x] and Var[x].

Figure1: The uncertainty for feed-forward NNs is evaluated. The solid line is the target function and the training samples (×) are plotted. For each method, the uncertainty is depicted as the confidence interval.

Figure 2: Left: LSTM. Right: the corresponding VPBNN.

Figure6: The uncertainty of Bayesian RNN is evaluated. The solid line is the target function. For each method, the uncertainty is depicted as the confidence interval. Upper panels: the target and estimated regression function with its uncertainty. Lower panels: the standard deviation of output value at each input value.

Single model perplexity for the Penn Treebank language modeling task is presented. The asterisk ( * ) denotes the best perplexity on the test set for each dropout setting. The asterisk (•) means the perplexity reported in(Gal & Ghahramani, 2016b).

Results of out-of-distribution detection are presented. "Test accuracy" is the prediction accuracy computed on the test set of Fashion-MNIST. For each pair of training domain (Fashion-MNIST) and non-training domain (MNIST, EMNIST, Kannada or Kuzushiji), the averaged AUC score computed using 30 random seeds is shown with the standard deviation. The asterisk * * (resp. * ) denotes the highest (the second-highest) AUC for each non-training dataset.

The architecture of RNN used in Section 5.1. The sigmoid function and tanh function are used as the activation function.

AUC score of Out-of-Distribution detection for MC dropout using 2000 sampling, Taylor approximation, and VPBNN. The standard deviation of the AUC score is calculated using 30 different random seeds. The asterisk * * (resp. * ) denotes the highest (the second-highest) AUC for each pair of training and non-training dataset.

Test accuracy on the test set of Fashion-MNIST.

Gaussian input to ReLU function

NN used for out-of-distribution VPBNN detect samples from non-training datasets, MNIST (Lecun et al., 1998) , EMNIST-MNIST (Cohen et al., 2017 ), Kannada (Prabhu, 2019) , and Kuzushiji (Clanuwat et al., 2018) . The detection accuracy of each method is evaluated by the AUC measure on the test dataset.The network architecture used in Section 5.3 is the CNN in Figure 8 of the Appendix. In addition to the CNN with the softmax function provided in Keras, we implemented an another CNN with multi-label sigmoid functions at the output layer.In the training process of the NNs, Adam optimizer with early stopping on validation dataset is exploited. The 60k training data was divided into 50k training data for the weight parameter tuning and 10k validation data for hyper-parameter tuning. For the uncertainty evaluation, we used two criteria; one is the entropy computed from the mean value of the output, and the other is the meanstandard deviation (mean-std) (Kampffmeyer et al., 2016; Gal et al., 2017) that is computed from the variance. More precisely, the entropy defined from the softmax output isand the entropy defined from the sigmoid function for multi-label setting is given byThe mean-std is defined byThe results are presented in Table 3 in the appendix. For the out-of-distribution detection, we find that the mean-std based method outperforms the other methods using entropy criterion. Moreover, our method provides the uncertainty by only one-path feed-forward calculation, while the MC dropout needs more than hundreds of feed-forward calculations. The Taylor approximation fails to detect the sample from non-training distribution. This is because the approximation accuracy of the Taylor approximation is not necessarily high as shown in Section B.On the other hand, all the methods considered here achieve almost the same prediction accuracy on the test data of Fashion MNIST as shown in Table 4 . The prediction is done by using the top-ranked labels. In this experiment, the calibration of the label probability does not significantly affect the ranking of the label probability.

