EXCESS RISK ANALYSIS FOR EPISTEMIC UNCERTAINTY WITH APPLICATION TO VARIATIONAL INFERENCE

Abstract

Bayesian deep learning plays an important role especially for its ability evaluating epistemic uncertainty (EU). Due to computational complexity issues, approximation methods such as variational inference (VI) have been used in practice to obtain posterior distributions and their generalization abilities have been analyzed extensively, for example, by PAC-Bayesian theory; however, little analysis exists on EU, although many numerical experiments have been conducted on it. In this study, we analyze the EU of supervised learning in approximate Bayesian inference by focusing on its excess risk. First, we theoretically show the novel relations between generalization error and the widely used EU measurements, such as the variance and mutual information of predictive distribution, and derive their convergence behaviors. Next, we clarify how the objective function of VI regularizes the EU. With this analysis, we propose a new objective function for VI that directly controls the prediction performance and the EU based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing VI methods.

1. INTRODUCTION

As machine learning applications spread, understanding the uncertainty of predictions is becoming more important to increase our confidence in machine learning algorithms (Bhatt et al., 2021) . Uncertainty refers to the variability of a prediction caused by missing information. For example, in regression problems, it corresponds to the error bars in predictions; and in classification problems, it is often expressed as the class posterior probability, entropy, and mutual information (Hüllermeier & Waegeman, 2021; Gawlikowski et al., 2022) . There are two types of uncertainty (Bhatt et al., 2021) : 1) Aleatoric uncertainty (AU), which is caused by noise in the data itself, and 2) Epistemic uncertainty (EU), which is caused by a lack of training data. In particular, since EU can tell us where in the input space is yet to be learned, integrated with deep learning methods, it is used in such applications as dataset shift (Ovadia et al., 2019) , adversarial data detection (Ye & Zhu, 2018), active learning (Houlsby et al., 2011) , Bayesian optimization (Hernández-Lobato et al., 2014) , and reinforcement learning (Janz et al., 2019) . Mathematically, AU is defined as Bayes risk, which expresses the fundamental difficulty of learning problems (Depeweg et al., 2018; Jain et al., 2021; Xu, 2020) . For EU, Bayesian inference is useful because posterior distribution updated from prior distribution can represent a lack of data (Hüllermeier & Waegeman, 2021) . In practice, measurements like the variance of the posterior predictive distribution, and associated conditional mutual information represented EU in practice (Kendall & Gal, 2017; Depeweg et al., 2018) . In Bayesian inference, since posterior distribution is characterized by the training data and the model using Bayes' formula, its prediction performance and EU are determined automatically. However, due to computational issues, such exact Bayesian inference is difficult to implement; we often use approximation methods, such as variational inference (VI) (Bishop, 2006) , especially for deep Bayesian models. Since the derived posterior distribution also depends on the properties of approximation methods, the prediction performance and EU of deep Bayesian learning are no longer automatically guaranteed through Bayes' formula. The prediction performance has been analyzed as generalization error, for example, by PAC-Bayesian theory (Alquier, 2021). Since EU is also essential in practice, we must obtain a theoretical guarantee of the algorithm-and the sample-dependent non-asymptotic theory for EU, similarly to generalization error analysis. Unfortunately, study has been limited in that direction. Traditional EU analysis has focused on the properties of the exact Bayesian posterior and predictive distributions (Fiedler et al., 2021; Lederer et al., 2019) as well as large sample behaviors (Clarke & Barron, 1990) . Since Bayesian deep learning uses approximate posterior distributions, we cannot apply such traditional EU analysis based on Bayes' formula to Bayesian deep learning. The asymptotic theory of a sufficiently large sample may overlook an important property of the EU that is due to the lack of training data. Recently, analysis of EU focusing on loss functions was proposed for supervised learning (Xu & Raginsky, 2020; Jain et al., 2021) . EU was defined as the excess risk obtained by subtracting the Bayes risk corresponding to the AU from the total risk. Thus, excess risk implies the loss due to insufficient data when the model is well specified. Although this approach successfully defines EU with loss functions, the following limitation still remains. Xu & Raginsky (2020) assume that the data generating mechanism is already known and that we can precisely evaluate Bayesian posterior and predictive distribution. A correct model is not necessarily a realistic assumption, and the assumption about an exact Bayesian posterior hampers understanding EU in approximation methods. To address these limitations, it appears reasonable to analyze excess risk under a similar setting as PAC-Bayesian theory and apply it to EU of approximate Bayesian inference. However, as shown in Sec. 2, analyzing excess risk in such a way leads to impractical theoretical results, and the relations between excess risk and the widely used EU measurements remain unclear. This greatly complicates EU analysis. Because of this difficulty, to the best of our knowledge, no research exists on excess risk for EU for approximation methods. In this paper, we propose a new theoretical analysis for EU that addresses the above limitations of these existing settings. Our contributions are the followings: • We show non-asymptotic analysis for widely used EU measurements (Theorems 2 and 3). We propose computing the Bayesian excess risk (BER) (Eq. ( 9)) and show that this excess risk equals to widely used EU measurements. Then we theoretically show the convergence behavior of BER using PAC-Bayesian theory (Eqs. ( 13)) and ( 18)). • Based on theoretical analysis, we give a new interpretation of the existing VI that clarifies how the EU is regularized (Eqs. ( 19) and ( 20)). Then we propose a novel algorithm that directly controls the prediction and the EU estimation performance simultaneously based on PAC-Bayesian theory (Eq. ( 21)). Numerical experiments suggest that our algorithm significantly improves EU evaluation over the existing VI.

2. BACKGROUND OF PAC-BAYESIAN THEORY AND EPISTEMIC UNCERTAINTY

Here we introduce preliminaries. Such capital letters as X represent random variables, and such lowercase letters as x represent deterministic values. All the notations are summarized in Appendix A. In Appendix B, we show summary of the settings.

2.1. PAC-BAYESIAN THEORY

We consider a supervised setting and denote input-output pairs by Z = (X, Y ) ∈ Z := X × Y. We assume that all the data are i.i.d. from some unknown data-generating distribution ν (Z) = ν(Y |X)ν(X). Learners can access N training data, Z N := (Z 1 , . . . , Z N ) with Z n := (X n , Y n ), which are generated by Z N ∼ ν(Z) N . We express ν(Z) N as ν(Z N ). We express conditional distribution as ν(Y |X = x) as ν(Y |x) for simplicity. We introduce loss function l : Y × A → R where A is an action space. We express loss of action a ∈ A and target variable y is written as l(y, a). We introduce a model f θ : X → A, parameterized by θ ∈ Θ ⊂ R d . When we put a prior p(θ) over θ, the PAC-Bayesian theory (Alquier, 2021; Germain et al., 2016) guarantees the prediction performance by focusing on the average of the loss with respect to posterior distribution q(θ|Z N ) ∈ Q. Q is a family of distributions and q(θ|Z N ) is not restricted to Bayesian posterior distribution. In this work we consider the log loss and the squared loss. For the log loss, we consider model p(y|x, θ), and the loss is given as l(y, p(y|x, θ)) =ln p(y|x, θ), where A is probability distributions. For the squared loss, we use model f θ (x) and l(y, f θ (x)) = |yf θ (x)| 2 , where Y = A = R.

