APPROXIMATE BAYESIAN INFERENCE WITH STEIN FUNCTIONAL VARIATIONAL GRADIENT DESCENT

Abstract

We propose a general-purpose variational algorithm that forms a natural analogue of Stein variational gradient descent (SVGD) in function space. While SVGD successively updates a set of particles to match a target density, the method introduced here of Stein functional variational gradient descent (SFVGD) updates a set of particle functions to match a target stochastic process (SP). The update step is found by minimizing the functional derivative of the Kullback-Leibler divergence between SPs. SFVGD can either be used to train Bayesian neural networks (BNNs) or for ensemble gradient boosting. We show the efficacy of training BNNs with SFVGD on various real-world datasets.

1. INTRODUCTION

Bayesian inference can be treated as a powerful framework for data modeling and reasoning under uncertainty. However, this assumes that we can encode our prior knowledge in a meaningful manner. Typically, this is done by specifying the prior distribution of the model parameters. However, in machine learning (ML), models potentially consist of millions of parameters with potentially highly complex interactions (e.g., very large neural networks (NNs)). Furthermore, the parameter structure of the models itself is allowed to change during training, e.g., the number of parameter grows when using gradient boosting (GB). This makes defining meaningful prior assumptions for parameter spaces difficult or nearly (practically) infeasible. As we usually do not care about single parameters but the complete resulting function, it seems intuitive to directly express our prior knowledge in hypothesis function space by, e.g., specifying the characteristic length scale, periodicity, or smoothness in general. Fortunately, Bayesian inference can also be formulated in function space. In this case, the prior and posterior distributions are stochastic processes (SPs). The most prominent representative is the Gaussian process (GP), for which the posterior GP can be analytically computed. However, training GPs scale cubically in the number of observations, and the implicit Gaussian likelihood assumption is often violated in reality. In this paper, we introduce Stein functional variational gradient descent (SFVGD). This method provides a general gradient descent method in function space that enables practitioners to train ML models to approximate the posterior SP, assuming certain regularity conditions of the prior SP and the likelihood function hold.

1.1. RELATED WORK

Kernelized Stein Methods These methods combine Stein's identity with a reproducing kernel Hilbert space (RKHS) assumption. Based on a finite particle set, they can either be used to find the optimal transport direction to match a target density or to estimate the score gradient of the empirical distribution of the particles. The former is called Stein variational gradient descent (SVGD) Liu & Wang (2016) , and approaches of the latter category are called (non-parametric) score estimators (Zhou et al., 2020) . Our method internally uses SVGD and forms a natural analogue in function space. Several extensions to SVGD exist, e.g., approaches incorporating second-order information such as Leviyev et al. (2022) and the more general matrix-kernel valued approach by Wang et al. (2019a) . While these extensions usually outperform SVGD, their computational costs are also higher. Bayesian Neural Networks (BNNs) Typically, BNNs are NNs with weight priors that are trained via variational inference. The prominent representatives are BNNs using Bayes by Backprop (Blundell et al., 2015) and scalable probabilistic backpropagation (Hernandez-Lobato & Adams, 2015) . Recently, Immer et al. (2020) proposed transforming BNNs into generalized linear models with inference based on a Gaussian process equivalent to the model. While Markov Chain Monte Carlo (MCMC) methods often are prohibitively expensive to be used for BNNs, some variants, e.g., Chen et al. (2014) account for noisy gradient evaluations and can be used in this setting. However, MCMC-based methods are still usually employed for relatively low-dimensional problems. Functional BNNs (FBNNs) Sun et al. (2019) proposed to use functional priors to train BNNs. Training BNNs with our descent is closely related to their method, but while they use a score-based approach for the estimation of the derivative of the Kullback-Leibler divergence D KL between the prior SP and the variational SP, we estimate this derivative directly via SVGD. Wang et al. (2019b) also use SVGD for FBNNs but apply SVGD directly to the D KL between the posterior SP and the variational SP. Furthermore, their work does not show that this in fact maximizes a lower bound for the log marginal likelihood. Recently, Ma & Hernández-Lobato (2021) and Rudner et al. (2021) proposed different FBNN approaches that also build upon the results of Sun et al. (2019) , but while their methods are specific to training NN function generators, our method can be used to update a set of particle functions in general. Repulsive Deep Ensembles Repulsive Deep Ensembles are deep ensembles that incorporate repulsive terms in their gradient update, forcing their members' weights apart. A variety of repulsive terms are presented in (D'Angelo & Fortuin, 2021) and (D'Angelo et al., 2021) , outperforming the approach by Wang et al. (2019b) . However, these approaches mainly focus on weight priors, and empirical findings also only relate to the weight space. In contrast to our work, functional priors can only be applied if a posterior SP with analytical marginal density exists. GB with Uncertainty The closest neighbor of our approach applied to GB is the ensemble GB scheme proposed by Malinin et al. (2021) , which is based on Bayesian ensembles. In contrast to our functional approach, their method is based on approximating the posterior of the model parameters. Another GB-based method is NGBoost proposed by Duan et al. (2019) , which directly learns the predictive uncertainty; however, prior knowledge can not be taken into account.

1.2. OUR CONTRIBUTION

We propose a novel natural extension of SVGD in function space (Section 3), which enables the practitioner to match a target SP. This approach can be implemented in a BNN or as GB routine (Section 3.3). Using real-world benchmarks, we show that the resulting generator training algorithm is competitive while having less computational costs than the approach of Sun et al. (2019) . In contrast to other existing uncertainty-aware GB algorithms, a GB ensemble, when trained via SFVGD, can naturally incorporate prior functional information. These versatile applications of our framework are made possible by providing a unifying view of NNs and GB from a functional analysis perspective.

2.1. SUPERVISED ML

FROM A FUNCTIONAL ANALYSIS PERSPECTIVE Given a labeled dataset D ∈ (X × Y) n ∼ P n xy of n ∈ N independent and identically distributed (i.i.d.) observations from an unknown data generating process P xy , a supervised ML algorithm tries to construct a risk optimal model f under a pre-specified loss L. In this case, the function f defines a mapping from the feature space X to the target space Y. The learning algorithm I to construct f is a function mapping from the set of all datasets n∈N (X × Y) n to a hypothesis space H, which is a subset of the set of all functions mapping from X to the model output spacefoot_0 Y ⊂ R g with g ∈ N. In order to specify the goodness-of-fit of a function f , one can define a loss function L : Y × Y → R, (y, f (x)) → L(y, f (x)), which measures how well the output of a fixed model f ∈ H fits an observation (x, y) ∼ P xy . In the following, we present supervised ML from a functional analysis perspective. Here, we fix the observation and associate the loss L with the loss functionalL (x,y) [f ] : H → R, f → L(y, f (x)). Based on this loss functional, we can define the risk functional of a model f , R[f ] = E (x,y)∼Pxy L (x,y) [f ], which measures the expected loss of f , and is used to theoretically identify optimal models. In the following, we will assume that the expectation in Eq. (1) exists and is finite. If we knew the usually unknown data generating process and hence the risk functional, we could update any model f ∈ H in the direction of the steepest descent in H w.r.t. R by following the negative functional gradient of R. The negative functional gradient of R, -∇ f R[f ], is itself a mapping from X to Y. For every input location x, this gradient returns the direction in model output space Y, which points to the locally steepest descent w.r.t. R. In the following, unless otherwise stated, the functional derivative is taken in the L 2 space. Proposition 2.1 Assuming sufficient regularity and that L(y, f (x)) is partially continuously differentiable w.r.t. f (x), we observe for numeric inputs and model output that -∇ f R[f ](x) = -p x (x) • E y∼P y|x ∂L(y, f (x)) ∂f (x) , where p x is the marginal density of x. The proof is given in A.1.1. In practice, we usually do not know p x , and since our dataset D is finite, we only have access to n realizations of ∂L(y,f (x)) ∂f (x) . If the feature space is at least partially continuous, its size |X | = ∞, and we thus cannot estimate -∇ f R[f ](x) without additional assumptions. However, we have access to the functional empirical risk R emp,D [f ] := (xi,yi)∈D L (xi,yi) [f ], for which we assume that it converges in mean to R as n → ∞. Its negative functional gradient can be expressed via the chain rule such that -∇ f R emp,D [f ](x) = - (xi,yi)∈D ∂L(y i , f (x i )) ∂f (x i ) • ∇ f [f (x i )] (x), where ∇ f [f (x i )] is the functional gradient of the evaluation functional of f at x i , which evaluates to the Dirac delta function δ xi . However, since we take the functional gradient in H, ∇ f [f (x i )] becomes the projection of δ xi into H. For example, if H is an RKHS with associated kernel k, then ∇ f [f (x i )] (x) = k(x i , x), i.e., our choice of H directly influences the "bumpiness" of ∇ f R emp,D [f ]. Furthermore, we can interpret ∇ f R emp,D [f ] as a (jump-)continuous functional representation of the dataset ∂D L,f := {(x i , -∂L(yi,f (xi)) ∂f (xi) ) (x i , y i ) ∈ D} ⊂ (X × Y) n , which also implicitly defines a learner. In the following, we show how two core supervised ML algorithms (gradient boosting and neural networks) naturally incorporate this functional gradient while training. Gradient Boosting (GB) For GB (Friedman, 2001) , the situation is usually reversed, and we choose a (base) learner I b that implicitly defines H and with which we fit a model to the data set ∂D L,f . GB uses these approximations of the negative functional gradient of the empirical risk to successively update a model f [0] such that f [t+1] = f [t] + η [t] b [t] with b [t] = I b (∂D L,f [t] ), where η [t] ∈ R >0 is the learning rate and possibly depends on the iteration t ∈ N. For further details see Appendix (A.2). Neural Networks (NNs) If f is an NN with parameters ϕ, then the parameter gradients w.r.t. the empirical risk functional needed for backpropagation can be obtained via the chain rule such that ∇ ϕ R emp,D [f ] = X ∇ f R emp,D [f ](x) • ∇ ϕ f (x)dx = (xi,yi)∈D ∂L(y i , f (x i )) ∂f (x i ) • ∇ ϕ f (x i ), where the second equality holds, since here we do not restrict H, i.e., ∇ f [f (x i )] = δ xi . However, these procedures only assure that we can find an optimal model f ∈ H w.r.t. R emp , which does not imply that f is optimal w.r.t. R. In practice, we tune the hyperparameters of the algorithmsi.e., use data withheld from learning for subsequent model selection -and apply early stopping to find a model f approximately optimal w.r.t. R.

2.2. STOCHASTIC PROCESSES

In this section, we will shortly introduce stochastic processes (SPs) that can be used to represent distributions over functions and thereby allow us to express the uncertainty of models independent of their specific parameter structure. We will regard X as an index set and let (Y, G) be a measurable space with σ-algebra G on the state space Y. For x ∈ X and a given probability space (Ω, F, P) based on the sample space Ω, F is a σ-algebra on Ω and probability measure P. Let Q(x) be a random variable projecting from Ω to Y. An SP Q is the family {Q(x); x ∈ X } of all random variables Q(x) (Lamperti, 1977) . With this, we can define a sample function f ω : X → Y, x → Q(x)(ω) for a fixed ω ∈ Ω. Often, it is easier to look at SPs from this sample function view: For every A ∈ F, a set of functions {f ω ; ω ∈ A} with an associated measure P(A) can be identified -i.e., SPs define a distribution over functions projecting from X to Y. For a finite index set X := x 1:m ∈ X m , we denote the finite-dimensional marginal joint distribution over function values {Q(x 1 ), . . . , Q(x m )} as Q X . In the following, we assume that for every Q X exists a corresponding density function p Q X : Y m → R ≥0 , f X → p Q X (f X ) , where f X := (f (x 1 ), . . . , f (x m )) are the function values at x 1:m based on a sample function f where we suppressed the ω to ease the following notation. We will denote the associated functional to this density function with p Q X [f ]. The D KL is a measure of distance between two distributions over the same probability space. Since SPs are distributions over functions, the D KL can also be used for distances between two SPs. Unfortunately, computing this quantity is non-trivial (Matthews et al., 2015) . However, for two consistent and ergodic SPs Q and P , i.e., Q and P can be characterized by marginals over all finite index sets (e.g., GPs), Sun et al. (2019) showed that the D KL between these SPs can be solely expressed in terms of their marginals, i.e., D KL (Q||P ) = sup m∈N,X∈X m D KL (Q X ∥P X ). This expression enables us to find a differentiable distance measure between two stochastic processes.

2.3. STEIN VARIATIONAL GRADIENT DESCENT

SVGD (Liu & Wang, 2016 ) is a variational Bayesian method. Variational methods can be used to approximate the generally intractable posterior density of a continuous random variable θ p θ|D (θ) = p D|θ (D|θ)p θ (θ) p D|θ (D|θ)p θ (θ)dθ , where p D|θ and p θ are the likelihood and the prior density function, respectively. SVGD tries to match the posterior p θ|D with a density q represented via a fixed number r ∈ N of pseudo-samplesso-called particles -and iteratively updates them by minimizing D KL (q∥p θ|D ) = q(θ) log(q(θ)) p θ|D (θ) dθ. In an RKHS with associated kernel k, the optimal update direction is found by considering the negative functional derivative -∇ f D KL (q [T ] ∥p θ|D ) f =0 = E θ∼q ∇ θ log p θ|D (θ)k(θ, •) + ∇ θ k(θ, •) , where T (θ) = θ + f (θ), and q [T ] is the density of θ ′ = T (θ) when θ ∼ q. We can estimate this functional gradient based on the particles in an unbiased manner, as we are able to evaluate the score function of p θ|D (i.e., ∇ θ log p θ|D ), although log p θ|D might be intractable.

3. STEIN FUNCTIONAL VARIATIONAL GRADIENT DESCENT

In this section, we develop a functional version of SVGD which we will call Stein functional variational gradient descent (SFVGD). While SVGD can be used to approximate the posterior distribution of a continuous random variable, SFVGD can be applied when we are interested in the posterior SP P f |D defined by its Radon-Nikodym derivative (Schervish, 1995) w.r.t. the prior SP P f , dP f |D dP f [f ] = p D|f [D|f ] p D|f [D|f ]dP f [f ] , where p D|f is the likelihood functional, which measures how likely it is to observe D, given a sample function f . In the following, we assume that the posterior P f |D exists and also that it is an ergodic and consistent SP. Analogously to SVGD, we try to approximate P f |D with a distribution Q represented by pseudo-samples. However, for SFVGD, these particles are now functions.

3.1. OBJECTIVE FUNCTION

Since analytical solutions for the differential Eq. ( 9) only exist in special cases (e.g., if the prior P f is a GP and the likelihood is also Gaussian), we use the D KL between two SPs to formulate an optimization objective. More specifically, the goal of our framework is to construct an approximating measure Q * for which it holds that Q * ∈ arg min Q∈Q D KL (Q∥P f |D ), ( ) where Q is the set of representable variational posterior processes. Here, we represent Q via r ∈ N sample functions f 1 , . . . , f r from Q, which act as pseudo-samples and which we also call particle functions. It can be shown (Matthews et al., 2015) that minimizing Eq. ( 10) is equivalent to maximizing the functional evidence lower bound (ELBO) L D , i.e., Q * ∈ arg max Q∈Q E f ∼Q [ℓ[D|f ]] -D KL (Q∥P f ) =:L D (Q) , where ℓ[D|f ] := log p D|f [D|f ] . The advantage of formulation ( 11) over ( 10) is that Eq. ( 11) only depends on known quantities. In the following, we apply Eq. 6, i.e., the results of Sun et al. (2019) regarding the D KL of ergodic and consistent SPs, yielding Q * ∈ arg max Q∈Q inf m∈N,X∈X m E f ∼Q [ℓ[D|f ]] -D KL (Q X ∥P f X ) =:L D,X (Q) . ( ) In contrast to Sun et al. (2019) , however, we do not unfold the D KL term, since we are able to directly take its functional gradient via SVGD. The resulting maximin game formulation of Eq. ( 12) proves to be challenging to solve, especially since we need to minimize over discrete sets X and the infimum also does not ensure a finite m. Hence, we follow Sun et al. ( 2019) by replacing the inner minimization with a sampling-based approach, i.e., Q * ∈ arg max Q∈Q E Ds E X M ∼C X L Ds,[X Ds ,X M ] (Q) , where D s is a random subsample of size |D s | = s drawn from D. X Ds are the associated feature vectors of D s , and X M = [x 1 , . . . , x M ] ⊤ ∈ X M are M stacked random feature vectors drawn from a sampling distribution C X with support X . If X is bounded, Sun et al. (2019) proposes a uniform distribution for C X . It has been shown in Sun et al. (2019) for D s = D and M > 1 that L D,[X D ,X M ] is a lower bound for the log marginal likelihood log p(D), i.e., the maximization in Eq. 13 implies the minimization in Eq. 10. Although, as noted by Burt et al. (2020) , if Q is a parametric family, the objective is ill-defined, we did not encounter any problems in practice. Also, we could straightforwardly use the grid functional D KL proposed by Ma & Hernández-Lobato (2021) , which fixes some of these theoretical shortcomings. However, note that SFVGD itself does not assume Q to be parametric.

3.2. FUNCTIONAL DERIVATIVE OF THE OBJECTIVE

When using conventional gradient descent methods, we want to apply a map to update the parameters of our model such that our loss is reduced. In SFVGD, we proceed in a similar manner but update functions towards a loss-minimizing direction. A map that takes a function as an argument and returns another function is called an operator. Hence, we want to express how our objective value Eq. 13 changes when an operator F : H → H, f → f F is applied to every f ∼ Q. This means that the objective value changes with F such that L Ds,X (Q [T ] ) = E f ∼Q [T ] ℓ[D s |f ] -D KL (Q [T ] X ∥P f X ), where T (f ) = f + F (f ) and Q [T ] is the distribution of f = T (f ) when f ∼ Q. Naturally, we are interested in the functional derivative of Eq. 14 w.r.t. to F , since this gives us the direction of the steepest ascent in operator space regarding the functional ELBO. However, in order to make our computations tractable, we must limit the space of feasible operators: Definition 3.1 Let F : H → H, f → f F be a continuous operator with the property that for all m ∈ N and each X ∈ X m exists a function F X : Y m → Y m such that f X F = F X (f X ) for any f ∈ H.We call such an operator "evaluation-only dependent". Thus, F does not depend on derivatives of f (which is not a restriction, since we only assumed f to be continuous); we can also treat F as a construction rule of F X for arbitrary m and X. Now, we can state the functional gradient of the objective functional w.r.t. an evaluation-only dependent operator F , for X = [X Ds , X M ] and X = X Ds ∇ F L Ds,X (Q [T ] ) = ∇ F E f ∼Q [T ] ℓ[D s | f ] -∇ F D KL (Q [T ] X ∥P f X ) = ∇ F E y∼Q [T ] X ℓ(D s , y) -∇ F D KL (Q [T ] X ∥P f X ), where we assumed that there exists a log-likelihood function ℓ : s∈N (X × Y) s × Y s → R such that ℓ[D s |f ] = ℓ(D s , f X ) for every D s . If we set F = 0, then T becomes the identity operator, i.e. Q [T ] = Q. Since we want to iteratively update our particle functions, we must only consider small perturbations around F = 0. Proposition 3.1 For an evaluation-only dependent operator F , the functional derivative of the functional ELBO at F = 0 evaluated for a function f ∇ F L Ds,X (Q [T ] ) F =0 (f ) = E y∼Q X ∇ y ℓ(D s , y) • δ y (f X ) • δ X1 (•), . . . , δ Xs (•) ⊤ + E y∼Q X ∇ y log p P f X (y)k Y (y, f X ) + ∇ y k Y (y, f X ) • δ X1 (•), . . . , δ X s+M (•) ⊤ , where we assume that H Y ⊂ {f : Y s+M → Y s+M } is an RKHS with associated kernels k Y . The proof is given in A.1.2, where we also show the following corollary. Corollary 3.1.1 For an evaluation-only dependent operator F , the functional derivative of the functional ELBO at F = 0 evaluated for a function f ∇ F L Ds,X (Q [T ] ) F =0 (f ) = E y∼Q X ∇ y ℓ(D s , y) • k Y ( y, f X ) • δ X1 (•), . . . , δ Xs (•) ⊤ + E y∼Q X ∇ y log p P f X (y)k Y (y, f X ) + ∇ y k Y (y, f X ) • δ X1 (•), . . . , δ X s+M (•) ⊤ , where we assume that H Y ⊂ {f : Y s+M → Y s+M }, H Y ⊂ {f : Y s → Y s } are RKHSs with associated kernels k Y , k Y , respectively. We call ∇ F L Ds,X (Q [T ] ) F =0 the Stein functional variational gradient operator. It inherits its name from SVGD, which internally is used to find the functional derivative of the D KL term. The key idea of SFVGD is that by updating every particle function f ∼ Q via functional gradient descent in the direction of ∇ F L Ds,X (Q [T ] ) F =0 (f ), we carry out a gradient step in the distribution space. This increases the current overall functional ELBO value we want to maximize by pulling Q closer to Q * and consequently also closer to the true posterior stochastic process P f |D .

3.3. ALGORITHMS

Based on the particle functions f 1 , . . . , f r , we can find an estimator of Eq. 16 ∇ F L Ds,X (Q [T ] ) F =0 (f ) = 1 r r i=1 ∇ f i X ℓ(D s , f i X )δ f i X (f X ) • δ X1 (•), . . . , δ Xs (•) ⊤ + λ r r i=1 ∇ f i X log p P f X (f i X )k Y (f i X , f X ) + ∇ f i X k Y (f i X , f X ) • δ X1 (•), . . . , δ X s+M (•) ⊤ , Algorithm 1: Stein Functional Variational Gradient Descent Step sfvgd_step Hyperparameters: Dataset D, log likelihood ℓ, prior SP P f , number of measure points M , sampling distribution CX over X , regularization parameter λ Input: Set of particle functions {fi} r i=1 treated as multi-output function f Output: Input locations to update X, Stein functional variational gradient (of f evaluated at X) ∆ f X XM ∼ CX ; Ds = ( X, y) ⊂ D X = X, XM for j = 1, . . . , r do ∆ j,ℓ = 1 r r i=1 ∇ f i X ℓ(Ds, f i X )δ f i X (f j X ) • δ X 1 (•), . . . , δ Xs (•) ⊤ ∆j,KL = 1 r r i=1 ∇ f i X log pP f X (f i X )k Y (f i X , f j X ) + ∇ f i X k Y (f i X , f j X ) • δ X 1 (•), . . . , δ X s+M (•) ⊤ end ∆ f X = (∆ ℓ + λ • ∆KL)(X) Algorithm 2: Stein Functional Variational Neural Network Hyperparameters: Same as for sfvgd_step Input: Variational posterior g(•), optimizer opt Output: Variational posterior g(•), which approximates the target distribution while ϕ not converged do fi = g(h ϕ (X, ξi)), ξi ∼ p(ξ), i = 1, . . . , r X, ∆ f X = sfvgd_step(f ) ϕ = opt(ϕ, X, ∆ f X ) end where we introduce a regularization parameter λ ∈ R ≥0 . Furthermore, if we set λ = 1, the estimator becomes an unbiased estimator of Eq. ( 16). Since L D,X is a lower bound of the log marginal likelihood log p(D), it would be preferable to update the particle functions via ∇ F L D,X (Q [T ] ) F =0 . However, the major computation bottleneck in Eq. 18 is the calculation of the score gradient ∇ f i X log p P f X (f i X ) for all particle functions f i , i = 1, . . . , r evaluated at X. For example, if P f is a GP, then the costs of computing ∇ f i X log p P f X (f i X ) are O((s + M ) 3 r). In addition, the computation of all kernel values k Y (f i X , f j X ), i = 1, . . . , r, j = 1, . . . , r required in Eq. 18 costs O(r 2 ). However, this is usually small compared to the cost of computing the score gradient for the functional prior. We choose for M a small constant number, since L D,[X D ,X M ] is a lower bound for the log marginal likelihood log p(D) for M > 1, and we set r to a number of particle functions that can represent the posterior SP reasonably well. Thus, we are interested in estimating ∇ F L D,X (Q [T ] ) F =0 with mini-batches. In principle, an unbiased estimate of ℓ(D, f i X D ) is n/s • ℓ(D s , f i X ) , which suggests that λ = s/n. Although (in general) L Ds,X is not a lower bound of log p(D), we found in a practice setting that λ to s/n still results in reasonable performance. However, our theoretical framework gives the reassuring guarantee that if we use full-batch training, we would, in fact, maximize a lower bound of log p(D). In the following, we present two algorithms, namely Stein functional variational NNs and Stein functional variational gradient boosting (A.3.1), based on the estimated Stein functional variational gradient -i.e., they depend on the score gradient of the functional prior evaluated at X. If there exists no analytical score gradient, we can use a score gradient estimator, as suggested in Sun et al. (2019) . This only requires function samples of the prior process evaluated at X, but estimating the score gradient is usually computationally expensive (Zhou et al., 2020) . Since our approach builds upon SVGD, there exists an additional approach in our framework based on a gradient-free SVGD (Han & Liu, 2018 ) that only requires the evaluation of the marginal densities of the prior process. Stein Functional Variational Neural Network (SFVNN) Sun et al. (2019) proposed to train neural networks (NNs) acting as function generators with the negative functional ELBO as loss, which they call Bayesian Functional Variational Neural Networks (BFVNNs). Such a function generator can be modeled via an NN with stochastic weights, which can be represented as a differentiable function g : Z → Y, z → g(z), where z ∈ Z consists of the deterministic input x and stochastic inputs, i.e., we can model z as a random variable z ∼ p(z|x). These NNs are applicable as long as the reparameterization trick (Kingma & Welling, 2014) can be used, i.e., there exists a random variable ξ ∈ Ξ with ξ ∼ p(ξ) and a differentiable function h ϕ : X × Ξ → Z parametrized by ϕ such that h ϕ (x, ξ) ∼ p(z|x). With this, we can sample a function by sampling ξ ∼ p(ξ) and defining f ξ : X → Y, x → g(h ϕ (x, ξ)). In this case, we can write the gradient of Eq. 14 w.r.t. ϕ as ∇ ϕ L Ds,X (Q [T ] ) = E ξ∼p(ξ) ∇ f ξ X ℓ(D s , f ξ X )∇ ϕ f ξ X -E ξ∼p(ξ) ∇ f ξ X log p Q X (f ξ X ) -∇ f ξ X log p P f X (f ξ X ) ∇ ϕ f ξ X . This is also the result obtained in Sun et al. (2019) , where they then use a score estimator (namely, the spectral stein gradient estimator (SSGE; Shi et al., 2018) ) to approximate ∇ y log p Q X (f ξ X ). SSGE estimates the score gradient in an RKHS, i.e., the entropy gradient ∇ f ξ X log p Q X (f ξ X ). Hence, the entropy gradient and the cross entropy gradient ∇ f ξ X log p P f X (f ξ X ) are taken in different functional spaces. Our SFVNN is based on the parameter gradient ∇ ϕ L Ds,X (Q [T ] ) = E ξ∼p(ξ) ∇ F L Ds,X (Q [T ] ) F =0 (f ξ )(X) • ∇ ϕ f ξ X -E f X ∼Q X ∇ ϕ log p Q X ,ϕ (f X ) =0 , where we use the general Stein functional variational gradient from Eq. 16. In contrast to Sun et al. (2019) , we thereby directly take the functional gradient of the D KL term in an RKHS, and our score gradients of the prior process are also subject to the implicit kernel smoothing. Runtime comparison between SVFNN and FVBNN While FVBNN scales as O(r 3 + r 2 (s + M )) (because of SSGE), our approach scales only quadratically in r (because of SVGD), allowing for a larger number of sample functions (see Appendix B). Given three sample functions and two data points {(0.75, 1.0), (2.0, 0.0)} (1a), we want to approximate the posterior GP (Figure 1h ) w.r.t. the prior GP shown in Figure 1b and a Gaussian likelihood via SFVGB. Hence, we also sample a necessary measure point x M = 1.5. The resulting three-dimensional marginal density is defined by the prior GP. SVGD gives us the optimal update direction for the sample function values to fit this marginal density (Figures 1c, 1d ). We fit a kernel ridge regression to these directions after adding the log likelihood gradients at the two data points (Figure 1e ). The resulting updated function samples after this SFVGD step can be seen in Figure 1f and the converged function samples in Figure 1g . Qualitatively comparing these converged sample functions in Figure 1g with sample functions from the exact posterior GP in Figure 1h reassures that we are are able to approximate the posterior GP in this toy example reasonably well. 

4. BENCHMARK STUDY

We further investigate the competitiveness of our approach using its neural network variant (SFVNN) with its closest neighbor, the functional variational Bayesian neural network (FVBNN) from Sun et al. (2019) . We also include one well-established BNN baseline (Blundell et al., 2015) and the standard Gaussian Process for the small data sets (where analytical computation is feasible). For results from SVFGB we refer to Section A.3.2 in the Appendix. For most of the datasets, however, the NN provides a better fit to the data. Further details and a contextual bandits experiment can be found in Appendix A.5. Data and experimental setup All data sets are standardized prior to model fitting and split into 90% training data and 10% test data. For the comparisons, this splitting process is repeated 10 times based on 10 different splits to also evaluate the variability of each method. Further details on data sets, data set-specific pre-processing, and their references can be found in the Appendix. Details on methods and comparisons In order to provide a fair comparison between methods, we reproduce the best results reported by Sun et al. (2019) for FVBNN and BNN, and also use the same hyperparameters for our method except that while Sun et al. ( 2019) use λ = 1, we set λ to s/n. Details for each procedure are given in the Appendix. We compare methods based on the negative log-likelihood (NLL) and the root mean squared error (RMSE) on each test data set and calculate the mean and standard deviation across all 10 data splits. Results Results are summarized in Tables 1 and 2 , indicating that SFVNN is competitive with other existing approaches for both small and large data sets. As the two functional approaches (SFVNN, FVBNN) optimize the same objective, we would expect them to perform similarly, which is confirmed by the results. We further observe that a weight space approach (the BNN) seems to work better than the functional approaches for a few datasets (in particular, for the Wine dataset, where this is expected as the outcome is of discrete nature).

5. CONCLUSION

We introduced a novel gradient descent in distribution space that allows us to update a set of particle functions in a general manner to resemble sample functions from a target process. SFVNN was found to be competitive with or to outperform FBVNN while having less computational costs.

A APPENDIX

A.1 PROOFS A.1.1 FUNCTIONAL DERIVATIVE OF THE RISK FUNCTIONAL Assuming that p(x, y)L (x,y) [f ] ∈ L 1 using Fubini's theorem, we find that R[f ] = E (x,y)∼Pxy L (x,y) [f ] = X ×Y p(x, y)L (x,y) [f ]dx dy (19) = X p(x) Y p(y|x)L (x,y) [f ]dy =:L(x,f (x)) dx. (20) Assuming that L is sufficiently smooth using the Euler-Lagrange derivative, we find that ∇ f R[f ](x) = ∂ (L(x, f (x))) ∂f (x) (21) = p(x) Y p(y|x) ∂L (x,y) [f ] ∂f (x) dy (22) = p(x) • E y∼P y|x ∂L (x,y) [f ] ∂f (x) . A.1.2 FUNCTIONAL DERIVATIVE OF THE FUNCTIONAL ELBO Let F be an operator that depends on evaluation only with associated maps F X : Y s → Y s , F X : Y s+M → Y s+M . Further, let F[F ] = L Ds,X (Q [T ] ) = E y∼Q [T ] X ℓ(D s , y) =F1[F X ] -D KL (Q [T ] X ∥P f X ) =F2[F X ] . In general, and under the assumption that ℓ is sufficiently smooth, we find that ∇ F X F 1 [F X ] = ∇ F X E y∼Q X ℓ(D s , y + F X ( y)) = E y∼Q X ∇ y ℓ(D s , y + F X ( y)) • δ y (•) . ( ) Under the assumption that H Y ⊂ {f : Y s → Y s } is an RKHS with associated kernels k Y , we find that F 1 [F X + εG X ] -F 1 [F X ] = E y∼Q X ℓ(D s , y + F X ( y) + εG X ( y)) -ℓ(D s , y + F X ( y)) (25) = ε E y∼Q X ∇ y ℓ(D s , y + F X ( y)) • G X ( y) + O(ε 2 ) (26) = ε E y∼Q X ∇ y ℓ(D s , y + F X ( y)) • ⟨k Y ( y, •), G X ⟩ + O(ε 2 ) (27) = ε ⟨E y∼Q X ∇ y ℓ(D s , y + F X ( y)) • k Y ( y, •) , G X ⟩ + O(ε 2 ), from which it follows that ∇ F X F 1 [F X ] F X =0 = E y∼Q X ∇ y ℓ(D s , y + F X ( y)) • k Y ( y, •) F X =0 (29) = E y∼Q X ∇ y ℓ(D s , y) • k Y ( y, •) . ( ) Under the assumption that H Y ⊂ {f : Y s+M → Y s+M } is an RKHS with associated kernels k Y , it has been shown in Liu & Wang (2016) that ∇ F X F 2 [F X ] F X =0 = -E y∼Q X ∇ y log p P f X (y)k Y (y, •) + ∇ y k Y (y, •) . ( ) Using the chain rule, we obtain Further experimental details Model tuning of SGB is done as explained in Malinin et al. (2021) . ∇ F L Ds,X (Q [T ] ) F =0 (f ) = ∇ F X F 1 [F X ](f X )∇ F F X (f ) F =0 (32) -∇ F X F 2 [F X ](f X )∇ F [F X ] (f ) F =0 . ( ) (34) Here, different tree depths ∈ {3, 4, 5, 6}, learning rates {0.001, 0.01, 0.1}, and numbers of samples ∈ {0.25, 0.5, 0.75} of approaches are trained on the first 80% of the training data and evaluated on the latter 20%. The GLMB approach is tuned using a 10-fold cross-validation to determine the number of stopping iterations, which is the only hyperparameter of the model.

Results

Results show that the SFVGB can often improve over the GLMB baseline but still yields inferior performance on some data sets. This is, in particular, the case if there is a rather discrete outcome space (e.g., Wine which only consists of values 0, 1, and 2). The SGB model, in turn, works better than both the SFVGB and GLMB in most cases. This is likely due to the much more flexible base learner structure (as SGB like most of the state-of-the-art boosting approaches uses trees, whereas GLMB uses linear regression and ours uses kernel regression). (2019) for evaluating probabilistic regression approaches. This setup includes selected data sets from the UCI repository -namely, the four smaller data sets Concrete, Energy, Wine, Yacht, and the four larger data sets Naval, Protein, Video (Memory and Time), and GPU. In addition to Sun et al. (2019) , we also compare our approaches on three additional smaller data sets (Airfoil, Diabetes, Forest Fire) and further investigate the second task on the Naval data set (i.e., we examine both the compressor decay and the turbine decay state coefficient, referred to as NavalC and NavalT, respectively). In Table 5 , the data characteristics and pre-processing steps are listed.

A.4.2 FURTHER EXPERIMENTAL DETAILS

BNN approaches are fitted using the recommended architecture and tuning parameters by Sun et al. (2019) . We reduced the epochs from 10,000 to 1,000 epochs for the smaller data sets to reduce the computational runtime. This did not negatively impact the BNN's performance. In all our benchmark experiments, we follow the setup for BNNs and FVBNNs described by Sun et al. (2019) .

A.5 FURTHER RESULTS

Here, we provide further results using the application of probabilistic methods for contextual bandits. Contextual Bandits One important application of uncertainty-aware models is for exploration, as in Bayesian optimization, reinforcement learning, or bandits. Following Sun et al. (2019) , we evaluate SFVNN using the contextual bandits benchmark by Riquelme et al. (2018) by re-runing the settings investigated in Sun et al. (2019) and report the cumulative regret based on the best expected 100.0 (0.00) 100.0 (0.00) 100.0 (0.00) 100.0 (0.00) 100.0 (0.00) 100.0 (0.00) 100.0 (0.00) reward. Following Riquelme et al. (2018) , we use the benchmark data sets Adult, Census, Covertype, Financial, Jester, Mushroom, Statlog, and Wheel. For these, the rewards are deterministic, and the regret is equal to the best realized reward. As in previous works, we report the regret as a relative value, relative to a random uniform sampling procedure that emulates Thompson Sampling (see Riquelme et al., 2018) . As comparison methods, we use the BNN (with 50 and 500 units), three spinnoffs of the NeuralLinear algorithm (namely, the Bootstrapped NN trained with RMSprop (BootRMS), Parameter Noise (ParamNoise), and Dropout (see Riquelme et al., 2018 , for more details)), as well as two variants of the FVBNN (with one and three layers each with 50 units). Experiments are run 5 times with shuffled contexts, for which we report mean and standard deviation of the relative cumulative regret. Results are given in Table 6 , suggesting that both FVBNN and SFVBNN are well-and particularly similar-performing methods in the application of contextual bandits.

B RUNTIME EXPERIMENT B.1 SETUP

We trained FVBNN and SVFNN 5 times on the Energy data set for each number of particle functions r ∈ {50, 100, 150} and plotted the resulting runtimes in Figure 2 . It becomes apparent that the number of particle functions influences the runtime of FVBNN more substantially than SFVNN, as we expect from our computational complexity analysis. All experiments and benchmarks were carried out on an internal cluster with Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 cores, 64 GB Random-access memory, and operating system Ubuntu 20.04.1 LTS.



If Y is numeric, Y = Y. Otherwise, Y is a numerical encoding of Y.



Figure 1: Illustrative example of SFVGB. The points in (a) represent the given data points, and the dashed lines represent all x values that define the marginal prior density in (c,d) w.r.t. the prior GP shown in (b). The arrows in (c) show the resulting SFVGD gradients.

NUMERICAL EXPERIMENTS: FURTHER DETAILS A.4.1 FURTHER DATA DETAILS We use the benchmark data setup proposed by Hernandez-Lobato & Adams (2015) and Sun et al.

Figure 2: Comparision of the runtimes of FVBNN and SVFNN on the Energy data set with 5 repetitions for each number of particle functions

Comparison of different methods (columns) on small benchmark data sets (rows) using the average NLL (smaller is better) and RMSE over 10 train-test data splits with standard deviation in brackets. The best performing method for each data set is highlighted in bold.

Comparison of different methods (columns) on large benchmark data sets (rows) using the average NLL (smaller is better) and RMSE over 10 train-test data splits with standard deviation in brackets. The best performing method for each data set is highlighted in bold.

Comparison of boosting approaches on the small benchmark data sets (columns) using the average NLL (smaller is better) over 10 train-test data splits with standard deviation in brackets. The best performing method for each data set is highlighted in bold.

Comparison of boosting approaches on the small benchmark data sets (columns) using the average RMSE (smaller is better) over 10 train-test data splits with standard deviation in brackets. The best performing method for each data set is highlighted in bold.

Data set characteristics, additional pre-processing and references.

Relative contextual bandits regret (relative to the cumulative regret of Uniform sampling) for different data sets (columns) and methods (rows). Numbers in brackets of methods indicate the network sizes. Reported numbers are the mean (and standard derivation in brackets) over 5 trials. The best algorithms per data set are highlighted in bold.

Algorithm 3: Stein Functional Variational Gradient Boosting

Hyperparameters: Same as for sfvgd_step Input: number of iterations tmax, learning rate η [t] in the t-th iteration, set of initial particle functions {f[0]i } r i=1 treated as multi-output function f [0] , multi-output base learner I b Output: Set of particle functions {fIf H is assumed to be an RKHS with associated kernel k X (x, •) then this becomesA.2 GRADIENT BOOSTING GB (Friedman, 2001 ) is a powerful supervised learning algorithm where, iteratively, the residuals are minimized via so-called weak learners. The resulting model consists of a weighted ensemble of these weak learners. In its original form, tree-based learners were used as weak learners, which proved to be effective, especially in the presence of heterogeneous features. XGBoost (Chen & Guestrin, 2016 ) is a highly efficient algorithm that builds upon the GB paradigm, which proves to be a strong baseline for many structured, supervised regression and classification tasks.

A.3.1 SFVGB ALGORITHM

We treat the r particle functions mapping from X to Y as a single function f [0] mapping from X to Y r -i.e., we identify the i-th sample function f[0]i , i = 1, . . . , r with the i-th component of f [0] . Analogously to standard GB, we choose a base learner I b which defines a hypothesis space H ⊂ {f : X → Y r }. While this requires the base learner I b to be a multi-output learner, it is always possible to use an ensemble of single-output learners. With this, SFVGB is vanilla GB of a multi-output function f where the loss is the negative functional ELBO. Consequently, we update f [t] in the t-th iteration viar )(X) ⊤ , using Eq. 18 and a (potentially adaptive) learning rate η [t] .

A.3.2 SFVGB REGRESSION EXPERIMENTS

We also test SFVGB on the small regression datasets and compare it to 1) the approach proposed in Malinin et al. (2021) , i.e., uncertainty quantification approaches by randomly subsampling the data in every iteration of a stochastic gradient boosting (SGB) model and 2) boosting generalized linear model (GLMB Buehlmann, 2006) as a baseline approach. As base learners, SGB uses trees and GLMB uses linear models. Our model uses a Nadaraya-Watson kernel regression variant, which does not scale the resulting sum of kernel functions. This is the natural learner if the hypothesis space H is assumed to be an RHKS, as discussed in section 2.1. The hyperparameters of the SFVGD step are the same as the ones we used for SFVNN. The results after 1000 iterations are summarized in Table 4 .

