APPROXIMATE BAYESIAN INFERENCE WITH STEIN FUNCTIONAL VARIATIONAL GRADIENT DESCENT

Abstract

We propose a general-purpose variational algorithm that forms a natural analogue of Stein variational gradient descent (SVGD) in function space. While SVGD successively updates a set of particles to match a target density, the method introduced here of Stein functional variational gradient descent (SFVGD) updates a set of particle functions to match a target stochastic process (SP). The update step is found by minimizing the functional derivative of the Kullback-Leibler divergence between SPs. SFVGD can either be used to train Bayesian neural networks (BNNs) or for ensemble gradient boosting. We show the efficacy of training BNNs with SFVGD on various real-world datasets.

1. INTRODUCTION

Bayesian inference can be treated as a powerful framework for data modeling and reasoning under uncertainty. However, this assumes that we can encode our prior knowledge in a meaningful manner. Typically, this is done by specifying the prior distribution of the model parameters. However, in machine learning (ML), models potentially consist of millions of parameters with potentially highly complex interactions (e.g., very large neural networks (NNs)). Furthermore, the parameter structure of the models itself is allowed to change during training, e.g., the number of parameter grows when using gradient boosting (GB). This makes defining meaningful prior assumptions for parameter spaces difficult or nearly (practically) infeasible. As we usually do not care about single parameters but the complete resulting function, it seems intuitive to directly express our prior knowledge in hypothesis function space by, e.g., specifying the characteristic length scale, periodicity, or smoothness in general. Fortunately, Bayesian inference can also be formulated in function space. In this case, the prior and posterior distributions are stochastic processes (SPs). The most prominent representative is the Gaussian process (GP), for which the posterior GP can be analytically computed. However, training GPs scale cubically in the number of observations, and the implicit Gaussian likelihood assumption is often violated in reality. In this paper, we introduce Stein functional variational gradient descent (SFVGD). This method provides a general gradient descent method in function space that enables practitioners to train ML models to approximate the posterior SP, assuming certain regularity conditions of the prior SP and the likelihood function hold.

1.1. RELATED WORK

Kernelized Stein Methods These methods combine Stein's identity with a reproducing kernel Hilbert space (RKHS) assumption. Based on a finite particle set, they can either be used to find the optimal transport direction to match a target density or to estimate the score gradient of the empirical distribution of the particles. The former is called Stein variational gradient descent (SVGD) Liu & Wang (2016), and approaches of the latter category are called (non-parametric) score estimators (Zhou et al., 2020) . Our method internally uses SVGD and forms a natural analogue in function space. Several extensions to SVGD exist, e.g., approaches incorporating second-order information such as Leviyev et al. (2022) and the more general matrix-kernel valued approach by Wang et al. (2019a) . While these extensions usually outperform SVGD, their computational costs are also higher. Bayesian Neural Networks (BNNs) Typically, BNNs are NNs with weight priors that are trained via variational inference. The prominent representatives are BNNs using Bayes by Backprop (Blundell et al., 2015) and scalable probabilistic backpropagation (Hernandez-Lobato & Adams, 2015) . Recently, Immer et al. (2020) proposed transforming BNNs into generalized linear models with inference based on a Gaussian process equivalent to the model. While Markov Chain Monte Carlo (MCMC) methods often are prohibitively expensive to be used for BNNs, some variants, e.g., Chen et al. ( 2014) account for noisy gradient evaluations and can be used in this setting. However, MCMC-based methods are still usually employed for relatively low-dimensional problems. Functional BNNs (FBNNs) Sun et al. (2019) proposed to use functional priors to train BNNs. Training BNNs with our descent is closely related to their method, but while they use a score-based approach for the estimation of the derivative of the Kullback-Leibler divergence D KL between the prior SP and the variational SP, we estimate this derivative directly via SVGD. Wang et al. (2019b) also use SVGD for FBNNs but apply SVGD directly to the D KL between the posterior SP and the variational SP. Furthermore, their work does not show that this in fact maximizes a lower bound for the log marginal likelihood. Recently, Ma & Hernández-Lobato (2021) and Rudner et al. ( 2021) proposed different FBNN approaches that also build upon the results of Sun et al. ( 2019), but while their methods are specific to training NN function generators, our method can be used to update a set of particle functions in general. Repulsive Deep Ensembles Repulsive Deep Ensembles are deep ensembles that incorporate repulsive terms in their gradient update, forcing their members' weights apart. A variety of repulsive terms are presented in (D'Angelo & Fortuin, 2021) and (D'Angelo et al., 2021) , outperforming the approach by Wang et al. (2019b) . However, these approaches mainly focus on weight priors, and empirical findings also only relate to the weight space. In contrast to our work, functional priors can only be applied if a posterior SP with analytical marginal density exists. GB with Uncertainty The closest neighbor of our approach applied to GB is the ensemble GB scheme proposed by Malinin et al. (2021) , which is based on Bayesian ensembles. In contrast to our functional approach, their method is based on approximating the posterior of the model parameters. Another GB-based method is NGBoost proposed by Duan et al. ( 2019), which directly learns the predictive uncertainty; however, prior knowledge can not be taken into account.

1.2. OUR CONTRIBUTION

We propose a novel natural extension of SVGD in function space (Section 3), which enables the practitioner to match a target SP. This approach can be implemented in a BNN or as GB routine (Section 3.3). Using real-world benchmarks, we show that the resulting generator training algorithm is competitive while having less computational costs than the approach of Sun et al. (2019) . In contrast to other existing uncertainty-aware GB algorithms, a GB ensemble, when trained via SFVGD, can naturally incorporate prior functional information. These versatile applications of our framework are made possible by providing a unifying view of NNs and GB from a functional analysis perspective.

2.1. SUPERVISED ML FROM A FUNCTIONAL ANALYSIS PERSPECTIVE

Given a labeled dataset D ∈ (X × Y) n ∼ P n xy of n ∈ N independent and identically distributed (i.i.d.) observations from an unknown data generating process P xy , a supervised ML algorithm tries to construct a risk optimal model f under a pre-specified loss L. In this case, the function f defines a mapping from the feature space X to the target space Y. The learning algorithm I to construct f is a function mapping from the set of all datasets n∈N (X × Y) n to a hypothesis space H, which is a subset of the set of all functions mapping from X to the model output spacefoot_0 Y ⊂ R g with g ∈ N. In order to specify the goodness-of-fit of a function f , one can define a loss function L : Y × Y → R, (y, f (x)) → L(y, f (x)), which measures how well the output of a fixed model f ∈ H fits an observation (x, y) ∼ P xy . In the following, we present supervised ML from a functional analysis perspective. Here, we fix the observation and associate the loss L with the loss functionalL (x,y) [f ] : H → R, f → L(y, f (x)). Based on this loss functional, we can define the risk functional of a model f , R[f ] = E (x,y)∼Pxy L (x,y) [f ], (1)



If Y is numeric, Y = Y. Otherwise, Y is a numerical encoding of Y.

