ELBO-ING STEIN MIXTURES

Abstract

Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a particle-based technique for Bayesian inference. SVGD has recently gained popularity because it combines the ability of variational inference to handle tall data with the modeling power of non-parametric inference. Unfortunately, variance estimation scales inversely with the dimensionality of the model leading to underestimation, meaning more particles are required to represent high-dimensional models adequately. Stein mixtures (Nalisnick & Smyth, 2017) alleviate the exponential growth in particles by letting each particle parameterize a distribution. However, the inference algorithm proposed by Nalisnick & Smyth (2017) can be numerically unstable. We show that their algorithm corresponds to inference with the Rényi α-divergence for α = 0 and that using other values for α can lead to a more stable inference. We empirically study the performance of Stein mixtures inferred with different α values on various real-world problems, demonstrating significantly and consistently improved results when using α = 1, which corresponds to using the evidence lower bound (ELBO). We call this instance of our algorithm ELBO-within-Stein. An easy-to-use version of the inference algorithm (for arbitrary α ∈ R) is available in the deep probabilistic programming language NumPyro (Phan et al., 2019).

1. INTRODUCTION

The ability of Bayesian deep learning to quantify the uncertainty of predictions by deep models is causing a surge of interest in using these techniques (Izmailov et al., 2021) . Bayesian inference aims to describe i.i.d. data D = {x i } n i=1 using a model with latent a variable z. Bayesian inference does this by computing a posterior distribution p(z|D) over the latent variable given a model describing the joint distribution p(z, D) = p(D|z)p(z). We obtain the posterior by following Bayes' theorem, p(z|D) = n i=1 p(xi|z)p(z) /p(D), where p(D) = z n i=1 p(x i |z)p(z)dz is the normalization constant. For most practical models, the normalization constant lacks an analytic solution or poses a computability problem, complicating the Bayesian inference problem. Stein variational gradient descent (SVGD) (Liu & Wang, 2016 ) is a recent technique for Bayesian inference that uses a set of particles Z = {z i } N i=1 to approximate the posterior p(z|D). The idea behind SVGD is to iteratively transport Z according to a force field S Z , called the Stein force. The Stein force is given by S Z (z i ) = E zj ∼q Z [k(z i , z j )∇ zj log p(z j |D) + ∇ zj k(z i , z j )], where k(•, •) is a reproducing kernel (Berlinet & Thomas-Agnan, 2011), q Z = N -1 i δ zi is the empirical measure on the set of particles Z, δ x (y) represents the Dirac delta measure, which is equal to 1 if x = y and 0 otherwise, and ∇ zj log p(z j |D) is the gradient of the posterior with respect to the j-th particle. The technique is scalable to tall data (i.e. datasets with many data points) and offers the flexibility and scope of techniques such as Markov chain Monte Carlo (MCMC). SVGD is good at capturing multi-modality (Liu & Wang, 2016; Wang & Liu, 2019) , and has useful theoretical interpretations such as a set of particles following a gradient flow (Liu, 2017) or in terms of the properties of kernels (Liu & Wang, 2018) . The main problem is that SVGD suffers from the curse of dimensionality: variance estimation scales inversely with dimensionality (Ba et al., 2021) . Nalisnick & Smyth (2017) suggest resolving this by using a Stein mixture (SM). SMs lift each particle to the parameters of a variational distribution q, also called a guide. The idea is that each guide in the Stein mixture represents the density of multiple particles in SVGD, thereby reducing the number of particles needed to represent a posterior. The Nalisnick & Smyth algorithm introduces guides by replacing each posterior gradient ∇ zj log p(z j |D) in Equation ( 1) with the corresponding gradient of the marginal log-variational likelihood given by log p(D|ϕ ϕ ϕ j ) = log E q(z|D,ϕ ϕ ϕ j ) p(D, z) q(z|D, ϕ ϕ ϕ j ) . (2) Here, we denote the particles by Φ = {ϕ ϕ ϕ j } N i=1 instead of Z = {z i } N i=1 to emphasize they parameterize guide components q(z|ϕ ϕ ϕ i , D). The change in gradient corresponds to minimizing D KL [q Φ (ϕ ϕ ϕ) ∥ p(ϕ ϕ ϕ|D)] rather than D KL [q Z (z) ∥ p(z|D)], as in SVGD. Note that the line between the model p and guide q becomes blurred, as p(D|ϕ ϕ ϕ) is random in both data (D), as is usually the case, but also in the guide hyper-parameters ϕ ϕ ϕ (Ranganath et al., 2016; Nalisnick & Smyth, 2017) . To distinguish the two we subsequently refer to p(D) as the evidence and p(D|ϕ ϕ ϕ) as the hierarchical likelihood. The Stein force using the log hierarchical likelihood, which we call the hierarchical Stein force S H Φ , becomes S H Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ j ∼qΦ k(ϕ ϕ ϕ i , ϕ ϕ ϕ j )∇ ϕ ϕ ϕ j log E q(z|D,ϕ ϕ ϕ j ) p(D, z) q(z|D, ϕ ϕ ϕ j ) + ∇ ϕ ϕ ϕ j k(ϕ ϕ ϕ i , ϕ ϕ ϕ j ) , where q Φ is an empirical measure analogous to q Z . Inference converges (i.e. reaches a fixed point) when S H Φ (ϕ ϕ ϕ i ) = 0 for all particles, meaning all gradients in S H Φ must cancel (i.e. sum to zero). However, computing the gradient of the log-variational likelihood requires numerical estimation as analytical solutions do not exist for most models. Hence, we cannot expect the inference converges with noisy gradient estimations as the Stein force will compensate for the error in the gradient by a counterforce in the next iteration. Therefore, SMs require good (i.e. low relative variance) gradient approximations; otherwise, the particles will fluctuate around a fixed point without reaching it. We demonstrate that replacing the log hierarchical likelihood with the evidence lower bound (ELBO) can provide better (lower relative variance) gradient approximations. We call the new algorithm ELBO-within-Stein (EoS). We connect EoS with the algorithm proposed by Nalisnick & Smyth (2017) in terms of computing the gradient of different orders of the variational Rényi (VR) bound (Van Erven & Harremos, 2014) . Similarly to the ELBO, the VR bound is a lower bound of the evidence, p(D), also called the normalization constant, and is given by p(D) ≥ 1 1 -α log E q(z|D,ϕ ϕ ϕ) p(D, z) q(z|D, ϕ ϕ ϕ) 1-α , where α ≥ 0 is known as its orderfoot_0 . Understanding the inference of SMs in terms of the VR bound yields insight into the behavior of the two algorithms, as we can now understand α as controlling the variance of each component of the guide. Furthermore, presuming accurate gradient approximation for all (viable) values of α, the connection leads to a family of inference algorithms indexed by the VR bound order. After reviewing SVGD, the Rényi divergence, and the signal-to-noise ratio (SN-ratio) that is used to estimate the relative variance in Section 2, we make the following contributions: • We demonstrate that inaccurate gradient estimates can lead to issues with convergence for SMs. • We introduce a new family of inference algorithms for SMs indexed by the parameter α. The family results from connecting inference with SMs to the Rényi α-divergence and includes the inference algorithm by Nalisnick & Smyth (2017) as a special case for α = 0. • Unlike previous work, our algorithm allows for investigating a range of values for α for a model of interest. This allows us to investigate the convergence stability for different α's by measuring the SN-ratio. We find that α = 1 is optimal for models with a latent variable



The VR bound can be extended to α ∈ R. We presume α is finite, but we allow α to be less than or equal to zero(Van Erven & Harremos, 2014)

