ELBO-ING STEIN MIXTURES

Abstract

Stein variational gradient descent (SVGD) (Liu & Wang, 2016 ) is a particle-based technique for Bayesian inference. SVGD has recently gained popularity because it combines the ability of variational inference to handle tall data with the modeling power of non-parametric inference. Unfortunately, variance estimation scales inversely with the dimensionality of the model leading to underestimation, meaning more particles are required to represent high-dimensional models adequately. Stein mixtures (Nalisnick & Smyth, 2017) alleviate the exponential growth in particles by letting each particle parameterize a distribution. However, the inference algorithm proposed by Nalisnick & Smyth (2017) can be numerically unstable. We show that their algorithm corresponds to inference with the Rényi α-divergence for α = 0 and that using other values for α can lead to a more stable inference. We empirically study the performance of Stein mixtures inferred with different α values on various real-world problems, demonstrating significantly and consistently improved results when using α = 1, which corresponds to using the evidence lower bound (ELBO). We call this instance of our algorithm ELBO-within-Stein. An easy-to-use version of the inference algorithm (for arbitrary α ∈ R) is available in the deep probabilistic programming language NumPyro (Phan et al., 2019) .

1. INTRODUCTION

The ability of Bayesian deep learning to quantify the uncertainty of predictions by deep models is causing a surge of interest in using these techniques (Izmailov et al., 2021) . Bayesian inference aims to describe i.i.d. data D = {x i } n i=1 using a model with latent a variable z. Bayesian inference does this by computing a posterior distribution p(z|D) over the latent variable given a model describing the joint distribution p(z, D) = p(D|z)p(z). We obtain the posterior by following Bayes' theorem, p(z|D) = n i=1 p(xi|z)p(z) /p(D), where p(D) = z n i=1 p(x i |z)p(z)dz is the normalization constant. For most practical models, the normalization constant lacks an analytic solution or poses a computability problem, complicating the Bayesian inference problem. Stein variational gradient descent (SVGD) (Liu & Wang, 2016 ) is a recent technique for Bayesian inference that uses a set of particles Z = {z i } N i=1 to approximate the posterior p(z|D). The idea behind SVGD is to iteratively transport Z according to a force field S Z , called the Stein force. The Stein force is given by S Z (z i ) = E zj ∼q Z [k(z i , z j )∇ zj log p(z j |D) + ∇ zj k(z i , z j )], where k(•, •) is a reproducing kernel (Berlinet & Thomas-Agnan, 2011) , q Z = N -1 i δ zi is the empirical measure on the set of particles Z, δ x (y) represents the Dirac delta measure, which is equal to 1 if x = y and 0 otherwise, and ∇ zj log p(z j |D) is the gradient of the posterior with respect to the j-th particle. The technique is scalable to tall data (i.e. datasets with many data points) and offers the flexibility and scope of techniques such as Markov chain Monte Carlo (MCMC). SVGD is good at capturing multi-modality (Liu & Wang, 2016; Wang & Liu, 2019) , and has useful theoretical interpretations such as a set of particles following a gradient flow (Liu, 2017) or in terms of the properties of kernels (Liu & Wang, 2018) . The main problem is that SVGD suffers from the curse of dimensionality: variance estimation scales inversely with dimensionality (Ba et al., 2021) . Nalisnick & Smyth (2017) suggest resolving this by using a Stein mixture (SM). SMs lift each particle to the parameters of a variational distribution q, also called a guide. The idea is that each guide in the Stein mixture represents the density of multiple particles in SVGD, thereby reducing the number of particles needed to represent a posterior. The Nalisnick & Smyth algorithm introduces guides by replacing each posterior gradient ∇ zj log p(z j |D) in Equation ( 1) with the corresponding gradient of the marginal log-variational likelihood given by log p(D|ϕ ϕ ϕ j ) = log E q(z|D,ϕ ϕ ϕ j ) p(D, z) q(z|D, ϕ ϕ ϕ j ) . (2) Here, we denote the particles by Φ = {ϕ ϕ ϕ j } N i=1 instead of Z = {z i } N i=1 to emphasize they parameterize guide components q(z|ϕ ϕ ϕ i , D). The change in gradient corresponds to minimizing D KL [q Φ (ϕ ϕ ϕ) ∥ p(ϕ ϕ ϕ|D)] rather than D KL [q Z (z) ∥ p(z|D)], as in SVGD. Note that the line between the model p and guide q becomes blurred, as p(D|ϕ ϕ ϕ) is random in both data (D), as is usually the case, but also in the guide hyper-parameters ϕ ϕ ϕ (Ranganath et al., 2016; Nalisnick & Smyth, 2017) . To distinguish the two we subsequently refer to p(D) as the evidence and p(D|ϕ ϕ ϕ) as the hierarchical likelihood. The Stein force using the log hierarchical likelihood, which we call the hierarchical Stein force S H Φ , becomes S H Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ j ∼qΦ k(ϕ ϕ ϕ i , ϕ ϕ ϕ j )∇ ϕ ϕ ϕ j log E q(z|D,ϕ ϕ ϕ j ) p(D, z) q(z|D, ϕ ϕ ϕ j ) + ∇ ϕ ϕ ϕ j k(ϕ ϕ ϕ i , ϕ ϕ ϕ j ) , where q Φ is an empirical measure analogous to q Z . Inference converges (i.e. reaches a fixed point) when S H Φ (ϕ ϕ ϕ i ) = 0 for all particles, meaning all gradients in S H Φ must cancel (i.e. sum to zero). However, computing the gradient of the log-variational likelihood requires numerical estimation as analytical solutions do not exist for most models. Hence, we cannot expect the inference converges with noisy gradient estimations as the Stein force will compensate for the error in the gradient by a counterforce in the next iteration. Therefore, SMs require good (i.e. low relative variance) gradient approximations; otherwise, the particles will fluctuate around a fixed point without reaching it. We demonstrate that replacing the log hierarchical likelihood with the evidence lower bound (ELBO) can provide better (lower relative variance) gradient approximations. We call the new algorithm ELBO-within-Stein (EoS). We connect EoS with the algorithm proposed by Nalisnick & Smyth (2017) in terms of computing the gradient of different orders of the variational Rényi (VR) bound (Van Erven & Harremos, 2014) . Similarly to the ELBO, the VR bound is a lower bound of the evidence, p(D), also called the normalization constant, and is given by p(D) ≥ 1 1 -α log E q(z|D,ϕ ϕ ϕ) p(D, z) q(z|D, ϕ ϕ ϕ) 1-α , where α ≥ 0 is known as its orderfoot_0 . Understanding the inference of SMs in terms of the VR bound yields insight into the behavior of the two algorithms, as we can now understand α as controlling the variance of each component of the guide. Furthermore, presuming accurate gradient approximation for all (viable) values of α, the connection leads to a family of inference algorithms indexed by the VR bound order. After reviewing SVGD, the Rényi divergence, and the signal-to-noise ratio (SN-ratio) that is used to estimate the relative variance in Section 2, we make the following contributions: • We demonstrate that inaccurate gradient estimates can lead to issues with convergence for SMs. • We introduce a new family of inference algorithms for SMs indexed by the parameter α. The family results from connecting inference with SMs to the Rényi α-divergence and includes the inference algorithm by Nalisnick & Smyth (2017) as a special case for α = 0. • Unlike previous work, our algorithm allows for investigating a range of values for α for a model of interest. This allows us to investigate the convergence stability for different α's by measuring the SN-ratio. We find that α = 1 is optimal for models with a latent variable for each data point (local latent variables), resulting in better SN-ratios than all other α values. For models where all datapoints share a latent variable (global latent variables), using α = 0.5 (corresponding to the Hellinger distance) is on par with Nalisnick & Smyth (2017)'s algorithm (which corresponds to α = 0). Other values for α result in worse SN-ratios. • We evaluate our inference algorithm for different values of α on Bayesian neural networks (BNNs) and variational autoencoders (VAEs), showing that the α that results in the highest performance varies depending on both model and data set. • We describe a black-box inference algorithm for our proposed family of inference algorithms and provide a software library, called EinSteinVI, in NumPyro. In Section 4 we discuss related work. We benchmark our algorithm in Section 5. Finally, we summarize our results in Section 6.

2. BACKGROUND

Let z be a latent variable of interest taking values in a space Z ⊆ R d (up to a diffeomorphism) and D = {x i } n∈N be a set of i.i.d. observations. For many models, exact Bayesian inference is computationally impracticable due to the cost of evaluating the evidence p(D). Therefore, practitioners turn to tractable approximate variational inference (VI). VI aims to bring a computationally cheap variational distribution q(z|D) close to the model posterior. Typically, we measure closeness by the Kullback-Leibler divergence (D KL ), i.e. D KL [q(z|D) ∥ p(z|D)]. However, we generally avoid directly evaluating D KL [q(z|D) ∥ p(z|D)] as this requires evaluating the evidence, p(D). We will concern ourselves with two types of VI. The first type of VI searches for a parameterization ψ * of q in a family of distributions Q that minimizes the divergence to the posterior. When the divergence is measured by D KL , this type of VI is made tractable by maximizing the evidence lower bound (ELBO), that is ψ * = arg max ψ (log p(D) -D KL [q(z|D; ψ) ∥ p(z|D)]) = arg max ψ E q(z|D) log p(D, z) q(z|D; ψ) . The second type of VI we consider relies on particle-based methods and is the focus of this article. This type of VI relies on transporting a finite set of particles such that their empirical measure is close to the posterior. We will discuss this method in detail below.

2.1. STEIN VARIATIONAL GRADIENT DESCENT

The core idea of SVGD is to perform inference by approximating the target posterior distribution p(z|D) by an empirical distribution q Z (z) = N -1 i δ zi (z) based on a set of particles Z, where Z = {z i } N i=1 . One could thus see the approximating distribution q Z (z) as a (uniform) mixture of point estimates, each represented by a particle z i ∈ Z. The SVGD algorithm minimizes the Kullback-Leibler divergence D KL [q Z (z) ∥ p(z|D)] between the approximated and the true posterior by iteratively updating the particles using the following expression: z i+1 ← z i + ϵS Z (z i ) where ϵ is the learning rate and S Z denotes the Stein force. The two forces of SVGD The Stein force S Z consists of two underlying forces that work additively, with S Z = S + Z + S - Z . The attractive force is given by S + Z (z i ) = E zj ∼q Z [k(z i , z j )∇ zj log p(z j |D) ] and the repulsive force by S - Z (z i ) = E zj ∼q Z [∇ zj k(z i , z j )]. Here k : R d × R d → R is a kernel. The attractive force can be seen as pushing the particles towards the modes of the true posterior distribution, smoothed by some kernel. The repulsive force stops particles with high kernel values from collapsing onto each other. In Appendix C, we demonstrate the repulsive behavior for a radial basis function (RBF) kernel. The computational cost of S Z is quadratic in the size of Z, i.e. O(N 2 ), which makes SVGD computationally burdensome for high-dimensional posteriors. For a particle method such as SVGD, the number of particles required to represent a posterior distribution adequately is exponential in its dimensionality. SVGD suffers from the curse of dimensionality (Ba et al., 2021) , which results in variance collapse (i.e. variance is underestimated). Wang et al. (2018) demonstrates the problem with a simple factorized Gaussian, suggesting the (RBF) kernel introduces global statistical dependence driving the need for particles up for accurate representation. Ba et al. (2021) demonstrate that the collapse is due to the deterministic update of the attractive force. They do this by showing that re-sampling the particles at each iteration eliminates the underestimation of variance. Note that their particle re-sampling scheme by Ba et al. (2021) is not generally tractable; hence it does not suffice as a practical solution.

2.2. RÉNYI DIVERGENCE AND THE VARIATIONAL RÉNYI BOUND

The Rényi divergence (Rényi, 1961) is a family of divergences between distributions p and q indexed by the order parameter {α|α ∈ R + /{0, 1}, |D α | < ∞}. The divergence is given by D α [p ∥ q] = 1 α -1 log p(z) α q(z) 1-α dz. The Rényi divergence can be extended to α ∈ {0, 1, ∞} by continuity. In addition, if we allow for D α [p ∥ q] ≤ 0, the order can be further extended to α ∈ R (Van Erven & Harremos, 2014) . Several orders correspond to known divergences (see (Van Erven & Harremos, 2014) and (Li & Turner, 2016) for an overview). In particular, α = 1 corresponds to D KL . Analogous to the use of the D KL in the ELBO, D α leads to a variational Rényi bound (Li & Turner, 2016) which, when formulated as used with SMs, is given by log p(D) -D α [q(z|D, ϕ ϕ ϕ) ∥ p(z|D)] = 1 1 -α log E q(z|ϕ ϕ ϕ) p(z, D) q(z|D, ϕ ϕ ϕ) 1-α . Note that model hyper-parameters (ϕ) in the variational posterior, q(z|D, ϕ ϕ ϕ), are lifted to random variables when doing inference with SMs. See Appendix A for the derivation of Equation ( 6). Assuming reparameterization of z is possible, we can approximate the gradient Λ(ϕ ϕ ϕ) of Equation ( 6) using Monte Carlo integration by Λ K (ϕ ϕ ϕ) = K k=1 ω α k (Z, D)∇ ϕ ϕ ϕ log p(Z k , D) q(Z k |D, ϕ ϕ ϕ) , with Z k ∼ q(z|D, ϕ ϕ ϕ), where K ∈ N number of draws used to compute the VR bound and ω α k (z, D) = 1 C p(D, z k ) q(z k |D, ϕ ϕ ϕ) 1-α , with C = K i=1 p(D, z i ) q(z i |D, ϕ ϕ ϕ) 1-α . ( ) We provide the derivation in Appendix A.

2.3. THE SIGNAL-TO-NOISE RATIO

The signal-to-noise (SN) ratio was introduced by Rainforth et al. (2018) to study the effect of tighter variational bounds on gradient estimation. The SNR is given by SNR M,K (ϕ ϕ ϕ) = E ∆ α M,K (ϕ ϕ ϕ) σ ∆ α M,K (ϕ ϕ ϕ) , where σ[•] is the standard deviation, M, K ∈ N are the number of Monte Carlo draws, and ∆ α M,K (ϕ ϕ ϕ) derives from rewriting Equation ( 7) in the form ∆ α M,K (ϕ ϕ ϕ) = 1 1 -α 1 M M m ∇ ϕ ϕ ϕ log 1 K K k=1 p(Z m,k , D) q(Z m,k |D, ϕ ϕ ϕ) 1-α . ( ) Here, we separate tightening the bound (by increasing K) from reducing the noise in the gradient estimation (by increasing M ). If the rate at which the expected gradient decreases is faster than the rate of decrease of the variance, the gradient estimates worsen as K increases. The counter-intuitive implication is that a tighter bound can worsen the gradient estimation.

2.4. THE STEIN MIXTURE

Variational inference with SMs (Nalisnick & Smyth, 2017) approximates the target posterior distribution p(z|D) by letting the Stein particles Φ = {ϕ ϕ ϕ i } N i=1 parameterize guide programs, q(z|ϕ ϕ ϕ i , D). A SM yields a mixture marginal variational posterior, p(z|D) ≈ 1 /|Φ| ϕ ϕ ϕ∈Φ q(z|ϕ ϕ ϕ, D), from which it takes its name. Formally, SM is a hierarchical variational model (HVM) (Ranganath et al., 2016) with an empirical measure of particles q Φ (defined in the same way as q Z ) as its variational posterior, a uniform variational prior, and variational likelihood E q(z|D,ϕ ϕ ϕ) [ p(D,z|ϕ ϕ ϕ) /q(z|D,ϕ ϕ ϕ)]. Similarly to SVGD, SM minimizes D KL (q(ϕ ϕ ϕ) ∥ p(ϕ ϕ ϕ|D)) by iteratively transporting the particles according to the following expression ϕ ϕ ϕ i+1 ← ϕ ϕ ϕ i + ϵS H Φ (ϕ ϕ ϕ i ) where ϵ ≥ 0 is the learning rate and S H Φ is the hierarchical Stein force. The attractive force of SM Like SVGD, SM also makes use of two additive forces, S H Φ = S H+ Φ + S - Φ . The repulsive force S - Φ is the same as in SVGD, given by Equation ( 5). The attractive force is given by S H+ Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ∼qΦ k(ϕ ϕ ϕ i , ϕ ϕ ϕ)∇ ϕ ϕ ϕ log E q(z|ϕ ϕ ϕ) p(D, z) q(z|D, ϕ ϕ ϕ) , where k : R d × R d → R is a kernel. From the construction of SVGD, we require that the kernel has the reproducing property, so the kernel is dense in the space of continuous functions. If we choose Gaussian guides, the expected likelihood (EL) kernel (Jebara et al., 2004 ) is a natural choice because it accounts for the geometry of q(z|D, ϕ ϕ ϕ j ) and reduces to the RBF kernel for fixed variance, which is a reproducing kernel. The EL kernel is given by k(ϕ ϕ ϕ i , ϕ ϕ ϕ j ) = q(z|D, ϕ ϕ ϕ i )q(z|D, ϕ ϕ ϕ j )dz = ⟨q(z|D, ϕ ϕ ϕ i ), q(z|D, ϕ ϕ ϕ j )⟩ L2 , where L 2 is an inner product and k is a positive definite kernel.

3. α-INDEXED STEIN MIXTURES INFERENCE AND ELBO-WITHIN-STEIN

To see the connection between the hierarchical Stein force given in Equation ( 3) and the Rényi divergence, consider the gradient of the log hierarchical likelihood (that occurs in S H+ Φ ) and the VR bound given in Equation ( 6) for α = 0. Presuming the support of the variational likelihood q(z|ϕ ϕ ϕ) is a subset of the support of the prior of p, supp(q(z|ϕ ϕ ϕ)) ⊆ supp(p(z)), the gradient of the log hierarchical likelihood is given by ∇ ϕ ϕ ϕ log p(D|ϕ ϕ ϕ) = ∇ ϕ ϕ ϕ log E q p(D, z) q(z|D, ϕ ϕ ϕ) (α = 0, eq. (6)) = ∇ ϕ ϕ ϕ (log p(D) -D α=0 [q(z|D, ϕ ϕ ϕ) ∥ p(z|D)]) = -∇ ϕ ϕ ϕ D α=0 [q(z|D, ϕ ϕ ϕ) ∥ p(z|D)]. From Equation ( 11), we see that the gradient of the log marginal likelihood is exactly the gradient of the difference between the score log p(D), on the one hand, and the Rényi divergence (at α = 0) between the variational posterior q(z|ϕ ϕ ϕ) and the model posterior p(z∥D), on the other hand. Thus, Equation ( 11) shows that the attractive hierarchical force (S H+ Φ ) pushes the components of the variational posterior, q(z|D, ϕ ϕ ϕ), towards the model posterior, p(z|D), see Appendix D for details. The equivalence in Equation ( 11) suggests a whole class of hierarchical attractive forces indexed by the order α of the VR bound. Note that choosing α ̸ = 0 means we lose the interpretation of the attractive force as moving the particles towards the nearest peak of the conditional evidence. Assuming our marginal variational posterior q(z|ϕ ϕ ϕ) is reparameterizable, we can approximate the attractive force for any α ≥ 0 as S α+ Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ∼qΦ [k(ϕ ϕ ϕ i , ϕ ϕ ϕ)Λ K (ϕ ϕ ϕ)] , where Λ K (ϕ) is given by Equation ( 7). We call inference with Equation ( 12) α-indexed Stein mixture inference. There are two special cases of α that are worth highlighting. The first is α = 1 /2, for which the Rényi divergence corresponds to the Hellinger divergence (Van Erven & Harremos, 2014) (Li & Turner, 2016) . The second is α = 1, corresponding to the D KL -divergence. In this case, the VR bound recovers the ELBO. We call this instance of our α-indexed SM inference algorithm ELBO-within-Stein. In Appendix E we show that we can also recover the α = 1 case directly by applying Jensen's inequality to the conditional evidence.

3.1. INVESTIGATING THE SIGNAL-TO-NOISE RATIO

Estimation of a Stein mixture converges when S H Φ = 0, which means that the repulsive and attractive forces must be equal and oppose for them to cancel. Hence, convergence requires ∆ α M,K (ϕ ϕ ϕ) and ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) to be accurate. In Figure 1 we demonstrate the effect of inaccurate gradient approximations. To study the sensitivity of gradient approximations to the choice of α, we measure the SN-ratio (see Equation ( 9)) of the VR bound gradients (see Equation ( 6)). We simulate data {x i } n i=1 from a simple latent variable model given by N (D|z, I d )N (z|µ µ µ, I d ), where µ µ µ ∈ R d is unknown and I d is the d-dimensional identity matrix. To approximate its posterior we use a Stein mixture of the form 1 /2(N (ϕ ϕ ϕ 1 , 3 /2I d ) + N (ϕ ϕ ϕ 2 , 3 /2I d )), and an expected likelihood kernel. We choose a (computationally convenient) fixed variance such that the Stein mixture cannot exactly recover the posterior. We can see this as the posterior is unimodal which is only the case for the Stein mixture if |ϕ ϕ ϕ 1 -ϕ ϕ ϕ 2 | < 3 (Behboodian, 1970) , but in this interval the variance of the Stein mixture will be greater than or equal to 3 /2. With the expected likelihood kernel we can analytically characterize all fixed points for the Stein particles as -∇ ϕ ϕ ϕ 1 1 1 -α log E q(z|ϕ ϕ ϕ 1 ) p(z, D) q(z|ϕ ϕ ϕ 1 ) 1-α = ∇ ϕ ϕ ϕ 2 1 1 -α log E q(z|ϕ ϕ ϕ 2 ) p(z, D) q(z|ϕ ϕ ϕ 2 ) 1-α . See the Appendix B for the derivation. To measure the effect of gradient approximation on the system we use Equation (10) to estimate the gradients. To conduct our experiment, we sample the location µ µ µ from a 20-dimensional standard Gaussian and use this µ µ µ to simulate n = 64 data points D. We then approximate the gradients at a random point close to a fixed point (ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = µ µ µ + nD n + 1 + ∇ ϕ ϕ ϕ 1 ∆ α M,K (ϕ ϕ ϕ 1 ) + ϵ, µ + nD n + 1 + ∇ ϕ ϕ ϕ 2 ∆ α M,K (ϕ ϕ ϕ 2 ) , where D is the data average, µ+nD /n+1 is the posterior mean and ϵ offsets each dimension by a Gaussian with mean zero and variance 0.01. Figure 2b and Figure 2c show the convergence of the SN-ratio (see Equation ( 9)) as we tighten the VR bound by increasing either K or M . We only show the SN-ratio for the first particle (ϕ ϕ ϕ 1 ) as the behaviour for the second particle is the same. For the ELBO (α = 1), we fix K = 1 and increase M to reduce the gradient approximation variance, while for the rest (α ∈ {0, 0.5, 2, 10}), we fix M = 1 and increase K to tighten the VR bound. In Figure 2b , there is a latent variable for each data point. Note how the SN-ratio only improves with tightness for α = 1 (green line). In Figure 2c all data points share a latent variable. Figure 2a illustrates the experimental setup in two dimensions. For legibility in Figure 2a , we do not include the perturbation (i.e. added ϵ noise) on ϕ ϕ ϕ 1 to the visualization. The contours correspond to the exact posterior. As the particles are placed equidistant from the posterior mean (marked with a blue cross), in this setting the Stein forces are zero. As we would expect (see blue arrows in Figure 2a ) the gradient estimations of ∇ ϕ ϕ ϕ ∆ α M,K are equal and opposite for the two particles. For α ̸ = 1 we fix M = 1 and vary K, while for α = 1 we fix K = 1 and vary M . We do not need to consider α ̸ = 1 when K = 1 as the associated gradient scaling (1 -α) cancels in the SN-ratio. We empirically estimate the SN-ratio by estimating the expectation and standard deviation from 10, 000 gradient samples. In Figure 2b we show a local variate of the model where there is a latent variable z i for each x i ∈ D. We see that for α ̸ = 1 the SN-ratio does not depend on the particular choice of α and that the growth in SN-ratio is superseded by that of α = 1. This means there is little to no benefit in increasing K beyond K = 1, for which we recover the ELBO gradient when the guide is reparameterizable (see Section 2). In Figure 2c we evaluate a global latent variable variate of the model, so that there is one z for all datapoints in D. As with the local version, we fix either M or K. For this model, we see that α = 0 and α = 1 /2 achieve the highest SN-ratio. The result aligns with the BNN example, where α = 1 is not dominating in performance over the α ∈ {0, 1 /2} on all datasets. High precision is desirable to avoid fluctuation at convergence. From the above results, we see that α = 0 is not necessarily the best choice for precise (high SN-ratio) gradient estimation. In particular, for local latent variable models, α = 1 is a better choice, and for that, for global latent models, α = 1 /2 is on par with α = 0 in our experiment.

3.2. BLACK-BOX INFERENCE FOR ELBO-WITHIN-STEIN

We provide a mini-batch version of ELBO-within-Stein, called EinSteinVI, in NumPyro. To compute the VR bound exactly requires all the datapoints, that is, we cannot represent the bound as a pointwise expectation, except for α = 1. Therefore, in order to make EinSteinVI scalable to tall data, we provide an approximation of the VR bound which replaces the likelihood by p I (D|z, ϕ ϕ ϕ) = i∈I p(x i |z, ϕ ϕ ϕ) |D| /|I| , where I is a subset of a permutation of the data indices. The approximate attractive force S α+ Φ (ϕ ϕ ϕ) used in EinSteinVI is given by S α+ Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ∼qΦ k(ϕ ϕ ϕ, ϕ ϕ ϕ i )∇ ϕ ϕ ϕ 1 1 -α log E q I (z|D) p I (D|z, ϕ ϕ ϕ)p(z) q I (z|D, ϕ ϕ ϕ) 1-α , which recovers the exact VR bound when |I| = |D|. We describe the NumPyro integration and provide example programs in Appendix I. (Amari, 2012; Tsallis, 1988) . Hernandez-Lobato et al. ( 2016) introduced a blackbox algorithm for variational inference based on the α-divergence using automatic differentiation. Unlike our algorithm, their algorithm is not for HVMs. Rainforth et al. (2018) demonstrated that for VAEs the gradient estimation degrades for multi-sample approximations when using the importance weighted variational autoencoder (IWAE) bound (Burda et al., 2015) . Furthermore, Rainforth et al. (2018) showed that this is not the case when using the ELBO. Rainforth et al. (2018) differs from our work in that the VAEs estimated are with a point mass guide, as their inference algorithm is not for HVMs. Le et al. (2020) investigates the deterioration experimentally, providing evidence for it on several real world tasks. Tucker et al. (2018) show that by double reparameterizing the gradient estimator, they can eliminate the degrading SN-ratio for multi-sample estimation of the IWAE gradient, among others.

5. EXAMPLES

We evaluate α-indexed Stein mixture inference by inferring Bayesian neural networks (BNN) and variational autoencoders (VAE) on standard datasets. We use the BNNs for regression on the UCI regression benchmark (the same as Hernández-Lobato & Adams (2015)) and VAE for unsupervised learning on MNIST (Salakhutdinov & Murray, 2008; LeCun et al., 1998) and OMNIGLOT (Lake et al., 2013) . Bayesian neural networks For brevity, we present BNNs for the subset of the UCI regression benchmark detailed in Appendix G. All datasets use real-valued features. We use a 90-10 split for training and test datasets. We compare α-indexed SM inference for α ∈ {0, 0.5, 1} on BNN regression. We test with two guides: factorized Gaussian guides with an EL kernel and point mass (Dirac delta) guides with an RBF kernel. Like Liu & Wang (2016) , we use a BNN with one hidden layer of size fifty and a RELU activation. We put a Gamma(1, 0.1) prior on the precision of the neurons and the likelihood. We use five particles for all experiments. We run all datasets for 35,000 -146.241 -146.257 -148.428 epochs with a subsample size of 32, the Adam optimizer (Kingma & Ba, 2014) and a step size of 0.002. All measurements are repeated three times and obtained on a GPUfoot_2 . We compare against the SVGD implantation from Liu et al. (2016) with 20 particles, mean field variational Bayes (MFVI) with a factorized Gaussian guide (Hoffman et al., 2013) and Laplace approximation. For the latter two we inference engines from NumPyro (Phan et al., 2019) . Table 1 shows the root mean squared error (RMSE) on test sets. EoS with delta Guides out performance baselines and other α-orders, except on Boston. SM and Hell perform similarly with factorized Gaussian guides, which aligns with our SN-ratio experiment that shows the gradient approximations are similar for these two cases. Note the Stein mixtures use only five particles, whereas SVGD uses twenty. Table 2 gives the log-likelihood on the same test sets. EoS achieves better average log-likelihood for all datasets with factorized Gaussian guides than other α-orders. We see that α = 0 and α = .5 performs similarly with a factorized Gaussian prior, which aligns with our SN-ratio experiment in that the quality of gradient approximations is similar. Variational autoencoder We evaluate Stein mixtures and SVGD for VAEs on two datasets with α ∈ {0, 0.5, 1}. We use binarized MNIST (Salakhutdinov & Murray, 2008; LeCun et al., 1998) , a dataset of 28 × 28 pixel images of handwritten single digit numbers, and a variate of OMNIGLOT (Lake et al., 2013) , which contains 28 × 28 pixel images of characters from fifty different alphabets. We use the same VAE architecture as Burda et al. (2015) , detailed in Appendix H. For both datasets we optimize using the Adam optimizer and learning rate of 5 • 10 -4 . We optimize with a batch size of 20 and use 20 draws to approximate the gradients. For OMNIGLOT we use 20 epochs and for MNIST we use 50. Table 3 show the performance of ELBO-within-Stein for α ∈ {0.0, 0.5, 1.0}. We find that ELBO-within-Stein with α = 1 achieves better log-likelihoods on MNIST datasets. On OMNIGLOT α = 0.5 and α = 0 achieve comparable log-likelihood with α0 the slightly outperforming α = 0.5.

6. SUMMARY

We introduce a new algorithm called ELBO-within-Stein (EoS) based on a new connection between the inference of Stein mixtures and the Rényi variational bound. We demonstrate that Eos provides better gradient approximations than alternative algorithms, which results in better performance for standard benchmark problems. EoS is integrated as a black box library in the NumPyro PPL which is distributed freely.

A VARIATIONAL RÉNYI BOUND

For convenience, we derive the variational Rényi bound (Li & Turner, 2016) in the context of inference with our algorithm below. Recall that Stein mixtures lift the set of guide hyper-parameters ϕ (optimized in VI) to a random variable ϕ ϕ ϕ. Let D be a finite set of observations, z ∈ R d be a latent variable, D α [q||p] the Rényi α-divergence (Rényi, 1961) between distributions p and q, and ϕ ϕ ϕ ∈ R d a set of guide hyper-parameters lifted to a random variable. Then we have log p(D) -D α [q(z|D, ϕ ϕ ϕ)||p(z|D)] = 1 1 -α log E q(z|ϕ ϕ ϕ) p(z, D) q(z|D, ϕ ϕ ϕ) 1-α . ( ) To see this, consider, D α [q(z|D, ϕ ϕ ϕ)||p(z|D)] = 1 α -1 log q(z|D, ϕ ϕ ϕ) α p(z|D) 1-α dz = 1 α -1 log q(z|D, ϕ ϕ ϕ) α p(z, D) p(D) 1-α dz = 1 α -1 log q(z|D, ϕ ϕ ϕ) α p(z, D) 1-α dz • p(D) α-1 = 1 α -1 log q(z|D, ϕ ϕ ϕ) α p(z, D) 1-α dz + α -1 α -1 log p(D) = 1 α -1 log q(z|D, ϕ ϕ ϕ)q(z|D, ϕ ϕ ϕ) -(1-α) p(z, D) 1-α + log p(D) = 1 α -1 log E q(z|D,ϕ ϕ ϕ) p(z, D) q(z|D, ϕ ϕ ϕ) 1-α + log p(D), from which we recover the desired equality by rearranging and multiplying both sides by negative one, log p(D) -D α (q(z|ϕ ϕ ϕ)||p(z|D)) = 1 1 -α log E q(z|D,ϕ ϕ ϕ) p(z, D) q(z|ϕ ϕ ϕ) 1-α . To shorten notation, we let C α (D, ϕ ϕ ϕ) = E ( p(z,D) /q(z|D,ϕ ϕ ϕ)) 1-α where the expectation is with respect to q(z|D, ϕ ϕ ϕ). The gradient of the variational Rényi bound with respect to ϕ ϕ ϕ is ∇ ϕ ϕ ϕ 1 1 -α log E p(z, D) q(z|D, ϕ ϕ ϕ) 1-α = 1 1 -α C α (D, ϕ ϕ ϕ) -1 E ∇ ϕ ϕ ϕ p(z, D) q(z|D, ϕ ϕ ϕ) 1-α = C α (D, ϕ ϕ ϕ) -1 E p(z, D) q(z|D, ϕ ϕ ϕ) 1-α ∇ ϕ ϕ ϕ log p(z, D) q(z|D, ϕ ϕ ϕ) = E    p(z,D) q(z|D,ϕ ϕ ϕ) 1-α C α (D, ϕ ϕ ϕ) -1 ∇ ϕ ϕ ϕ log p(z, D) q(z|D, ϕ ϕ ϕ)    = E ω α (z, D)∇ ϕ ϕ ϕ log p(z, D) q(z|D, ϕ ϕ ϕ) , where ω α (z, D) = ( p(z,D) /q(z|D,ϕ ϕ ϕ)) 1-α E ( p(z,D) /q(z|D,ϕ ϕ ϕ)) 1-α .

B CHARACTERIZING TWO PARTICLE FIXED POINTS

We give the full derivation of stationary points for the Stein mixture that we consider in Section 3. Recall that Section 3 investigated the SN-ratio for a Stein mixture given by 1 2 (N (ϕ ϕ ϕ 1 , 3 /2I d ) + N (ϕ ϕ ϕ 2 , 3 /2I d )) , where ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ∈ R d are two d-dimensional particles. We use the kernel given by k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = exp - 1 h ||ϕ ϕ ϕ 1 -ϕ ϕ ϕ 2 || 2 2 , where h ∈ R + is the bandwidth. The kernel has the following properties: ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = -∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ), k(•, •) = 1, k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = k(ϕ ϕ ϕ 2 , ϕ ϕ ϕ 1 ), ∇ ϕ ϕ ϕ k(•, •) = 0, which we will use in the derivation. Finally, we introduce ξ α (ϕ ϕ ϕ) = 1 1-α log E q(z|ϕ ϕ ϕ) p(z,D) q(z|ϕ ϕ ϕ) 1-α as notation short-hand. Our two particle configuration reaches a fixed point when (ϕ ϕ ϕ 1 + ϵS H Φ (ϕ ϕ ϕ 1 ), ϕ ϕ ϕ 2 + ϵS H Φ (ϕ ϕ ϕ 2 )) = (ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) , where ϵ ≥ 0 is the step size. Therefore, S H Φ (ϕ ϕ ϕ 1 ) = 0 and S H Φ (ϕ ϕ ϕ 2 ) = 0 at any fixed point. S H Φ (ϕ ϕ ϕ 1 ) is given by S H Φ (ϕ ϕ ϕ 1 ) = 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 1 ) ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 ) + 0 ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 1 ) +k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) + ∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 ) + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) + ∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = 0. Therefore, -∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 ) + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) at a fixed point. By a similar argument for S H Φ (ϕ ϕ ϕ 2 ), we have ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = -(∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 2 )) at a fixed point. As ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = -∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) , it follows from Equations ( 16) and ( 17) that ∇ ϕ ϕ ϕ 1 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) = -∇ ϕ ϕ ϕ 2 k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 ) -(∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 2 )) = ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 ) + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) -∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 )(1 + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )) = ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 )(1 + k(ϕ ϕ ϕ 1 , ϕ ϕ ϕ 2 )) -∇ ϕ ϕ ϕ 2 ξ α (ϕ ϕ ϕ 2 ) = ∇ ϕ ϕ ϕ 1 ξ α (ϕ ϕ ϕ 1 ). Hence, we see that at any fixed point for our two particle configuration, the gradients of the VR-bound are equal and opposite.

C KERNELS IN SVGD

For an example of a kernel, consider the radial basis function (RBF) kernel k(z i , z j ) = exp -1 h ∥ z i -z j ∥ 2 2 with bandwidth parameter h, chosen as 1 log N med(Z), where med is the median operator. The repulsive force moves particles away from each other, ensuring that they do not collapse onto the same mode. For the RBF kernel, the repulsive force becomes E zj ∼q Z ∇ zj k(z i , z j ) = j - 2 h k(z i , z j ) (z i -z j ) • 1 d , where • is the (euclidean) inner product and 1 d is a d-dimensional one vector. It follows that z i is pushed away from z j when k(z i , z j ) is large.

D CONDITIONAL EVIDENCE AS RÉNYI DIVERGENCE BETWEEN POSTERIORS

That Equation ( 11) pushes posteriors towards each other follows from properties of Rényi divergence with α ∈ [0, 1] and the negative direction of the gradient on the Rényi divergence. In particular, we have (i) that the divergence is a similarity measure of distributions for α ≥ 0 so D α=0 [q||p] = 0 =⇒ q = p, (ii) that D α [q ∥ p] is everywhere positive, and (iii) the divergence is jointly convex (i.e. convex in both distributions) (Van Erven & Harremos, 2014) . Putting it all together, we see from (ii) and (iii) that the extremum at D α=0 [q||p] = 0 is global and from the (negative) gradient we are minimizing D α=0 [q||p] . So, we have that -∇ ϕ ϕ ϕ D α=0 [q(z|D, ϕ ϕ ϕ) ∥ p(z|D)] = 0 =⇒ D α=0 [q(z|D, ϕ ϕ ϕ) ∥ p(z|D)] = 0 which from (i) means q(z|D, ϕ ϕ ϕ) = p(z|D).

E ALTERNATIVE ELBO-WITHIN-STEIN DERIVATION

For α = 1 we can derive the attractive force of S H Φ directly by applying Jensen's inequality to the log conditional evidence, resulting in ∇ ϕ ϕ ϕ log E q p(D, z|ϕ ϕ ϕ) q(z|D, ϕ ϕ ϕ) ≥ ∇ ϕ ϕ ϕ E q log p(D, z|ϕ ϕ ϕ) q(z|D, ϕ ϕ ϕ) . In ELBO-within-Stein, the attractive force takes the simple form S ELBO+ Φ (ϕ ϕ ϕ i ) = E ϕ ϕ ϕ∼qΦ k(ϕ ϕ ϕ i , ϕ ϕ ϕ)∇ ϕ ϕ ϕ E q(z|ϕ ϕ ϕ) [log p(D, z|ϕ ϕ ϕ)] -k(ϕ ϕ ϕ i , ϕ ϕ ϕ)∇ ϕ ϕ ϕ E q(z|ϕ ϕ ϕ) [q(z|D, ϕ ϕ ϕ)] , and the repulsive force is given by Equation (5). 

Dataset

Data points Feature count Boston (Harrison Jr & Rubinfeld, 1978) 506 13 Concrete (Yeh, 1998) 1030 8 Energy (Tsanas & Xifara, 2012) 768 8 Power (Tüfekci, 2014) 9568 4 Protein (Rana, 2013) 45730 9 Year (Bertin-Mahieux et al., 2011) 515345 90 Table 5 : Variational autoencoder architecture for MNIST and OMNIGLOT. s denotes a stochastic layer and d denotes a deterministic layer. Read the networks left-to-right for guide description and right-to-left for model description. Architecture Activation MNIST d200-d200-s50 tanh OMNIGLOT d200-d200-s100-d100-d100-s50 tanh

F ILLUSTRATING STEIN MIXTURES

To illustrate the use of a SM, consider the variational autoencoder (VAE) (Kingma & Welling, 2013) . The VAE simultaneously trains a generative model p(D|g θ (z))p(z) and a variational approximation q(z|f ψ (D)) of the posterior p(z|D). Here, θ and ψ are parameters of the generative neural network g θ (•) and the inference network f ψ (•), respectively. VAE training is typically done by stochastic variational inference (SVI) (Hoffman et al., 2013) which optimizes θ θ θ and ψ ψ ψ to minimize the ELBO. With a SM, the generative model remains the same, that is, we obtain a point estimate of θ. However, the marginal posterior approximation changes to 1 /|Φ| ϕ ϕ ϕ∈Φ q(z|f ϕ ϕ ϕ (D)). So with a Stein mixture, each particle ϕ ϕ ϕ parameterizes a separate inference network, i.e. f ϕ ϕ ϕ (•), meaning the guide becomes amortized similar to Shu et al. (2018) .

G UCI BENCHMARK DETAILS

We compare ELBO-within-Stein for α ∈ {0, 0.5, 1} on BNNs regression point mass (Dirac delta) guide and a RBF kernel. With ELBO-within-Stein we recover a variant of SVGD with a VR gradient rather than the score function. Like Liu & Wang (2016) , we use a BNN with one hidden layer of size fifty and a RELU activation. We put a Gamma(1, 0.1) prior on the precision of the neurons and the likelihood. For both versions we use 5 particles and update Year for 40 epochs, Protein for 100 epochs and 500 epochs for the rest. We use Adagrad (Duchi et al., 2011) with a step size of 0.05 and a subsample size of 100. All measurements are repeated five times and obtained on a GPUfoot_3 .

H VAE DETAILS

Following Li & Turner (2016) ; Burda et al. (2015) ; Rainforth et al. (2018) , we use VAEs with multiple stochastic layers. The idea is to define the model through ancestral sampling as p(x|θ θ θ) = z1,...,z L p(z L )p(z L-1 |f θ θ θ L-1 (z L )) . . . p(x|f θ θ θ0 (z 1 )), where x is a data-point (which we will also denote z 0 ), z 1 , . . . , z L are the L stochastic layers, and θ θ θ l parameterizes a neural network f l which takes z l+1 to the parameters of the distribution p l , i.e. p(z l |f l (z l+1 ). We then let the guide factor in the opposite direction, resulting in q(z|ϕ ϕ ϕ, x) = q(z 1 |f ϕ ϕ ϕ 1 (x))q(z 2 |f ϕ ϕ ϕ 2 (z 1 )) . . . p(z L |f ϕ ϕ ϕ L (z L-1 )). We use the same network architecture as Rainforth et al. (2018) (summarized in Table 5 ). In Table 5 , s denotes a stochastic layer and d denotes a deterministic layer (affine transforms). We use tanh as the activation functions on deterministic layers. Stochastic layers distribute according to a factorized Gaussian distribution, and for the likelihood we use the Bernoulli distribution (hence the binarization of the datasets).

I THE EINSTEINVI LIBRARY

We provide a library called EinSteinVI for inferring Stein mixtures in the probabilistic programming language (PPL) NumPyro (Bingham et al., 2019; Phan et al., 2019) . EinSteinVI uses α-indexed SM inference as its core algorithm as described in Section 3. NumPyro is a universal PPL (van de Meent et al., 2018) embedded in Python. NumPyro provides specialized constructs for expressing probabilistic models as Python programs and allows executing arbitrary code in its model and guide. The computational backend of NumPyro is Jax (Frostig et al., 2018) , which combines XLA (accelerated linear algebra) (Sabne, 2020) program transformations with automatic differentiation. As EinSteinVI works with arbitrary guides, NumPyro is a well-suited language for embedding EinSteinVI. Further, we chose NumPyro because: • NumPyro is embedded in Python, the de-facto programming language for data science; • NumPyro includes the necessary data structures for tracking random variables in both model and guide; • NumPyro features stochastic variational inference (SVI) with an application programming interface (API) that is well suited for EinSteinVI; and • NumPyro benefits computationally from Jax.

I.1 A EINSTEINVI PROGRAM EXAMPLE

To demonstrate the two modes of VI (SVGD and Stein mixtures) with EinSteinVI, we consider the 1D Gaussian mixture 1/3N (-2, 1) + 2/3N (2, 1) (see Figure 3 and Figure 4 ). The Gaussian mixture is bi-modal and well-suited for the nonparametric nature of SVGD and Stein mixtures. Figure 4 shows that both SVGDfoot_4 and the Stein-mixture naturally capture the bi-modality of the target distribution, compared to SVI with a Gaussian guide. Note the reduction in particles required to estimate the target when using Stein mixtures compared to SVGD. Also, note that the Stein-mixture overestimates the variance and slightly perturbs the locations. The error seen at the right mode for the Stein-mixture with two particles is due to the uniform weighting of the particles in SVGD, and as such is algorithmic. The Stein-mixture will therefore not be able to exactly capture the mixing components of a target mixture model with one particle per component. However, with more particles the mixture can be approximated better as demonstrated using three particles.

I.2 INTEGRATION WITH NUMPYRO

Integrating EinSteinVI with NumPyro requires handling transformations between the parameter representation of NumPyrofoot_5 and the array representation that ELBO-within-Stein operates on. For this, we rely on Jax's PyTreesfoot_6 which converts back and forth between Python dictionaries and array representations. Algorithm 1 shows the black-box version of α-indexed SM inference in NumPyro. The algorithm allows SVI to estimate a subset of the parameters and α-indexed SM inference the rest. To differentiate the two, we denote parameters updated by SVI with ψ ψ ψ and parameters updated by ELBO-within-Stein with ϕ ϕ ϕ i (i.e. the Stein particles Φ = {ϕ ϕ ϕ} N i=1 ). In the model, only SVI can update parameters which we denote by θ θ θ. We update θ θ θ and ψ ψ ψ by averaging the loss over the Stein particles. For the Stein particles, the process is more elaborate. First, we convert the set of individual distribution parameters in the guide to a monolithic array using Jax's PyTrees. The array represents the particles as a Figure 4 : The blue dashed line is the target pdf, while the solid green line is the density of the particles. We estimate the particle density for SVGD with Gaussian kernel density estimation. We use 100 particles for SVGD, and two or three particles for the Stein-mixture. SVI uses a Gaussian guide. flattened and stacked Jax array. Then, we compute a kernel on the particles, delegated to the kernel interface (see Appendix I.3) as the computation is kernel-dependent. We apply Jax's vmap (Frostig et al., 2018; Phan et al., 2019) operator to compute the Stein forces for each particle in a vectorized manner. As we compute the Stein forces in unconstrained space, we must correct them by the Jacobian of the bijection to constrained space. Naively computing the Jacobian on the monolithic array incurs a massive memory overhead. However, as NumPyro registers a bijection for each distribution parameter, we can eliminate the overhead by computing the Jacobian on the Jax representations of the individual distribution parameters rather than the monolith. Like computing the Stein forces, the correction is embarrassingly parallel so that we can use a vmap operator again. Inside the vmap we nest a tree_map to do the appropriate conversion between representations. Finally, we convert the monolithic array to its NumPyro representation and return the expected changes for SVI-and Stein-parameters. Require: SVI parameters θ θ θ and ψ ψ ψ, Stein parameters {ϕ ϕ ϕ i } N i=1 , model p ϕ ϕ ϕ (z, x), guide q θ θ θ,ψ ψ ψ (z), loss L α , kernel interface KI. Ensure: Parameter changes based on SVI (∆θ θ θ, ∆ψ ψ ψ) and hierarchical Stein forces ({∆ϕ ϕ ϕ i } N i=1 ). procedure UPDATE(θ θ θ, ψ ψ ψ, {ϕ ϕ ϕ i } N i=1 , p θ θ θ , q ϕ ϕ ϕ,ψ ψ ψ ) ∆θ θ θ ← E θ θ θ [∇ θ θ θ L α (p θ θ θ , q ϕ ϕ ϕ,ψ ψ ψ )] ∆ψ ψ ψ ← E ψ ψ ψ [∇ ψ ψ ψ L α (p θ θ θ , q ϕ ϕ ϕ,ψ ψ ψ )] {a i } i ← PyTreeFlatten({ϕ ϕ ϕ i } N i=1 ) k ← KI({a i } N i=1 ) procedure HSTEIN-FORCES(a i ) ▷ Calculate forces per particle for higher-order vmap function. θ θ θ i ← PYTREERESTORE(a i ) ∆a i ← aj k(a j , a i )∇ ai L α (p ϕ ϕ ϕ , q θ θ θi,ψ ψ ψ ) + ∇ ai k(a j , a i ) return ∆a i end procedure {∆a i } i ← VMap({a i } i , HSTEIN-FORCES) {∆ϕ ϕ ϕ i } N i=1 ← PYTREERESTORE({∆a i } N i=1 ) return ∆θ θ θ, ∆ψ ψ ψ, {∆ϕ ϕ ϕ i } N i=1 end procedure Algorithm 1: α-indexed Stein Mixture inference

I.3 KERNEL INTERFACE

The kernel interface is straightforward. To extend the interface, users must implement the compute function, which accepts as input the current set of particles, the mapping between model parameters and particles, and the loss function L and returns a differentiable kernel k. Table 6 gives the complete list of kernels in EinSteinVI.  ) Constant Pre- conditioned Q -1 2 K(Q 1 2 x, Q 1 2 y)Q -1 2 K is an inner matrix-valued kernel and Q is a preconditioning matrix like the Hessian -∇ 2 z log p(z|x) or Fisher information -E z∼q Z (z) [∇ 2 z log p(z|x)] matrices matrix Wang et al. (2019) Anchor Point Preconditioned m ℓ=1 K Q ℓ (x, y)ω ω ω ℓ (x)ω ω ω ℓ (y) {a ℓ } m ℓ=1 is a set of anchor points, Q ℓ = Q(a ℓ ) is a preconditioning matrix for each anchor point, K Q ℓ is an inner kernel conditioned using Q ℓ , and ω ω ω ℓ (x) = softmax ℓ ({N (x|a ℓ ′ , Q -1 ℓ ′ )} ℓ ′ ) matrix Wang et al. (2019) 



The VR bound can be extended to α ∈ R. We presume α is finite, but we allow α to be less than or equal to zero(Van Erven & Harremos, 2014) Quadro RTX 6000 with Cuda V11.4.120 Quadro RTX 6000 with Cuda V11.4.120 We recover SVGD with a point mass (Delta dirac distribution) on all distributions in the guide. A dictionary mapping parameters to their values, which can be of arbitrary Python type https://jax.readthedocs.io/en/latest/pytrees.html



Figure 1: Two particle system at theoretical fixed point. The blue arrows indicate the magnitude and direction of the attractive force, the red arrows show the repulsive force, and the black arrows the Stein force. Note that Figure 1b has no Stein force as expected for a converged system.

(a) Experimental setup. (b) SN-ratio convergence with a local latent variable. (c) SN-ratio convergence with a global latent variable.

Figure 3: 1D Gaussian mixture model in NumPyro. We use the deprecated NormalMixture over the more general (and more verbose) MixtureSameFamily for clarity.

Average RMSE (lower is better) test results for BNNs on UCI benchmark. EoS (ours) corresponds α = 1, Hell (ours) to α = 0.5, and SM(Nalisnick & Smyth, 2017) to α = 0. Parentheses mean we use Dirac delta guides. EoS and Hell gives the best results. Smyth (2017) first suggested Stein mixtures as an alternative to HVMs(Ranganath et al., 2016). Using SVGD allows Stein mixtures to side-step the need of HVMs for an auxiliary distribution to keep the bound (learning objective) tight. This is an improvement as the effect of the auxiliary distribution on the approximation is implicit and therefore hard to understand, whereas with Stein mixtures the choice of kernel controls the tightness and we have theoretical understanding of kernels

Average log likelihood (higher is better) test results for BNNs on UCI benchmark. EoS (ours) corresponds α = 1, Hell (ours) to α = 0.5, and SM(Nalisnick & Smyth, 2017) to α = 0. Parentheses mean we use Dirac delta guides. EoS generally outperforms Hell and SM.

Log likelihood (higher is better) test results for VAE. EoS (ours) corresponds α = 1, Hell (ours) to α = 0.5, and SM(Nalisnick & Smyth, 2017) to α = 0.

Summary statistics of datasets from the UCI regression benchmark.

Kernels included in the EinSteinVI library.

