ELBO-ING STEIN MIXTURES

Abstract

Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a particle-based technique for Bayesian inference. SVGD has recently gained popularity because it combines the ability of variational inference to handle tall data with the modeling power of non-parametric inference. Unfortunately, variance estimation scales inversely with the dimensionality of the model leading to underestimation, meaning more particles are required to represent high-dimensional models adequately. Stein mixtures (Nalisnick & Smyth, 2017) alleviate the exponential growth in particles by letting each particle parameterize a distribution. However, the inference algorithm proposed by Nalisnick & Smyth (2017) can be numerically unstable. We show that their algorithm corresponds to inference with the Rényi α-divergence for α = 0 and that using other values for α can lead to a more stable inference. We empirically study the performance of Stein mixtures inferred with different α values on various real-world problems, demonstrating significantly and consistently improved results when using α = 1, which corresponds to using the evidence lower bound (ELBO). We call this instance of our algorithm ELBO-within-Stein. An easy-to-use version of the inference algorithm (for arbitrary α ∈ R) is available in the deep probabilistic programming language NumPyro (Phan et al., 2019).

1. INTRODUCTION

The ability of Bayesian deep learning to quantify the uncertainty of predictions by deep models is causing a surge of interest in using these techniques (Izmailov et al., 2021) . Bayesian inference aims to describe i.i.d. data D = {x i } n i=1 using a model with latent a variable z. Bayesian inference does this by computing a posterior distribution p(z|D) over the latent variable given a model describing the joint distribution p(z, D) = p(D|z)p(z). We obtain the posterior by following Bayes' theorem, p(z|D) = n i=1 p(xi|z)p(z) /p(D), where p(D) = z n i=1 p(x i |z)p(z)dz is the normalization constant. For most practical models, the normalization constant lacks an analytic solution or poses a computability problem, complicating the Bayesian inference problem. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a recent technique for Bayesian inference that uses a set of particles Z = {z i } N i=1 to approximate the posterior p(z|D). The idea behind SVGD is to iteratively transport Z according to a force field S Z , called the Stein force. The Stein force is given by S Z (z i ) = E zj ∼q Z [k(z i , z j )∇ zj log p(z j |D) + ∇ zj k(z i , z j )], where k(•, •) is a reproducing kernel (Berlinet & Thomas-Agnan, 2011), q Z = N -1 i δ zi is the empirical measure on the set of particles Z, δ x (y) represents the Dirac delta measure, which is equal to 1 if x = y and 0 otherwise, and ∇ zj log p(z j |D) is the gradient of the posterior with respect to the j-th particle. The technique is scalable to tall data (i.e. datasets with many data points) and offers the flexibility and scope of techniques such as Markov chain Monte Carlo (MCMC). SVGD is good at capturing multi-modality (Liu & Wang, 2016; Wang & Liu, 2019) , and has useful theoretical interpretations such as a set of particles following a gradient flow (Liu, 2017) or in terms of the properties of kernels (Liu & Wang, 2018) . The main problem is that SVGD suffers from the curse of dimensionality: variance estimation scales inversely with dimensionality (Ba et al., 2021) . Nalisnick & Smyth (2017) suggest resolving this by 1

