EINSTEIN VI: GENERAL AND INTEGRATED STEIN VARIATIONAL INFERENCE IN NUMPYRO

Abstract

Stein Variational Inference is a technique for approximate Bayesian inference that is gaining popularity since it combines the scalability of traditional Variational Inference (VI) with the flexibility of non-parametric particle based inference methods. While there has been considerable progress in development of algorithms, integration in existing probabilistic programming languages (PPLs) with an easy-to-use interface is currently lacking. EinStein VI is a lightweight composable library that integrates the latest Stein Variational Inference methods with the NumPyro PPL. Inference with EinStein VI relies on ELBO-within-Stein to support use of custom inference programs (guides), non-linear scaling of repulsion force, second-order gradient updates using matrix-valued kernels and parameter transforms. We demonstrate the achieved synergy of the different Stein techniques and the versatility of the EinStein VI library by applying it on examples. Compared to traditional Stochastic VI, EinStein VI is better at capturing uncertainty and representing richer posteriors. We use several applications to show how one can use Neural Transforms (NeuTra) and second-order optimization to provide better inference using EinStein VI. We show how EinStein VI can be used to infer novel Stein Mixture versions of realistic models. We infer the parameters of a Stein Mixture Latent Dirichlet Allocation model (SM-LDA) with a neural guide. The results indicate that Einstein VI can be combined with NumPyro's support for automatic marginalization to do inference over models with discrete latent variables. Finally, we introduce an example with a novel Stein Mixture extension to Deep Markov Models, called the Stein Mixture Deep Markov Model (SM-DMM), which shows that EinStein VI can be scaled to reasonably large models with over 500.000 parameters.

1. INTRODUCTION

Interest in Bayesian deep learning has surged due to the need for quantifying the uncertainty of predictions provided by machine learning algorithms. The idea behind Bayesian learning is to describe observed data x using a model with latent variable z (representing model parameters and nuisance variables, see e.g., Fig. 4a ). The goal is then to infer a posterior distribution p(z|x) over latent variables given a model describing the joint distribution p(z, x) = p(x|z)p(z) following the rules of Bayesian inference: p(z|x) = Z -1 p(x|z)p(z) where the normalization constant Z = z p(x|z)p(z)dz is intractable for most practical models including deep neural networks: an analytic solution is lacking or may require an infeasible number of calculations. Variational Inference (VI) techniques (Blei et al., 2017; Hoffman et al., 2013; Ranganath et al., 2014) provide a way to find an approximation of the posterior distribution. VI poses a family of distributions over latent variables q(z) ∈ Q (e.g., Fig. 4b ) and chooses the one that minimizes a chosen divergencefoot_0 D(q(z) p(z|x)) (e.g., Kullback-Leibler) to the true posterior distribution. VI often provides good approximations that can capture uncertainty, scaling to millions of data points using mini-batch training. ) is a recent non-parametric approach to VI which uses a set of particles {z i } N i=1 as the approximating distribution q(z) to provide better flexibility in capturing correlations between latent variables. The technique preserves the scalability of traditional VI approaches while offering the flexibility and modelling scope of techniques such as Markov Chain Monte Carlo (MCMC). Stein VI has been shown to be good at capturing multi-modality (Liu and Wang, 2016; Wang and Liu, 2019a) , and has useful theoretical interpretations as particles following a gradient flow (Liu, 2017) and as a moment matching optimization system (Liu and Wang, 2018). Many advanced inference methods based on Stein VI have been recently developed, including Stein mixtures (Nalisnick, 2019), non-linear Stein (Wang and Liu, 2019b), factorized graphical models (Zhuo et al., 2018; Wang et al., 2018b ), matrix-valued kernels (Wang et al., 2019) and support for higher-order gradient-based optimization (Detommaso et al., 2018) . These techniques have been shown to significantly extend the power of Stein VI, allowing more flexible and effective approximations of the true posterior distribution. While algorithmic power is growing, there remains a distinct lack of integration of these techniques into a general probabilistic programming language (PPL) framework. Such an integration would solve one of the most prominent limitations of traditional VI, which lacks flexibility to capture rich correlations in the approximated posterior. This paper presents the EinStein VI library that extends the NumPyro PPL (Bingham et al., 2019; Phan et al., 2019) with support for the recent developments in Stein Variational Inference in an integrated and compositional fashion (see Fig. 1c and Fig. 1d ). The library takes advantage of the capabilities of NumPyro-including universal probabilistic programming (van de Meent et al., 2018) , integration with deep learning using JAX (Frostig et al., 2018) , and automatic optimization and marginalization of discrete latent variables (Obermeyer et al., 2019) -to provide capabilities that work synergetically with the Stein algorithms. Concretely, our contributions are:



Asymmetric distance



Figure 1: Linear regression example model in NumPyro with EinStein VI for inference

