SEMI-IMPLICIT VARIATIONAL INFERENCE VIA SCORE MATCHING

Abstract

Semi-implicit variational inference (SIVI) greatly enriches the expressiveness of variational families by considering implicit variational distributions defined in a hierarchical manner. However, due to the intractable densities of variational distributions, current SIVI approaches often use surrogate evidence lower bounds (EL-BOs) or employ expensive inner-loop MCMC runs for direct ELBO maximization for training. In this paper, we propose SIVI-SM, a new method for SIVI based on an alternative training objective via score matching. Leveraging the hierarchical structure of semi-implicit variational families, the score matching objective allows a minimax formulation where the intractable variational densities can be naturally handled with denoising score matching. We show that SIVI-SM closely matches the accuracy of MCMC and outperforms ELBO-based SIVI methods in a variety of Bayesian inference tasks.

1. INTRODUCTION

Variational inference(VI) is an approximate Bayesian inference approach where the inference problem is transformed into an optimization problem (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2017) . It starts by introducing a family of variational distributions over the model parameters (or latent variables) to approximate the posterior. The goal then is to find the closest member from this family of distributions to the target posterior, where the closeness is usually measured by the Kullback-Leibler (KL) divergence from the posterior to the variational approximation. In practice, this is often achieved by maximizing the evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence (Jordan et al., 1999) . One of the classical VI methods is mean-field VI (Bishop & Tipping, 2000) , where the variational distributions are assumed to be factorized over the parameters (or latent variables). When combined with conditional conjugacy, this often leads to simple optimization schemes with closed-form update rules (Blei et al., 2017) . While popular, the factorizable assumption and conjugacy condition greatly restrict the flexibility and applicability of variational posteriors, especially for complicated models with high dimensional parameter space. Recent years have witnessed much progress in the field of VI that extends it to more complicated settings. For example, the conjugacy condition has been removed by the black-box VI methods which allow a broad class of models via Monte carlo gradient estimators (Nott et al., 2012; Paisley et al., 2012; Ranganath et al., 2014; Rezende et al., 2014; Kingma & Welling, 2014) . On the other hand, more flexible variational families have been proposed that either explicitly incorporate more complicated structures among the parameters (Jaakkola & Jordan, 1998; Saul & Jordan, 1996; Giordano et al., 2015; Tran et al., 2015) or borrow ideas from invertible transformation of probability distributions (Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma et al., 2016; Papamakarios et al., 2019) . All these methods require tractable densities for the variational distributions. It turns out that the variational family can be further expanded by allowing implicit models that have intractable densities but are easy to sample from (Huszár, 2017) . One way to construct these implicit models is to transform a simple base distribution via a deterministic map, i.e., a deep neural network (Tran et al., 2017; Mescheder et al., 2017; Shi et al., 2018a; b; Song et al., 2019) . Due to the intractable densities of implicit models, when evaluating the ELBO during training, one often resorts to density ratio estimation which is known to be challenging in high-dimensional settings (Sugiyama et al., 2012) . To avoid density ratio estimation, semi-implicit variational inference (SIVI) has been proposed where the variational distributions are formed through a semi-implicit hierarchical construction and surrogate ELBOs (asymptotically unbiased) are employed for training (Yin & Zhou, 2018; Moens et al., 2021) . Instead of surrogate ELBOs, an unbiased gradient estimator of the exact ELBO has been derived based on MCMC samples from a reverse conditional (Titsias & Ruiz, 2019). However, the computation for the inner-loop MCMC runs can easily become expensive in high-dimensional regimes. There are also approaches that estimate the gradients instead of the objective (Li & Turner, 2018; Shi et al., 2018b; Song et al., 2019) . Besides KL divergence, score-based distance measures have also been introduced in various statistical tasks (Hyvärinen, 2005; Zhang et al., 2018) and have shown advantages in complicated nonlinear models (Song & Ermon, 2019; Ding et al., 2019; Elkhalil et al., 2021) . Recently, there are also some studies that use score matching for variational inference (Yang et al., 2019; Hu et al., 2018) . However, these methods are not designed for SIVI and hence either do not apply to or can not fully exploit the hierarchical structure of semi-implicit variational distributions. In this paper, we propose SIVI-SM, a new method for SIVI using an alternative training objective via score matching. We show that the score matching objective and the semi-implicit hierarchical construction of variational posteriors can be combined in a minimax formulation where the intractability of densities is naturally handled with denoising score matching. We demonstrate the effectiveness and efficiency of our method on both synthetic distributions and a variety of real data Bayesian inference tasks.

2. BACKGROUND

Semi-Implicit Variational Inference Semi-implicit variational inference (SIVI) (Yin & Zhou, 2018) posits a flexible variational family defined hierarchically using a mixing parameter as follows x ∼ q φ (x|z), z ∼ q ξ (z), q ϕ (x) = q φ (x|z)q ξ (z)dz. (1) where ϕ = {φ, ξ} are the variational parameters. This variational distribution is called semi-implicit as the conditional layer q φ (x|z) is required to be explicit but the mixing layer q ξ (z) can be implicit, and q ϕ (x) is often implicit unless q ξ (z) is conjugate to q φ (x|z). Compared to standard VI, the above hierarchical construction allows a much richer variational family that is able to capture complicated dependencies between parameters (Yin & Zhou, 2018) . Similar to standard VI, current SIVI methods fit the model parameters by maximizing the evidence lower bound (ELBO) derived as follows log p(D) ≥ ELBO := E x∼qϕ(x) [log p(D, x) -log q ϕ (x)] , where D is the observed data. As q ϕ (x) is no longer tractable, Yin & Zhou (2018) considered a sequence of lower bounds of ELBO ELBO ≥ L L := E z∼q ξ (z),x∼q φ (x|z) E z (1) ,••• ,z (L) i.i.d. ∼ q ξ (z) log p(D, x) 1 L+1 q φ (x|z) + L l=1 q φ (x|z (l) ) . Note that L L is an asymptotically exact surrogate ELBO as L → ∞. An increasing sequence of {L t } ∞ t=1 , therefore, is often suggested, with L Lt being optimized at the t-th iteration. Instead of maximizing surrogate ELBOs, Titsias & Ruiz (2019) proposed unbiased implicit variational inference (UIVI) which is based on an unbiased gradient estimator of the exact ELBO. More specifically, consider a fixed mixing distribution q ξ (z) = q(z) and a reparameterizable conditional q φ (x|z) such that x = T φ (z, ), ∼ q ( ) ⇔ x ∼ q φ (x|z), then ∇ φ ELBO = ∇ φ E q ( )q(z) log p(D, x) -log q φ (x)| x=T φ (z, ) , := E q ( )q(z) g mod φ (z, ) + g ent φ (z, ) ,

