SEMI-IMPLICIT VARIATIONAL INFERENCE VIA SCORE MATCHING

Abstract

Semi-implicit variational inference (SIVI) greatly enriches the expressiveness of variational families by considering implicit variational distributions defined in a hierarchical manner. However, due to the intractable densities of variational distributions, current SIVI approaches often use surrogate evidence lower bounds (EL-BOs) or employ expensive inner-loop MCMC runs for direct ELBO maximization for training. In this paper, we propose SIVI-SM, a new method for SIVI based on an alternative training objective via score matching. Leveraging the hierarchical structure of semi-implicit variational families, the score matching objective allows a minimax formulation where the intractable variational densities can be naturally handled with denoising score matching. We show that SIVI-SM closely matches the accuracy of MCMC and outperforms ELBO-based SIVI methods in a variety of Bayesian inference tasks.

1. INTRODUCTION

Variational inference(VI) is an approximate Bayesian inference approach where the inference problem is transformed into an optimization problem (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2017) . It starts by introducing a family of variational distributions over the model parameters (or latent variables) to approximate the posterior. The goal then is to find the closest member from this family of distributions to the target posterior, where the closeness is usually measured by the Kullback-Leibler (KL) divergence from the posterior to the variational approximation. In practice, this is often achieved by maximizing the evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence (Jordan et al., 1999) . One of the classical VI methods is mean-field VI (Bishop & Tipping, 2000) , where the variational distributions are assumed to be factorized over the parameters (or latent variables). When combined with conditional conjugacy, this often leads to simple optimization schemes with closed-form update rules (Blei et al., 2017) . While popular, the factorizable assumption and conjugacy condition greatly restrict the flexibility and applicability of variational posteriors, especially for complicated models with high dimensional parameter space. Recent years have witnessed much progress in the field of VI that extends it to more complicated settings. For example, the conjugacy condition has been removed by the black-box VI methods which allow a broad class of models via Monte carlo gradient estimators (Nott et al., 2012; Paisley et al., 2012; Ranganath et al., 2014; Rezende et al., 2014; Kingma & Welling, 2014) . On the other hand, more flexible variational families have been proposed that either explicitly incorporate more complicated structures among the parameters (Jaakkola & Jordan, 1998; Saul & Jordan, 1996; Giordano et al., 2015; Tran et al., 2015) or borrow ideas from invertible transformation of probability distributions (Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma et al., 2016; Papamakarios et al., 2019) . All these methods require tractable densities for the variational distributions. It turns out that the variational family can be further expanded by allowing implicit models that have intractable densities but are easy to sample from (Huszár, 2017) . One way to construct these implicit models is to transform a simple base distribution via a deterministic map, i.e., a deep neural network (Tran et al., 2017; Mescheder et al., 2017; Shi et al., 2018a; b; Song et al., 2019) . Due to the intractable densities of implicit models, when evaluating the ELBO during training, one often resorts to density ratio estimation which is known to be challenging in high-dimensional settings (Sugiyama et al., 2012) . To avoid density ratio estimation, semi-implicit variational inference (SIVI) has been proposed where the variational distributions are formed through a semi-implicit hierarchical construction and surrogate ELBOs (asymptotically unbiased) are employed for training (Yin & Zhou, 2018; Moens et al., 2021) . Instead of surrogate ELBOs, an unbiased gradient estimator of the exact ELBO has been derived based on MCMC samples from a reverse conditional (Titsias & Ruiz, 2019) . However, the computation for the inner-loop MCMC runs can easily become expensive in high-dimensional regimes. There are also approaches that estimate the gradients instead of the objective (Li & Turner, 2018; Shi et al., 2018b; Song et al., 2019) . Besides KL divergence, score-based distance measures have also been introduced in various statistical tasks (Hyvärinen, 2005; Zhang et al., 2018) and have shown advantages in complicated nonlinear models (Song & Ermon, 2019; Ding et al., 2019; Elkhalil et al., 2021) . Recently, there are also some studies that use score matching for variational inference (Yang et al., 2019; Hu et al., 2018) . However, these methods are not designed for SIVI and hence either do not apply to or can not fully exploit the hierarchical structure of semi-implicit variational distributions. In this paper, we propose SIVI-SM, a new method for SIVI using an alternative training objective via score matching. We show that the score matching objective and the semi-implicit hierarchical construction of variational posteriors can be combined in a minimax formulation where the intractability of densities is naturally handled with denoising score matching. We demonstrate the effectiveness and efficiency of our method on both synthetic distributions and a variety of real data Bayesian inference tasks.

2. BACKGROUND

Semi-Implicit Variational Inference Semi-implicit variational inference (SIVI) (Yin & Zhou, 2018 ) posits a flexible variational family defined hierarchically using a mixing parameter as follows x ∼ q φ (x|z), z ∼ q ξ (z), q ϕ (x) = q φ (x|z)q ξ (z)dz. (1) where ϕ = {φ, ξ} are the variational parameters. This variational distribution is called semi-implicit as the conditional layer q φ (x|z) is required to be explicit but the mixing layer q ξ (z) can be implicit, and q ϕ (x) is often implicit unless q ξ (z) is conjugate to q φ (x|z). Compared to standard VI, the above hierarchical construction allows a much richer variational family that is able to capture complicated dependencies between parameters (Yin & Zhou, 2018) . Similar to standard VI, current SIVI methods fit the model parameters by maximizing the evidence lower bound (ELBO) derived as follows log p(D) ≥ ELBO := E x∼qϕ(x) [log p(D, x) -log q ϕ (x)] , where D is the observed data. As q ϕ (x) is no longer tractable, Yin & Zhou (2018) considered a sequence of lower bounds of ELBO ELBO ≥ L L := E z∼q ξ (z),x∼q φ (x|z) E z (1) ,••• ,z (L) i.i.d. ∼ q ξ (z) log p(D, x) 1 L+1 q φ (x|z) + L l=1 q φ (x|z (l) ) . Note that L L is an asymptotically exact surrogate ELBO as L → ∞. An increasing sequence of {L t } ∞ t=1 , therefore, is often suggested, with L Lt being optimized at the t-th iteration. Instead of maximizing surrogate ELBOs, Titsias & Ruiz (2019) proposed unbiased implicit variational inference (UIVI) which is based on an unbiased gradient estimator of the exact ELBO. More specifically, consider a fixed mixing distribution q ξ (z) = q(z) and a reparameterizable conditional q φ (x|z) such that x = T φ (z, ), ∼ q ( ) ⇔ x ∼ q φ (x|z), then ∇ φ ELBO = ∇ φ E q ( )q(z) log p(D, x) -log q φ (x)| x=T φ (z, ) , := E q ( )q(z) g mod φ (z, ) + g ent φ (z, ) , where g mod φ (z, ) := ∇ x log p(D, x)| x=T φ (z, ) ∇ φ T φ (z, ), g ent φ (z, ) := -E q φ (z |x) ∇ x log q φ (x|z ) x=T φ (z, ) ∇ φ T φ (z, ). The gradient term in 4 involves an expectation w.r.t. the reverse conditional q φ (z|x) which can be estimated using an MCMC sampler (e.g., Hamiltonian Monte Carlo (Neal, 2011) ). However, the inner-loop MCMC runs can easily become computationally expensive in high dimensional regimes. See a more detailed discussion on the derivation and computation issues of UIVI in Appendix A and D. Score Matching Score matching is first introduced by Hyvärinen (2005) to learn un-normalized statistical models given i.i.d. samples from an unknown data distribution p(x). Instead of estimating p(x) directly, score matching trains a score network S(x) to estimate the score of the data distribution, i.e. ∇ log p(x), by minimizing the score matching objective E p(x) [ 1 2 S(x) -∇ x log p(x) 2 2 ]. Using the trick of partial integration, Hyvärinen (2005) shows that the score matching objective is equivalent to the following up to a constant E x∼p(x) [Tr(∇ x (S(x))) + 1 2 S(x) 2 2 ]. The expectation in Eq. 5 can be quickly estimated using data samples. However, it is often challenging to scale up score matching to high dimensional data due to the computation of Tr ∇ x (S(x)). A commonly used variant of score matching that can scale up to high dimensional data is denoising score matching (DSM) (Vincent, 2011) . The first step of DSM is to perturb the data with a known noise distribution q σ ( x|x), which leads to a perturbed data distribution q σ ( x) = q σ ( x|x)p(x)dx. The score matching objective for q σ ( x) turns out to be equivalent to the following up to a constant 1 2 E qσ( x|x)p(x) S( x) -∇ x log q σ ( x|x) 2 2 . Unlike Eq. 5, Eq. 6 does not involve the trace term and can be computed efficiently, as long as the score of the noise distribution ∇ x log q σ ( x|x) is easy to compute. Note that the optimal score network here estimates the score of the perturbed data distribution rather than that of the true data distribution. A small noise, therefore, is required for accurate approximation of the true data score ∇ log p(x). Despite this, DSM is widely used in learning energy based models (Saremi et al., 2018) and score based generative models (Song & Ermon, 2019) .

3. PROPOSED METHOD

While ELBO-based training objectives prove effective for semi-implicit variational inference, current approaches either rely on surrogates of the exact ELBO or expensive inner-loop MCMC runs for unbiased gradient estimates due to the intractable variational posteriors. In this section, we introduce an alternative training objective for SIVI based on score matching. We show that the score matching objective can be reformulated in a minimax fashion such that the semi-implicit hierarchical construction of variational posteriors can be efficiently exploited using denoising score matching. Throughout this section, we assume the conditional layer q φ (x|z) to be reparameterizable and its score function ∇ x log q φ (x|z) is easy to evaluatefoot_0 .

3.1. A MINIMAX REFORMULATION

Rather than maximizing the ELBO as in previous semi-implicit variational inference methods, we can instead minimize the following Fisher divergence that compares the score functions of the target and the variational distribution min ϕ E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 . ( ) Here S(x) = ∇ x log p(x|D) = ∇ x log p(D, x) is the score of the target posterior distribution, and the variational distribution q ϕ (x) is defined in Eq. 1. Due to the semi-implicit construction in Eq. 1, the score of variational distribution, i.e. ∇ x log q ϕ (x), is intractable, making the Fisher divergence in Eq. 7 not readily computable. Although the hierarchical structure of q ϕ (x) allows us to estimate its score function via denoising score matching, the estimated score function would break the dependency on the variational parameter ϕ, leading to biased gradient estimates (see an illustration in Appendix L). Fortunately, this issue can be remedied by reformulating 7 as a minimax problem. The key observation is that the squared norm of S(x) -∇ x log q ϕ (x) can be viewed as the maximum value of the following nested optimization problem S(x) -∇ x log q ϕ (x) 2 2 = max f (x) 2f (x) T [S(x) -∇ x log q ϕ (x)] -f (x) 2 2 , ∀x. where f (x) is an arbitrary function of x, and the unique optimal solution is f ϕ (x) := S(x) -∇ x log q ϕ (x). Based on this observation, we can rewrite the optimization problem in 7 as min ϕ max f E x∼qϕ(x) 2f (x) T [S(x) -∇ x log q ϕ (x)] -f (x) 2 2 . ( ) Now we can take advantage of the hierarchical structure of q ϕ (x) to get ride of the intractable score term ∇ x log q ϕ (x) in Eq. 8, similarly as done in DSM. More specifically, note that ∇ x log q ϕ (x) = 1 q ϕ (x) q ξ (z)q φ (x|z)∇ x log q φ (x|z)dz. We have E x∼qϕ(x) f (x) T ∇ x log q ϕ (x) = q ϕ (x)f (x) T 1 q ϕ (x) q ξ (z)q φ (x|z)∇ x log q φ (x|z)dzdx, = q ξ (z)q φ (x|z)f (x) T ∇ x log q φ (x|z)dzdx, = E z∼q ξ (z),x∼q φ (x|z) f (x) T ∇ x log q φ (x|z). Substituting Eq. 9 into Eq. 8 completes our reformulation. Theorem 1. Assume the variational distribution q ϕ (x) is a semi-implicit distribution defined by Eq. 1, then the optimization problem in 7 is equivalent to the following minimax problem min ϕ max f E z∼q ξ (z),x∼q φ (x|z) 2f (x) T [S(x) -∇ x log q φ (x|z)] -f (x) 2 2 , Where ϕ = {φ, ξ}. Moreover, assume that f can represent any function. If (ϕ * , f * ) defines a Nash-equilibrium of Eq. 10, then f * , ϕ * is given by f * (x) = S(x) -∇ x log q ϕ * (x), ϕ * ∈ arg min ϕ {E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 }. ( ) See a detailed proof of Theorem 1 in Appendix B. Note that the objective in equation 8 can also be derived via Stein discrepancy minimization (Liu et al., 2016; Gorham & Mackey, 2015; Ranganath et al., 2016; Grathwohl et al., 2020) . However, our reformulation in equation 10 takes a further step by utilizing the hierarchical structure of q ϕ (x) and hence can easily scale up to high dimensions. See Appendix K for a more detailed discussion.

3.2. PRACTICAL ALGORITHMS

In practice, we parameterize f with a neural network f ψ (x). According to the above minimax reformulation, the Monte Carlo estimation of the objective function in Eq. 10 can be easily obtained by sampling from the hierarchical variational distribution z ∼ q ξ (z), x ∼ q φ (x|z). Furthermore, using the reparameterization trick (Kingma & Welling, 2013) , i.e. x = T φ (z; ), z = h ξ (γ), where ∼ q ( ), γ ∼ q γ (γ), we can rewrite Eq. 10 as follows min ϕ max ψ E qγ (γ),q ( ) 2f ψ (x) T [S(x) -∇ x log q φ (x|z) - 1 2 f ψ (x)] x=T φ (z, ),z=h ξ (γ) , Algorithm 1 SIVI-SM with multivariate Gaussian conditional layer Input: Score of target posterior distribution S(x). Total iteration number N . Number of gradient steps K for the inner optimization on f ψ (x). Output: Variational parameter ϕ and the neural network parameter ψ. Initialize ϕ 0 , ψ 0 for t = 0 to N -1 do Sample {γ (1) , γ (2) , • • • , γ (m) } from prior q γ (γ) and let z (i) = h ξ (γ (i) ), i = 1, . . . , m. Sample { (1) , (2) , • • • , (m) } from N (0, I). Compute x (i) = µ φ (z (i) ) + σ φ (z (i) ) (i) . Update ϕ by descending its stochastic gradient: ∇ ϕ 1 m m i=1 f ψ (x (i) ) T [S(x (i) ) + σ φ (z (i) ) -1 (i) ] - 1 2 f ψ (x (i) ) 2 2 . for j = 1 to K do Sample {γ (1) , γ (2) , • • • , γ (m) } from prior q γ (γ) and let z (i) = h ξ (γ (i) ), i = 1, . . . , m. Sample { (1) , (2) , • • • , (m) } from N (0, I). Compute x (i) = µ φ (z (i) ) + σ φ (z (i) ) (i) . Update ψ by ascending its stochastic gradient: ∇ ψ 1 m m i=1 f ψ (x (i) ) T [S(x (i) ) + σ φ (z (i) ) -1 (i) ] - 1 2 f ψ (x (i) ) 2 2 .

end for end for

This allows us to directly optimize the parameters ϕ, ψ with gradient based optimization methods (Goodfellow et al., 2014) . Optimizing f to completion in the inner loop of training is computational prohibitive. Therefore, we alternate between K steps of optimizing f and one step of optimizing q ϕ . As mentioned before, we use the multivariate Gaussian distribution with diagonal covariance matrix for the conditional q φ (x|z) ∼ N (µ φ (z), diag{σ 2 φ (z)}), which can be reparameterized as follows x = µ φ (z) + σ φ (z) , where ∼ N (0, I), where means the element-wise product. The corresponding score function is ∇ x log q φ (x|z) = -σ φ (z) -1 . The complete procedure of SIVI-SM is formally presented in Algorithm 1.

3.3. THEORETICAL RESULTS REGARDING NEURAL NETWORK APPROXIMATION

The inexact lower-level optimization for neural networks introduces approximation errors. Also, neural networks themselves may introduce approximation gaps due to their approximation capacities. These numerical errors may affect the approximation quality of variational distribution q ϕ (x), which we analyze in the following proposition. Proposition 1. Let Ω be the feasible domain of ϕ. ∀ϕ ∈ Ω, we say that f ψ(ϕ) is -accurate, if E x∼qϕ(x) Rϕ (x) 2 2 ≤ , where Rϕ (x) := S(x) -∇ log q ϕ (x) -f ψ(ϕ) . ( ) Let ϕ * be one of the optimal variational parameters defined in Eq. 11 and φ be the one obtained using neural network approximation defined as follows with f ψ(ϕ) being -accurate φ := arg min ϕ∈Ω {E x∼qϕ(x) 2f ψ(ϕ) (x) T [S(x) -∇ x log q ϕ (x)] -f ψ(ϕ) (x) 2 2 }. Then we have E x∼q φ (x) S(x) -∇ x log q φ(x) 2 2 ≤ E x∼q ϕ * (x) S(x) -∇ x log q ϕ * (x) 2 2 + . ( ) See a detailed proof of Proposition 1 in Appendix C. From proposition 1, we see that the approximation error of our numerical solution to the minimax problem in 10 can be controlled by two terms. The first term E x∼q ϕ * (x) S(x) -∇ x log q ϕ * (x)foot_1 2 measures the approximation ability of the variational distribution, and the second term measures the approximation/optimization error of the neural network f ψ (x). As long as the approximation error of f ψ (x) is small, the minimax formulation in 10 can provide variational posteriors with similar approximation accuracy to those of the original problem in 7 in terms of Fisher divergence to the target posterior. Remark. When the variational parameter ϕ is fixed, the lower-level optimization problem on f ψ (x) is equivalent to the following min ψ E x∼qϕ(x) S(x) -∇ log q ϕ (x) -f ψ (x) 2 2 , and we use this objective as a measure of approximation accuracy of f ψ (x) given ϕ in Eq. 12.

4. EXPERIMENTS

In this section, we compare SIVI-SM to ELBO-based methods including the original SIVI and UIVI on a range of inference tasks. We first show the effectiveness of our method and illustrate the role of the auxiliary network approximation f ψ on several two-dimensional toy examples. The KL divergence from the target distributions to different variational approximations was also provided for direct comparison. We also compare the performance of SIVI-SM with both baseline methods on several Bayesian inference tasks, including a multidimensional Bayesian logistic regression problem and a high dimensional Bayesian multinomial logistic regression problem. Following Titsias & Ruiz (2019), we set the conditional layer to be q φ (x|z) = N (x|µ φ (z), diag(σ)) 2 and fix the mixing layer as q(z) = N (0, I). The variational parameters therefore are ϕ = {φ, σ}. All experiments were implemented in Pytorch (Paszke et al., 2019) . If not otherwise specified, we use the Adam optimizer for training (Kingma & Ba, 2014) .

4.1. TOY EXAMPLES

We first apply SIVI-SM to approximate three synthetic distributions defined on a two-dimensional space: a banana-shaped distribution, a multimodal Gaussian, and an X-shaped mixture of Gaussian. The densities of these distributions are given in Table 2 in Appendix E. For the convenience of comparison, we used the same configuration of semi-implicit distribution family as in UIVI (Titsias & Ruiz, 2019) . The µ φ (z) is a multilayer perceptron (MLP) with layer widths [3, 50, 50, 2] . The network approximation f ψ (x) is parameterized by a 4 layers MLP with layer widths [2, 128, 128, 2] . For SIVI-SM, we set the number of inner-loop gradient steps K = 1. For SIVI, we set L = 50 for the surrogate ELBO defined in Eq. 2. For UIVI, we used 10 iterations for every inner-loop HMC sampling. To facilitate exploration, for all methods, we used the annealing trick (Rezende & Mohamed, 2015) during training for the multimodal and X-shaped Gaussian distributions. Variational approximations from all methods were obtained after 50,000 variational parameter updates. Figure 5 in Appendix F shows the contour plots of the synthetic distributions, together with 1000 samples from the trained variational distributions. We see that SIVI-SM produces samples that match the target distributions well. We also report the KL divergence from the target distributions to the variational posteriors (estimated via the ITE package (Szabó, 2014) using 100,000 samples from each distribution) given by different methods in Table 3 in Appendix G. We see that SIVI-SM performs better for more challenging target distributions. To better understand the role the network approximation f ψ (x) played during the training process, we visualize its training dynamics on the X-shaped distribution in Figure 1 . We see that during training, f ψ (x) automatically detected where the current approximation is insufficient and guided the variational posterior towards these areas. Note the Nash-equilibrium of f * (x) in Eq. 11 is the difference between the score functions of the target distribution p(x) and q ϕ (x). As the variational posterior gets closer to the target, the signal provided by f ψ (x) becomes weaker and would converge to zero in the perfect case. More details on the convergence of f ψ (x) can be found in Appendix H. 

4.2. BAYESIAN LOGISTIC REGRESSION

Our second example is a Bayesian logistic regression problem where the log-likelihood function takes the following form log p(y i |x i , β) = y i * β T x i -log(1 + exp(β T x i )), y i ∈ {0, 1}, x i = 1 xi . Here x i are covariates, and y i ∈ {0, 1} are binary response variables. Following Yin & Zhou (2018) , we set the prior as β ∼ N (0, α -1 I), where α = 0.01. We consider the waveformfoot_2 dataset where the dimension of x i is 21 which leads to a parameter space of 22 dimensions. We used a standard 10-dimensional Gaussian prior for the q(z). For µ φ (z), we used a 4 layer MLP with layer widths [10, 100, 100, 22] . The network approximation f ψ (β) is also a 4 layer MLP with layer width [22, 256, 256, 22] . Similarly as in section 4.1, we set the number of inner-loop gradient steps K = 1 in SIVI-SM. For SIVI, we set L = 100 and used the same training method as in Yin & Zhou (2018) . For UIVI, we set the length of inner-loop HMC iterations to be 10 with the first 5 iterations discarded as burn-in, with 5 leapfrog steps in each iteration. The results of all methods were collected after 20,000 variational parameter updates. We collected 1000 samples of β to represent the approximated posterior distributions for all three SIVI variants. The ground truth was formed from a long MCMC run of 400,000 iterations using parallel stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) with 1000 independent particles, and a small stepsize of 10 -4 . Figure 2 shows the posterior estimates provided by different SIVI variants in contrast to the ground truth MCMC results. We see that SIVI and UIVI tend to slightly underestimate the variance for both univariate marginal and pairwise joint posteriors (especially for β 4 , β 5 ), while SIVI-SM agreed with MCMC well. Furthermore, we also examined the covariance estimates of β and the results were presented in Figure 3 . We see that SIVI-SM provides the best overall approximation to the posterior which achieved the smallest rooted mean square error (RMSE) to the ground truth at 0.0184.

4.3. BAYESIAN MULTINOMIAL LOGISTIC REGRESSION

Our next example is a Bayesian multinomial logistic regression problem. For a data set of N covariate and label pairs {(x i , y i ) : i = 1, . . . , N }, where y i ∈ {1, . . . , R}, the categorical likelihood is p( We used the same variational family as before, with a 100-dimensional standard Gaussian prior for q(z). We used MLPs with two hidden layers for the mean network µ φ (z) of the Gaussian conditional and the network approximation f ψ (β), with 200 hidden neurons for µ φ (z) and 256 hidden neurons for f ψ (β) for each of the hidden layers respectively. We used the same initialization of variational parameters for all methods. Following Titsias & Ruiz (2019), we used a minibatch size of 2,000 for MNIST and 863 for HAPT. As before, we set the number of inner-loop gradient steps K = 1 in SIVI-SM. For SIVI, we set L = 200 as previously done by Titsias & Ruiz (2019) . For UIVI, we set the number of inner-loop HMC iterations to be 10 and discarded the first 5 iterations as burn-in, with 5 leapfrog steps in each iteration. As done in UIVI, we used the RMSProp optimizer (Tieleman & Hinton, 2012) for training. We used different batch sizes during training to investigate its effect on the quality of variational approximations for different methods. These batch sizes were selected in such a way that the corresponding computational times are comparable between different methods. See a detailed comparison on the computation times in Appendix I (the experiments were run on a RTX2080 GPU). As the gradient computation for the inner-loop HMC sampling required by UIVI is not scalablefoot_5 , the batch size for UIVI is set as m = 1 which was also used by Titsias & Ruiz (2019) . For all methods, the results were collected after 90,000 variational parameter updates for MNIST and 40,000 variational parameter updates for HAPT. y i = r|x i ) ∝ exp([1, x T i ] • β r ), r ∈ {1, 2, • • • , R}, where β = (β T 1 , β T 2 , • • • , β T R ) T is Figure 4 shows the predictive log-likelihood on the test data as a function of the number of iterations for both data sets, where the estimates were formed based on 8,000 samples from the variational 

4.4. BAYESIAN NEURAL NETWORKS

Lastly, we compare our method with SIVI, UIVI and SGLD on sampling the posterior of Bayesian neural network on the UCI datasets. We conduct the two-layer network with 50 hidden units and ReLU activation function. The datasets are all randomly partitioned into 90% for training and 10% for testing. We use the variational family as before, with 3-dimensional standard Gaussian prior for q(z), and 10 hidden neurons for µ φ (z) and 16 hidden neurons for f ψ (x). The results are averaged over 10 random trials. We refer the reader to Appendix J for hyper-parameter tuning and other experiment details. Table 1 shows the average test RMSE and NLL and their standard deviation. We see that SIVI-SM can achieve on par or better results than SIVI and UIVI. Although SGLD performs better for some datasets, it requires a long run to generate samples. 

5. CONCLUSION

We proposed SIVI-SM, a new method for semi-implicit variational inference based on an alternative training objective via score matching. Unlike the ELBO-based objectives, we showed that the score matching objective allows a minimax formulation where the hierarchical structure of semi-implicit variational families can be more efficiently exploited as the corresponding intractable variational densities can be naturally handled with denoising score matching. In experiments, we demonstrated that SIVI-SM closely matches the accuracy of MCMC in posterior estimation and outperforms two typical ELBO-based methods (SIVI and UIVI) in a variety of Bayesian inference tasks. A DERIVATION OF EQ. 2-4 The gradient of ELBO is that ∇ φ ELBO = ∇ φ E q( )q(z) log p(D, x) -log q φ (x)| x=T φ (z, ) , = E q( )q(z) ∇ x log p(D, x)∇ φ T φ (z, ) -∇ x log q φ (x)∇ φ T φ (z, )| x=T φ (z, ) , := E q( )q(z) g mod φ (z, ) -g ent φ (z, ) , where g mod φ (z, ) := ∇ x log p(D, x)| x=T φ (z, ) ∇ φ T φ (z, ), g ent φ (z, ) := -∇ x log q φ (x)| x=T φ (z, ) ∇ φ T φ (z, ). Note that the property of margin score function, ∇ x log q φ (x) = q φ (z |x)∇ x log q φ (x|z )dz . ( ) Then the gradient g ent φ can be representd as g ent φ (z, ) = -E q φ (z |x) ∇ x log q φ (x|z ) x=T φ (z, ) ∇ φ T φ (z, ).

B PROOF OF THEOREM 1

Proof. As discussed in section 3.1, by introducing the vector-valued function f , we can rewrite the optimization objective in Eq. 7 as E x∼qϕ(x) max f (x) {2f (x) T [S(x) -∇ x log q ϕ (x)] -f (x) 2 2 }. Compute the score of the semi-implicit distribution q ϕ (x) defined in Eq. 1, we have ∇ x log q ϕ (x) = 1 q ϕ (x) ∇ x q ξ (z)q φ (x|z)dz = 1 q ϕ (x) q ξ (z)q φ (x|z)∇ x log q φ (x|z)dz. Bring the above score of q ϕ (x) into Eq. 16, we have E x∼qϕ(x) max f (x) {2f (x) T [S(x) -∇ x log q ϕ (x)] -f (x) 2 2 } =E x∼qϕ(x) max f (x) {2f (x) T [S(x) - 1 q ϕ (x) q ξ (z)q φ (x|z)∇ x log q φ (x|z)dz] -f (x) 2 2 } = max f (x) {E x∼qϕ(x) [2f (x) T S(x) -f (x) 2 2 ] -q ξ (z)q φ (x|z)2f (x) T ∇ x log q φ (x|z)dxdz} = max f (x) {E z∼q ξ (z),x∼q φ (x|z) [2f (x) T (S(x) -∇ x log q φ (x|z)) -f (x) 2 2 ]} Therefor, let ϕ = {ξ, φ}, we can rewrite the original score matching problem Eq. 7 as min ϕ max f E z∼q ξ (z),x∼q φ (x|z) 2f (x) T [S(x) -∇ x log q φ (x|z)] -f (x) 2 2 . If (ϕ * , f * ) defines a Nash-equilibrium of the above problem, fixing the parameters ϕ = ϕ * , the optimal vector-valued function f * (x) is invariant in the derivation of Eq. 17. So we can easily deduce f * by Eq. 16 f * (x) = S(x) -∇ x log q ϕ * (x). Bring f * (x) into Eq. 10, we have the unbiased approximation of ϕ * ϕ * ∈ arg min ϕ {E z∼q ξ (z),x∼q φ (x|z) 2f * (x) T [S(x) -∇ x log q φ (x|z)] -f * (x) 2 2 }, ∈ arg min ϕ {E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 }.

C PROOF OF PROPOSITION 1

Proof. Consider the score matching problem with the well -trained f ψ(ϕ) , we have φ = arg min ϕ∈Ω {E x∼qϕ(x) 2f ψ(ϕ) (x) T [S(x) -∇ x log q ϕ (x)] -f ψ(ϕ) (x) 2 2 }, = arg min ϕ∈Ω {E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 -S(x) -∇ x log q ϕ (x) -f ψ(ϕ) (x) 2 2 }, = arg min ϕ∈Ω {E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 -Rϕ (x) 2 2 }. Therefore, we can estimate the upper bound of Fisher divergence between S(x) and ∇ x log q φ(x) E x∼q φ(x) S(x) -∇ x log q φ(x) 2 2 , =E x∼q φ(x) S(x) -∇ x log q φ(x) 2 2 -R φ(x) 2 2 + R φ(x) 2 2 , ≤E x∼q ϕ * (x) S(x) -∇ x log q ϕ * (x) 2 2 -Rϕ * (x) 2 2 + R φ(x) 2 2 , ≤E x∼q ϕ * (x) S(x) -∇ x log q ϕ * (x) 2 2 + , where ϕ * := arg min ϕ∈Ω {E x∼qϕ(x) S(x) -∇ x log q ϕ (x) 2 2 }. And the last inequality is due to the fact that f ψ(ϕ) is well -trained and Rϕ * (x) 2 2 is non-negative.

D COMPUTATIONAL ISSUES ON UIVI

Unlike SIVI that samples from the prior q(z), UIVI samples from the posterior distribution q(z|x), which can provide unbiased gradient estimate to the exact ELBO (Titsias & Ruiz, 2019). However, UIVI requires computing the gradient of log q(z|x) during the iterations of HMC sampling procedures. If UIVI uses a minibatch of m data points x 1 , x 2 , • • • , x m in the training process, it needs to compute the Jacobian matrix [∇ z log q(z|x 1 ), ∇ z log q(z|x 2 ), • • • , ∇ z log q(z|x m )], which is not scalable for automatic differentiation using backpropagation. Therefore, we set the batch size m = 1 for UIVI as done in Titsias & Ruiz (2019).

E DENSITIES FOR THE TOY EXAMPLES

Table 2 : Synthetic target distributions used in the toy experiments.

Name p(x) Parameters

Banana  -shaped x = (v 1 , v 2 1 + v 2 + 1), v ∼ N (0, Σ) Σ = 1 0.9 0.9 1 Multimodal 1 2 N (x|µ 1 , I) + 1 2 N (x|µ 2 , I) -µ 1 = µ 2 = [2, 0] T X-shaped 1 2 N (x|0, Σ 1 ) + 1 2 N (x|0, Σ 2 ) Σ 1 = 2 1.8 1.8 2 , Σ 2 = 2 -1.8 -1.8 2

H CONVERGENCE PERFORMANCE OF SIVI-SM

Here, we demonstrate the convergence behavior of SIVI-SM in our experiments. For the topy examples in section 4.1, we use 500 samples from q ϕ (x) to form the Monte Carlo estimates of the loss (SM loss) in Eq. 10, and the L 2 -norm E qϕ(x) f ψ (x) 2 2 of the f ψ function(fnet's norm) during the training process. Figure 6 shows the estimated SM loss and fnet's norm as a function of the number of iterations for the three synthetic toy distributions. Similarly, Figure 7 shows the convergence traces of the SM loss and fnet's norm in the experiments in section 4.3. Note that although the dimensions of the posterior distributions are high, i.e. 7850 for MNIST and 6744 for HAPT, the corresponding fnet's norms can be quite low (58.510 for MNIST 

I SECONDS PER ITERATION IN FIGURE 4

The following table shows the run times of different methods per iteration on a RTX2080 GPU. We see that the run times for SIVI and SIVI-SM are comparable with the chosen pairs of batch sizes (i.e., 10 vs 100 and 20 vs 200). As discussed before, the inner-loop HMC iterations make UIVI slower than other methods. 

J EXPERIMENT SETTING FOR BAYESIAN NEURAL NETWORKS

For SIVI, we set L = 100 the batch size is m = 10 in training process. For UIVI, the setting of HMC inner loop is similar with section 4.3. For SGLD, we choose the step size from {10 -4 , 10 -5 , 10 -6 } and iteration number in {50000, 100000} by validation in training process with 100 particles. For SIVI-SM, we set inner-loop gradient steps K = 1, 3 by validation and run 20,000 iterations for training.

K RELATED METHODS

Consider a test functions class F, the Stein discrepancy (Gorham & Mackey, 2015) measure between p and q is defined follows S(q, p) = sup f ∈F E q(x) ∇ x log p(x) T f (x) + Tr(∇ x f (x)) . ( ) This measure is based on the following Stein's identity (Stein, 1972) E q(x) ∇ x log q(x) T f (x) + Tr(∇ x f (x)) = 0. (21) An early example in variational inference used Stein discrepancy is operator variational inference (OPVI) (Ranganath et al., 2016) , which constructs a variational operator (e.g. Langevin-Stein Op-Published as a conference paper at ICLR 2023 erator O p LS ) objectivefoot_6  L(q, O p LS , F) = sup f ∈F E q(x) ∇ x log p(x) T f (x) + Tr(∇ x f (x)) 2 . Then OPVI solves the minmax optimization problem simultaneously with q and f . Unlike OPVI, learned Stein discrepancy (LSD) (Grathwohl et al., 2020) utilizes the L 2 regularization term to substitute the constraint of F L LSD = sup f E q(x) ∇ x log p(x) T f (x) + Tr(∇ x f (x)) -λ f (x) 2 2 . ( ) In fact, the variational objective of SIVI-SM in Eq.8 can be viewed as L LSD . Bring Eq.21 into Eq.22 and let λ = 1 2 , we have L LSD = sup f E q(x) f (x) T (∇ x log p(x) -∇ x log q(x)) - 1 2 f (x) 2 2 , = 1 2 E q(x) ∇ log p(x) -∇ log q(x) 2 2 . However, OPVI and LSD both involve the Tr(∇ x f (x)) term which is not easy to compute for high dimensional problems. Our method takes a further step by utilizing the hierarchical structure of q(x) in Eq.1. Using a mathematical trick that is similar to denoising score matching, we arrive at a formulation that easily scales up to high dimensions. L ON THE BIASENESS OF THE MC GRADIENT ESTIMATE OF THE FISHER DIVERGENCE IN EQ. 7 



This assumption is quite general and it holds for many classical distributions that are commonly used as conditionals, such as Gaussian and other exponential family distributions. Here σ is a vector with the same dimension as x. https://archive.ics.uci.edu/ml/machine-learning-databases/waveform/ http://yann.lecun.com/exdb/mnist/ http://archive.ics.uci.edu/ml/machine-learning-databases/00341/ See a more detailed explanation in Appendix D. The objective is similar with the definition of Stein measure in(Liu et al., 2016).



Figure 1: The quiver plots of f ψ (x) and samples from the variational posteriors during the training process on the X-shaped distribution.

Figure 2: Comparison of the marginal and pairwise joint posteriors. The contours of the marginal and pairwise empirical densities trained by the three semi-implicit variational inference algorithms, i.e. SIVI-SM (orange), SIVI (blue) and UIVI (green), are plotted against the ground truth (black).

Figure 3: Scatter plot comparison of the sample covariances of the posterior. The X-axis and Yaxis represent the estimates from the ground truth MCMC runs and the corresponding SIVI variants respectively. The red lines are the regression lines.

the model parameter and follows a standard Gaussian prior. Following Titsias & Ruiz (2019), we used two data sets: MNIST 4 and HAPT 5 . MNIST is a commonly used dataset in machine learning that contains 60,000 training and 10,000 test instances of 28×28 images of hand-written digits which has R = 10 classes. HAPT (Reyes-Ortiz et al., 2016) is a human activity recognition dataset. It contains 7,767 training and 3,162 test data points, and each one of them contains features of 561-dimensional measurements captured by inertial sensors, which correspond to R = 12 classes of static postures, dynamic activities and postural transitions. The dimensions of the posterior distributions are 7,850 (MNIST) and 6,744 (HAPT) respectively.

Figure 4: Estimates of the test log-likelihood for the Bayesian multinomial logistic regression model. The number in parentheses specifies the batch sizes used for training.

Figure 6: The training loss and L2 -norm of f ψ (x). E qϕ(x) f ψ (x) 22 (fnet's norm) and the training objective loss (SM loss) in Eq. 10 are both estimated using 500 samples.

Figure 7: Loss convergence for MNIST and HAPT. The loss trace has been smoothed with a rolling window of size 5.

Figure 8: The quiver plots of f ψ (x) and samples from the variational posteriors during the training process of the multimodal Gaussian distribution. Up: SIVI-SM. Bottom: Biased gradient estimates.

Averaged test RMSE and test negative log-likelihood of Bayesian Neural Networks on seven UCI datasets. The results were averaged from 10 independent runs.

KL divergence from the target to the variational posteriors. The results were averaged from 5 independent runs with one standard deviation in the parentheses.

Seconds per iteration for MNIST and HAPT.

ACKNOWLEDGMENTS

This work was supported by National Natural Science Foundation of China (grant no. 12201014). The research of Cheng Zhang was support in part by the Key Laboratory of Mathematics and Its Applications (LMAM) and the Key Laboratory of Mathematical Economics and Quantitative Finance (LMEQF) of Peking University. The authors are grateful for the computational resources provided by the Megvii institute. The authors appreciate the anonymous ICLR reviewers for their constructive feedback.

