PARTICLE-BASED VARIATIONAL INFERENCE WITH PRECONDITIONED FUNCTIONAL GRADIENT FLOW

Abstract

Particle-based variational inference (VI) minimizes the KL divergence between model samples and the target posterior with gradient flow estimates. With the popularity of Stein variational gradient descent (SVGD), the focus of particlebased VI algorithms has been on the properties of functions in Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow. However, the requirement of RKHS restricts the function class and algorithmic flexibility. This paper offers a general solution to this problem by introducing a functional regularization term that encompasses the RKHS norm as a special case. This allows us to propose a new particle-based VI algorithm called preconditioned functional gradient flow (PFG). Compared to SVGD, PFG has several advantages. It has a larger function class, improved scalability in large particle-size scenarios, better adaptation to ill-conditioned distributions, and provable continuous-time convergence in KL divergence. Additionally, non-linear function classes such as neural networks can be incorporated to estimate the gradient flow. Our theory and experiments demonstrate the effectiveness of the proposed framework. Σ -1 * approximates the optimal transport direction, while other g(0, x) deviates significantly. SVGD with linear kernel underestimates the gradient of large-variance directions; SVGD with RBF kernel even suffers from gradient vanishing in low-density area. Fig. 2 demonstrates the path of µ t with different regularizers. For linear kernel, due to the

1. INTRODUCTION

Sampling from unnormalized density is a fundamental problem in machine learning and statistics, especially for posterior sampling. Markov Chain Monte Carlo (MCMC) (Welling & Teh, 2011; Hoffman et al., 2014; Chen et al., 2014) and Variational inference (VI) (Ranganath et al., 2014; Jordan et al., 1999; Blei et al., 2017) are two mainstream solutions: MCMC is asymptotically unbiased but sample-exhausted; VI is computationally efficient but usually biased. Recently, particle-based VI algorithms (Liu & Wang, 2016; Detommaso et al., 2018; Liu et al., 2019) tend to minimize the Kullback-Leibler (KL) divergence between particle samples and the posterior, and absorb the advantages of both MCMC and VI: (1) non-parametric flexibility and asymptotic unbiasedness; (2) sample efficiency with the interaction between particles; (3) deterministic updates. Thus, these algorithms are competitive in sampling tasks, such as Bayesian inference (Liu & Wang, 2016; Feng et al., 2017; Detommaso et al., 2018) , probabilistic models (Wang & Liu, 2016; Pu et al., 2017) . Given a target distribution p * (x), particle-based VI aims to find g(t, x), so that starting with X 0 ∼ p 0 , the distribution p(t, x) of the following method: dX t = g(t, X t )dt, converges to p * (x) as t → ∞. By the continuity equation (Jordan et al., 1998) , we can capture the evolution of p(t, x) by ∂p(t, x) ∂t = -∇ • (p(t, x)g(t, x)) . (1) In order to measure the "closeness" between p(t, •) and p * , we typically adopt the KL divergence, D KL (t) = p(t, x) ln p(t, x) p * (x) dx. (2) Using chain rule and integration by parts, we have dD KL (t) dt = -p(t, x)[∇ • g(t, x) + g(t, x) ⊤ ∇ x ln p * (x)]dx, which captures the evolution of KL divergence. To minimize the KL divergence, one needs to define a "gradient" to update the particles as our g(t, x). The most standard approach, Wasserstein gradient (Ambrosio et al., 2005) , defines a gradient for p(t, x) in the Wasserstein space, which contains probability measures with bounded second moments. In particular, for any functional L that maps probability density p(t, x) to a non-negative scalar, we say that the particle density p(t, x) follows the Wasserstein gradient flow of L if g(t, x) is the gradient field of L 2 (R d )-functional derivative of L (Villani, 2009) . For KL divergence, the solution is ∇ ln p * (x) p(t,x) . However, the computation of deterministic and time-inhomogeneous Wasserstein gradient is non-trivial. It is necessary to restrict the function class of g(t, x) to obtain a tractable form. Stein variational gradient descent (SVGD) is the most popular particle-based algorithm, which provides a tractable form to update particles with the kernelized gradient flow (Chewi et al., 2020; Liu, 2017) . It updates particles by minimizing the KL divergence with a functional gradient measured in RKHS. By restricting the functional gradient with bounded RKHS norm, it has an explicit formulation: g(t, x) can be obtained by minimizing Eq. (3). Nonetheless, there are still some limitations due to the restriction of RKHS: (1) the expressive power is limited because kernel method is known to suffer from the curse of dimensionality (Geenens, 2011) ; (2) with n particles, the O(n 2 ) computational overhead of kernel matrix is required. Further, we identify another crucial limitation of SVGD: the kernel design is highly non-trivial. Even in the simple Gaussian case, where particles start with N (0, I) and p * = N (µ * , Σ * ), commonly used kernels such as linear and RBF kernel, have fundamental drawbacks in SVGD algorithm (Example 1). Our motivation originates from functional gradient boosting (Friedman, 2001; Nitanda & Suzuki, 2018; Johnson & Zhang, 2019) . For each p(t, x), we find a proper function as g(t, x) in the function class F to minimize Eq. (3). In this context, we design a regularizer for the functional gradient to approximate variants of "gradient" explicitly. We propose a regularization family to penalize the particle distribution's functional gradient output. For well-conditioned -∇ 2 ln p *foot_0 , we can approximate the Wasserstein gradient directly; For ill-conditioned -∇ 2 ln p * , we can adapt our regularizer to approximate a preconditioned one. Thus, our functional gradient is an approximation to the preconditioned Wasserstein gradient. Regarding the function space, we do not restrict the function in RKHS. Instead, we can use non-linear function classes such as neural networks to obtain a better approximation capacity. The flexibility of the function space can lead to a better sampling algorithm, which is supported by our empirical results. Contributions. We present a novel particle-based VI framework that incorporates functional gradient flow with general regularizers. We leverage a special family of regularizers to approximate the preconditioned Wasserstein gradient flow, which proves to be more effective than SVGD. The functional gradient in our framework explicitly approximates the preconditioned Wasserstein gradient, making it well-suited to handle ill-conditioned cases and delivering provable convergence rates. Additionally, our proposed algorithm eliminates the need for the computationally expensive O(n 2 ) kernel matrix, resulting in increased computational efficiency for larger particle sizes. Both theoretical and empirical results demonstrate the superior performance of our framework and proposed algorithm.

2. ANALYSIS

Notations. In this paper, we use x to denote particle samples in R d . The distributions are assumed to be absolutely continuous w.r.t. the Lebesgue measure. The probability density function of the posterior is denoted by p * . p(t, x) (or p t ) refers to particle distribution at time t. For scalar function p(t, x), ∇ x p(t, x) denotes its gradient w.r.t. x. For vector function g(t, x), ∇ x g(t, x), ∇ x • g(t, x), ∇ 2 x g(t, x) denote its Jacobian matrix, divergence, Hessian w.r.t. x. We let g(t, x) belong to a vector-valued function class F, and find the best functional gradient direction. Inspired by the gradient boosting algorithm for regression and classification problems, we approximate the gradient flow by a function g(t, x) ∈ F with a regularization term, which solves the [20, 20] ⊤ , diag(100, 1)) (first row: evolution of particle mean µ t ; second row: particle distribution p(5, x) at t = 5) (a) 1 2 ∥f ∥ 2 H d (linear) (b) 1 2 ∥f ∥ 2 H d (RBF) (c) 1 2 Ep t ∥f ∥ 2 (d) 1 2 Ep t ∥f ∥ 2 Σ -1 * (e) Optimal transport (f ) = 1 2 ∥f ∥ 2 H d (linear) (b) Q(f ) = 1 2 ∥f ∥ 2 H d (RBF) (c) Q(f ) = 1 2 Ep t ∥f ∥ 2 (d) Q(f ) = 1 2 Ep t ∥f ∥ 2 Σ -1 * Figure 2: Evolution of particle distribution from N ([0, 0] ⊤ , I) to N ( following minimization formulation: g(t, x) = arg min f ∈F -p(t, x)[∇ • f (x) + f (x) ⊤ ∇ ln p * (x)]dx + Q(f ) , where Q(•) is a regularization term that limit the output magnitude of f . This regularization term also implicitly determines the underlying "distance metric" used to define the gradient estimates g(t, x) in our framework. When Q(x) = 1 2 p(t, x)∥f (x)∥ 2 dx, Eq. ( 4) is equivalent to g(t, x) = arg min f ∈F p(t, x) f (x) -∇ ln p * (x) p t (x) 2 dx. ( ) If F is well-specified, i.e., ∇ ln p * (x) pt(x) ∈ F, we have g(t, x) = ∇ ln p * (x) pt(x) , which is the direction of Wasserstein gradient flow. Interestingly, despite the computational intractability of Wasserstein gradient, Eq. ( 4) provides a tractable variational approximation. For RKHS, we will show that SVGD is a special case of Eq. ( 4), where Q(f ) = 1 2 ∥f ∥ 2 H d (Sec. 3.2.1 ). We point out that RKHS norm usually fails to regularize the functional gradient properly since it is fixed for any p(t, x). Our next example shows that SVGD is a suboptimal solution for Gaussian case. Example 1. Consider that p(t, •) is N (µ t , Σ t ), p * is N (µ * , Σ * ). We consider the SVGD algorithm with linear kernel, RBF kernel, and regularized functional gradient formulation with Q(f ) = 1 2 E pt ∥f ∥ 2 , and Q(f ) = 1 2 E pt ∥f ∥ 2 Σ -1 * . Starting with N (0, I), Fig. 1 plots the g(0, x) with different Q and the optimal transport direction; the path of µ t and p(5, x) are illustrated in Fig. 2 . The detailed mathematical derivations and analytical results are provided in Appendix A.2. Example 1 shows the comparison of different regularizations for the functional gradient. For RKHS norm, we consider the most commonly used kernels: linear and RBF. Fig. 1 shows the g(0, x) of different regularizers: only Q(f ) = 1 2 E pt ∥f ∥ 2 curl component (µ * µ ⊤ t term in linear SVGD is non-symmetric; See Appendix for details), p(5, x) is rotated with an angle. For RBF kernel, the functional gradient of SVGD is not linear, leading to slow convergence. The L 2 regularizer is suboptimal due to the ill-conditioned Σ * . We can see that Q(f ) = 1 2 E pt ∥f ∥ 2 Σ -1 * produces the optimal path for µ t (the line segment between µ 0 and µ * ).

2.1. GENERAL REGULARIZATION

Inspired by the Gaussian case, we consider the general form Q(f (x)) = 1 2 p(t, x)∥f (x)∥ 2 H dx (6) where H is a symmetric positive definite matrix. Proposition 1 shows that with Eq. ( 6), the resulting functional gradient is the approximation of preconditioned Wasserstein gradient flow. It also implies the well-definedness and existence of g(t, x) when F is closed, since ∥ • ∥ H is lower bounded by 0. Proposition 1. Consider the case that Q(f ) = 1 2 p(t, x)∥f (x)∥ 2 H dx , where H is a symmetric positive definite matrix. Then the functional gradient defined by Eq. ( 4) is equivalent as g(t, x) = arg min f ∈F 1 2 p(t, x) f (x) -H -1 ∇ ln p * (x) p(t, x) 2 H dx (7) Remark. Our regularizer is similar to the Bregman divergence in mirror descent. We can further extend Q(•) with a convex function h(•) : R → [0, ∞), where the regularizer is defined by Q(f (x)) = E pt h(f (x) ). We can adapt more complex geometry in our framework with proper h(•).

2.2. CONVERGENCE ANALYSIS

Equilibrium condition. To provide the theoretical guarantees, we show that the stationary distribution of our g(t, x) update is p * . Meanwhile, the evolution of KL-divergence is well-defined (without the explosion of functional gradient) and descending. We list the regularity conditions below. [A 1 ] (Regular function class) The discrepancy induced by function class F is positive-definite: for any q ̸ = p, there exists f ∈ F such that E q [∇ • f (x) + f (x) ⊤ ∇ ln p(x)] > 0. For f ∈ F and c ∈ R, cf ∈ F, and F is closed. The tail of f is regular: lim ∥x∥→∞ f (θ, x)p * (x) = 0 for f ∈ F. [A 2 ] (L-Smoothness) For any x ∈ R d , p * (x) > 0 and p(t, x) > 0 are L-smooth densities, i.e., ∥∇ ln p(x) -∇ ln p(y)∥ ≤ L∥x -y∥, with E p ∥x∥ 2 < ∞. Particularly, [A 1 ] is similar to the positive definite kernel in RKHS, which guarantees the decrease of KL divergence. [A 2 ] is designed to make the gradient well-defined: RHS of Eq. ( 3) is finite. Proposition 2. Under [A 1 ], [A 2 ], when we update X t as Eq. ( 7), we have -∞ < dDKL dt < 0 for all p(t, x) ̸ = p * (x). i.e., g(t, x) = 0 if and only if p(t, x) = p * (x). Proposition 2 shows that the continuous dynamics of our g(t, x) is well-defined and the KL divergence along the dynamics is descending. The only stationary distribution for p(t, x) is p * (x). Convergence rate. The convergence rate of our framework mainly depends on (1) the capacity of function class F; (2) the complexity of p * . In this section, we analyze that when the approximation error is small and the target p * is log-Sobolev, the KL divergence converges linearly. [A 3 ] (ϵ-approximation) For any t > 0, there exists f t (x) ∈ F and ϵ < 1, such that p(t, x) ft(x) -H -1 ∇ ln p(t, x) p * (x) 2 H dx ≤ ϵ p(t, x) ∇ ln p(t, x) p * (x) 2 H -1 dx [A 4 ] The target p * satisfies µ-log-Sobolev inequality (µ > 0): For any differentiable function g, we have E p * g 2 ln g 2 -E p * g 2 ln E p * g 2 ≤ 2 µ E p * ∥∇g∥ 2 . Specifically, [A 3 ] is the error control of gradient approximation. With universal approximation theorem (Hornik et al., 1989) , any continuous function can be estimated by neural networks, which indicates the existence of such a function class. [A 4 ] is a common assumption in sampling literature. It is more general than strongly concave assumption (Vempala & Wibisono, 2019) . Theorem 1. Under [A 1 ]-[A 4 ], for any t > 0, we assume that the largest eigenvalue, λ max (H) = m. Then we have dDKL (t) dt ≤ -1-ϵ 2 E pt ∇ ln pt(x) p * (x) H -1 and D KL (t) ≤ exp(-(1 -ϵ)µt/m)D KL (0). Theorem 1 shows that when p * is log-Sobolev, our algorithm achieves linear convergence rate with a function class powerful enough to approximate the preconditioned Wasserstein gradient. Moreover, considering the discrete algorithm, we have to make sure that the step size is proper, i.e., ∥H -1 ∇foot_1 ln p * (x)∥ ≤ c, for some constant c > 0. Thus, the preconditioned H = -c∇ 2 ln p * is better than plain one H = cLI, since (-∇ 2 ln p * ) -1 ⪰ L -1 I. For better understanding, we have provided a Gaussian case discretized proof in Appendix A.6 to illustrate this phenomenon.

3. PFG: PRECONDITIONED FUNCTIONAL GRADIENT FLOW

Algorithm 1 PFG: Preconditioned Functional Gradient Flow Input: Unnormalized target distribution p * (x) = e -U (x) , f θ (x) : R d → R d , initial particles (parameters) {x i 0 } n i=1 , θ0, iteration parameter T, T ′ , step size η, η ′ , regularization function h(•). for t = 1, . . . , T do Assign θ 0 t = θt-1; for t ′ = 1, • • • , T ′ do Compute L(θ) = 1 n n i=1 h(f θ (x i t )) + f θ (x i t ) • ∇U (x i t ) -∇ • f θ (x i t ) Update θ t ′ t = θ t ′ -1 t -η ′ ∇ L(θ t ′ -1 t ); end Assign θt = θ T 1 t and update particles x i t = x i t + η f θ t (x i t ) for all i = 1, • • • , n; end Return: Optimized particles {x i T } n i=1 3.1 ALGORITHM We will realize our algorithm with parametric f θ (such as neural networks) and discretize the update. Parametric Function Class. We can let F = {f θ (x) : θ ∈ Θ} and apply g(t, x) = f θt (x), such that θt = arg min θ∈Θ p(t, x)[-∇ • f θ (x) -f θ (x) ⊤ ∇ ln p * (x) + 1 2 ∥f θ (x)∥ 2 H ]dx , ( ) where H is a symmetric positive definite matrix estimated at time t. Eq. ( 8) is a direct result from Eq. ( 4) and ( 6). The parametric function class allows f θ to be optimized by iterative algorithms. Choice of H. Considering the posterior mean trajectory, it is equivalent to the conventional optimization, so that -∇ 2 ln p * is ideal (Newton's method) to implement discrete algorithms. We use diagonal Fisher information estimators for efficient computation as Adagrad (Duchi et al., 2011) . We approximate the preconditioner H for all particles at each time step t by moving averaging. Discrete update. We can present our algorithm by discretizing Eq. ( 4) and ( 8). Given X 0 ∼ p 0 , we update X k as X k+1 = X k + ηf θk (X k ), where θk is obtained by Eq. ( 8) with (stochastic) gradient descent. The integral over p(k, x) is estimated by particle samples. Full procedure is presented in Alg. 1, where the regularizer h is 1 2 ∥ • ∥ 2 H by default.

3.2.1. SVGD FROM A FUNCTIONAL GRADIENT VIEW

For simplicity, we prove the case under finite-dimensional feature map 2 , ψ(x) : R d → R h . We assume that F = {W ψ(x) : W ∈ R d×h }, and let kernel k(x, y) = ψ(x) ⊤ ψ(y), Q(f ) = 1 2 ∥f ∥ 2 H d = 1 2 ∥W ∥ 2 F , where RKHS norm is the Frobenius norm of W . The solution is defined by Ŵt = arg min W ∈R d×h -p(t, x) trace[W ∇ψ(x) + W ψ(x)∇ ln p * (x) ⊤ ]dx + 1 2 ∥W ∥ 2 F , which gives Ŵt = p(t, x ′ )[∇ψ(x ′ ) ⊤ + ∇ ln p * (x ′ )ψ(x ′ ) ⊤ ]dx ′ . This implies that g(t, x) = Ŵt ψ(x) = p(t, y)[∇ y k(y, x) + ∇ ln p * (y)k(y, x)]dy, which is equivalent to SVGD. For linear function classes, such as RKHS, Q can be directly applied to the parameters, such as W here. The regularization of SVGD is 1 2 ∥ • ∥ 2 F (the norm defined in RKHS). For non-linear function classes, such as neural networks, the RKHS norm cannot be defined.

3.2.2. LIMITATIONS OF SVGD

Kernel function class. As in Section 3.2.1, RKHS only contains linear combination of the feature map functions, which suffers from curse of dimensionality (Geenens, 2011) . On the other hand, some non-linear function classes, such as neural network, performs well on high dimensional data (LeCun et al., 2015) . The extension to non-linear function classes is needed for better performance. Gradient preconditioning in SVGD. In Example 1, when -∇ 2 ln p * is ill-conditioned, PFG algorithm follows the shortest path from µ 0 to µ * . Although SVGD can implement preconditioning matrices as (Wang et al., 2019) , due to the curl component and time-dependent Jacobian of dµ t /dt, any symmetric matrix cannot provide optimal preconditioning (detailed derivation in Appendix). Suboptimal convergence rate. For log-Sobolev p * , SVGD with commonly used bounded smoothing kernels (such as RBF kernel) cannot reach the linear convergence rate (Duncan et al., 2019) and the explicit KL convergence rate is unknown yet. Meanwhile, the Wasserstein gradient flow converges linearly. When the function class is sufficiently large, PFG converges provably faster than SVGD. Computational cost. For SVGD, main computation cost comes from the kernel matrix: with n particles, we need O(n 2 ) memory and computation. Our algorithm uses an iterative approximation to optimize g(t, x), whose memory cost is independent of n and computational cost is O(n) (Bertsekas et al., 2011) . The repulsive force between particles is achieved by ∇ • f operator on each particle.

4. EXPERIMENT

To validate the effectiveness of our algorithm, we have conducted experiments on both synthetic and real datasets. Without special declarations, we use parametric two-layer neural networks with Sigmoid activation as our function class. To approximate H in real datasets, we use the approximated diagonal Hessian matrix Ĥ, and choose H = Ĥα , where α ∈ {0, 0.1, 0.2, 0.5, 1}; the inner loop T ′ of PFG is chosen from {1, 2, 5, 10}, the hidden layer size is chosen from {32, 64, 128, 256, 512}. The parameters are chosen by validation. More detailed settings are provided in the Appendix.  (a) µ * = [1, 1] ⊤ (b) µ * = [1, 1] ⊤ (c) µ * = [20, 20] ⊤ (d) µ * = [20, 20] ⊤ Σ * = diag(1, 1) Σ * = diag(100, 1) Σ * = diag(1, 1) Σ * = diag(100, 1) Figure 4 : Evolution of particle distribution from N (0, I) to N (µ * , Σ * ) (first row: mean squared error of µ t : ∥µ t -µ * ∥ 2 ; second row: KL divergence between p(t, x) and p * (x)) able to determine the gradient near connected clusters. It fails to capture the clusters near (-0.75, 0.5). However, the non-linear function class is able to resolve the above difficulties. In PFG, we found that the score function is well-estimated and the resulting samples mimic the true density properly. Ill-conditioned Gaussian distribution. We show the effectiveness of our proposed regularizer. For ill-conditioned Gaussian distribution, the condition number of Σ * is large, i.e., the ratio between maximal and minimal eigenvalue is large. We follow Example 1 and compare different µ * and Σ * . When Σ * is well-conditioned (Σ * = I), L 2 regularizer (equivalent to Wasserstein gradient) performs well. However, it will be slowed down significantly with ill-conditioned Σ * . For SVGD with linear kernel, the convergence slows down with shifted µ * or ill-conditioned Σ * . For SVGD with RBF kernel, the convergence is slow due to the improper function class. Interestingly, for ill-conditioned case, µ t of SVGD (linear) converges faster than our method with H = I but KL divergence does not always follow the trend. The reason is that Σ t of SVGD is highly biased, making KL divergence large. Our algorithm provides feasible Wasserstein gradient estimators and makes the particle-based sampling algorithm compatible with ill-conditioned sampling case. Logistic Regression. We conduct Bayesian logistic regression for binary classification task on Sonar and Australian dataset (Dua & Graff, 2017) . In particular, the prior of weight samples are assigned N (0, 1); the step-size is fixed as 0.01. We compared our algorithm with SVGD (200 particles), full batch gradient is used in this part. (Hoffman et al., 2014) . Fig. 5 shows that our method outperforms SVGD consistently. Hierarchical Logistic Regression. For hierarchical logistic regression, we consider a 2level hierarchical regression, where prior of weight samples is N (0, α -1 ). The prior of α is Gamma(0, 0.01). All step-sizes are chosen from {10 -t : t = 1, 2, • • • } by validation. We compare the test accuracy and negative log-likelihood as Liu & Wang (2016) . Mini-batch was used to estimate the log-likelihood of training samples (batch size is 200). Tab. 1 shows that our algorithm performs the best against SVGD and SGLD on hierarchical logistic regression. et al., 2019) and pSGLD (Li et al., 2016) . Data samples are randomly partitioned to two parts: 90% (training), 10% (testing) and 100 particles is used. For MNIST, data split follows the default setting and we compare our algorithm with SVGD for 5, 10, 50, 100 particles (SVGD with 100 particles exceeded the memory limit). In Tab. 2, PFG outperforms SVGD and SGLD significantly. In Fig. 6 , we found that the accuracy and NLL of PFG is better than SVGD with all particle size. With more particle samples, PFG algorithm also improves. Time Comparison. As indicated, our algorithm is more scalable than SVGD in terms of particle numbers. Tab. 3 shows that SVGD entails more computation cost with the increase of particle numbers. Our functional gradient is obtained with iterative approximation without kernel matrix computation. When n is large, our O(n) algorithm is much faster than O(n 2 ) kernel-based method. Interestingly, the introduction of particle interactions is a key merit of particle-based VI algorithms, which intuitively needs O(n 2 ) computation. The integration by parts technique draws the connection between the particle interaction and ∇ • f θ which supports more efficient computational realizations. 

5. RELATED WORKS

Stein Discrepancy and SVGD. Stein discrepancy (Liu et al., 2016; Chwialkowski et al., 2016; Liu & Wang, 2016) is known as the foundation of the SVGD algorithm, which is defined on the bounded function classes, such as functions with bounded RKHS norm. We provide another view to derive SVGD: SVGD is a functional gradient method with RKHS norm regularizer. From this view, we found that there are other variants of particle-based VI. We suggest that the our proposed framework has many advantages over SVGD as the function class is more general and the kernel computation can be saved. We use neural networks as our F in real practice, which is much more efficient than SVGD. Moreover, the general regularizer in Eq. ( 6) improves the conditioning of functional gradient, which is proven better than SVGD in both theory and experiments. The preconditioning of our algorithm is more explainable than SVGD, without incorporating RKHS-dependent properties. Learning Discrepancy with Neural Networks. Some recent works (Hu et al., 2018; Grathwohl et al., 2020; di Langosco et al., 2021) leverage neural networks to learn the Stein discrepancy, which are related to our algorithm. They only implement Q(f (x)) = E pt ∥f (x)∥ 2 to estimate the discrepancy. Grathwohl et al. (2020) measure the similarity between distributions and train energy-based models (EBM) with the discrepancy. However, they did not further validate the sampling performance of their "learned discrepancy", so it is a new version of KSD (Liu et al., 2016) , rather than SVGD. Hu et al. ( 2018) train a neural "generator" for sampling tasks. They discussed the scaling of L 2 regularization to align with the step size. In contrast, we restrict our investigation in particle-based VI and avoid learning a "generator", which may introduce more parameterization errors. di Langosco et al. ( 2021) is an empirical study to implement the L 2 version of our algorithm with comparable performance to SVGD. We extend the design of Q to a general case, and emphasize the benefits of our proposed regularizers corresponding to the preconditioned algorithm. We include the Example 1 and Fig. 4 to demonstrate the necessity of preconditioning. Our theoretical and experimental results have demonstrated the improvement of PFG algorithm against SVGD. KL Wasserstein Gradient Flow. Wasserstein gradient flow (Ambrosio et al., 2005) is the continuous dynamics to minimize functionals in Wasserstein space, which is crucial in sampling and optimal transport tasks. However, the numerical computation of Wasserstein gradient is non-trivial. Previous works (Peyré, 2015; Benamou et al., 2016; Carlier et al., 2017) (Welling & Teh, 2011; Bernton, 2018) by adding Gaussian noise to particles, which are also variants of MCMC. However, the deterministic algorithm to approximate Wasserstein gradient flow is still challenging. Our framework provides a tractable approximation to fix this gap.

6. CONCLUSION

In particle-based VI, we consider that the particles can be updated with a preconditioned functional gradient flow. Our theoretical results and Gaussian example led to an algorithm: PFG that can approximate preconditioned Wasserstein gradient directly. Our theory indicates that when the function class is large, PFG is provably faster than conventional particle-based VI, SVGD. The empirical result showed that PFG performs better than conventional SVGD and SGLD variants.

A PROOFS

A.1 PROOF OF EQ. (3) From the continuity equation (or Fokker-Planck equation), we have ∂p(t, x) ∂t = -∇ • [g(t, x)p(t, x)] By chain rule and integration by parts, dD KL(t) dt = ∂p(t, x) ∂t 1 + ln p(t, x) p * (x) dx = -∇ • [g(t, x)p(t, x)] 1 + ln p(t, x) p * (x) dx = [g(t, x)p(t, x)] ⊤ ∇ ln p(t, x) p * (x) dx = [g(t, x)p(t, x)] ⊤ ∇[ln p(t, x) -ln p * (x)]dx = g(t, x) ⊤ [∇p(t, x) -p(t, x)∇ ln p * (x)]dx = -p(t, x)[∇ • g(t, x) + g(t, x)∇ x ln p * (x)]dx. A.2 DISCUSSION OF EXAMPLE 1 (1) SVGD with linear kernel k(x, y) = x ⊤ Ky + 1 g(t, x) = -Σ -1 * [(Σ t + (µ t -µ * )µ ⊤ t )Kx + µ t -µ * ] + Kx. (2) SVGD with RBF kernel k σ (x, y) = 1 √ σ 2 exp -1 2σ 2 ∥y -x∥ 2 g(t, x) = O(x exp(-∥x∥ 2 )). (3) Linear function class with L 2 regularization Q(f (x)) = 1 2 E pt ∥f (x)∥ 2 . g(t, x) = -Σ -1 * (x -µ * ) + Σ -1 t (x -µ t ). (4) Linear function class with Mahalanobis regularization Q(f (x)) = 1 2 E pt ∥f (x)∥ 2 Σ -1 * g(t, x) = -(x -µ * ) + Σ * Σ -1 t (x -µ t ). For optimal transport g(t , x) = -(x -µ * ) + Σ -1/2 t (Σ 1/2 t Σ * Σ 1/2 t ) 1/2 Σ -1/2 t (x -µ t ). Note that µ * µ ⊤ t is not symmetric, which makes the distribution rotate. Figure 7 shows that we can split the curl component by Helmholtz decomposition, which means that this part would rotate the distribution and cannot be compromised by preconditioning method. Thus, we cannot find any K to obtain a proper preconditioning. Proof. We consider these 4 cases seperately. (1) SVGD with linear kernel. (2) SVGD with RBF kernel. If k(x, y) = x ⊤ Ky + 1, then ∇ y k(x, y) = Kx, g(t, x) = p(t, y)(-Σ -1 * (y -µ * ) + Σ -1 t (y -µ t ))(y ⊤ Kx + 1)dy = -p(t, y)(Σ -1 * (y -µ * )(y ⊤ Kx + 1))dy + p(t, y)(Σ -1 t (yy ⊤ -µ t y ⊤ ))dyKx = -Σ -1 * [(Σ t + (µ t -µ * )µ ⊤ t )Kx + µ t -µ * ] + Kx g(t, x) = 1 2π|Σ t | exp - 1 2 (y -µ t ) ⊤ Σ -1 t (y -µ t ) 1 √ σ 2 exp - 1 2σ 2 ∥y -x∥ 2 Σ -1 * (y -µ * ) -Σ -1 t (y -µ t ) dy = 1 2π|Σ t | exp - 1 2 ỹ⊤ Σ -1 t ỹ 1 √ σ 2 exp - 1 2σ 2 ∥ỹ + µ t -x∥ 2 Σ -1 * (ỹ + µ t -µ * ) -Σ -1 t ỹ dỹ = 2π|(σ -2 I + Σ -1 t ) -1 | 2πσ 2 |Σ t | 1 2π|(σ -2 I + Σ -1 t ) -1 | • exp - 1 2 (ỹ -µ x ) ⊤ σ -2 I + Σ -1 t (ỹ -µ x ) • exp 1 2 (µ t -x) ⊤ Σ -1 t (µ t -x) Σ -1 * (ỹ + µ t -µ * ) -Σ -1 t ỹ dỹ = 1 σ 2 |Σ t (σ -2 I + Σ -1 t )| p(ỹ) exp 1 2 (µ t -x) ⊤ Σ -1 t (µ t -x) Σ -1 * (ỹ + µ t -µ * ) -Σ -1 t ỹ dỹ =C x Σ -1 * -Σ -1 t µ x + Σ -1 * (µ t -µ * ) where µ x = σ -2 σ -2 I + Σ -1 t -1 (x -µ t ) = I + σ 2 Σ -1 t -1 (x -µ t ), C x = 1 |σ 2 I + Σ t | exp - 1 2σ 2 (x -µ t ) ⊤ (I -(I + σ 2 Σ -1 t ) -1 )(x -µ t ) Note that g(t, x) = O(x exp(-∥x∥ 2 )). When x → ∞, g(t, x) is vanishing. (3) and ( 4) are special case of Proposition 2. Considering the evolution of µ t , we have (1) dµt dt = -Σ -1 * [(Σ t + (µ t -µ * )µ ⊤ t )Kµ t + µ t -µ * ] + Kµ t , which contains non-symmetric µ * µ ⊤ t . (2) The transport is not linear. (3) dµt dt = Σ -1 * (µ * -µ t ),. (4) dµt dt = µ * -µ t , which means that the evolution of ( 4) is equivalent with optimal transport.

A.3 PROOF OF PROPOSITION 1

Proof. Assume that Q(f ) = p(t, x)∥f (x)∥ 2 H dx, when we have Eq. ( 4) as g(t, x) = arg min f ∈F p(t, x)[-∇ • f (x) -f (x) ⊤ ∇ ln p * (x) + 1 2 ∥f (x)∥ 2 H ]dx (15) = arg min f ∈F p(t, x)[-f (x) ⊤ ∇ ln p * (x) p(t, x) + 1 2 ∥f (x)∥ 2 H ]dx (16) = arg min f ∈F E pt [ 1 2 H -1 ∇ ln p * (x) p(t, x) 2 H -f (x) ⊤ ∇ ln p * (x) p(t, x) + 1 2 ∥f (x)∥ 2 H ] (17) = arg min f ∈F 1 2 E pt f (x) -H -1 ∇ ln p * (x) p(t, x) 2 H A.4 PROOF OF PROPOSITION 2 Lemma 1. (A Variant of Lemma. 11 in (Vempala & Wibisono, 2019 )) Suppose that p(x) > 0 is L-smooth. Then, we have p(x) ∥∇ ln p(x)∥ 2 dx ≤ Ld; Proof. By [A 1 ], there exists f such that -p(t, x)[∇ • f + f ⊤ ∇ x ln p * (x)]dx < 0. ( ) Thus, consider f c (x) = cf (x), L(c) = -p(t, x)[∇ • cf + cf ⊤ ∇ x ln p * (x)]dx =: -C f c (20) l(c) = p(t, x)∥cf ∥ 2 H dx =: c 2 p(t, x)∥f ∥ 2 H dx = C H c 2 (21) where C H , C f > 0. When c = C f 2C H , we have C H c 2 -C f c = - C 2 f 4C H < 0. By choosing c 0 = C f 2C H , -p(t, x)[∇ • g(t, x) + g(t, x) ⊤ ∇ ln p * (x) + 1 2 ∥g(t, x)∥ 2 H ]dx (22) ≤ -p(t, x)[∇ • c 0 f + c 0 f ⊤ ∇ ln p * (x) + 1 2 ∥c 0 f ∥ 2 H ]dx (23) <0 (24) Thus, dD KL (t) dt = -p(t, x)[∇ • g(t, x) + g(t, x) ⊤ ∇ ln p * (x)dx < 0 By Eq. (3), dD KL (t) dt = -p(t, x)[∇ • g(t, x) + g(t, x) ⊤ ∇ x ln p * (x)]dx (25) = -p(t, x)[g(t, x)∇ ln p * (x) p(t, x) ]dx (26) ≥ - 1 2 p(t, x) ∇ ln p * (x) p(t, x) 2 dx - 1 2 p(t, x) ∥g(t, x)∥ 2 dx (27) ≥ -(1 + λ max (H)λ -2 min (H)) p(t, x) ∇ ln p * (x) p(t, x) 2 dx (28) ≥ -(1 + λ max (H)λ -2 min (H)) Ld + p(t, x) ∥∇ ln p * (0) + Lx∥ 2 dx (29) > -∞ where p(t, x) ∇ ln p * (x) p(t,x) 2 dx can be controlled by Ld ([A 2 ] and Lemma 1) and E pt ∥x∥ 2 .

A.5 PROOF OF THEOREM 1

Proof. By µ-log-Sobolev inequality in [A 4 ], suppose g 2 = p t /p * , then we have D KL (t) ≤ 1 2µ p(t, x) ∇ ln p(t, x) p * (x) 2 dx. When we consider the H-induced norm, for any x, we have ∥x∥ H -1 ≥ 1 λmax(H) ∥x∥. Thus, D KL (t) ≤ λ max (H) 2µ p(t, x) ∇ ln p(t, x) p * (x) 2 H -1 dx. ( ) Using Cauchy-Schwarz inequality and [A 3 ], we have dD KL (t) dt = -p(t, x)∇ ln p * (x) p(t, x) • g(t, x)dx = -p(t, x)∇ ln p * (x) p(t, x) • g(t, x) -H -1 ∇ ln p * (x) p(t, x) + H -1 ∇ ln p * (x) p(t, x) dx = -p(t, x) ∇ ln p(t, x) p * (x) 2 H -1 dx + p(t, x)∇ ln p * (x) p(t, x) • g(t, x) -H -1 ∇ ln p * (x) p(t, x) dx ≤ -p(t, x) ∇ ln p(t, x) p * (x) 2 H -1 dx + p(t, x) ∇ ln p * (x) p(t, x) H -1 g(t, x) -H -1 ∇ ln p * (x) p(t, x) H dx ≤ - 1 2 p(t, x) ∇ ln p(t, x) p * (x) 2 H -1 dx + 1 2 p(t, x) g(t, x) -H -1 ∇ ln p * (x) p(t, x) 2 H dx ≤ - µ(1 -ϵ) λ max (H) D KL (t) Thus, by Grönwall's inequality, D KL (t) ≤ exp - µ(1 -ϵ) λ max (H) D KL (0) Remark: Theorem 1 shows that PFG algorithm convergence with linear rate. Moreover, the choice of preconditioning matrix affect the convergence rate with the largest eigenvalue. In real practice, due to the discretized algorithm, one needs to normalize the scaling of H, which is equivalent to the learning rate in practice. For any preconditioning H, we need to compute the corresponding learning rate to get H = η -1 H. One usually set that ∥η H-1 ∇ 2 ln p * (x)∥ -1 2 ≤ 1 to make the discretized (first-order) algorithm feasible (η ≤ ∥ H-1 ∇ 2 ln p * (x)∥ 2 ), otherwise the second order error would be too large. The construction is similar to condition number dependency in optimization literature. Take the fixed largest learning rate for example in the continuous case: When H = I, we have that η = 1/L, H = LI, dD KL (t) dt ≤ - 1 2L E pt ∇ ln p t (x) p * (x) 2 . When H = -∇ 2 ln p * , we have that η = 1, H = -∇ 2 ln p * , dD KL (t) dt ≤ - 1 2 E pt ∇ ln p t (x) p * (x) 2 -(∇ 2 ln p * ) -1 ≤ - 1 2L E pt ∇ ln p t (x) p * (x) 2 .

A.6 ANALYSIS OF DISCRETE GAUSSIAN CASE

As is shown in Example 1, given that x 0 ∼ N (0, I) and the target distribution is N (µ * , Σ * ), we have the following update for our PFG algorithm, x t+1 = x t + ηH -1 Σ -1 * (µ * -x t ) + Σ -1 t (x t -µ t ) . By taking the expectation, we have µ t+1 -µ t = ηH -1 Σ -1 * (µ * -µ t ) Σ t+1 -Σ t = ηH -1 Σ -1 t -Σ -1 * Σ t + ηΣ t Σ -1 t -Σ -1 * H -1 + η 2 H -1 Σ -1 t -Σ -1 * Σ t Σ -1 t -Σ -1 * H -1 Since Σ 0 = I, the eigenvectors of Σ t will not change during the update. For notation simplicity, we write the diagonal case in the following discussion. Assume that Σ * = diag(σ 2 1 , • • • , σ 2 d ), Σ t = diag(s 2 1 (t), • • • , s 2 d (t)), H = diag(h 1 , • • • , h d ), such that σ i ≥ σ i+1 for any i = 1, • • • , d -1. Without loss of generality, we assume that σ 1 ≤ 1, otherwise we can still rescale the initialization to obtain the similar constraint. Given the fact that x -ln(1 + x) ≤ x 2 2 when x > 0, we have, D KL (t) ≤ 1 2 ∥µ t -µ * ∥ 2 Σ -1 * + 1 2 d i=1 s 2 i (t) -σ 2 i σ 2 i 2 . Thus, we define C 0 := 1 2 ∥µ 0 -µ * ∥ 2 Σ -1 * + 1 2 d i=1 1 -σ 2 i σ 2 i 2 . ( ) Considering the i-th dimension (with notation [•] i ), [µ t+1 ] i = 1 - η h i σ 2 i [µ t ] i + η h i σ 2 i [µ * ] i s 2 i (t + 1) = s 2 i (t) + 2ηs 2 i (t) h i 1 s i (t) 2 - 1 σ 2 i + η 2 s 2 i (t) h 2 i 1 s i (t) 2 - 1 σ 2 i 2 . ( ) To guarantee the convergence of the algorithm, we need to make sure that both µ t and Σ t converge. By rearranging (34) and ( 35), we have [µ t+1 ] i -[µ * ] i σ i = 1 - η h i σ 2 i [µ t ] i -[µ * ] i σ i s 2 i (t + 1) -σ 2 i σ 2 i = 1 - 2η h i σ 2 i s 2 i (t) -σ 2 i σ 2 i + η 2 h 2 i σ 2 i s 2 i (t) s 2 i (t) -σ 2 i σ 2 i 2 . One needs to construct contraction sequences of both mean and variance gap, which are guided by 1 -η/(h i σ 2 i ) and 1 -2η/(h i σ 2 i ). Notice that s 2 i (0) ≥ σ 2 i . Assume that η ≤ min i h i σ 2 i /2. Then we have s 2 i (t + 1) ≥ σ 2 i . Thus, we obtain η h i 1 σ 2 i - 1 s 2 i (t) ≤ 1. It is natural to obtain the contraction of both mean and variance gap, ∥µ t+1 -µ * ∥ 2 Σ -1 * ≤ max i 1 - η h i σ 2 i 2 ∥µ t -µ * ∥ 2 Σ -1 * s 2 i (t + 1) -σ 2 i σ 2 i ≤ 1 - η h i σ 2 i s 2 i (t) -σ 2 i σ 2 i D KL (t) ≤ max i 1 - η h i σ 2 i 2t C 0 , where C 0 is defined in Eq. ( 33). When h i = 1, we have η = σ 2 d 2 D KL (t) ≤ 1 - σ 2 d 2σ 2 1 2t C 0 . When h i = σ -2 i , we have η = 1 2 D KL (t) ≤ 1 4 t C 0 . This result is equivalent to the Remark in A.5.

A.7 PROOF OF INFINITE-DIMENSIONAL CASE OF RKHS

Proof. Assume the feature map, ψ(x) : R d → H and let kernel k(x, y) = ⟨ψ(x), ψ(y)⟩ H , Q(f ) = 1 2 ∥f ∥ H d . One can perform spectral decomposition to obtain that k(x, y) = ∞ i=1 λ i ψ i (x)ψ i (y) where ψ i : R d → R are orthonormal basis and λ i is the corresponding eigenvalue. For any g ∈ H d , we have the decomposition, g(x) = ∞ i=1 g i λ i ψ i (x), where g i ∈ R d and ∞ i=1 ∥g i ∥ 2 < ∞. The solution is defined by ĝ = arg min g∈H d -p(t, x) ∇ • g + ∇ ln p * (x) ⊤ g(x) + 1 2 ∥g∥ 2 H d , = arg min g∈H d -p(t, x) ∇ • ∞ i=1 g i λ i ψ i (x) + ∞ i=1 λ i ∇ ln p * (x) ⊤ g i ψ i (x) + ∞ i=1 ∥g i ∥ 2 , which gives ĝi = λ i p(t, y)[∇ψ i (y) + ∇ ln p * (y)ψ i (y)]dy. This implies that g(t, x) = ∞ i=1 λ i ĝi ψ i (x) = p(t, y)[∇ y k(y, x) + ∇ ln p * (y)k(y, x)]dy, which is equivalent to SVGD.

B MORE DISCUSSIONS

B.1 SCORE MATCHING Score matching (SM) is related to our algorithm due to the integration by parts technique, which uses parameterized models to estimate the stein score ∇ ln p * . We made some modifications to the original techniques in SM: (1) We extend the score matching ∇ ln p * to Wasserstein gradient approximation ∇ ln p * -∇ ln p t ; (2) We introduce the geometry-aware regularization to approximate the preconditioned gradient. Thus, our proposed modifications make it suitable for sampling tasks. In a word, both SM and PFG are derived from the integration by parts technique. SM is a very promising framework in generative modeling and our PFG is more suitable for sampling tasks.

B.2 OTHER PRECONDITIONED SAMPLING ALGORITHMS

Preconditioning is a popular topic in SVGD literature. Stein variational Newton method (SVN) (Detommaso et al., 2018) approximates the Hessian to accelerate SVGD. However, due to the gap between SVGD and Wasserstein gradient flow, the theoretical interpretation is not clear. Matrix SVGD (Wang et al., 2019) leverages more general matrix-valued kernels in SVGD, which includes SVN as a variant and is better than SVN. (Chen et al., 2019) perform a parallel update of the parameter samples projected into a low-dimensional subspace by an SVN method. It implements dimension reduction to SVN algorithm. Liu et al. (2022) projects SVGD onto arbitrary dimensional subspaces with Grassmann Stein kernel discrepancy, which implement Riemannian optimization technique to find the proper projection space. Note that all of these algorithms are based on RKHS norm regularization. As we mentioned in Example 1, due to the introduction of RKHS, the preconditioning matrix is often hard to find. For example, the preconditioning matrix of the linear kernel varies in time, while our algorithm only needs a constant matrix. For RBF kernel, the previous works have demonstrated the improvement of the Hessian inverse matrix, but the design is still heuristic due to the gap between the Wasserstein gradient and SVGD. Our algorithm directly approximates the preconditioned Wasserstein gradient, and further analysis is clear and conclusive. Also, the SVGD-based algorithms suffer from the drawbacks of RKHS norm regularization, as discussed in Example 1. Besides, there are some other interesting works (Li et al., 2019; Garbuno-Inigo et al., 2020; Wang et al., 2021; Lin et al., 2021; Wang & Li, 2020; 2022; Li & Ying, 2019 ) also take the preconditioning methods / local geometry / subspace properties into consideration, and they have also shown the improvement over the plain versions, which have similar motivation to our algorithm. Although these methods are out of the particle-based VI scope, we believe that these methods have great potentials in our literature, which can be interesting future works.

B.3 OTHER WASSERSTEIN GRADIENT FLOW ALGORITHMS

There is another line of work to approximate Wasserstein gradient flow through JKO-discretization (Mokrov et al., 2021; Alvarez-Melis et al., 2021; Fan et al., 2021) . The continuous versions of these Wasserstein gradient flow algorithms are related to our algorithm when the functional is chosen as KL divergence. These are promising alternative algorithms to PFG in transport tasks. From the theoretical view, they solves the JKO operator with special neural networks (ICNN) and aims to estimate the Wasserstein gradient of general functionals, including KL/JS divergence, etc. On the other hand, our algorithm is motivated by the continuous case of KL Wasserstein gradient flow and we only consider the Euler discretization (rather than JKO), which is the setting of particle-based variational inference. When we consider the KL divergence and posterior sampling task, our proposed method is more efficient than Wasserstein gradient flow based ones, due to the flexibility of the gradient estimation function. We provide the empirical results on Bayesian logistic regression below (Covtype dataset). Some recent works (Ba et al., 2021; Gong et al., 2020; Zhuo et al., 2018; Liu et al., 2022) suggest that SVGD suffers from curse of dimensionality and the variance of SVGD-trained particles tend to diminish in high-dimensional spaces. We highlight that one of the key issues is that the SVGD algorithm with RBF function space is improper. As shown in Tab. 5, when fitting a Gaussian distribution, an ideal function class is the linear function class given the Gaussian initialization. However, provided with RBF-based SVGD, the variance collapse phenomenon is extremely severe. On the contrary, when using a proper function class (linear), both SVGD and PFG performs well. More importantly, for some powerful non-linear function classes (neural networks), it still performs well with PFG. The effectiveness of neural network function class and PFG algorithm would be particularly important when the target distribution is more complex. We also include a further justification in Tab. 7, which compute the energy distance between particle the target distribution and the particle estimation on a 4-mode Gaussian mixture distribution. In this case, the function class of gradient should be much larger than linear case. We can find that the resulting performance improves the previous work (GSVGD). Ideally, one would leverage the exact Hessian inverse as the preconditioning matrix to make the algorithm well-conditioned, which is also the target of other preconditioned SVGD algorithms. However, the computation of exact Hessian is challenging. Thus, motivated by adaptive gradient optimization algorithms, we leverage the moving average of Fisher information to obtain the preconditioning matrix. We conduct hierarchical logistic regression to discuss the importance of preconditioning. We compare 3 kinds of preconditioning strategies to estimate: (1) Full Hessian: compute the exact full Hessian matrix; (2) K-FAC: Kronecker-factored Approximate Curvature to estimate Hessian; (3) Diagonal: Diagonal moving average to estimate Hessian. According to the table, we can find that the preconditioning indeed improves the performance of Bayesian inference, and the choice of estimation strategies does not differ much. Thus, we choose the most efficient one: diagonal estimation. We also include a more complete empirical results on Bayesian neural networks, including the comparison with plain version of SVGD (Liu & Wang, 2016) , SGLD (Welling & Teh, 2011) , and Sliced SVGD (S-SVGD) (Gong et al., 2020) . Our algorithm is still the most competitive one according to the table. Besides, we also conduct an ablation study to further justify the importance of preconditioning, which implies the value of Q(•) design. In Table 11 , we have shown that the full algorithm with proper Q outperforms the plain version. One may be interested in the reason why we choose particle-based VI rather than Langevin dynamics. In the continuous case, Langevin dynamics solves Fokker-Planck equation. We have several reasons to demonstrate the superiority of proposed framework rather than Langevin dynamics. 1. Motivation of particle-based variational inference: Deterministic update and repulsive interactions. One of the key algorithmic differences between particle-based variational inference and Langevin dynamics is the realization of ∇ ln p t , where particle-based variational inference explicitly estimates the deterministic repulsive function and Langevin dynamics uses Brownian motion. The deterministic version repulsive force introduces interactions between particles while the stochastic version only maintain the variance with the randomness. Thus, for each particle, only the deterministic algorithm leads to a convergence, which is more stable and efficient (wrt the number of particles). Figure 8 has demonstrated the phenomenon: we use both particle-based variational inference and Langevin dynamics to sample from Gaussian distribution. It is clear that although both algorithms return reasonable particles, the Langevin-induced particles are fully random and highly unstable due to the empirical randomness, but particle-based VI is robust against different random seeds. When the number of particles is small, the sample efficiency of particle-based VI should be much better than Langevin dynamics. Besides, the deterministic update can induce a transport function, which can be used to map input to the target distribution directly, which can be a great potential of our propose framework. For example, we can maintain the composite function of all the particle updates. When x t+1 = f t (x t ) = x t + ηg(t, x t ), t = 1, • • • , T -1, then x T = f T -1 • • • • • f 1 (x 1 ), then we can perform resampling painlessly. For the Gaussian case, this composite function is just a linear transform without other overheads. For neural networks, we may also use distillation to obtain a transport function. We believe that the distillation of the transport function have great potential in the future. 2. Discretization of Fokker-Planck equation: Forward-Flow discretization vs Euler discretization. About the discretization of Fokker-Planck equation, Langevin dynamics is performing Forward-Flow (FFl) discretization, which is not the same as conventional gradient desent (Euler discretization). In a word, FFI only discretizes the ln p * term, but solve ln p t term with SDE, so that the discretized gradient is biased in general (Wibisono, 2018) . The particle-based variational inference tend to perform Euler discretization, which is unbiased by discretizing the full (Wasserstein) gradient (similar to conventional gradient descent). Thus, from a theoretical perspective, Euler-type discretization is simpler and more direct, which is worthwhile to be further explored as our paper. In particular, neural networks are proven to outperform conventional RKHS in many areas, because it is a non-linear function class with learnable features, that can work with uneven subspaces. The RBF kernel uses the same smoothing operator for all gradients. In Figure 3 , we have shown that the RKHS is incapable of capturing functional gradient near connected clusters, while neural networks can do so. As a result, the sampling quality can be improved.



For any positive-definite matrix, the condition number is the ratio of the maximal eigenvalue to the minimal eigenvalue. A low condition number is well-conditioned, while a high condition number is ill-conditioned. Infinite-dimensional version is provided in Appendix A.7.



Figure 1: Illustration of the functional gradient g(t, x) compared with optimal transport. (a)-(d) denotes the corresponding Q of the regularized functional gradient. (a)-(b) are SVGD algorithms with linear and RBF kernel. Optimal transport denotes the direction of the shortest path towards p * .

Figure 3: Particle-based VI for Gaussian mixture sampling.Gaussian Mixture. To demonstrate the capacity of non-linear function class, we have conducted the Gaussian mixture experiments to show the advantage over linear function class (RBF kernel) with SVGD. We consider to sample from a 10-cluster Gaussian Mixture distribution. Both SVGD and our algorithm are trained with 1,000 particles. Fig.3shows that the estimated "score" by RBF kernel is usually unsatisfactory: (1) In lowdensity area, it suffers from gradient vanishing, which makes samples stuck at these parts (similar to Fig.1 (b));(2) The score function cannot distinguish connected clusters. Specifically, some clusters are isolated while others might be connected. The choice of bandwidth is hard. The fixed bandwidth makes the SVGD algorithm un-

Figure 5: Posterior sampling for Bayesian logistic regression. (200 particles; dataset: sonar (first column) and Australian (second column). µ t : particle mean; µ * : posterior mean.)

Figure 6: Test accuracy and NLL of Bayesian Neural Network (MNIST classification)

Figure 7: Illustration of µ * µ ⊤ t term by Helmholtz decomposition. (µ t = [0, 10], µ * = [20, 20])

Figure 8: Particle-based VI (a-c) vs Langevin dynamics (d-f). (Blue dots refer to particles obtained) 3. Function classes: non-linear function class (neural networks) vs linear function class (RKHS). (Note that the linearity is wrt function bases rather than the plain linear function) In our framework, both RKHS and neural networks (or other function class) are valid function classes to estimate the Wasserstein gradient.

Without ambiguity, ∇ stands for ∇ x for conciseness. Notation ∥x∥ 2 H stands for x ⊤ Hx and ∥x∥ I is denoted by ∥x∥. Notation ∥ • ∥ H d denotes the RKHS norm on R d .



Averaged test root-mean-square error (RMSE) and test log-likelihood of Bayesian Neural Networks on UCI datasets (100 particles). Results are computed by 10 trials.

Time Comparison on Bayesian logistic regression (sonar dataset, 1,000 iterations) ±0.10 5.25 ±0.13 5.71 ±0.11 19.10 ±0.52 81.50 ±1.86 PFG 7.87 ±0.16 7.95 ±0.15 8.15 ±0.17 11.56 ±0.24 13.22 ±0.26

attempt to find tractable formulation with spatial discretization, which suffers from curse of dimensionality. More recently,(Mokrov et al., 2021;Alvarez-Melis et al., 2021) leverage neural networks to model the Wasserstein gradient and aim to find the full transport path between p 0 and p * . The computation of full path is extremely large.Salim et al. (2020) defines proximal gradient in Wasserstein space by JKO operator. However, the work mainly focus on theoretical properties and the efficient implementation remains open.Wang  et al. (2022)  solves SDP to approximate Wasserstein gradient, which considers the dual form of the variational problem. When the functional is KL divergence, Wasserstein gradient can also be realized with Langevin dynamics

Bayesian logistic regression (Covtype dataset)

Illustration of Variance Collapse in RBF function classes (20-dim N (0, I))

Estimating the dimension-averaged marginal variance of N (0, I)

Energy distance (×10 -2 ) between the target distribution and the particle estimation on Multimodal Gaussian mixture distribution (4 modes)

Average Performance on Hierarchical Logistic Regression (German dataset)

Comparison with plain version of SVGD and SGLD. Averaged test root-mean-square error (RMSE) and test log-likelihood of Bayesian Neural Networks on UCI datasets (100 particles). Results are computed by 10 trials.

Comparison with Sliced SVGD (S-SVGD). Averaged test root-mean-square error (RMSE) and test log-likelihood of Bayesian Neural Networks on UCI datasets (100 particles). Results are computed by 10 trials.

Ablation Study. Averaged test root-mean-square error (RMSE) and test log-likelihood of Bayesian Neural Networks on UCI datasets (100 particles). Results are computed by 10 trials.

ACKNOWLEDGEMENTS

The work was supported by the General Research Fund (GRF 16310222 and GRF 16201320).

C IMPLEMENTATION DETAILS

All experiments are conducted on Python 3.7 with NVIDIA 2080 Ti. Particularly, we use PyTorch 1.9 to build models. Besides, Numpy, Scipy, Sklearn, Matplotlib, Pillow are used in the models.C.1 SYNTHETIC EXPERIMENTS C.1.1 ILL-CONDITIONED GAUSSIAN DISTRIBUTION For ill-conditioned Gaussian distribution, we reproduce the continuous dynamics with Euler discretization. We consider 4 cases: (a) [20, 20] ⊤ , Σ * = diag(100, 1) to illustrate the behavior of SVGD and our algorithm clearly.Notably, for our algorithms and SVGD (linear kernel), we use step size 10 -3 to approximate the continuous dynamics and solve the mean and variance exactly (equivalent to infinite particles); for SVGD with RBF kernel, we use step size 10 -2 with 1, 000 particles to approximate mean and variance. For the neural network function class, we assume that the function class of two-layer neural network F θ = {f θ : θ ∈ Θ}. In practice, we may incorporate base shift f 0 as gradient boosting to accelerate the convergence, we usewith some constant c. In practice, we select c from {0, 0.1, 0.2, 0.5, 1} by validation. Empirically, high dimensional models (such as Bayesian neural networks) can be accelerated significantly with the base function.

C.1.2 GAUSSIAN MIXTURE

For Gaussian Mixture, we sample a 2-d 10-cluster Gaussian Mixture, where cluster means are sampled from standard Normal distribution, covariance matrix is 0.1 2 I, the marginal probability of each cluster is 1/10. The function class is 2-layer network with 32 hidden neurons and Tanh activation function. For inner loop, we choose T ′ = 5 and SGD optimizer (lr = 1e-3) with momentum 0.9. For particle optimization, we choose step-size = 1e-1. For SVGD, we choose RBF kernel with median bandwidth, step-size is 1e-2.

C.2 APPROXIMATE POSTERIOR INFERENCE

To compute the trace, we use the randomized trace estimation (Hutchinson's trace estimator) (Hutchinson, 1989) to accelerate the computation .• Sample {ξi} K i=1 , such that Eξ = 0 and Covξ = I;where ξ is chosen as Rademacher distribution.

C.2.1 LOGISTIC REGRESSION

For logistic regression, we compare our algorithm with SVGD in terms of the "goodness" of particle distribution. The ground truth is computed by NUTS (Hoffman et al., 2014) . The mean is computed by 40,000 samples, the MMD is computed with 4,000 samples. The H in this part is chosen asIn the experiments, we select step-size from {10 -1 , 10 -2 , 10 -3 , 10 -4 } for each algorithm by validation. And we use Adam optimizer without momentum. For SVGD, we choose RBF kernel with median bandwidth. For PFG, we select inner loop from {1, 2, 5, 10} by validation. The hidden neuron is 32.

C.2.2 HIERARCHICAL LOGISTIC REGRESSION

For hierarchical logistic regression, we use the test performance to measure the different algorithms, including likelihood and accuracy. The H in this part is chosen as H = Ĥ0.5 .For all experiments, batch-size is 200 and we select step-size from {10 -1 , 10 -2 , 10 -3 , 10 -4 } for each algorithm by validation. And we use Adam optimizer without momentum. For SVGD, we choose RBF kernel with median bandwidth. For PFG, we select inner loop from {1, 2, 5, 10} by validation. The hidden neuron is 32.

C.2.3 BAYESIAN NEURAL NETWORKS (BNN)

We have two experiments in this part: UCI datasets, MNIST classification. The metric includes MSE/accuracy and likelihood. The hidden layer size is chosen from {32, 64, 128, 256, 512}. To approximate H, we use the approximated diagonal Hessian matrix Ĥ, and choose H = Ĥα , where α ∈ {0, 0.1, 0.2, 0.5, 1}. The inner loop T ′ of PFG is chosen from {1, 2, 5, 10} and with SGD optimizer (lr = 1e-3, momentum=0.9).For UCI datasets, we follow the setting of SVGD (Liu & Wang, 2016) . Data samples are randomly partitioned to two parts: 90% (training), 10% (testing) and we use 100 particles for inference. We conduct the two-layer BNN with 50 hidden units (100 for Year dataset) with ReLU activation function.The batch-size is 100 (1000 for Year dataset).Step-size is 0.01.For MNIST, we conduct the two-layer BNN with 128 neurons. We compared the experiments with different particle size.Step-sizes are chosen from {10 -1 , 10 -2 , 10 -3 , 10 -4 }. The batch-size is 100.

