IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS

Abstract

Stein Variational Gradient Descent (SVGD) is a popular sampling algorithm used in various machine learning tasks. It is well known that SVGD arises from a discretization of the kernelized gradient flow of the Kullback-Leibler divergence D KL (• | π), where π is the target distribution. In this work, we propose to enhance SVGD via the introduction of importance weights, which leads to a new method for which we coin the name β-SVGD. In the continuous time and infinite particles regime, the time for this flow to converge to the equilibrium distribution π, quantified by the Stein Fisher information, depends on ρ 0 and π very weakly. This is very different from the kernelized gradient flow of Kullback-Leibler divergence, whose time complexity depends on D KL (ρ 0 | π). Under certain assumptions, we provide a descent lemma for the population limit β-SVGD, which covers the descent lemma for the population limit SVGD when β → 0. We also illustrate the advantages of β-SVGD over SVGD by simple experiments.

1. INTRODUCTION

The main technical task of Bayesian inference is to estimate integration with respect to the posterior distribution π(x) ∝ e -V (x) , where V : R d → R is a potential. In practice, this is often reduced to sampling points from the distribution π. Typical methods that employ this strategy include algorithms based on Markov Chain Monte Carlo (MCMC), such as Hamiltonian Monte Carlo (Neal, 2011) , also known as Hybrid Monte Carlo (HMC) (Duane et al., 1987; Betancourt, 2017) , and algorithms based on Langevin dynamics (Dalalyan & Karagulyan, 2019; Durmus & Moulines, 2017; Cheng et al., 2018) . One the other hand, Stein Variational Gradient Descent (SVGD)-a different strategy suggested by Liu & Wang (2016) -is based on an interacting particle system. In the population limit, the interacting particle system can be seen as the kernelized negative gradient flow of the Kullback-Leibler divergence D KL (ρ | π) := log ρ π (x) dρ(x); (1) see (Liu, 2017; Duncan et al., 2019) . SVGD has already been widely used in a variety of machine learning settings, including variational auto-encoders (Pu et al., 2017) , reinforcement learning (Liu et al., 2017) , sequential decision making (Zhang et al., 2018; 2019) , generative adversarial networks (Tao et al., 2019) and federated learning (Kassab & Simeone, 2022) . However, current theoretical understanding of SVGD is limited to its infinite particle version (Liu, 2017; Korba et al., 2020; Salim et al., 2021; Sun et al., 2022) , and the theory on finite particle SVGD is far from satisfactory. Since SVGD is built on a discretization of the kernelized negative gradient flow of (1), we can learn about its sampling potential by studying this flow. In fact, a simple calculation (for example, see Korba et al. (2020) ) reveals that min 0≤s≤t I Stein (ρ s | π) ≤ DKL(ρ0|π) t , where I Stein (ρ s | π) is the Stein Fisher information (see Definition 2) of ρ s relative to π, which is typically used to quantify how close to π are the probability distributions (ρ s ) t s=0 generated along this flow. In particular, if our goal is to guarantee min 0≤s≤t I Stein (ρ s | π) ≤ ε, result (2) says that we need to take t ≥ DKL(ρ0|π) ε . Unfortunately, and this is the key motivation for our work, the quantity the initial KL divergence D KL (ρ 0 | π) can be very large. Indeed, it can be proportional to the underlying dimension, which is highly problematic in high dimensional regimes. Salim et al. (2021) and Sun et al. (2022) have recently derived an iteration complexity bound for the infinite particle SVGD method. However, similarly to the time complexity of the continuous flow, their bound depends on D KL (ρ 0 | π).

1.1. SUMMARY OF CONTRIBUTIONS

In this paper, we design a family of continuous time flows-which we call β-SVGD flow-by combining importance weights with the kernelized gradient flow of the KL-divergence. Surprisingly, we prove that the time for this flow to converge to the equilibrium distribution π, that is min 0≤s≤t I Stein (ρ s | π) ≤ ε with (ρ s ) t s=0 generated along β-SVGD flow, can be bounded by -1 εβ(β+1) when β ∈ (-1, 0). This indicates that the importance weights can potentially accelerate SVGD. Actually, we design β-SVGD method based on a discretization of the β-SVGD flow and provide a descent lemma for its population limit version. Some simple experiments in Appendix D verify our predictions. We summarize our contributions in the following: • A new family of flows. We construct a family of continuous time flows for which we coin the name β-SVGD flows. These flows do not arise from a time re-parameterization of the SVGD flow since their trajectories are different, nor can they be seen as the kernelized gradient flows of the Rényi divergence. • Convergence rates. When β → 0, this returns back to the kernelized gradient flow of the KL-divergence (SVGD flow); when β ∈ (-1, 0), the convergence rate of β-SVGD flows is significantly improved than that of the SVGD flow in the case D KL (ρ 0 | π) is large. Under a Stein Poincaré inequality, we derive an exponential convergence rate of 2-Rényi divergence along 1-SVGD flow. Stein Poincaré inequality is proved to be weaker than Stein log-Sobolev inequality, however like Stein log-Sobolev inequality, it is not clear to us when it does hold. • Algorithm. We design β-SVGD algorithm based on a discretization of the β-SVGD flow and we derive a descent lemmas for the population limit β-SVGD. • Experiments. Finally, we do some experiments to illustrate the advantages of β-SVGD with negative β. The simulation results on β-SVGD when β changes from positive to negative corroborate our theory.

1.2. RELATED WORKS

The SVGD sampling technique was first presented in the fundamental work of Liu & Wang (2016) . Since then, a number of SVGD variations have been put out. The following is a partial list: Newton version SVGD (Detommaso et al., 2018) , stochastic SVGD (Gorham et al., 2020) , mirrored SVGD (Shi et al., 2021) , random-batch method SVGD (Li et al., 2020) and matrix kernel SVGD (Wang et al., 2019) . The theoretical knowledge of SVGD is still constrained to population limit SVGD. The first work to demonstrate the convergence of SVGD in the population limit was by Liu (2017) ; Korba et al. (2020) then derived a similar descent lemma for the population limit SVGD using a different approach. However, their results relied on the path information and thus were not self-contained, to provide a clean analysis, Salim et al. (2021) assumed a Talagrand's T 1 inequality of the target distribution π and gave the first iteration complexity analysis in terms of dimension d. Following the work of Salim et al. (2021) ; Sun et al. (2022) derived a descent lemma for the population limit SVGD under a non-smooth potential V . In this paper, we consider a family of generalized divergences, Rényi divergence, and SVGD with importance weights. For these two themes, we name a few but non-exclusive related results. Wang et al. (2018) proposed to use the f -divergence instead of KL-divergence in the variational inference problem, here f is a convex function; Yu et al. (2020) also considered variational inference with f -divergence but with its dual form. Han & Liu (2017) considered combining importance sampling with SVGD, however the importance weights were only used to adjust the final sampled points but not in the iteration of SVGD as in this paper; Liu & Lee (2017) considered importance sampling, they designed a black-box scheme to calculate the importance weights (they called them Stein importance weights in their paper) of any set of points.

2. PRELIMINARIES

We assume the target distribution π ∝ e -V , and we have oracle to calculate the value of e -V (x) for all x ∈ R d . 2.1 NOTATION Let x = (x 1 , . . . , x d ) , y = (y 1 , . . . , y d ) ∈ R d , denote x, y := d i=1 x i y i and x := x, x . For a square matrix B ∈ R d×d , the operator norm and Frobenius norm of B are defined respectively by B op := (B B) and B F := d i=1 d j=1 B 2 i,j , respectively, where denotes the spectral radius. It is easy to verify that B op ≤ B F . Let P 2 (R d ) denote the space of probability measures with finite second moment; that is, for any µ ∈ P 2 (R d ) we have x 2 dµ(x) < +∞. The Wasserstein 2-distance between ρ, µ ∈ P 2 (R d ) is defined by W 2 (ρ, µ) := inf η∈Γ(ρ,π) x -y 2 dη(x, y), where Γ (ρ, µ) is the set of all joint distributions defined on R d × R d having ρ and µ as marginals. The push-forward distribution of ρ ∈ P 2 R d by a map T : R d → R d , denoted by T # ρ, is defined as follows: for any measurable set Ω ∈ R d , T # ρ (Ω) := ρ T -1 (Ω) . By definition of the push-forward distribution, it is not hard to verify that the probability densities satisfy T # ρ(T (x))| det D T (x)| = ρ(x) , where D T is the Jacobian matrix of T . The reader can refer to Villani (2009) for more details.

2.2. R ÉNYI DIVERGENCE

Next, we define the Rényi divergence which plays an important role in information theory and many other areas such as hypothesis testing (Morales González et al., 2013) and multiple source adaptation (Mansour et al., 2012) . Definition 1 (Rényi divergence) For two probability distributions ρ and µ on R d and ρ µ, the Rényi divergence of positive order α is defined as D α (ρ | µ) :=      1 α-1 log ρ µ α-1 (x) dρ(x) 0 < α < ∞, α = 1 log ρ µ (x) dρ(x) α = 1 . ( ) If ρ is not absolutely continuous with respect to µ, we set D α (ρ | µ) = ∞. Further, we denote D KL (ρ | µ) := D 1 (ρ | µ). Rényi divergence is non-negative, continuous and non-decreasing in terms of the parameter α; specifically, we have D KL (ρ | µ) = lim α→1 D α (ρ | µ). More properties of Rényi divergence can be found in a comprehensive article by Van Erven & Harremos (2014) . Besides Rényi divergence, there are other generalizations of the KL-divergence, e.g., admissible relative entropies (Arnold et al., 2001) .

2.3. BACKGROUND ON SVGD

Stein Variational Gradient Descent (SVGD) is defined on a Reproducing Kernel Hilbert Space (RKHS) H 0 with a non-negative definite reproducing kernel k : R d × R d → R + . The key feature of this space is its reproducing property: f (x) = f (•), k(x, •) H0 , ∀f ∈ H 0 , where •, • H0 is the inner product defined on H 0 . Let H be the d-fold Cartesian product of H 0 . That is, f ∈ H if and only if there exist f 1 , • • • , f d ∈ H 0 such that f = (f 1 , . . . , f d ) . Naturally, the inner product on H is given by f, g H := d i=1 f i , g i H0 , f = (f 1 , . . . , f d ) ∈ H, g = (g 1 , . . . , g d ) ∈ H. (5) For more details of RKHS, the readers can refer to Berlinet & Thomas-Agnan (2011) . It is well known (see for example Ambrosio et al. (2005) 2016) proposed a kernelized Wasserstein gradient of the KL-divergence, defined by ) that ∇ log ρ π is the Wasserstein gradient of D KL (• | π) at ρ ∈ P 2 (R d ). Liu & Wang ( g ρ (x) := k(x, y)∇ log ρ π (y) dρ(y) ∈ H. Integration by parts yields g ρ (x) = -[∇ log π(y)k(x, y) + ∇ y k(x, y)] dρ(y). Comparing the Wasserstein gradient ∇ log ρ π with (7), we find that the latter can be easily approximated by g ρ (x) ≈ ĝρ := -1 N N i=1 [∇ log π(x i )k(x, x i ) + ∇ xi k(x, x i )] , with ρ = 1 N N i=1 δ xi and (x i ) N i=1 sampled from ρ. With the above notations, the SVGD update rule x i ← x i + γ N N j=1 ∇ log π(x j )k(x i , x j ) + ∇ xj k(x i , x j ) , i = 1, . . . , N, where γ is the step-size, can be presented in the compact form ρ ← (I -γĝ ρ) # ρ. When we talk about the infinite particle SVGD, or population limit SVGD, we mean ρ ← (I -γg ρ ) # ρ. The metric used in the study of SVGD is the Stein Fisher information or the Kernelized Stein Discrepancy (KSD). Definition 2 (Stein Fisher Information) Let ρ ∈ P 2 (R d ). The Stein Fisher Information of ρ relative to π is defined by I Stein (ρ | π) := k(x, y) ∇ log ρ π (x), ∇ log ρ π (y) dρ(x) dρ(y). A sufficient condition under which lim n→∞ I Stein (ρ n | π) implies ρ n → π weakly can be found in Gorham & Mackey (2017) , which requires: i) the kernel k to be in the form k(x, y) = c 2 + x -y 2 θ for some c > 0 and θ ∈ (-1, 0); ii) π ∝ e -V to be distant dissipative; roughly speaking, this requires V to be convex outside a compact set, see Gorham & Mackey (2017) for an accurate definition. In the study of the kernelized Wasserstein gradient (7) and its corresponding continuity equation  > 0 if D KL (ρ | π) ≤ 1 2λ I Stein (ρ | π). While this inequality can guarantee an exponential convergence rate of ρ t to π, quantified by the KL-divergence, the condition for π to satisfy the Stein log-Sobolev inequality is very restrictive. In fact, little is known about when (11) holds. Figure 1 : The performance of β-SVGD with three choices of β, but using the same step-size. The blue dashed line is the target distribution π: the Gaussian mixture 2 5 N (2, 1) + 3 5 N (6, 1). The green solid line is the distribution generated by β-SVGD after 100 iterations; see Appendix D for more results and details.

3. CONTINUOUS TIME DYNAMICS OF THE β-SVGD FLOW

In this section, we mainly focus on the continuous time dynamics of the β-SVGD flow. Due to page limitation, we leave all of the proofs to Appendix B.

3.1. β-SVGD FLOW

In this paper, a flow refers to some time-dependent vector field v t : R d → R d . This time-dependent vector field will influence the mass distribution on R d by the continuity equation or the equation of conservation of mass ∂ρt ∂t + div (ρ t v t ) = 0, readers can refer to Ambrosio et al. (2005) for more details. Definition 4 (β-SVGD flow) Given a weight parameter β ∈ (-1, +∞), the β-SVGD flow is given by v β t (x) := -π ρt β (x) k(x, y)∇ log ρt π (y) dρ t (y). ( ) Note that when β = 0, this is the negative kernelized Wasserstein gradient (6). Note that we can not treat β-SVGD flow as the kernelized Wasserstein gradient flow of the (β + 1)-Rényi divergence. However, they are closely related, and we can derive the following theorem. Theorem 1 (Main result) Along the β-SVGD flow (13), we have 1 min t∈[0,T ] I Stein (ρ t | π) ≤      e βD β+1 (ρ 0 |π) T β(β+1) β > 0 DKL(ρ0|π) T β = 0 - 1 T β(β+1) β ∈ (-1, 0) . ( ) Note the left hand side of ( 14) is the Stein Fisher information. When β decreases from positive to negative, the right hand side of ( 14) changes dramatically; it appears to be independent of ρ 0 and π. If we do not know the Rényi divergence between ρ 0 and π, it seems the best convergence rate is obtained by setting β = -1 2 , that is min t∈[0,T ] I Stein (ρ t | π) ≤ 4 T . It is somewhat unexpected to observe that the time complexity is independent of ρ 0 and π, or to be more precise, that it relies only very weakly on ρ 0 and π when β ∈ (-1, 0). We wish to stress that this is not achieved by time re-parameterization. In the proof of Theorem 1, we can 1 In fact, in the proof in Appendix B we know a stronger result. When β ∈ (-1, 0), the right hand side of ( 14) is only weakly dependent on ρ0 and π and should be e βD β+1 (ρ 0 |π) -e βD β+1 (ρ T |π) T |β(β+1)| , which is less than - 1 T β(β+1) . see the term ( π /ρt) β in β-SVGD flow ( 13) is utilized to cancel term ( ρt /π) β in the Wasserstein gradient of (β + 1)-Rényi divergence. Actually, when β ∈ (-1, 0), this term has an added advantage and can be seen as the acceleration factor in front of the kernelized Wasserstein gradient of KL-divergence. Specifically, the negative kernelized Wasserstein gradient of KL-divergence v 0 t (x) := -k(x, y)∇ log( ρt π )(y)dρ t (y) is the vector field that compels ρ t to approach π, while ( π /ρt) β (x) is big (roughly speaking this means x is close to the mass concentration region of ρ t but away from the one of π), this factor will enhance the vector field at point x and force the mass around x move faster towards the mass concentration region of π; on the other hand, if ( π /ρt) β (x) is small (this means x is already near to the mass concentration region of π), this factor will weaken the vector field and make the mass surrounding x remain within the mass concentration region of π. This is the intuitive justification for why, when β ∈ (-1, 0), the time complexity for β-SVGD flow to diminish the Stein Fisher information only depends on ρ 0 and π very weakly. Remark 1 While it may seem reasonable to suspect that the time complexity of the β-SVGD flow with β ≤ -1 will also depend on ρ 0 and π very weakly, surprisingly, this is not true. In fact, we can prove that (see Appendix B) min t∈[0,T ] I Stein (ρ t | π) ≤ e (-β-1)D -β (π|ρ 0 ) |T β(β+1)| . Letting β → -1, we get min t∈[0,T ] I Stein (ρ t | π) ≤ DKL(π|ρ0) T . The regime when β ≤ -1 is similar to the β > 0 regime in Theorem 1, which heavily depends on ρ 0 and π. Mathematically speaking, the weak dependence on ρ 0 and π is caused by the concavity of the function s α on s ∈ R + when α = β + 1 ∈ (0, 1).

3.2. 1-SVGD FLOW AND THE STEIN POINCAR É

INEQUALITY Functional D KL (• | •) is non-symmetric; that is, D KL (• | π) = D KL (π | •) , and so is their Wasserstein gradient. The Wasserstein gradient of D KL (π | •) at distribution ρ ∈ P 2 (R d ) is -∇ π ρ (see Appendix A), or, to put it another way, π ρ ∇ log( ρ π ), which may be regarded as the non-kernelized 1-SVGD flow (module a minus sigh) when compared to (13). To conclude, the 1-SVGD flow v 1 t (x) := -π ρt (x) k(x, y)∇ log ρt π (y) dρ t (y), is the negative kernelized Wasserstein gradient flow of D KL (π | •). Next, we will study the exponential convergence of 2-Rényi divergence along 1-SVGD flow under the Stein Poincaré inequality. Definition 5 (Stein Poincaré inequality) We say that π satisfies the Stein Poincaré inequality with constant λ > 0 if |g| 2 dπ ≤ 1 λ k(x, y) ∇g(x), ∇g(y) dπ(x) dπ(y), for any smooth g with g dπ = 0. While Duncan et al. (2019) also introduced the Stein Poincaré inequality, they presented it in a different form. Just as Poincaré inequality is a linearized log-Sobolev inequality (see for example (Bakry et al., 2014, Proposition 5.1.3 )), Stein Poincaré inequality is also a linearized Stein log-Sobolev inequality (11). Although Stein Poincaré inequality is weaker than Stein log-Sobolev inequality, the condition for it to hold is quite restrictive, as in the case of Stein log-Sobolev inequality; see the discussion in (Duncan et al., 2019, Section 6) . Lemma 1 (Stein log-Sobolev implies Stein Poincaré) If π satisfies the Stein log-Sobolev inequality (11) with constant λ > 0, then it also satisfies the Stein Poincaré inequality with the same constant λ. While the proof of the above lemma is a routine task, for completeness we provide it in Appendix B. The following theorem is inspired by Cao et al. (2019) , in which they proved the exponential convergence of Rényi divergence along Langevin dynamic under a strongly convex potential V . However, due to the structure of 1-SVGD flow, we can only prove the results for α-Rényi divergence with α ∈ (0, 2]. Theorem 2 Suppose π satisfies the Stein Poincaré inequality with constant λ > 0. Then the flow (15) satisfies D 2 (ρ t | π) ≤ C • D 2 (ρ 0 | π) • e -2λt , where C = e Dfoot_0 (ρ 0 |π) -1 D2(ρ0|π) . Since D α1 (ρ | π) ≤ D α2 (ρ | π) for any 0 < α 1 ≤ α 2 < ∞, the exponential convergence of α-Rényi divergence with α ∈ (0, 2) can be easily deduced from (17). Corollary 1 Suppose π satisfies the Stein Poincaré inequality with constant λ > 0. Then the flow (15) satisfies D α (ρ t | π) ≤ C • D α (ρ 0 | π) • e -2λt for all α ∈ (0, 2], where C = e D 2 (ρ 0 |π) -1 Dα(ρ0|π) .

4. THE β-SVGD ALGORITHM

The β-SVGD algorithm 2 proposed here is a sampling method suggested by the discretization of the β-SVGD flow (13). Our method reverts to the traditional SVGD algorithm when β = 0. As in SVGD, the integral term in the β-SVGD flow ( 13) can be approximated by ( 8). However, when β = 0, we have to estimate the extra importance weight term ( π /ρt) β . can use the kernel method (Silverman, 2018) to estimate ρ t given points sampled from ρ t . The idea behind the kernel density estimation is simple. Assuming that Dirac δ(•) can be weakly approximated by kernel K(•), that is for any bounded smooth function f ∈ C ∞ b (R d ), we have lim h→0 1 h d f (x)K x h dx → f (x)δ(x)dx = f (0). Then for ρ, a smooth probability density function on R d , we have ρ(x) = δ(x -y)ρ(y) dy ≈ 1 h d K( x-y h )ρ(y) dy ≈ 1 N h d N i=1 K( x-yi h ) =: ρ(x), with (y i ) N i=1 sampled from ρ. Usually, K will be a radially symmetric unimodal probability density function, for example, the standard multivariate Gaussian K g (x) = 1 (2Π) d 2 e -x 2 2 , Π is the area of unit circle. While unfortunately we only know the value of π(x) up to a normalizing constant, this constant is independent of x, allowing us to merge it into the step-size. One needs to keep in mind though that ( π /ρt) β may explode from above. Therefore, in the implementation of β-SVGD (1), we must truncate this value from above by a relatively big number M . Remark 2 Note that the performance of kernel density estimation largely depends on the sample size Parzen (1962) ; Devroye & Wagner (1979) and bandwidth h Sheather ( 2004). Optimal bandwidth is difficult to obtain even with good bandwidth selection heuristics Scott & Sheather (1985) . (Silverman, 2018, Section 4.3.1) showed that the approximately optimal bandwidth h opt , in the sense of minimizing mean integrated square error (see the section in the book), should be of order N -1 d+4 . When d = 1, a lemma from Parzen (1962) (see also Lemma 2 in Appendix C) suggests h ∼ N -1 2 .

4.1. NON-ASYMPTOTIC ANALYSIS FOR β-SVGD

In this section, we study the convergence of the population limit β-SVGD. Specifically, we establish a descent lemma for it. The derivation of the descent lemma is based on several assumptions. The first assumption postulates L-smoothness of V ; this is typically assumed in the study of optimization algorithms, Langevin algorithms and SVGD. Algorithm 1 Beta Stein Variational Gradient Descent (β-SVGD) 1: Input: The potential function V : R d → R of the target distribution π ∝ e -V , density estimation kernel K(•), the scaling parameter h, reproducing kernel k(•, •), a set of initial particles (x 0 i ) N i=1 and iteration number n. 2: for l = 0, 1, . . . , n do

3:

Estimate density: ρ l i = 1 N h d N j=1 K x l i -x l j h , i = 1, . . . , N

4:

Calculate the weight with choice 1: w l i = e -V (x l i ) ρ l i β ∧ M l N j=1 e -V (x l j ) ρ l j β ∧ M l ; choice 2: w l i = e -V (x l i ) ρ l i β ∧ M l N, i = 1, . . . , N , M l is the truncating number 5: Update particles with step-size γ l : x l+1 i ← x l i +γ l w l i N j=1 -k(x l i , x l j )∇ x l j V (x l j ) + ∇ x l j k(x l i , x l j ) , i = 1, . . . , N 6: end for 7: Return: Particles (x n+1 i ) N i=1 . Assumption 1 (L-smoothness) The potential function V of the target distribution π ∝ e -V is L- smooth; that is, ∇ 2 V op ≤ L. Our second assumption postulates two bounds involving the reproducing kernel k(•, •), and is also common when studying SVGD; see (Liu, 2017; Korba et al., 2020; Salim et al., 2021; Sun et al., 2022) . Assumption 2 Kernel k is continuously differentiable and there exists B > 0 such that k(x, .) H0 ≤ B and ∇ x k(x, .) The third assumption was already used by Liu (2017) ; Korba et al. (2020) , and was later replaced by Salim et al. (2021) it with a Talagrand inequality (Wasserstein distance can be upper bounded by KL-divergence) which depends on π only. However, β-SVGD reduces the Rényi divergence instead of the KL-divergence. Since we do not have a comparable inequality for the Rényi divergence, we are forced to adopt the one from (Liu, 2017; Korba et al., 2020) here. Assumption 3 There exists C > 0 such that I Stein (ρ n | π) ≤ C for all n = 0, 1, . . . , N . In the proof of the descent lemma, the next two assumptions help us deal with the extra term ( π /ρn) β . Note that the fourth assumption is very weak. In fact, as long as Z n (x, y)ρ n (x)ρ n (y) is integrable on R d × R d , then by the monotone convergence theorem, the truncating number M ρn (δ) is always attainable since ( ρn /π) β (x) ( π /ρn) β ∧ M is non-decreasing and converges point-wise to 1 as M → +∞. Assumption 4 For any small δ > 0, we can find M ρn (δ) > 0 such that I Stein (ρ n | π) - ρn π β (x) π ρn β ∧ M ρn (δ)Z n (x, y) dρ n (x) dρ n (y) ≤ δ, ( ) where Z n (x, y) := k(x, y) ∇ log ρn π (x), ∇ log ρn π (y) . Our fifth and last assumption is of a technical nature, and helps us bound ∇ x ( π /ρn) β (x) k(x, y)∇ log( ρn π )(y)dρ n (y) F . It is also relatively weak, and achievable for example when the potential function of ρ n does not fluctuate wildly.



For simplicity, we will often just call it β-SVGD; not to be confused with the β-SVGD flow.



∂ρt ∂t + div (ρ t g ρt ) = 0, Duncan et al. (2019) introduced the following kernelized log-Sobolev inequality to prove the exponential convergence of D KL (ρ t | π) along the direction (7): Definition 3 (Stein log-Sobolev inequality) We say π satisfies the Stein log-Sobolev inequality with constant λ

xi k (x, .) 2 H0 ≤ B 2 , ∀x ∈ R d .By the reproducing property (4), this is equivalent to k(x, x) ≤ B 2 andd i=1 ∂ xi ∂ yi k(x, y) | y=x ≤ B 2 for any x ∈ R d ,and this is easily satisfied by kernel of the form k(x, y) = f (x -y), where f is some smooth function at point 0.

annex

Though Assumptions 3, 4 and 5 are relatively reasonable, as we stated, we do not know how to estimate constants C, M ρn (δ) and C ρn (δ) beforehand.With all this preparation, we can now formulate our descent lemma for the population limit β-SVGD when β ∈ (-1, 0). The proof can be found in Appendix B.  and Assumptions 1,  2, 4 and 5 we have the descent propertyProposition 1 contains the descent lemma for the population limit SVGD Liu (2017); Korba et al. (2020) . Actually, let β and δ approach to 0, the descent lemma for the population limit SVGD will be derived by L'Hospital rule. When β > 0, we also have Equation ( 21), however due to the sign change of -β, Equation ( 21) can not guarantee D β+1 (ρ n+1 | π) < D β+1 (ρ n | π) anymore (for an asymptotic analysis, please refer to Appendix C).Remark 3 The lack of a descent lemma for β-SVGD when β > 0 is not a great loss for us, as explained in Section 3.1, negative β is preferable in the implementation of β-SVGD. One can see from our experiments that β-SVGD with negative β performs much better than the one with positive β, this verifies our theory in Section 3.1.The next corollary is a discrete time version of Theorem 1. Letting M ρn (ε) and C ρn (ε) have consistent upper bound is reasonable since intuitively ρ n will approach π, though we can not verify this beforehand.Corollary 2 In Proposition 1, choose δ = ε and suppose Assumptions 1, 2, 3, 4 and 5 hold with uniformly bounded M ρn (ε) and C ρn (ε), so that γ is uniformly lower bounded. Then we have at most Korba et al. (2020) . However, the existing methods either give a qualitative analysis or provide a exponential bound under more restrictive assumptions, which can not provide useful information in the implementation of β-SVGD. Interested readers can refer to Korba et al. (2020) ; Shi et al. (2021) , we will not include them in this paper.

5. CONCLUSION

We construct a family of continuous time flows called β-SVGD flows on the space of probability distributions, when β ∈ (-1, 0), its convergence rate is independent of the initial distribution and the target distribution. Based on β-SVGD flow, we design a family of weighted SVGD called β-SVGD. β-SVGD has the similar computation complexity as SVGD, but achieves faster convergence rate in our analysis and experiments. We introduce β-SVGD in this work, but there are still lots of questions we do not answer and deserve to explore, like how to tune the parameters to make it more efficient, the performance of β-SVGD in more complex model and etc.

