LOCAL KL CONVERGENCE RATE FOR STEIN VARIA-TIONAL GRADIENT DESCENT WITH REWEIGHTED KER-NEL

Abstract

We study the convergence properties of Stein Variational Gradient Descent (SVGD) algorithm for sampling from a non-normalized probabilistic distribution p * (x) ∝ exp(-f * (x)). Compared with Kernelized Stein Discrepancy (KSD) convergence analyzed in previous literature, KL convergence as a more convincing criterion can better explain the effectiveness of SVGD in real-world applications. In the population limit, SVGD performs smoothed gradient descent with kernel integral operator. Notably, SVGD with smoothing kernels suffers from gradient vanishing in low-density areas, which makes the error term between smoothed gradient and the Wasserstein gradient not controllable. In this context, we introduce a reweighted kernel to amplify the smoothed gradient in low-density areas, which leads to a bounded error term. When the p * (x) satisfies log-Sobolev inequality, we develop the convergence rate for SVGD in KL divergence with the reweighted kernel. Our analysis points out the defects of conventional smoothing kernels in SVGD and provides the convergence rate for SVGD in KL divergence.

1. INTRODUCTION

Sampling from non-normalized distributions is a crucial task in statistics. In particular, in Bayesian inference, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) are considered two mainstream lines to handle the intractable integration of posterior distributions. On the one hand, although methods based on MCMC, e.g., Langevin Monte Carlo (Durmus & Moulines, 2019; Welling & Teh, 2011) (LMC) and Metropolis-adjusted Langevin algorithm (Xifara et al., 2014) (MALA) , are able to provide approximate target distributions with arbitrarily small error (Wibisono, 2018) , the sample efficiency is low due to the lack of repulsive force between samples (Duncan et al., 2019; Korba et al., 2020) . On the other hand, VI-based sampling methods (Blei et al., 2017; Ranganath et al., 2014) can improve the sampling efficiency by reformulating inference as an optimization problem. However, restricting the search space of the optimization problem to some parametric distributions in VI usually causes a huge gap between its solution and the target distribution p * . Inspired by conventional VI, a series of recent works analyze LMC as the optimization problem of Kullback-Leibler (KL) divergence (Wibisono, 2018; Bernton, 2018; Durmus et al., 2019) , i.e., arg min p∈P2(R d ) H p * (p) := D KL (p∥p * ) = p(x) ln p(x) p * (x) dx where P 2 (R d ) is the set of Radon-Nikodym derivatives of probability measures ν over Lebesgue measure such that p(x) = dν(x)/dx, ∥x∥ 2 p(x)dx < ∞. LMC is considered as a discrete scheme of the gradient flow of the relative entropy by driving particles with stochastic and energy-induced force. Besides, to take the best of both MCMC and VI, Stein Variational Gradient Descent (Liu & Wang, 2016) (SVGD) was proposed as a non-parametric VI method. It replaces the stochastic force in LMC with the interaction between particles and approximates the target distribution by a driving force in Reproducing Kernel Hilbert space (RKHS). It means the gradient flow of SVGD is defined by the functional derivative projection of Eq. 1 to RKHS. The empirical performance of SVGD and its variants have been largely demonstrated in various tasks such as learning deep probabilistic models (Liu & Wang, 2016; Pu et al., 2017) , Bayesian inference (Liu & Wang, 2016; Feng et al., 2017; Detommaso et al., 2018) , and reinforcement learning (Liu et al., 2017) . In addition to rich applications, there is a lot of work on the theoretical analysis of SVGD. For example, Kernelized Stein Discrepancy (KSD) convergence properties of SVGD under asymptotic and non-asymptotic settings are investigated by Liu (2017) ; Lu et al. (2019) and Korba et al. (2020) ; Salim et al. (2021; 2022) , respectively. However, different from the convergence of KL divergence in the analysis of LMC (Cheng & Bartlett, 2018; Vempala & Wibisono, 2019) , KSD convergence cannot deduce the effectiveness of SVGD in some real-world applications, e.g., posterior sampling (Welling & Teh, 2011) and non-convex learning (Raginsky et al., 2017) . Then, to provide KL convergence, some other works (Duncan et al., 2019; Korba et al., 2020) present a linear convergence of SVGD with Stein log-Sobolev inequality (SLSI) (Duncan et al., 2019) . Nonetheless, different from the clear meaning and criteria of standard log-Sobolev inequality (LSI) in the analysis of LMC (Vempala & Wibisono, 2019) , the establishment of SLSI requires the property of the coupling of designed smoothing kernels and the target distribution, which can hardly be verified in commonly used kernels (Duncan et al., 2019) . In addition, SLSI in higher dimensions is more challenging to hold. To fill these gaps, in this paper, we aim to provide the convergence rate of SVGD (in the infinite particle regime) in terms of KL objective, when p * = e -f * satisfies standard LSI. Specifically, we first point out that the SVGD with smoothing kernel, e.g., RBF kernel, suffers from gradient vanishing in low-density areas due to the extra p t (x) scaling. Then, we denote the importance of reweighted kernels by dividing p t (x) or p * (x), where the scaling of smoothed gradients can be normalized. With the reweighting scaling p -1/2 * (x)p -1/2 * (y) for kernel k(x, y) and regularity conditions, SLSI in higher dimensions can be nearly established with an additional term controlled by kernel approximation error. Finally, by choosing a proper reweighted smoothing kernel, the KL divergence of SVGD dynamics obtains a local linear convergence rate to any neighborhood of p * (x) under mild assumptions when the initialization p 0 (x) is relatively close to p * (x). The main contributions of the paper are as follows: • We introduce reweighted kernels to SVGD which replaces traditional smoothing kernels and overcomes the gradient vanishing problem in low-density areas. • We study the KL convergence rate of SVGD algorithm. Under the standard LSI and some mild assumptions, we show SVGD with a reweighted kernel has a local linear convergence rate to any neighborhood of p * (x).

2. PRELIMINARIES

In this section, we first introduce important notations used in the following sections. Then, we explain how to optimize functionals on Wasserstein space by continuous updates in the infinite particle regime. After that, we show that the key condition LSI on the target distribution to obtain the KL convergence rate of LMC. However, the convergence rate of SVGD dynamics is non-trivial with this assumption. Notations. The set P 2 (X ) is consist of probability measure µ on X with finite second order moment. k denotes a function smoother, such as exp(-∥x∥ 2 ), max{0, 1 -∥x∥}.

2.1. OPTIMIZATION IN THE WASSERSTEIN SPACE

Sampling algorithms can be considered as optimizing some given functionals in the Wasserstein space as Eq. 1. Generally, they only update particles, which causes the evolution of the particles' distribution. Such an evolution finally affects the objective functional. In particular, given initial distribution x 0 ∼ p 0 (x) and function class H, suppose the update of x t is dx t = ϕ t (x t ), where ϕ t : R d → R d ∈ H. By the continuity equation, we have the differential equation for p t (x), dp t (x) dt = -∇ • (p t (x)ϕ t (x)). For any suitable functional F w.r.t. p t , its evolution can be presented as d dt F(p t ) = R d δF δp (p t )∂ t p t dx = R d ∇ δF δp (p t ) • ϕ t (x)dp t . where δF/δp denotes the L 2 (R d )-functional derivative (Villani, 2009; Duncan et al., 2019) . Assuming that F = H p * , we have Proposition 2.1. The evolution of KL divergence with Eq. 2 is dH p * (p t ) dt = -p t (x)ϕ t (x) ⊤ ∇ ln p * (x) p t (x) dx. Proposition 2.1 is a direct result of the functional derivative for KL divergence, where δF/δp = ln p + 1 + ln p * (Chapter 15 of Villani ( 2009)). By choosing ϕ t which decreases KL divergence via Eq. 3, we can optimize p t to approach p * .

2.2. LOG-SOBOLEV INEQUALITY

In the Wasserstein space optimization literature, LSI is particularly crucial to obtaining the convergence rate, which is an analogue to Polyak-Lojasiewicz (PL) inequality in Euclidean space. LSI applies to a wider class of measures than log-concave distributions and can be checked by Bakry-Emery criterion Bakry & Émery (1985) . Specifically, bounded perturbation and Lipschitz mapping can preserve the establishment of LSI Vempala & Wibisono (2019) , where log-concavity would be failed. For example, subtracting some small Gaussians from a strongly log-concave distribution will destroy the log-concavity of the original distribution, while it still satisfies LSI as long as the Gaussians we subtract are small enough. When the target distribution p * satisfies µ-LSI, it denotes E p * g 2 ln g 2 -E p * g 2 ln E p * g 2 ≤ 2 µ E p * ∥∇g∥ 2 , for any differentiable function g ∈ L 2 (ν). Such an inequality usually provides some connection between the sufficient descent of functional evolution and its exact values. Coupling log-Sobolev inequality with Langevin dynamics. In particular, the most popular algorithm, Langevin dynamics Vempala & Wibisono (2019) chooses ϕ t (x) = ∇ ln p * (x) p t (x) , dp t (x) dt = ∇ • p t (x)∇ ln p t (x) p * (x) to decrease the KL functional (Eq. 3) which is equivalent to the particles' update, dx t = -∇f * (x)dt + √ 2dB t , where B t is standard Brownian motion. The introduction of randomness can also convert ∇ ln p t (x) to a tractable form and the dynamics becomes dH p * (p t ) dt = -p t (x) ∇ ln p t (x) p * (x) 2 dx (6) where the absolute value of RHS in Eq. 6 is called relative Fisher information. Taking g 2 = p t /p * , we have H p * (p t ) ≤ 1 2µ p t (x) ∇ ln p t (x) p * (x) 2 dx. Then LSI provides a lower bound for gradient norm, leading to sufficient descent for KL divergence. Combining Eq. 7 and Eq. 6, we have dH p * (p t ) dt ≤ -2µH p * (p t ), for some µ > 0. Note that Eq. 10 defines a functional gradient in RKHS, which indicates the steepest KL decreasing direction with a normalized functional vector. Using integration by parts, Eq. 9 can be rewritten as ϕ t = arg max ϕ∈H p t (x)(ϕ(x) ⊤ ∇ ln p * (x) + ∇ϕ(x))dx, such that ∥ϕ∥ H ≤ S(p t , q). ( ) The explicit form for ϕ t (x) is presented as follows. Proposition 2.2. Assume that ϕ t satisfies Eq. 10. Then we have ϕ t (x) = p t (y) [∇ ln p * (y)k(x, y) + ∇ 1 k(y, x)] dy, where ϕ t (x) can be estimated by particle samples from p t . Proposition 2.2 (as Lemma 3.2 proved in Liu & Wang ( 2016)) makes Eq. 2 become a practically tractable algorithm by Monte Carlo estimation of Eq. 11, where particle samples are from p t (y). Note that Eq. 2 and Eq. 11 naturally lead to the algorithm of SVGD. Combining Proposition 2.1 and 2.2, the dissipation of the KL divergence along continuous SVGD can be obtained, dH p * (p t ) dt = - R d R d k(x, y)p t (x)p t (y) • ∇ ln p t (x) p * (x) • ∇ ln p t (y) p * (y) dydx. ( ) When the kernel is strictly positive definite, the RHS of Eq. 12 is negative, leading to the decrease of KL divergence. Unlike the sufficient descent bounded by the functional value in Eq. 7, LSI cannot be conducted on the RHS of Eq. 12 due to the kernelization, which also causes the KL convergence rate to be unknown. Instead, previous works (Liu, 2017; Lu et al., 2019) 

3. KL CONVERGENCE OF SVGD

In this section, we first show, compared with KSD convergence proved in most previous works on SVGD Liu (2017) ; Lu et al. (2019) ; Korba et al. (2020) ; Salim et al. (2021; 2022) , KL convergence is more powerful in explaining the practical performance of real-world applications. After that, we explain Stein log-Sobolev Inequality (SLSI) the necessary condition for analyzing KL convergence of SVGD can hardly be verified. In order to investigate more reasonable conditions, our assumptions are proposed and validated empirically in some simple cases. Finally, we provide the main theorem, i.e., the KL convergence of SVGD under these mild assumptions. Due to the page limit, we left the comparison with previous works by list in Appendix A.

3.1. SAMPLING TASKS REQUIRE KL CONVERGENCE

Sampling algorithms are widely used in real-world applications for solving corresponding machine learning problems. Although different tasks usually require different criteria for convergence analysis, most of these criteria can be deduced by KL convergence. In Bayesian learning Welling & Teh (2011)  p ′ H p(w) n i=1 p(zi|w) (p ′ ). It means the convergence of Bayesian learning is directly dependent on KL convergence. Another important application of sampling algorithms is to minimize the expected excess risk as E [F ( ŵ)] -F * (15) where F denotes the objective function of the stochastic optimization under the unknown data distribution P F (w) = E P [f (w, z)] = z f (w, z)P (dz), and F * denotes inf w∈R d F (w). When F is L-smooth, previous gradient-based MCMC methods would like to analyze the convergence with the general framework by E [F ( ŵk )] -F * =E [F ( ŵk )] -E [F ( ŵ * )] + E [F ( ŵ * )] -F * = P (n) (dz) R d F (w)p k,z (w)dw - R d F (w)p * ,z (w)dw + E [F ( ŵ * )] -F * ≤ P (n) (dz) [LC • W 2 (p k,z , p * ,z )] Training error +E [F ( ŵ * )] -F * (16) where the n-tuple (data) z = {z 1 , z 2 , . . . , z n } of i.i.d. samples are drawn from P . It means the minimization of Wasserstein 2 distance between pt,z (MCMC samples at time t) and p * ,z (x) ∝ p(w|z) leads to convergence of expected excess risk. The aforementioned results can be directly deduced by the KL convergence (Raginsky et al., 2017; Xu et al., 2018) with W 2 (p k,z , p * ,z ) ≤ C • H p * ,z (p k,z ) + H p * ,z (p k,z ) 2 1/4 . Unfortunately, the connection between W 2 (p k,z , p * ,z ) and S(p k,z , p * ,z ) depends on the choice of RKHS, which is highly specialized and non-general. From a theoretical perspective, when the RKHS is over-smooth with a sufficient large bandwidth, k(x, y) = σ -d exp(-∥x -y∥ 2 /2σ 2 ) with large σ, the corresponding kernelized gradient tend to diminish, i.e., lim σ→∞ E pt [k(x, y)∇ ln pt(y) p * (y) ] = 0. That means the KSD can be arbitrarily small with improper RKHS choices. In this condition, the convergence of KSD does not make much sense about the quality of p t , since it can be simply controlled by some special RKHS.

3.2. KL CONVERGENCE OF SVGD WITH DIFFERENT ASSUMPTIONS

To investigate KL convergence of SVGD, some previous works (Duncan et al., 2019; Korba et al., 2020; Salim et al., 2021) introduce the following assumption. Assumption 1. The probability density p * satisfies Stein log-Sobolev inequality (SLSI) with a constant µ > 0, if for any p t ∈ P 2 (R d ), it has H p * (p t ) ≤ 1 2µ S(p t , p * ). We can immediately obtain H p * (p t ) ≤ 1 2µ R d R d k(x, y)p t (x)p t (y) • ∇ ln p t (x) p * (x) • ∇ ln p t (y) p * (y) dydx, and a linear KL convergence can be achieved by combining Eq. 12 and Eq. 18 for SVGD (Duncan et al., 2019; Korba et al., 2020; Salim et al., 2021) . However, the verification of this assumption is highly non-trivial because we cannot test all p t ∈ P 2 (R d ). Only when the designed RKHS is overly regular, the RHS of Eq. 18 can be estimated. In the meanwhile, an overly regular kernel, i.e., p * (x)∇ ln p * (x)•∇ ln p * (x)k(x, x)-2∇ ln p * (x)•∇ 1 k(x, x)+∇ 1 k(x, x)•∇ 2 k(x, x)dx < ∞, will make SLSI fail, which is indicated in (Duncan et al., 2019) . Besides, Eq. 19 holds for the most widely used smoothing kernels, such as Radial basis function (RBF) kernel. The contradiction between Eq. 18 and Eq. 19 makes the current analysis in KL divergence highly restricted. In this condition, we expect more reasonable assumptions to investigate KL convergence of SVGD. Similar to (Arbel et al., 2019) , we have additional assumptions on trajectory of p t . Specifically, we assume the following.  [A 1 ] p * satisfies µ-log-Sobolev Inequality (Eq. 4) and f * is L-smooth, i.e., for any x, y ∈ R d , ∥∇f * (x) -∇f * (y)∥ ≤ L∥x -y∥. [A 2 ] f t is L-smooth where p t = e -ft . [A 3 ] p t is warm: sup x∈R d p t (x)/p * (x) ≤ β for some constant β ≥ 1. Assumption [A 1 ] and [A 2 ] k(x, y) = (p * (x)) -1/2 k σ (x, y) (p * (y)) -1/2 , k σ (x, y) = kσ (x -y) = σ -d k(σ -1 (x -y)), R d ∥y∥ 4 • k(y)dy ≤ M, and 1 - R d k σ (x, x -y)dy ≤ 1 2 √ 2 , ( ) where σ = min 1, ϵ 12LM √ β • 16Cd + 9βCd 2M + 6 βL + 3CL + βCL -1 , C = p * (x)dx , then the KL divergence between p t and p * satisfies H p * (p t ) ≤ max 0, H p * (p 0 ) - 64ϵ µ • exp - µt 16 + 64ϵ µ . ( ) Remark 1. It should be noted that C < ∞ is proven in Lemma B.1. Although we require the trajectory of the algorithm to satisfy some assumptions, the actual requirements are much looser. For example, we allow the coefficients L of smoothness in Assumption [A 2 ] and the maximum density ratio β in Assumption [A 3 ] to increase with the number of iteration t growth. Even if the rate of growth is polynomial, O(1/ϵ) convergence rate can still be obtained by decreasing σ t in the reweighted kernel in Eq. 21, which is shown in Remark 3. Assumption [A 3 ] is actually introduced to control Rényi divergence between p t and p * will not be infinity in some region near the target, which can be easily obtained in Langevin dynamics. This theorem demonstrates that, by introducing reweighted kernel k(x, y) and controlling the variance σ of smoothing kernel k, SVGD initialized in a local region will provide a linear convergence rate to any ϵ-neighborhood of the target distribution. To validate this result, we provide experiments in synthetic data in Appendix D, and illustrate SVGD with reweighted kernels usually achieves a lower KL divergence compared with traditional SVGD. It should be noticed that commonly used kernels, e.g., RBF kernel and Bump kernel, are proper k. Besides, the linear convergence shows all the parameters will not deteriorate the convergence of SVGD when σ is small enough.

4. REWEIGHTED KERNEL FOR KL CONVERGENCE

In this section, we mainly explain why we should introduce reweighted kernels in Theorem 3.1. The intuition can be split into 2 parts: (1) the infeasibility of the usage of LSI due to the kernel approximation error; (2) the tractable kernel approximation error form with a reweighted kernel.

4.1. KERNEL APPROXIMATION

Intuitively, to measure the error between Wasserstein gradient and its kernelized one, the most direct idea is to control the error of relative Fisher information and make use of LSI. However, this idea will encounter some fatal bottlenecks. The bottleneck of SVGD analysis with Eq. 7 If we directly upper bound the descent of KL divergence by Eq. 7, we have dH p * (p t ) dt ≤ - 1 2 p t (x) ∇ ln p t (x) p * (x) 2 dx sufficient descent + 1 2 R d p t (x) p t (y)k(x, y)∇ ln p * (y) p t (y) dy -∇ ln p * (x) p t (x) 2 dx kernel approximation error , where the kernel approximation error can hardly be upper bounded due to p t (y) in integration. It means if we directly plug some smoothing kernels into the iteration of SVGD, kernel approximate error may dominate RHS of Eq. 23 and cause SVGD to converge to a limit different from the target distribution with an uncontrollable bias. Failures of Smoothing Kernels. Smoothing Kernels (kernel smoothers), such as radial basis function kernel, are widely used in SVGD, due to their universal approximation capability to smooth functions (Park & Sandberg, 1991; Micchelli et al., 2006) . Assume that k σ (x, y) = kσ (x -y) = σ -d k(σ -1 (x -y) ) is a smoothing kernel with parameter σ > 0, where σ is called the bandwidth of the kernel. The variance σ 2 in the smoothing kernel tends to control the smoothness of the estimated gradient. A large σ makes kernelized gradient well-estimated with finite samples while the kernel approximation error is large. In the population limit, where randomness from p t (y) is ignored, the optimal smoothing kernel should be Dirac delta function δ x (y), where the kernel bandwidth is sufficiently small to estimate Wasserstein gradient. For each point x, the kernelized gradient becomes p t (x)∇ ln p * (x) pt(x) , which means that for those low-density areas p t (x) → 0, the kernelization suffers from gradient vanishing. Such biased Wasserstein gradient estimation in smoothing kernel SVGD will have an additional p t term, which hampers the sufficient descent of each iteration and even the convergence rate. This indicates that SVGD is not compatible with smoothing kernels in low-density parts. Ideally, the kernel should be reweighted by some 1/ p t (x)p t (y), which can balance the vanishing scaling in the current SVGD. However, p t is unknown in general, so this reweighting cannot provide algorithmic insight to improve SVGD. To solve this problem, a very intuitive idea is to balance the order of p t , we may require kernel k to be related to p t through the following proposition. Proposition 4.1. If we use the kernel k(x, y) = (p t (x)) -1/2 k σ (x, y) (p t (y)) -1/2 , by choosing delta function as the smoothing kernel k 0 (x, y) = δ x (y), kernel approximation error is 0 and SVGD is equivalent to the Wasserstein gradient flow. Unfortunately, the reweighting strategy with p t (x) makes the iteration of SVGD (Eq. 11) computationally intractable as p t (x) and ∇p t (x) are unknown in general. If we consider the local convergence of SVGD, when p t (x) is approaching p * (x), we can expect a lower kernel approximation error by replacing p -1/2 t in Proposition 4.1 with p -1/2 * (•) as follows k(x, y) = (p * (x)) -1/2 k σ (x, y) (p * (y)) -1/2 where k σ denotes the smoothing kernel, and it satisfies R d ∥y∥ 4 • k σ (x, x -y)dy < ∞ and 1 - R d k σ (x, x -y)dy ≤ 1 2 √ 2 . ( ) Notice that many popular kernels satisfies Eq. 25, e.g., standard RBF kernel, Bump function (Eq. 26), etc. kσ (z) = 1 σ exp - 1 1 -∥z∥ 2 /σ 2 ∥z∥ ∈ B(0, σ). Therefore, the dynamics of KL divergence in such a reweighted kernel is dH p * (p t ) dt = - R d R d k σ (x, y) • p t (x) p * (x) ∇ ln p t (x) p * (x) • p t (y) p * (y) ∇ ln p t (y) p * (y) dydx. ( ) Similar to the requirement of a delta function in 4.1, a small σ is preferred in our setting, since our analysis is based on the population limit of p t (x). Besides, our analysis would indicate the kernel choice to obtain the convergence. Compared with utilizing smoothing kernel directly, we may expect a smaller kernel approximation error by introducing kernels as Eq. 24. We also validate this phenomenon by Figure 2 , which has shown the gradient vanishing phenomenon with the smoothing kernel. Figure 2 (a) is the vector field of Wasserstein gradient ∇ ln p * (x)/p t (x), which is a linear function in Gaussian case. However, when using a smoothing kernel the low-density area in Figure 2 (b) has almost no gradient, which makes the particle in this area stuck. Our proposed reweighted kernel amplifies the gradient in low-density areas, the resulting gradient is similar to the Wasserstein gradient.

4.2. ERROR REWEIGHTING

To achieve a better kernel approximation error with reweighted smoothing kernel Eq. 24, we require a corresponding local version of log-Sobolev inequality to control the sufficient descent of the evolution of the KL divergence (Eq. 27). Local version of log-Sobolev inequality. In this context, the sufficient descent term should be reformulated to obtain a well-behaved kernel approximation error. The choice of g in Eq. 38 may convert the kernel approximation error to a tractable form. By choosing g(x) = pt(x) p * (x) , we can find the upper bound of KL divergence by χ 2 version of Rényi information. Lemma 4.2. Suppose p * satisfies the µ-log-Sobolev inequality (LSI) with a constant µ > 0. When any probability density function p t satisfies D χ 2 (p t , p * ) ≤ 1/2, we have µ 4 H p * (p t ) ≤ R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx. ( ) Lemma 4.2 indicates that the KL divergence is also bounded by the Wasserstein gradient of χ 2 divergence. If p t /p * is bounded, Eq. 28 provides a tighter upper bound of KL divergence compared with that in Eq. 7, especially for the tail part. Thus, controlling the error term in this form has more potential. With such a construction, the decreasing of KL divergence satisfies  d dt H p * (p t ) ≤ - 1 2 R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx sufficient descent + 1 2 R d p t (x) p * (x) ∇ ln p t (x) p * (x) - R d k σ (x, dx ≤ 4ϵ + 1 4 R d p * (x) • pt(x) p * (x) ∇ ln pt(x) p * (x) 2 dx, where k σ satisfies requirements in Eq. 25. In this condition, we nearly establish the "Stein log-Sobolev Inequality" with arbitrary small ϵ by combining Eq. 27, Eq. 28, and Eq. 29 as follows µ 8 H p * (p t ) ≤ R d R d k σ (x, y) • p t (x) p * (x) ∇ ln p t (x) p * (x) • p t (y) p * (y) ∇ ln p t (y) p * (y) dydx + 4ϵ. (30) Therefore, reweighted kernels can be considered as a sufficient condition for establishing local "Stein log-Sobolev Inequality" near the target distribution p * .

A CONVERGENCE RATE COMPARISON IN SECTION 3

The assumptions are listed as follows [AS 1 ] p * satisfies µ-log-Sobolev Inequality (Eq. 4). [AS 2 ] p * satisfies Talagrand 1 Inequality. [AS 3 ] The Stein log-Sobolev inequality with constant λ, i.e., KL (p∥p * ) ≤ 1 2λ D (p∥p * ) 2 [AS 4 ] f * is L-smooth, i.e., for any x, y ∈ R d , ∥∇f * (x) -∇f * (y)∥ ≤ L∥x -y∥. [AS 5 ] f t is L-smooth where p t = e -ft . [AS 6 ] p t is warm: sup x∈R d p t (x)/p * (x) ≤ β for some constant β ≥ 1. [AS 7 ] p t SVGD satisfy: k(x, x)p t (x)dx < ∞. [AS 8 ] Kernel regularization assumption 1: sup x 1 2 ∥∇ log p * ∥ Lip k(x, x) + 2∇ xx ′ < ∞. ( ) [AS 9 ] Kernel regularization assumption 2: ∥k (x, •)∥ H ≤ B and ∥∇ x k (x, •)∥ H d ≤ B [AS 1 0] Bounded moment assumptions: sup t ∥x∥ p t (x)dx < ∞. We compare our theoretical results with all previous work where Stein discrepancy is abbreviated as SD.  ϕ t (x) = (k (x, y) ∇ ln p * (y) + ∇ y k (x, y)) dp t (y). Notice that Eq. 35 holds in the sense of distribution here means ∂ t p t (x)v(x)dx = ∇v(x) • ϕ t (x)dp t (x) (37) for all v(x) ∈ C ∞ c (R d ). B.1 PROOF OF PROPOSITION 4.1 Proof. When k(x, y) = (p t (x)) -1/2 k σ (x, y) (p t (y)) -1/2 and k 0 (x, y) = δ x (y), we have k(x, y) = δ x (y) (p t (x)) 1/2 (p t (y)) 1/2 = δ x (y) p t (y) . Thus,  ϕ t (x) = R d p t ( → R with E p * [g 2 ] < ∞, E p * g 2 ln g 2 -E p * g 2 ln E p * g 2 ≤ 2 µ E p * ∥∇g∥ 2 . ( ) Suppose g = p t /p * , we have E p * [g 2 ] = D χ 2 (p t , p * ) + 1 < ∞ and R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) 2 dx - R d p * (x) • p t (x) p * (x) 2 dx • ln R d p * (x) • p t (x) p * (x) 2 dx ≤ 2 µ R d p * (x) • ∇ p t (x) p * (x) 2 dx = 2 µ R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx. With the fact ln x ≤ x -1 and ln x ≥ 1 -1 x when x ≥ 0, we have ln R d p * (x) • p t (x) p * (x) 2 dx ≤ R d p t (x) • p t (x) p * (x) -1 dx = R d p 2 t (x) p * (x) • pt(x) p * (x) -1 pt(x) p * (x) dx ≤ R d p 2 t (x) p * (x) • ln p t (x) p * (x) dx. Plugging the previous inequality into LHS of Eq. 39, we have LHS ≥ R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) 2 dx - R d p * (x) • p t (x) p * (x) 2 dx • R d p 2 t (x) p * (x) • ln p t (x) p * (x) dx =2 R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) dx - R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) dx - R d p * (x) • p t (x) p * (x) 2 -1 dx • R d p 2 t (x) p * (x) • ln p t (x) p * (x) dx ≥ 2 -D χ 2 (p t , p * ) -1 • R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) dx ≥ 1 2 R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) dx, where the last inequality follows from D χ 2 (p t , p * ) ≤ 1/2. Besides, we have R d p * (x) • p t (x) p * (x) 2 ln p t (x) p * (x) dx = R d p t (x) p * (x) • p t (x) ln p t (x) p * (x) dx = R d p t (x) p * (x) -1 • p t (x) ln p t (x) p * (x) dx + R d p t (x) ln p t (x) p * (x) dx ≥ H p * (p t ), where the last inequality follows from (x -1) ln x ≥ 0 for all x ≥ 0. Combining Eq. 39, Eq. 41 and Eq. 42, we complete the proof, and obtain µ 4 H p * (p t ) ≤ R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx. ( ) B.3 PROOF OF LEMMA 4.3 Before providing the Kernel Approximation Error (KAE) in Lemma 4.3, we need to introduce some lemmas. Lemma B.1. Assume that p(x) is L-smooth and log-Sobolev, then for any α > 0 Ledoux (1999) , when p(x) is log-Sobolev, there exists c > 0 such that p(x)e c∥x∥ 2 dx < ∞ then e ln p(x)+c∥x∥ 2 dx < ∞ Thus, there exists c x > 0, for sufficient large ∥x∥ > c x , p(x) α dx < ∞ Proof. By ln p(x) < -c∥x∥ 2 and ∥x∥>cx p α (x)dx < ∥x∥>cx e -αc∥x∥ 2 dx < 2πd cα For ∥x∥≤cx p α (x)dx ≤ ∥x∥≤cx p α (0)e αL∥x∥ 2 dx < p α (0)C(d, c x , αL) where C(d, c x , αL) is a constant depend on d, c x , αL. Lemma B.2. (A Variant of Lemma. 11 in Vempala & Wibisono (2019)) Suppose p(x) = e -f (x) , x ∈ R d , f is L-smooth and p(x) satisfies R d p(x)dx ≤ C/2, then we have p(x) ∥∇f (x)∥ 2 dx ≤ LCd; Proof. Since f (x) is L-smooth, we have for any x ∇ 2 f (x) ⪯ LI Using integration by parts, we have e -f (x)/2 ∥∇f (x)∥ 2 dx = 4 e -f (x)/2 ∥∇f (x)/2∥ 2 dx (44) = 4 e -f (x)/2 ∆(f (x)/2)dx ≤ 2Ld e -f (x)/2 dx = LCd (46) Proof. In the following, we mainly focus on providing the upper bound of Kernel Approximation Error (KAE), and have KAE = R d p t (x) p * (x) ∇ ln p t (x) p * (x) - R d k σ (x, x -y) p t (x -y) p * (x -y) ∇ ln p t (x -y) p * (x -y) dy 2 dx ≤2 R d R d k σ (x, x -y) p t (x) p * (x) ∇ ln p t (x) p * (x) - p t (x -y) p * (x -y) ∇ ln p t (x -y) p * (x -y) dy 2 dx + 2 R d 1 - R d k σ (x, x -y)dy 2 p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx. (47) where the first equation follows from the change of variable. With the requirement of k, we have k σ (x, y) = kσ (x -y) = 1 σ k (x -y) σ , R d ∥y∥ 4 • k(y)dy ≤ M and 1 - R d k σ (x, x -y)dy ≤ 1 2 √ 2 . ( ) In this condition, suppose y = σz in Eq. 47, x z := x -σz, then we have KAE ≤4 R d R d k(z) p t (x) p * (x) ∇ ln p t (x) p * (x) - p t (x z ) p * (x z ) ∇ ln p t (x z ) p * (x z ) dz 2 dx Term 1 + 1 4 R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx. ( ) For any x ∈ R d , we have (x) . p t (x) p * (x) ∇ ln p t (x) p * (x) = p t (x) p * (x) (∇f * (x) -∇f t (x)) where p t (x) = e -ft(x) , p * (x) = e -f * Plugging such an equation into Eq. 49, we have Term 1 = R d R d k(z) p t (x) p * (x) ∇f * (x) - p t (x z ) p * (x z ) ∇f * (x z ) - p t (x) p * (x) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x z ) dz 2 dx ≤2 R d R d k(z) p t (x) p * (x) ∇f * (x) - p t (x z ) p * (x z ) ∇f * (x z ) dz 2 dx + 2 R d R d k(z) p t (x) p * (x) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x z ) dz 2 dx ≤2 R d   R d k(z)dz • R d k(z) p t (x) p * (x) ∇f * (x) - p t (x z ) p * (x z ) ∇f * (x z ) 2 dz   dx + 2 R d   R d k(z)dz • R d k(z) p t (x) p * (x) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x z ) 2 dz   dx ≤3 R d R d k(z) p t (x) p * (x) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x z ) 2 dzdx Term 1.1 + 3 R d R d k(z) p t (x) p * (x) ∇f * (x) - p t (x z ) p * (x z ) ∇f * (x z ) 2 dzdx Term 1.2 (50) where the first inequality follows from Minkowski inequality, the second inequality follows from Cauchy-Schwarz inequality, and the third inequality follows from Eq. 48. Consider Term 1.1, we have, Term 1.1 = R d R d k(z) p t (x) p * (x) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x) + p t (x z ) p * (x z ) ∇f t (x) - p t (x z ) p * (x z ) ∇f t (x z ) 2 dzdx ≤2 R d R d k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f t (x) 2 dzdx + 2 R d R d k(z) p t (x z ) p * (x z ) (∇f t (x) -∇f t (x z )) 2 dzdx ≤2 R d R d k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f t (x) 2 dzdx + 2 R d k(z)L 2 σ 2 ∥z∥ 2 R d p t (x z ) p * (x z ) 2 dxdz ≤2 R d R d k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f t (x) where the second inequality follows from L-smoothness of p t (x) and the Fubini's theorem, and the third inequality follows from p t warm (Assumption 3) and the fact R d ∥y∥ 2 • k(y)dy ≤ D in Eq. 48. In the following, we focus on the first term of RHS of Eq. 51, and have p t (x) p * (x) = exp f * (x) 2 -f t (x) . For each x ∈ R d , suppose high dimensional R d can be divided into two parts: B l (x) = z f * (x) 2 -f t (x) ≤ f * (x z ) 2 -f t (x z ) , B u (x) = z f * (x) 2 -f t (x) ≥ f * (x z ) 2 -f t (x z ) . For z ∈ B l (x), we have p t (x z ) p * (x z ) - p t (x) p * (x) = p t (x z ) p * (x z ) • 1 -exp f * (x) 2 -f t (x) - f * (x z ) 2 -f t (x z ) ≤ p t (x z ) p * (x z ) • f * (x z ) 2 -f t (x z ) - f * (x) 2 -f t (x) ≤ p t (x z ) p * (x z ) • - σ 2 (∇f * (x) -∇f * (x z ) + f * (x -σz)) ⊤ z + σ∇f t (x z ) ⊤ z + 3Lσ 2 4 ∥z∥ 2 ≤ p t (x z ) p * (x z ) • - σ 2 ∇f * (x z ) ⊤ z + σ∇f t (x z ) ⊤ z + 5Lσ 2 4 ∥z∥ 2 , where the first inequality follows from 1 -e -x ≤ x for any x ≥ -1, the second and third inequality follows from L-smoothness of f * and f t . Therefore, we have R d B l k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f t (x) 2 dzdx ≤ R d B l k(z) p t (x z ) p * (x z ) • - σ 2 ∇f * (x z ) ⊤ z + σ∇f t (x z ) ⊤ z + 5Lσ 2 4 ∥z∥ 2 • ∇f t (x) 2 dzdx ≤ R d B l k(z) 2σ - σ 2 ∇f * (x z ) ⊤ z + σ∇f t (x z ) ⊤ z + 5Lσ 2 4 ∥z∥ 2 2 p t (x z )dzdx + R d B l k(z)σ 2 • (p t (x z )) 1.5 p * (x z ) • ∥∇f t (x) -∇f t (x z ) + ∇f t (x z )∥ 2 dzdx ≤ R d B l σ • k(z) ∥z∥ 2 p t (x z ) • 1 2 ∥∇f * (x z )∥ 2 + 3 2 ∥∇f t (x z )∥ 2 + 5L 2 σ 2 2 ∥z∥ 2 dzdx + R d B l σ • k(z) • (p t (x z )) 1.5 p * (x z ) • ∥∇f t (x z )∥ 2 + L 2 σ 2 ∥z∥ 2 dzdx ≤ √ βσ 2 • R d k(z)∥z∥ 2 R d p * (x z ) ∥∇f * (x z )∥ 2 dxdz + 3σ 2 • R d k(z)∥z∥ 2 R d p t (x z ) ∥∇f t (x z )∥ 2 dxdz + 5L 2 σ 3 2 • R d k(z)∥z∥ 4 R d p t (x z )dxdz + βσ • R d k(z) R d p t (x z ) ∥∇f t (x z )∥ 2 dxdz + βL 2 σ 3 • R d k(z)∥z∥ 2 R d p t (x z )dxdz ≤ βCdL(M + 1) • σ + 3 βCdL(M + 1) • σ + 5 4 βCL 2 M • σ 3 + 3β 1.5 CdL • σ + 1 2 β 1.5 CL 2 (M + 1) • σ 3 . (55) It can be noticed that the first inequality follows from Eq. 54, the second and the third inequalities follow from Cauchy-Schwarz inequality, the fourth inequality follows from the β-warm during the update, the fourth inequality follows from the β-warm during the update, Lemma B.2 and the following fact B l k(z)dz ≤ R d k(z)dz. Besides, the constant C is provided by Lemma B.1 as R d p * (x)dx ≤ C. For z ∈ B u (x), we have p t (x) p * (x) - p t (x z ) p * (x z ) = p t (x) p * (x) • 1 -exp f * (x z ) 2 -f t (x z ) - f * (x) 2 -f t (x) ≤ p t (x) p * (x) • f * (x) 2 -f t (x) - f * (x z ) 2 -f t (x z ) ≤ p t (x) p * (x) • σ 2 (∇f * (x z ) -∇f * (x) + ∇f * (x)) ⊤ z -σ∇f ⊤ t (x)z + 3Lσ 2 4 ∥z∥ 2 ≤ p t (x) p * (x) • σ 2 ∇f ⊤ * (x)z -σ∇f ⊤ t (x)z + 5Lσ 2 4 ∥z∥ 2 . Similar to Eq. 55, we have R d Bu k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f t (x) 2 dzdx ≤ R d Bu k(z) p t (x) p * (x) • σ 2 ∇f ⊤ * (x)z -σ∇f ⊤ t (x)z + 5Lσ 2 4 ∥z∥ 2 • ∇f t (x) 2 dzdx ≤ R d Bu k(z) 2σ σ 2 ∇f ⊤ * (x)z -σ∇f ⊤ t (x)z + 5Lσ 2 4 ∥z∥ 2 2 • p t (x)dzdx + R d Bu k(z)σ 2 • (p t (x)) 1.5 p * (x) • ∥∇f t (x)∥ 2 dzdx ≤ R d Bu σ • k(z) ∥z∥ 2 p t (x) • 1 2 ∥∇f * (x)∥ 2 + 3 2 ∥∇f t (x)∥ 2 + 5L 2 σ 2 2 ∥z∥ 2 dzdx + R d Bu k(z)σ 2 • (p t (x)) 1.5 p * (x) • ∥∇f t (x)∥ 2 dzdx ≤ √ βσ 2 • R d k(z)∥z∥ 2 R d p * (x) ∥∇f * (x)∥ 2 dxdz + 3σ 2 • R d k(z)∥z∥ 2 R d p t (x) ∥∇f t (x)∥ 2 dxdz + 5L 2 σ 3 2 • R d k(z)∥z∥ 4 R d p t (x)dxdz + βσ 2 • R d k(z) R d p t (x) ∥∇f t (x)∥ 2 dxdz ≤ βCdL(M + 1) • σ + 3 βCdL(M + 1) • σ + 5 4 βCL 2 M • σ 3 + 3 2 β 1.5 CdL • σ. Plugging Eq. 55, Eq. 57 into Eq. 51, we have Term 1.1 ≤16 βCdL(M + 1) • σ + 9β 1.5 CdL • σ + 2βL 2 (M + 1) • σ 2 + 5 βCL 2 M • σ 3 + β 1.5 CL 2 (M + 1) • σ 3 . ( ) With the same techniques in Eq. 51, we have Term 1.2 ≤ 2 R d B(0,r) k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f * (x) Similar to Eq. 55, when z ∈ B l (x), we have R d B l k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f * (x) 2 dzdx ≤ R d B l σ • k(z) ∥z∥ 2 p t (x z ) • 1 2 ∥∇f * (x z )∥ 2 + 3 2 ∥∇f t (x z )∥ 2 + 5L 2 σ 2 2 ∥z∥ 2 dzdx + R d B l σ • k(z) • (p t (x z )) 1.5 p * (x z ) • ∥∇f * (x z )∥ 2 + L 2 σ 2 ∥z∥ 2 dzdx ≤ √ βσ 2 • R d k(z)∥z∥ 2 R d p * (x z ) ∥∇f * (x z )∥ 2 dxdz + 3σ 2 • R d k(z)∥z∥ 2 R d p t (x z ) ∥∇f t (x z )∥ 2 dxdz + 5L 2 σ 3 2 • R d k(z)∥z∥ 4 R d p t (x z )dxdz + βσ • R d k(z) R d p t (x z ) ∥∇f * (x z )∥ 2 dxdz + βL 2 σ 3 • R d k(z)∥z∥ 2 R d p t (x z )dxdz ≤ βCdL(M + 1) • σ + 3 βCdL(M + 1) • σ + 5 4 βCL 2 M • σ 3 + 3β 1.5 CdL • σ + 1 2 β 1.5 CL 2 (M + 1) • σ 3 , ( ) where the last inequality utilizes additional β-warm condition comparing with Eq. 55. Similar to Eq. 57, when z ∈ B u (x), we have R d Bu k(z) p t (x) p * (x) - p t (x z ) p * (x z ) ∇f * (x) 2 dzdx ≤ R d Bu σ • k(z) ∥z∥ 2 p t (x) • 1 2 ∥∇f * (x)∥ 2 + 3 2 ∥∇f t (x)∥ 2 + 5L 2 σ 2 2 ∥z∥ 2 dzdx + R d Bu k(z)σ 2 • (p t (x)) 1.5 p * (x) • ∥∇f * (x)∥ 2 dzdx ≤ √ βσ 2 • R d k(z)∥z∥ 2 R d p * (x) ∥∇f * (x)∥ 2 dxdz + 3σ 2 • R d k(z)∥z∥ 2 R d p t (x) ∥∇f t (x)∥ 2 dxdz + 5L 2 σ 3 2 • R d k(z)∥z∥ 4 R d p t (x)dxdz + βσ 2 • R d k(z) R d p t (x) ∥∇f * (x)∥ 2 dxdz ≤ βCdL(M + 1) • σ + 3 βCdL(M + 1) • σ + 5 4 βCL 2 M • σ 3 + 3 2 β 1.5 CdL • σ. Combining Eq. 60, Eq. 61 with Eq. 59, we have Term 1.2 ≤16 βCdL(M + 1) • σ + 9β 1.5 CdL • σ + 2σ 2 βL 2 (M + 1) + 5 βCL 2 M • σ 3 + β 1.5 CL 2 (M + 1) • σ 3 . ( ) Without loss of generality, we suppose σ ≤ 1 and M ≥ 1. Plugging Eq. 58 and Eq. 62 into Eq. 50, we have  Combining the result with Remark 2, we have dD χ (p t , p * ) dt = - R d 2∇ p t (x) p * (x) • R d p t (y)k(x, y) • ∇ ln p t (y) p * (y) dydx = - R d 2∇ p t (x) p * (x) • R d p * (y)k(x, y) • ∇ p t (y) p * (y) dydx Plugging Eq. 20 to the previous equation, we have dD χ (p t , p * ) dt = -2 R d p t (x) p * (x) • ∇ p t (x) p * (x) • R d k(x, y) p * (y) • ∇ p t (y) p * (y) dydx = -2 R d p t (x) ∇ p t (x) p * (x) 2 dx -2 R d p t (x) p * (x) • ∇ p t (x) p * (x) • R d k(x, y) p * (y) • ∇ p t (y) p * (y) -p * (x) • ∇ p t (x) p * (x) dx ≤ - R d p t (x) ∇ p t (x) p * (x) 2 dx + R d p t (x) p * (x) R d k(x, y) p * (y) • ∇ p t (y) p * (y) -p * (x) • ∇ p t (x) p * (x) 2 dx ≤β • KAE (67) where the last inequality follows from the p t warm assumption. Suppose we control the KAE by Eq. 64, which means ∂ t D χ (p t , p * ) ≤ 4βϵ, and time T = -C ln ϵ which leads p t to the target region, i.e., KL(p t ∥p * ) ≤ ϵ by the linear convergence, when ϵ is small enough, e.g., ϵ ≤ (16βC) -2 and D χ (p 0 , p * ) ≤ 1/4, we have D χ (p T , p * ) ≤ D χ (p 0 , p * ) + 4βϵT = D χ (p 0 , p * ) -4Cβϵ ln ϵ ≤ D χ (p 0 , p * ) + 4Cβ √ ϵ ≤ 1/2. Hence, the proof is completed. With these Lemmas, we provide the main theorem proof in the following. Proof. Suppose H p * (p t ) := KL(p t ∥p * ) for abbreviation. According to the time derivative of KL divergence along any flow, we have d dt H p * (p t ) = R d δH p * δp (p t ) ∂ t p t dx. Therefore, along Remark 2, we have d dt H p * (p t ) = R d ∇ ln p t (x) p * (x) • ϕ t (x)p t (x)dx = - R d R d k(x, y)p t (x)p t (y) • ∇ ln p t (x) p * (x) • ∇ ln p t (y) p * (y) dydx, which follows from Eq. 36. By taking k(x, y) = (p * (x)) -1/2 k σ (x, y) (p * (y)) -1/2 , Eq. 69 satisfies  d dt H p * (p t ) = - R d p t (x) p * (x) ∇ ln p t (x) p * (x) dx • R d k σ (x, y) p t (y) p * (y) ∇ ln p t (y) p * (y) dy = - R d p t (x) p * (x) ∇ ln p t (x) p * (x) • R d k σ (x, y) p t (y) p * (y) ∇ ln p t (y) p * (y) dy - p t (x) p * (x) ∇ ln p t (x) p * (x) + p t (x) p * (x) ∇ ln p t (x) p * (x) dx = - R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx - R d p t (x) p * (x) ∇ ln p t (x) p * (x) • R d k σ (x, y) p t (y) p * (y) ∇ ln p t (y) p * (y) dy - p t (x) p * (x) ∇ ln p t (x) p * (x) dx. ≤ - 1 2 R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx + 1 2 R d p t (x) p * (x) ∇ ln p t (x) p * (x) - R d k σ (x, where the first inequality follows from Lemma 4.3 and the second one follows from Lemma 4.2. However, Lemma 4.2 requires a local condition of p t which we proved in Lemma C.1. By applying Gronwall's lemma, Eq. 71 implies the desired bound H p * (p t ) ≤ max 0, H p * (p 0 ) - 64ϵ µ • exp - µt 16 + 64ϵ µ . Hence, the proof is completed. Remark 3. Actually, instead of the constant upper bound of the density ratio provided in Assumption [A 3 ], we allow the upper bound of the density ratio to be upper-bounded as sup x∈R d p t (x)/p * (x) ≤ P (t). where P (t) denotes a polynomial function. Without loss of generality, we suppose P (t) ≤ (t + 1) q . In this condition, reweighted SVGD can achieve an O(1/ϵ) when we choose σ t = min 1, e -t-1 . In the following, we will show how this choice affects the kernel approximation error shown in Lemma 4.3. Similar to Eq. 63, we can obtain the following inequality  (t) 1.5 σ t + 36L 2 M • P (t)σ t + 12L 2 M C • P (t) 1.5 σ t . We will easily obtain that P 1.5 (t)σ t = (1 + t) 1.5q • e -(t+1) ≤ (1 + 1.5q) 1 .5q • e -1-1.5q t + 1 , Term 1 ≤ 192LM Cd + 54LCd + 36L 2 M + 12L 2 M C • (1 + 1.5q) 1 .5q • e -1-1.5q • (t + 1) -1 . For abbreviation, we suppose C * = 192LM Cd + 54LCd + 36L 2 M + 12L 2 M C • (1 + 1.5q) 1 .5q • e -1-1.5q . Then, Similar to Eq. 71, we have where Ei is denoted as d dt H p * (p t ) ≤ - 1 4 R d p * (x) • p t (x) p * (x) ∇ ln p t (x) p * (x) 2 dx + 4C * 1 + t ≤ - µ 16 H p * (p t ) + 4C * 1 + t . Ei(x) = - ∞ -x exp(-t) t dt. According to Abramowitz (1972) , we have e -x Ei(x) ≤ - 1 2 ln(1 - 2 x ) 4C * • exp - µ(t + 1) 16 • Ei µ(t + 1) 16 ≤ -2C * ln(1 - 32 µ(t + 1) ) ≤ 64C * µ(t + 1) -32 , where the last inequality follows from ln(x) ≥ 1 -1/x. Hence, by requiring RHS of Eq. 76 to be smaller than ϵ, we have t ≥ 64C * /(µϵ) + 32/µ. In the following, we will show that a convergence rate can be obtained by approximating an unknown normalizing constant of the target distribution p * (similar to Theorem 3.1). Proposition C.2. Suppose Assumption [A 1 ]-[A 3 ] are satisfied, chi-square D χ (p 0 , p * ) ≤ 1/4 and PDF of target distribution p * (x) can be estimated by p * (x) satisfying p * (x) = e -f * (x) C * , p * (x) = e -f * (x) Ĉ where 0 < C * , Ĉ < ∞. For any ϵ > 0, if we set reweighted kernel k:  k(x, y) = (p * (x)) -1/2 k σ (x, y) (p * (y)) -1/2 , k σ (x, y) = kσ (x -y) = σ -d k(σ -1 (x -y)), This proposition demonstrates that approximating an unknown normalizing constant will not harm the linear convergence rate of SVGD with reweighted kernels, and only provides an additional factor Ĉ/C * in total complexity. The very common question is that it seems that the convergence can be arbitrarily fast when the factor Ĉ/C * is large enough. Actually, we should notice this convergence rate only establishes in asymptotic analysis, which means the discretization error cannot be controlled without a tiny step size when Ĉ/C * is large. That means a large Ĉ/C * usually implies a small step size in practice rather than arbitrarily fast convergence. 

D EXPERIMENTAL RESULTS

In this section, we conduct the SVGD with reweighted kernels in some synthetic data to validate our claims, i.e., compared with traditional SVGD, SVGD with reweighted kernels can achieve any ϵ-neighborhood with a linear convergence. To validate our theoretical results in asymptotic settings, we choose different particle sizes and show that sampling by SVGD with reweighted kernels can obtain a lower KL divergence. To demonstrate the efficiency and stability of SVGD, we provide a numerical illustration for these two algorithms in Figure 5 . It is clear that Langevin dynamics suffer from the introduction of the stochasticity, which makes the particle highly unstable. Thus, it is necessary to use more particles to perform the task to guarantee the stability. 



CONCLUSIONSIn this paper, we prove the local linear convergence for SVGD with reweighted kernel in KL divergence. In particular, our analysis is based on the smoothing kernel for SVGD algorithm and we point out that the conventional smoothing kernel fails to provide valid gradient scaling in low-density areas. Thus, we highlight that the reweighting is necessary for smoothing kernels in SVGD algorithm. With (p * (x)p * (y)) -1/2 weighting for k(x, y), we provides the KL convergence rate for SVGD algorithm locally for log-Sobolev p * (x). Our analysis provide new insights on the kernel design in SVGD, especially for gradient amplification in the low-density area. dzdx + 2σ 2 βL 2 (M + 1),(51) dzdx + 5L 2 σ 2 r 2 . (59) dx,and the proof is completed.



are similar to the convexity geometry and L-smoothness in conventional Euclidean optimization. Assumption [A 3 ] restricts the domain of our proof: the tail of p t should be lighter than p * , which is widely used in Langevin dynamics. Compared with SLSI, due to the decoupling of requirements of the target distribution p * and designed kernels k, we can verify these assumptions in several ways. The establishment of LSI of the target distribution p * can be checked by the criterion mentioned in Section 2.2. For the trajectory assumptions, i.e., [A 2 ] and [A 3 ], we provide the empirical validation by showing the estimation of density ratio and smoothness of p t in some simple cases with the growth of t (Fig1). Then, we have the following theorem. Theorem 3.1. Suppose Assumption [A 1 ]-[A 3 ] are satisfied, and chi-square D χ (p 0 , p * ) ≤ 1/4. For any ϵ > 0, if we set reweighted kernel k:

Figure 1: Illustration of warmness and smoothness evolution, where p 0 ∼ N (0, 0.25), p * ∼ N (0, 5).

Figure2: Illustration of gradient vanishing with smoothing kernel. The vector field illustrates -∇ ln p t (x) for data distribution N (0, diag(1, 0.5 2 ))

Figure 3: Reweighted vs smoothing kernel (1K particles). p * = N ([0, 0], diag(5, 1)).

Figure 4: Reweighted vs smoothing kernel (0.5K particles). p * = N ([0, 0], diag(5, 1)).

Figure 5: SVGD vs Langevin dynamics.

By applying Gronwall's lemma, Eq. 8 yields H p * (p t ) ≤ H p * (p 0 ) exp(-2µt), which indicates the linear convergence rate of Langevin dynamics. Nonetheless, coupling LSI with SVGD is challenging due to the introduction of RKHS. H0 , and H is the d-times Cartesian product space of H 0 that contains Frobenius-normalized linear functions from H to R d , i.e., ⟨ϕ t , ϕ t ⟩ H ≤ 1.



Before providing the main theorem, i.e., Theorem 3.1, we need to introduce some lemmas. Lemma C.1. Suppose Assumption [A 1 ]-[A 3 ] are satisfied, and p 0 is near to the target p * satisfying D χ (p 0 , p * ) ≤ 1/4, for any time T = -C ln ϵ where ϵ ≤ (16βC) -2 , we haveD χ (p T , p * ) ≤ 1/2.Proof. We denote D χ 2 (p t , p * ) as chi-square distance between p t and p * . We have the following functional derivative dD χ (p t , p * ) dt =

then the KL divergence between p t and p * satisfiesH p * (p t ) ≤ max 0, H p * (p 0 ) -Proof.We suppose the following reweighted kernelsk(x, y) = (p * (x)) -1/2 k σ (x, y) (p * (y))where the equation follows from the definition of k, and the inequality follows from Theorem 3.1. By applying Gronwall's lemma, Eq. 80 implies the desired bound H p * (p t ) ≤ max 0, H p * (p 0 ) -

