LOCAL KL CONVERGENCE RATE FOR STEIN VARIA-TIONAL GRADIENT DESCENT WITH REWEIGHTED KER-NEL

Abstract

We study the convergence properties of Stein Variational Gradient Descent (SVGD) algorithm for sampling from a non-normalized probabilistic distribution p * (x) ∝ exp(-f * (x)). Compared with Kernelized Stein Discrepancy (KSD) convergence analyzed in previous literature, KL convergence as a more convincing criterion can better explain the effectiveness of SVGD in real-world applications. In the population limit, SVGD performs smoothed gradient descent with kernel integral operator. Notably, SVGD with smoothing kernels suffers from gradient vanishing in low-density areas, which makes the error term between smoothed gradient and the Wasserstein gradient not controllable. In this context, we introduce a reweighted kernel to amplify the smoothed gradient in low-density areas, which leads to a bounded error term. When the p * (x) satisfies log-Sobolev inequality, we develop the convergence rate for SVGD in KL divergence with the reweighted kernel. Our analysis points out the defects of conventional smoothing kernels in SVGD and provides the convergence rate for SVGD in KL divergence.

1. INTRODUCTION

Sampling from non-normalized distributions is a crucial task in statistics. In particular, in Bayesian inference, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) are considered two mainstream lines to handle the intractable integration of posterior distributions. On the one hand, although methods based on MCMC, e.g., Langevin Monte Carlo (Durmus & Moulines, 2019; Welling & Teh, 2011) (LMC) and Metropolis-adjusted Langevin algorithm (Xifara et al., 2014) (MALA) , are able to provide approximate target distributions with arbitrarily small error (Wibisono, 2018) , the sample efficiency is low due to the lack of repulsive force between samples (Duncan et al., 2019; Korba et al., 2020) . On the other hand, VI-based sampling methods (Blei et al., 2017; Ranganath et al., 2014) can improve the sampling efficiency by reformulating inference as an optimization problem. However, restricting the search space of the optimization problem to some parametric distributions in VI usually causes a huge gap between its solution and the target distribution p * . Inspired by conventional VI, a series of recent works analyze LMC as the optimization problem of Kullback-Leibler (KL) divergence (Wibisono, 2018; Bernton, 2018; Durmus et al., 2019) & Wibisono, 2019) , the establishment of SLSI requires the property of the coupling of designed smoothing kernels and the target distribution, which can hardly be verified in commonly used kernels (Duncan et al., 2019) . In addition, SLSI in higher dimensions is more challenging to hold. To fill these gaps, in this paper, we aim to provide the convergence rate of SVGD (in the infinite particle regime) in terms of KL objective, when p * = e -f * satisfies standard LSI. Specifically, we first point out that the SVGD with smoothing kernel, e.g., RBF kernel, suffers from gradient vanishing in low-density areas due to the extra p t (x) scaling. Then, we denote the importance of reweighted kernels by dividing p t (x) or p * (x), where the scaling of smoothed gradients can be normalized. With the reweighting scaling p -1/2 * (x)p -1/2 * (y) for kernel k(x, y) and regularity conditions, SLSI in higher dimensions can be nearly established with an additional term controlled by kernel approximation error. Finally, by choosing a proper reweighted smoothing kernel, the KL divergence of SVGD dynamics obtains a local linear convergence rate to any neighborhood of p * (x) under mild assumptions when the initialization p 0 (x) is relatively close to p * (x). The main contributions of the paper are as follows: • We introduce reweighted kernels to SVGD which replaces traditional smoothing kernels and overcomes the gradient vanishing problem in low-density areas. • We study the KL convergence rate of SVGD algorithm. Under the standard LSI and some mild assumptions, we show SVGD with a reweighted kernel has a local linear convergence rate to any neighborhood of p * (x).

2. PRELIMINARIES

In this section, we first introduce important notations used in the following sections. Then, we explain how to optimize functionals on Wasserstein space by continuous updates in the infinite particle regime. After that, we show that the key condition LSI on the target distribution to obtain the KL convergence rate of LMC. However, the convergence rate of SVGD dynamics is non-trivial with this assumption. The set P 2 (X ) is consist of probability measure µ on X with finite second order moment. k denotes a function smoother, such as exp(-∥x∥ 2 ), max{0, 1 -∥x∥}.

2.1. OPTIMIZATION IN THE WASSERSTEIN SPACE

Sampling algorithms can be considered as optimizing some given functionals in the Wasserstein space as Eq. 1. Generally, they only update particles, which causes the evolution of the particles' distribution. Such an evolution finally affects the objective functional. In particular, given initial distribution x 0 ∼ p 0 (x) and function class H, suppose the update of x t is dx t = ϕ t (x t ),



In addition to rich applications, there is a lot of work on the theoretical analysis of SVGD. For example, Kernelized Stein Discrepancy (KSD) convergence properties of SVGD under asymptotic and non-asymptotic settings are investigated byLiu (2017); Lu et al. (2019)  andKorba et al. (2020);  Salim et al. (2021; 2022), respectively. However, different from the convergence of KL divergence in the analysis of LMC(Cheng & Bartlett, 2018; Vempala & Wibisono, 2019), KSD convergence cannot deduce the effectiveness of SVGD in some real-world applications, e.g., posterior sampling(Welling  & Teh, 2011)  and non-convex learning(Raginsky et al., 2017). Then, to provide KL convergence, some other works(Duncan et al., 2019; Korba et al., 2020)  present a linear convergence of SVGD with Stein log-Sobolev inequality (SLSI)(Duncan et al., 2019). Nonetheless, different from the clear meaning and criteria of standard log-Sobolev inequality (LSI) in the analysis of LMC (Vempala

. In following sections, bold letters x, y, z denote vectors in R d , and B(x, r) means the open ball centered at x with radius r > 0. For function f : R d → R, ∇f (•) and ∇ 2 f (•) refer to its gradient and Hessian matrix respectively. For function f : R d → R d , ∇f (•) and ∇ • f (•) present the Jacobian matrix and divergence. For function with multiple variables, ∇ i means the gradient w.r.t i-th variable. The distributions are assumed to be absolutely continuous with respect to the Lebesgue measure, which produces density function p. The probability density function of the target posterior is denoted by p * . The density at time t is p t . Notation ∥ • ∥ denotes 2-norm for both vectors and matrix. In Hilbert space H equipped with the inner product ⟨•, •⟩ H , the norm is induced as ∥ • ∥ H .

