LOCAL KL CONVERGENCE RATE FOR STEIN VARIA-TIONAL GRADIENT DESCENT WITH REWEIGHTED KER-NEL

Abstract

We study the convergence properties of Stein Variational Gradient Descent (SVGD) algorithm for sampling from a non-normalized probabilistic distribution p * (x) ∝ exp(-f * (x)). Compared with Kernelized Stein Discrepancy (KSD) convergence analyzed in previous literature, KL convergence as a more convincing criterion can better explain the effectiveness of SVGD in real-world applications. In the population limit, SVGD performs smoothed gradient descent with kernel integral operator. Notably, SVGD with smoothing kernels suffers from gradient vanishing in low-density areas, which makes the error term between smoothed gradient and the Wasserstein gradient not controllable. In this context, we introduce a reweighted kernel to amplify the smoothed gradient in low-density areas, which leads to a bounded error term. When the p * (x) satisfies log-Sobolev inequality, we develop the convergence rate for SVGD in KL divergence with the reweighted kernel. Our analysis points out the defects of conventional smoothing kernels in SVGD and provides the convergence rate for SVGD in KL divergence.

1. INTRODUCTION

Sampling from non-normalized distributions is a crucial task in statistics. In particular, in Bayesian inference, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) are considered two mainstream lines to handle the intractable integration of posterior distributions. On the one hand, although methods based on MCMC, e.g., Langevin Monte Carlo (Durmus & Moulines, 2019; Welling & Teh, 2011) (LMC) and Metropolis-adjusted Langevin algorithm (Xifara et al., 2014) (MALA) , are able to provide approximate target distributions with arbitrarily small error (Wibisono, 2018), the sample efficiency is low due to the lack of repulsive force between samples (Duncan et al., 2019; Korba et al., 2020) . On the other hand, VI-based sampling methods (Blei et al., 2017; Ranganath et al., 2014) can improve the sampling efficiency by reformulating inference as an optimization problem. However, restricting the search space of the optimization problem to some parametric distributions in VI usually causes a huge gap between its solution and the target distribution p * . Inspired by conventional VI, a series of recent works analyze LMC as the optimization problem of Kullback-Leibler (KL) divergence (Wibisono, 2018; Bernton, 2018; Durmus et al., 2019) , i.e., arg min p∈P2(R d ) H p * (p) := D KL (p∥p * ) = p(x) ln p(x) p * (x) dx where P 2 (R d ) is the set of Radon-Nikodym derivatives of probability measures ν over Lebesgue measure such that p(x) = dν(x)/dx, ∥x∥ 2 p(x)dx < ∞. LMC is considered as a discrete scheme of the gradient flow of the relative entropy by driving particles with stochastic and energy-induced force. Besides, to take the best of both MCMC and VI, Stein Variational Gradient Descent (Liu & Wang, 2016) (SVGD) was proposed as a non-parametric VI method. It replaces the stochastic force in LMC with the interaction between particles and approximates the target distribution by a driving force in Reproducing Kernel Hilbert space (RKHS). It means the gradient flow of SVGD is defined by the functional derivative projection of Eq. 1 to RKHS. The empirical performance of SVGD and its variants have been largely demonstrated in various tasks such as learning deep probabilistic models (Liu & Wang, 2016; Pu et al., 2017) , Bayesian inference (Liu & Wang, 2016; Feng et al., 2017; Detommaso et al., 2018) , and reinforcement learning (Liu et al., 2017) . 1

