IMPROVED STEIN VARIATIONAL GRADIENT DESCENT WITH IMPORTANCE WEIGHTS

Abstract

Stein Variational Gradient Descent (SVGD) is a popular sampling algorithm used in various machine learning tasks. It is well known that SVGD arises from a discretization of the kernelized gradient flow of the Kullback-Leibler divergence D KL (• | π), where π is the target distribution. In this work, we propose to enhance SVGD via the introduction of importance weights, which leads to a new method for which we coin the name β-SVGD. In the continuous time and infinite particles regime, the time for this flow to converge to the equilibrium distribution π, quantified by the Stein Fisher information, depends on ρ 0 and π very weakly. This is very different from the kernelized gradient flow of Kullback-Leibler divergence, whose time complexity depends on D KL (ρ 0 | π). Under certain assumptions, we provide a descent lemma for the population limit β-SVGD, which covers the descent lemma for the population limit SVGD when β → 0. We also illustrate the advantages of β-SVGD over SVGD by simple experiments.

1. INTRODUCTION

The main technical task of Bayesian inference is to estimate integration with respect to the posterior distribution π(x) ∝ e -V (x) , where V : R d → R is a potential. In practice, this is often reduced to sampling points from the distribution π. Typical methods that employ this strategy include algorithms based on Markov Chain Monte Carlo (MCMC), such as Hamiltonian Monte Carlo (Neal, 2011) , also known as Hybrid Monte Carlo (HMC) (Duane et al., 1987; Betancourt, 2017) , and algorithms based on Langevin dynamics (Dalalyan & Karagulyan, 2019; Durmus & Moulines, 2017; Cheng et al., 2018) . One the other hand, Stein Variational Gradient Descent (SVGD)-a different strategy suggested by Liu & Wang (2016)-is based on an interacting particle system. In the population limit, the interacting particle system can be seen as the kernelized negative gradient flow of the Kullback-Leibler divergence , 2022) . However, current theoretical understanding of SVGD is limited to its infinite particle version (Liu, 2017; Korba et al., 2020; Salim et al., 2021; Sun et al., 2022) , and the theory on finite particle SVGD is far from satisfactory. D KL (ρ | π) := log ρ π (x) dρ(x); Since SVGD is built on a discretization of the kernelized negative gradient flow of (1), we can learn about its sampling potential by studying this flow. In fact, a simple calculation (for example, see Korba et al. ( 2020)) reveals that min 0≤s≤t I Stein (ρ s | π) ≤ DKL(ρ0|π) t , where I Stein (ρ s | π) is the Stein Fisher information (see Definition 2) of ρ s relative to π, which is typically used to quantify how close to π are the probability distributions (ρ s ) t s=0 generated along 1

