SAMPLING WITH MOLLIFIED INTERACTION ENERGY DESCENT

Abstract

Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions-smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the minimizers of MIE converge to the target measure. Optimizing this energy with proper discretization yields a practical firstorder particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems, our algorithm performs on par with existing particle-based algorithms like SVGD, while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives.

1. INTRODUCTION

Sampling from an unnormalized probability density is a ubiquitous task in statistics, mathematical physics, and machine learning. While Markov chain Monte Carlo (MCMC) methods (Brooks et al., 2011) provide a way to obtain unbiased samples at the price of potentially long mixing times, variational inference (VI) methods (Blei et al., 2017) approximate the target measure with simpler (e.g., parametric) distributions at a lower computational cost. In this work, we focus on a particular class of VI methods that approximate the target measure using a collection of interacting particles. A primary example is Stein variational gradient descent (SVGD) proposed by Liu & Wang (2016) , which iteratively applies deterministic updates to a set of particles to decrease the KL divergence to the target distribution. While MCMC and VI methods have found great success in sampling from unconstrained distributions, they often break down for distributions supported in a constrained domain. Constrained sampling is needed when the target density is undefined outside a given domain (e.g., the Dirichlet distribution), when the target density is not integrable in the entire Euclidean space (e.g., the uniform distribution), or when we only want samples that satisfy certain inequalities (e.g., fairness constraints in Bayesian inference (Liu et al., 2021) ). A few recent approaches (Brubaker et al., 2012; Byrne & Girolami, 2013; Liu & Zhu, 2018; Shi et al., 2021) extend classical sampling methods like Hamiltonian Monte Carlo (HMC) or SVGD to constrained domains. These extensions, however, typically contain expensive numerical subroutines like solving nonlinear systems of equations and require explicit formulas for quantities such as Riemannian metric tensors or mirror maps to be derived on a case-by-case basis from the constraints. We present an optimization-based method called mollified interaction energy descent (MIED) that minimizes mollified interaction energies (MIEs) for both unconstrained and constrained sampling. An MIE takes the form of a double integral of the quotient of a mollifier-smooth approximation of δ 0 , the Dirac delta at the origin-over the target density properly scaled. Intuitively, minimizing an MIE balances two types of forces: attractive forces that drive the current measure towards the target density, and repulsive forces from the mollifier that prevents collapsing. We show that as the mollifier converges to δ 0 , the MIE converges to the χ 2 divergence to the target measure up to an additive constant (Theorem 3.3). Moreover, the MIE Γ-converges to χ 2 divergence (Theorem 3.6), so that minimizers of MIEs converge to the target measure, providing a theoretical basis for sampling by minimizing MIE. While mollifiers can be interpreted as kernels with diminishing bandwidths, our analysis is fundamentally different from that of SVGD where a fixed-bandwidth kernel is used to define a reproducing kernel Hilbert space (RKHS) on which the Stein discrepancy has a closed-form (Gorham & Mackey, 2017) . Deriving a version of the Stein discrepancy for constrained domains is far from trivial and requires special treatment (Shi et al., 2021; Xu, 2021) . In contrast, our energy has a unified form for constrained and unconstrained domains and approximates the χ 2 divergence as long as the bandwidth is sufficiently small so that short-range interaction dominates the energy: this idea of using diminishing bandwidths in sampling is under-explored for methods like SVGD. Algorithmically, we use first-order optimization to minimize MIEs discretized using particles. We introduce a log-sum-exp trick to neutralize the effect of arbitrary scaling of the mollifiers and the target density; this form also improves numerical stability significantly. Since we turn sampling into optimization, we can readily apply existing constrained sampling techniques such as reparameterization using a differentiable (not necessarily bijective) map or the dynamic barrier method by Gong & Liu (2021) to handle generic differentiable inequality constraints. Our method is effective as it only uses first-order derivatives of both the target density and the inequality constraint (or the reparameterization map), enabling large-scale applications in machine learning (see e.g. Figure 4 ). For unconstrained sampling problems, we show MIED achieves comparable performance to particlebased algorithms like SVGD, while for constrained sampling problems, MIED demonstrates strong performance compared to alternatives while being more flexible with constraint handling.

2. RELATED WORKS

KL gradient flow and its discretization for unconstrained sampling. The Wasserstein gradient flow of the Kullback-Leibler (KL) divergence has been extensively studied, and many popular sampling algorithms can be viewed as discretizations of the KL-divergence gradient flow. Two primary examples are Langevin Monte Carlo (LMC) and Stein variational gradient descent (SVGD). LMC simulates Langevin diffusion and can be viewed as a forward-flow splitting scheme for the KL-divergence gradient flow (Wibisono, 2018) . At each iteration of LMC, particles are pulled along -∇ log p where p is the target density, while random Gaussian noise is injected, although a Metropolis adjusting step is typically needed for unbiased sampling. In contrast with LMC, SVGD is a deterministic algorithm that updates a collection of particles using a combination of an attractive force involving -∇ log p and a repulsive force among the particles; it can be viewed as a kernelized gradient flow of the KL divergence (Liu, 2017) or of the χ 2 divergence (Chewi et al., 2020) . The connection to the continuous gradient flow in the Wasserstein space is fruitful for deriving sharp convergence guarantees for these sampling algorithms (Durmus et al., 2019; Balasubramanian et al., 2022; Korba et al., 2020; Salim et al., 2022) . Sampling in constrained domains. Sampling in constrained domains is more challenging compared to the unconstrained setting. Typical solutions are rejection sampling and reparameterization to an unconstrained domain. However, rejection sampling can have a high rejection rate when the constrained domain is small, while reparameterization maps need to be chosen on a case-by-case basis with a determinant-of-Jacobian term that can be costly to evaluate. Brubaker et al. (2012) propose a constrained version of HMC for sampling on implicit submanifolds, but their algorithm is expensive as they need to solve a nonlinear system of equations for every integration step in each step of HMC. Byrne & Girolami (2013) propose geodesic Hamiltonian Monte Carlo for sampling on embedded manifolds, but they require explicit geodesic formulae. Zhang et al. (2020); Ahn & Chewi (2021) propose discretizations of the mirror-Langevin diffusion for constrained sampling

