DENOISING DIFFUSION SAMPLERS

Abstract

Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schrödinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.

1. INTRODUCTION

Let π be a probability density on R d of the form π(x) = γ(x) Z , Z = R d γ(x)dx, where γ : R d → R + can be evaluated pointwise but the normalizing constant Z is intractable. We are here interested in both estimating Z and obtaining approximate samples from π. A large variety of Monte Carlo techniques has been developed to address this problem. In particular Annealed Importance Sampling (AIS) (Neal, 2001) and its Sequential Monte Carlo (SMC) extensions (Del Moral et al., 2006) are often regarded as the gold standard to compute normalizing constants. Variational techniques are a popular alternative to Markov Chain Monte Carlo (MCMC) and SMC where one considers a flexible family of easy-to-sample distributions q θ whose parameters are optimized by minimizing a suitable metric, typically the reverse Kullback-Leibler discrepancy KL(q θ ||π). Typical choices for q θ include mean-field approximation (Wainwright & Jordan, 2008) or normalizing flows (Papamakarios et al., 2021) . To be able to model complex variational distributions, it is often useful to model q θ (x) as the marginal of an auxiliary extended distribution; i.e. q θ (x) = q θ (x, u)du. As this marginal is typically intractable, θ is then learned by minimizing a discrepancy measure between q θ (x, u) and an extended target p θ (x, u) = π(x)p θ (u|x) where p θ (u|x) is an auxiliary conditional distribution (Agakov & Barber, 2004) . Over recent years, Monte Carlo techniques have also been fruitfully combined to variational techniques. For example, AIS can be thought of a procedure where q θ (x, u) is the joint distribution of a Markov chain defined by a sequence of MCMC kernels whose final state is x while p θ (x, u) is the corresponding AIS extended target (Neal, 2001) . The parameters θ of these kernels can then be optimized by minimizing KL(q θ ||p θ ) using stochastic gradient descent (Wu et al., 2020; Geffner & Domke, 2021; Thin et al., 2021; Zhang et al., 2021; Doucet et al., 2022; Geffner & Domke, 2022) . Instead of following an AIS-type approach to define a flexible variational family, we follow here an approach inspired by Denoising Diffusion Probabilistic Models (DDPM), a powerful class of generative models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021c) . In this context, one adds noise progressively to data using a diffusion to transform the complex data distribution into a Gaussian distribution. The time-reversal of this diffusion can then be used to transform a Gaussian sample into a sample from the target. While superficially similar to Langevin dynamics, this process mixes fast even in high dimensions as it inherits the mixing properties of the forward diffusion (De Bortoli et al., 2021, Theorem 1). However, as the time-reversal involves the derivatives of the logarithms of the intractable marginal densities of the forward diffusion, these so-called scores are practically approximated using score matching techniques. If the score estimation error is small, the approximate time-reversal still enjoys remarkable theoretical properties (De Bortoli, 2022; Chen et al., 2022; Lee et al., 2022) . These results motivate us to introduce Denoising Diffusion Samplers (DDS). Like DDPM, we consider a forward diffusion which progressively transforms the target π into a Gaussian distribution. This defines an extended target distribution p(x, u) = π(x)p(u|x). DDS are obtained by approximating the time-reversal of this diffusion using a process of distribution q θ (x, u). What distinguishes DDS from DDPM is that we cannot simulate sample paths from the diffusion we want to time-reverse, as we cannot sample its initial state x from π. Hence score matching ideas cannot be used to approximate the score terms. We focus on minimizing KL(q θ ||p), equivalently maximizing an Evidence Lower Bound (ELBO), as in variational inference. We leverage a representation of this KL discrepancy based on the introduction of a suitable auxiliary reference process that provides low variance estimate of this objective and its gradient. We can exploit the many similarities between DDS and DDPM to leverage some of the ideas developed in generative modeling for Monte Carlo sampling. This includes using the probability flow ordinary differential equation (ODE) (Song et al., 2021c) to derive novel normalizing flows and the use of underdamped Langevin diffusions as a forward noising diffusion (Dockhorn et al., 2022) . The implementation of these samplers requires designing numerical integrators for the resulting stochastic differential equations (SDE) and ODE. However, simple integrators such as the standard Euler-Maryuama scheme do not yield a valid ELBO in discrete-time. So as to guarantee one obtains a valid ELBO, DDS relies instead on an integrator for an auxiliary stationary reference process which preserves its invariant distribution as well as an integrator for the approximate time-reversal inducing a distribution absolutely continuous w.r.t. the distribution of the discretized reference process. Finally we compare experimentally DDS to AIS, SMC and other state-of-the-art Monte Carlo methods on a variety of sampling tasks.

2. DENOISING DIFFUSION SAMPLERS: CONTINUOUS TIME

We start here by formulating DDS in continuous-time to gain insight on the structure of the timereversal we want to approximate. We introduce C = C([0, T ], R d ) the space of continuous functions from [0, T ] to R d and B(C) the Borel sets on C. We will consider in this section path measures which are probability measures on (C, B(C)), see Léonard (2014a) for a formal definition. Numerical integrators are discussed in the following section.

2.1. FORWARD DIFFUSION AND ITS TIME-REVERSAL

Consider the forward noising diffusion given by an Ornstein-Uhlenbeck (OU) processfoot_0  dx t = -β t x t dt + σ 2β t dB t , x 0 ∼ π, where (B t ) t∈[0,T ] is a d-dimension Brownian motion and t → β t is a non-decreasing positive function. This diffusion induces the path measure P on the time interval [0, T ] and the marginal density of x t is denoted p t . The transition density of this diffusion is given by p t|0 (x t |x 0 ) = N (x t ; √ 1 -λ t x 0 , σ 2 λ t I) where λ t = 1 -exp(-2 t 0 β s ds). From now on, we will always consider a scenario where T 0 β s ds ≫ 1 so that p T (x T ) ≈ N (x T ; 0, σ 2 I). We can thus think of (2) as transporting approximately the target density π to this Gaussian density.



This is referred to as a Variance Preserving diffusion bySong et al. (2021c).

