DENOISING DIFFUSION SAMPLERS

Abstract

Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schrödinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.

1. INTRODUCTION

Let π be a probability density on R d of the form π(x) = γ(x) Z , Z = R d γ(x)dx, where γ : R d → R + can be evaluated pointwise but the normalizing constant Z is intractable. We are here interested in both estimating Z and obtaining approximate samples from π. A large variety of Monte Carlo techniques has been developed to address this problem. In particular Annealed Importance Sampling (AIS) (Neal, 2001) and its Sequential Monte Carlo (SMC) extensions (Del Moral et al., 2006) are often regarded as the gold standard to compute normalizing constants. Variational techniques are a popular alternative to Markov Chain Monte Carlo (MCMC) and SMC where one considers a flexible family of easy-to-sample distributions q θ whose parameters are optimized by minimizing a suitable metric, typically the reverse Kullback-Leibler discrepancy KL(q θ ||π). Typical choices for q θ include mean-field approximation (Wainwright & Jordan, 2008) or normalizing flows (Papamakarios et al., 2021) . To be able to model complex variational distributions, it is often useful to model q θ (x) as the marginal of an auxiliary extended distribution; i.e. q θ (x) = q θ (x, u)du. As this marginal is typically intractable, θ is then learned by minimizing a discrepancy measure between q θ (x, u) and an extended target p θ (x, u) = π(x)p θ (u|x) where p θ (u|x) is an auxiliary conditional distribution (Agakov & Barber, 2004) . Over recent years, Monte Carlo techniques have also been fruitfully combined to variational techniques. For example, AIS can be thought of a procedure where q θ (x, u) is the joint distribution of a Markov chain defined by a sequence of MCMC kernels whose final state is x while p θ (x, u) is the corresponding AIS extended target (Neal, 2001) . The parameters θ of these kernels can then be optimized by minimizing KL(q θ ||p θ ) using stochastic gradient descent (Wu et al., 2020; Geffner & Domke, 2021; Thin et al., 2021; Zhang et al., 2021; Doucet et al., 2022; Geffner & Domke, 2022) . Instead of following an AIS-type approach to define a flexible variational family, we follow here an approach inspired by Denoising Diffusion Probabilistic Models (DDPM), a powerful class of generative models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021c) . In this context, one adds noise progressively to data using a diffusion to transform the complex data distribution into a Gaussian distribution. The time-reversal of this diffusion can then be used to transform a Gaussian sample into a sample from the target. While superficially similar to Langevin dynamics, this process mixes fast even in high dimensions as it inherits the mixing properties of the forward diffusion (De Bortoli et al., 2021 , Theorem 1). However, as the time-reversal involves the derivatives of the logarithms of the intractable marginal densities of the forward diffusion, these so-called scores are practically approximated using score matching techniques. If the score estimation error is small, the approximate time-reversal still enjoys remarkable theoretical properties (De Bortoli, 2022; Chen et al., 2022; Lee et al., 2022) . These results motivate us to introduce Denoising Diffusion Samplers (DDS). Like DDPM, we consider a forward diffusion which progressively transforms the target π into a Gaussian distribution. This defines an extended target distribution p(x, u) = π(x)p(u|x). DDS are obtained by approximating the time-reversal of this diffusion using a process of distribution q θ (x, u). What distinguishes DDS from DDPM is that we cannot simulate sample paths from the diffusion we want to time-reverse, as we cannot sample its initial state x from π. Hence score matching ideas cannot be used to approximate the score terms. We focus on minimizing KL(q θ ||p), equivalently maximizing an Evidence Lower Bound (ELBO), as in variational inference. We leverage a representation of this KL discrepancy based on the introduction of a suitable auxiliary reference process that provides low variance estimate of this objective and its gradient. We can exploit the many similarities between DDS and DDPM to leverage some of the ideas developed in generative modeling for Monte Carlo sampling. This includes using the probability flow ordinary differential equation (ODE) (Song et al., 2021c) to derive novel normalizing flows and the use of underdamped Langevin diffusions as a forward noising diffusion (Dockhorn et al., 2022) . The implementation of these samplers requires designing numerical integrators for the resulting stochastic differential equations (SDE) and ODE. However, simple integrators such as the standard Euler-Maryuama scheme do not yield a valid ELBO in discrete-time. So as to guarantee one obtains a valid ELBO, DDS relies instead on an integrator for an auxiliary stationary reference process which preserves its invariant distribution as well as an integrator for the approximate time-reversal inducing a distribution absolutely continuous w.r.t. the distribution of the discretized reference process. Finally we compare experimentally DDS to AIS, SMC and other state-of-the-art Monte Carlo methods on a variety of sampling tasks.

2. DENOISING DIFFUSION SAMPLERS: CONTINUOUS TIME

We start here by formulating DDS in continuous-time to gain insight on the structure of the timereversal we want to approximate. We introduce C = C([0, T ], R d ) the space of continuous functions from [0, T ] to R d and B(C) the Borel sets on C. We will consider in this section path measures which are probability measures on (C, B(C)), see Léonard (2014a) for a formal definition. Numerical integrators are discussed in the following section.

2.1. FORWARD DIFFUSION AND ITS TIME-REVERSAL

Consider the forward noising diffusion given by an Ornstein-Uhlenbeck (OU) processfoot_0  dx t = -β t x t dt + σ 2β t dB t , x 0 ∼ π, where (B t ) t∈[0,T ] is a d-dimension Brownian motion and t → β t is a non-decreasing positive function. This diffusion induces the path measure P on the time interval [0, T ] and the marginal density of x t is denoted p t . The transition density of this diffusion is given by p t|0 (x t |x 0 ) = N (x t ; √ 1 -λ t x 0 , σ 2 λ t I) where λ t = 1 -exp(-2 t 0 β s ds). From now on, we will always consider a scenario where T 0 β s ds ≫ 1 so that p T (x T ) ≈ N (x T ; 0, σ 2 I). We can thus think of (2) as transporting approximately the target density π to this Gaussian density. From (Haussmann & Pardoux, 1986) , its time-reversal (y t ) t∈[0,T ] = (x T -t ) t∈[0,T ] , where equality is here in distribution, satisfies dy t = β T -t {y t + 2σ 2 ∇ ln p T -t (y t )}dt + σ 2β T -t dW t , y 0 ∼ p T , where (W t ) t∈[0,T ] is another d-dimensional Brownian motion. By definition this time-reversal starts from y 0 ∼ p T (y 0 ) ≈ N (y 0 ; 0, σ 2 I) and is such that y T ∼ π. This suggests that if we could approximately simulate the diffusion (3), then we would obtain approximate samples from π. However, putting this idea in practice requires being able to approximate the intractable scores (∇ ln p t (x)) t∈[0,T ] . To achieve this, DDPM rely on score matching techniques (Hyvärinen, 2005; Vincent, 2011) as it is easy to sample from (2). This is impossible in our scenario as sampling from (2) requires sampling x 0 ∼ π which is impossible by assumption.

2.2. REFERENCE DIFFUSION AND VALUE FUNCTION

In our context, it is useful to introduce a reference process defined by the diffusion following (2) but initialized at p ref 0 (x 0 ) = N (x 0 ; 0, σ 2 I) rather than π(x 0 ) thus ensuring that the marginals of the resulting path measure P ref all satisfy p ref t (x t ) = N (x t ; 0, σ 2 I). From the chain rule for KL for path measures (see e.g. (Léonard, 2014b, Theorem 2.4 ) and (De Bortoli et al., 2021, Proposition 24) ), one has KL(Q||P ref ) = KL(q 0 ||p ref 0 ) + E x0∼q0 [KL(Q(•|x 0 )||P ref (•|x 0 ))] . Thus P can be identified as the path measure minimizing the KL discrepancy w.r.t. P ref over the set of path measures Q with marginal q 0 (x 0 ) = π(x 0 ) at time t = 0, i.e. P = arg min Q {KL(Q||P ref ) : q 0 = π}. A time-reversal representation of P ref is given by (y t ) t∈[0,T ] = (x T -t ) t∈[0,T ] satisfying dy t = -β T -t y t dt + σ 2β T -t dW t , y 0 ∼ p ref 0 . As β T -t y t + 2σ 2 ∇ ln p ref T -t (y t ) = -β T -t y t , we can rewrite the time-reversal (3) of P as dy t = -β T -t {y t -2σ 2 ∇ ln ϕ T -t (y t )}dt + σ 2β T -t dW t , y 0 ∼ p T , where ϕ t (x) = p t (x)/p ref t (x) is a so-called value function which can be shown to satisfy a Kolmogorov equation such that ϕ t (x t ) = E P ref [ϕ 0 (x 0 )|x t ], the expectation being w.r.t. P ref .

2.3. LEARNING THE TIME-REVERSAL

To approximate the time-reversal (3) of P, consider a path measure Q θ whose time-reversal is induced by dy t = β T -t {y t + 2σ 2 s θ (T -t, y t )}dt + σ 2β T -t dW t , y 0 ∼ N (0, σ 2 I), so that y t ∼ q θ T -t . To obtain s θ (t, x) ≈ ∇ ln p t (x), we parameterize s θ (t, x) by a neural network whose parameters are obtained by minimizing KL(Q θ ||P) = KL(N (0, σ 2 I)||p T ) + E y0∼N (0,σ 2 I) [KL(Q θ (•|y 0 )||P(•|y 0 ))] (7) = KL(N (0, σ 2 I)||p T ) + σ 2 E Q θ T 0 β T -t ||s θ (T -t, y t ) -∇ ln p T -t (y t )|| 2 dt , where we have used the chain rule for KL then Girsanov's theorem (see e.g. (Klebaner, 2012) ). This expression of the KL is reminiscent of the expression obtained in (Song et al., 2021b, Theorem 1) in the context of DDPM. However, the expectation appearing in ( 7) is here w.r.t. Q θ and not w.r.t. P and we cannot get rid of the intractable scores (∇ ln p t (x)) t∈[0,T ] using score matching ideas. Instead, taking inspiration from (5), we reparameterize the time-reversal of Q θ using dy t = -β T -t {y t -2σ 2 f θ (T -t, y t )}dt + σ 2β T -t dW t , y 0 ∼ N (0, σ 2 I), i.e. f θ (t, x) = s θ (t, x) -∇ ln p ref t (x) = s θ (t, x) + x/σ 2 . This reparameterization allows us to express KL(Q θ ||P) in a compact form. Proposition 1. The Radon-Nikodym derivative dQ θ dP ref (y [0,T ] ) satisfies under Q θ ln dQ θ dP ref = σ 2 T 0 β T -t ||f θ (T -t, y t )|| 2 dt + σ T 0 2β T -t f θ (T -t, y t ) ⊤ dW t . From the identity KL(Q θ ||P) = KL(Q θ ||P ref ) + E y T ∼q θ 0 [ln p ref 0 (y T ) p0(y T ) ], it follows that KL(Q θ ||P) = E Q θ σ 2 T 0 β T -t ||f θ (T -t, y t )|| 2 dt + ln N (y T ; 0, σ 2 I) π(y T ) . ( ) For θ minimizing (10), approximate samples from π can be obtained by simulating (8) and returning y T ∼ q θ 0 . We can obtain an unbiased estimate of Z via the following importance sampling identity Ẑ = γ(y T ) N (y T ; 0, σ 2 I) dP ref dQ θ (y [0,T ] ), where the second term can be computed directly from (9) and y [0,T ] is obtained by solving (8).

2.4. CONTINUOUS-TIME NORMALIZING FLOW

To approximate the log likelihood of DDPM, it was proposed by Song et al. (2021c) T -t the distribution of y t then y T ∼ qθ 0 is an approximate sample from π. We can use this normalizing flow to obtain an unbiased estimate of Z using importance sampling Ẑ = γ(y T )/q θ 0 (y T ) for y T ∼ qθ 0 . Indeed, contrary to the proposal q θ 0 induced by (8), we can compute pointwise qθ 0 using the instantaneous change of variables formula (Chen et al., 2018 ) such that ln qθ 0 (y T ) = ln N (y 0 ; 0, σ 2 I) -σ 2 T 0 β T -t ∇ • f θ (T -t, y t ) dt and where (y t ) t∈[0,T ] arises from the ODE (12). The expensive divergence term can be estimated using the Hutchinson estimator (Grathwohl et al., 2018; Song et al., 2021c) .

2.5. EXTENSION TO UNDERDAMPED LANGEVIN DYNAMICS

For DDPM, it has been proposed to extend the original state x ∈ R d by a momentum variable p ∈ R d . One then diffuses the data distribution augmented by a Gaussian distribution on the momentum using an underdamped Langevin dynamics targeting N (x; 0, σ 2 I)N (m; 0, M ) where M is a positive definite mass matrix (Dockhorn et al., 2022) . It was demonstrated empirically that the resulting scores are smoother, hence easier to estimate and this leads to improved performance. We adapt here this approach to Monte Carlo sampling; see Section B for more details. We diffuse π(x, m) = π(x)N (m; 0, M ) using an underdamped Langevin dynamics, i.e. dx t = M -1 m t dt, dm t = - x t σ 2 dt -β t m t dt + 2β t M 1/2 dB t , (x 0 , m 0 ) ∼ π. The resulting path measure on [0, T ] is denoted P and the marginal of (x t , m t ) is denoted η t . Here, the reference process P ref is defined by the diffusion (13) initialized using x 0 ∼ N (0, σ 2 I), m 0 ∼ N (0, M ). This ensures that η ref t (x t , m t ) = N (x t ; 0, σ 2 I)N (m t ; 0, M ) for all t and the time-reversal process (y t , n t ) t∈[0,T ] = (x T -t , m T -t ) t∈[0,T ] (where equality is in distribution) of this stationary diffusion satisfies dy t = -M -1 n t dt, dn t = y t σ 2 dt -β T -t n t dt + 2β T -t M 1/2 dW t . Using manipulations identical to Section 2.2, the time-reversal of P can be also be written for ϕ t (x, m) := η t (x, m)/η ref t (x, m) as dy t = -M -1 n t dt and dn t = y t σ 2 dt -β T -t n t dt + 2β T -t ∇ nt ln ϕ T -t (y t , n t )dt + 2β T -t M 1/2 dW t . To approximate P, we will consider a parameterized path measure Q θ whose time reversal is defined for (y 0 , n 0 ) ∼ N (y 0 ; 0, σ 2 I)N (n 0 ; 0, M ) by dy t = -M -1 n t dt and dn t = y t σ 2 dt -β T -t n t dt + 2β T -t M f θ (T -t, y t , n t )dt + 2β T -t M 1/2 dW t . We can then express the KL of interest in a compact way similar to Proposition 1, see Appendix B.2. A normalising flow formulation is presented in Appendix B.3. Sample y 0,n ∼ N (0, σfoot_1 I) and set r 0 = 0 5: Algorithm 1 DDS Training 1: Input: σ > 0, γ : R d → R + , (β k ) K k=1 ∈ (R + ) K , θ ∈ R p , λ > 0 2: for i = 1, . . . , for k = 0, . . . , K -1 do # Iterate over integration steps. 6: λ K-k := 1 - √ 1 -α K-k 7: y k+1,n = √ 1-α K-k y k,n +2σ 2 λ K-k f θ (K-k, y k,n )+σ √ α K-k ε k,n , ε k,n i.i.d. ∼ N (0, I) 8: r k+1,n = r k,n + 2σ 2 λ 2 K-k α K-k ||f θ (K -k, y k,n )|| 2 9: end for 10: θ ← θ -λ∇ θ 1 N N n=1 r K,n + ln N (y K,n ;0,σ 2 I) π(y K,n ) 11: end for 12: end for 13: Return: θ

3. DENOISING DIFFUSION SAMPLERS: DISCRETE-TIME

We introduce here discrete-time integrators for the SDEs and ODEs introduced in Section 2. Contrary to DDPM, we not only need an integrator for Q θ but also for P ref to be able to compute an approximation of the Radon-Nikodym derivative dQ θ /dP ref . Additionally this integrator needs to be carefully designed as explained below to preserve an ELBO. For sake of simplicity, we consider a constant discretization step δ > 0 such that K = T /δ is an integer. In the indices, we write k for t k = kδ to simplify notation; e.g. x k for x kδ .

3.1. INTEGRATOR FOR P ref

Proposition 1 shows that learning the time reversal in continuous-time can be achieved by minimising the objective KL(Q θ ||P) given in (10). This expression is obtained by using the identity KL(Q θ ||P) = KL(Q θ ||P ref ) + E y T ∼q θ 0 [ln p ref 0 (y T )/p 0 (y T ) ] . This is also equivalent to maximizing the corresponding ELBO, E Q θ [ln Ẑ], for the unbiased estimate Ẑ of Z given in (11) An Euler-Maruyama (EM) discretisation of the corresponding SDEs is obviously applicable. However, it is problematic as established below. Proposition 2. Consider integrators of P ref and Q θ leading to approximations p ref (x 0:K ) and q θ (x 0:K ). Then KL(q θ ||p ref ) + E y K ∼q θ 0 [ln p ref 0 (y K )/π(y K ) ] is guaranteed to be non-negative and E Q θ [ln Ẑ] is an ELBO for Ẑ = (γ(y K )p ref (y 0:K ))/(N (y K ; 0, σ 2 )q θ (y 0:K )) if one has p ref K (y K ) = p ref 0 (y K ). The EM discretisation of P ref does not satisfy this condition. A direct consequence of this result is that the estimator for ln Z that uses the EM discretisation can be such that E Q θ [ln Ẑ] ≥ ln Z as observed empirically in Table 5 . To resolve this issue, we derive in the next section an integrator for P ref which ensures that p ref k (y) = p ref 0 (y) for all k.

3.2. ORNSTEIN-UHLENBECK

We can integrate exactly (4). This is given by y 0 ∼ N (0, σ 2 I) and y k+1 = 1 -α K-k y k + σ √ α K-k ε k , ε k i.i.d. ∼ N (0, I), for α k = 1-exp -2 kδ (k-1)δ β s ds 2 . This defines the discrete-time reference process. We propose the following exponential type integrator (De Bortoli, 2022) for (8) initialized using y 0 ∼ N (0, σ 2 I) where y k+1 = 1 -α K-k y k + 2σ 2 (1 -1 -α K-k )f θ (K -k, y k ) + σ √ α K-k ε k , ε k i.i.d. ∼ N (0, I). Equations ( 17) and ( 18) define the time reversals of the reference process p ref (x 0:K ) and proposal q θ (x 0:K ). For its time reversal, we write q θ (y 0:K ) = N (y 0 ; 0, σ 2 I) K k=1 q θ k-1|k (y K-k+1 |y K-k ) abusing slightly notation, and similarly for p ref (y 0:K ). We will be relying on the discrete-time counterpart of ( 9). Proposition 3. The log density ratio between q θ (y 0:K ) and p ref (y 0:K ) satisfies for y 0:K ∼ q θ (y 0:K ) ln q θ (y 0:K ) p ref (y 0:K ) = 2σ 2 K k=1 λ 2 k α k ||f θ (k, y K-k )|| 2 + 2σ K k=1 λ k √ α k f θ (k, y K-k ) ⊤ ε k (19) where λ k := 1 - √ 1 -α k and ε k defined through (18) is such that ε k i.i.d. ∼ N (0, I). In particular, one obtains from KL( q θ ||p) = KL(q θ ||p ref ) + E q θ 0 ln π(y K ) N (y K ;0,σ 2 I) that KL(q θ ||p) = E q θ 2σ 2 K k=1 λ 2 k α k ||f θ (k, y K-k )|| 2 + ln N (y K ; 0, σ 2 I) π(y K ) . ( ) We compute an unbiased gradient of this objective using the reparameterization trick and the JAX software package (Bradbury et al., 2018) . The training procedure is summarized in Algorithm 1 in Appendix D. Unfortunately, contrary to DDPM, q θ k|K is not available in closed form for k < K -1 so we can neither mini-batches over the time index k without having to simulate the process until the minimum sampled time nor reparameterize x k as in Ho et al. (2020) . Once we obtain the parameter θ minimizing (20), DDS samples from q θ using (18). The final sample y K has a distribution q θ 0 approximating π by design. By using importance sampling, we obtain an unbiased estimate of the normalizing constant Ẑ = (γ(y K )p ref (y 0:K ))/(N (y K ; 0, σ 2 )q θ (y 0:K )) for y 0:K ∼ q θ (y 0:K ). Finally Appendices B.4 and B.6 extend this approach to provide a similar approach to discretize the underdamped dynamics proposed in Section 2.5. In this context, the proposed integrators rely on a leapfrog scheme (Leimkuhler & Matthews, 2016) .

3.3. THEORETICAL GUARANTEES

Motivated by DDPM, bounds on the total variation between the target distribution and the distribution of the samples generated by a time-discretization of an approximate time reversal of a forward noising diffusion have been first obtained in (De Bortoli et al., 2021) then refined in (De Bortoli, 2022; Chen et al., 2022; Lee et al., 2022) and extended to the Wasserstein metric. These results are directly applicable to DDS because their proofs rely on assumptions on the score approximation error but not on the way these approximations are learned, i.e. via score matching for DDPM and reverse KL for DDS. For example, the main result in Chen et al. (2022) shows that if the true scores are L-Lipschitz, the L 2 (p t ) error on the scores is bounded and other mild integrability assumptions then DDS outputs samples ϵ-close in total variation to π in O(L 2 d/ϵ 2 ) time steps. As pointed out by the authors, this matches state-of-the-art complexity bounds for Langevin Monte Carlo algorithm for sampling targets satisfying a log-Sobolev inequality without having to make any log-concavity assumption on π. However, the assumption on the approximation error for score estimates is less realistic for DDS than DDPM as we do not observe realizations of the forward diffusion. 

3.4. INTEPRETATION, RELATED WORK AND EXTENSIONS

DDS as KL and path integral control. The reverse KL we minimize can be expressed as KL(q θ ||p) = E q θ ln N (x 0 ; 0, σ 2 I) π(x 0 ) + K k=1 ln q θ k-1|k (x k-1 |x k ) p ref k-1|k (x k-1 |x k ) . ( ) This objective is a specific example of KL control problem (Kappen et al., 2012) . In continuoustime, (10) corresponds to a path integral control problem; see e.g. (Kappen & Ruiz, 2016) . Heng et al. (2020) use KL control ideas so as to sample from a target π and estimate Z. However, their algorithm relies on p ref (x 0:K ) being defined by a discretized non-homogeneous Langevin dynamics such that p K (x K ) is typically not approximating a known distribution. Additionally it approximates the value functions (ϕ k ) K k=0 using regression using simple linear/quadratic functions. Finally it relies on a good initial estimate of p(x 0:K ) obtained through SMC. This limits the applicability of this methodology to a restricted class of models. Connections to Schrödinger Bridges. The Schrödinger Bridge (SB) problem (Léonard, 2014a; De Bortoli et al., 2021) takes the following form in discrete time. Given a reference density p ref (x 0:K ), we want to find the density p sb (x 0:K ) s.t. p sb = arg min q {KL(q||p ref ) : q 0 = µ 0 , q K = µ K } where µ 0 , µ K are prescribed distributions. This problem can be solved using iterative proportional fitting (IPF) which is defined by the following recursion with initialization p 1 = p ref p 2n = arg min q {KL(q||p 2n-1 ) : q 0 = µ 0 }, p 2n+1 = arg min q {KL(q||p 2n ) : q K = µ K }. (22) Consider the SB problem where µ 0 (x 0 ) = π(x 0 ), µ K (x K ) = N (x K ; 0, σ 2 I) and the time-reversal of p ref (x 0:K ) is defined through (17) . In this case, p 2 = p corresponds to the discrete-time version of the noising process and p 3 to the time-reversal of p but initialized at µ K instead of p K . This is the process DDS is approximating. As p K ≈ µ K for K large enough, we have approximately p sb ≈ p 3 ≈ p 2 . We can thus think of DDS as approximating the solution to this SB problem.

Consider now another SB problem where µ

0 (x 0 ) = π(x 0 ), µ K (x K ) = δ 0 (x K ) and p ref (x 0:K ) = δ 0 (x K ) K-1 k=0 N (x k ; x k+1 , δσ 2 I), i.e. p ref is a pinned Brownian motion running backwards in time. This SB problem was discussed in discrete-time in (Beghi, 1996) and in continuous-time in (Föllmer, 1984; Dai Pra, 1991; Tzen & Raginsky, 2019) . In this case, p 2 (x 0:K ) = π(x 0 )p ref (x 1:K |x 0 ) is a modified "noising" process that transports π to the degenerate measure δ 0 and it is easily shown that p sb = p 2 . Sampling from the time-reversal of this measure would generate samples from π starting from δ 0 . Algorithmically, Zhang et al. (2021) proposed approximating this time-reversal by some projection on some eigenfunctions. In parallel, Barr et al. (2020) , Vargas et al. (2021) and Zhang & Chen (2022) approximated this SB by using a neural network parameterization of the gradient of the logarithm of the corresponding value function trained by minimizing a reverse KL. We will adopt the Path Integral Sampler (PIS) terminology proposed by Zhang & Chen (2022) for this approach. DDS can thus be seen as an alternative to PIS which relies on a reference dynamics corresponding to an overdamped or underdamped OU process instead of a pinned Brownian motion. Theoretically the drift of the resulting time-reversal for DDS is not as a steep as for PIS (see Appendix A.2) and empirically this significantly improves numerical stability of the training procedure; see Figure 1 . The use of a pinned Brownian motion is also not amenable to the construction of normalizing flows. Forward KL minimization. In Jing et al. (2022) , diffusion models ideas are also use to sample unnormalized probability densities. The criterion being minimized therein is the forward KL as in DDPM, that is KL(p||q θ ). As samples from p are not available, an importance sampling approximation of p based on samples from q θ is used to obtain an estimate of this KL and its gradient w.r.t. θ. The method is shown to perform well in low dimensional examples but is expected to degrade significantly in high dimension as the importance sampling approximation will be typically poor in the first training iterations.

4. EXPERIMENTS

We present here experiments for Algorithm 1. In our implementation, f θ follows the PIS-GRAD network proposed in (Zhang & Chen, 2022) : f θ (k, x) = NN 1 (k, x; θ) + NN 2 (k; θ) ⊙ ∇ ln π(x). Across all experiments we use a two layer architecture with 64 hidden units each (for both networks), as in Zhang & Chen (2022) , with the exception of the NICE (Dinh et al., 2014 ) target where we use 3 layers with 512, 256, and 64 units respectively. The final layers are initialised to 0 in order to make the path regularisation term null. We use α 1/2 k ∝ α 1/2 max cos 2 π 2 1-k/K+s 1+s with s = 0.008 as in (Nichol & Dhariwal, 2021) . We found that detaching the target score stabilised optimization in both approaches without affecting the final result. We adopt this across experiments, an ablation of this feature can be seen in Appendix C.9.1. Across all tasks we compare DDS to SMC (Del Moral et al., 2006; Zhou et al., 2016) , PIS (Barr et al., 2020; Vargas et al., 2021; Zhang & Chen, 2022) , and Mean Field-VI (MF-VI) with a Gaussian variational distribution. We also compare DDS to AIS (Neal, 2001) and optimized variants of AIS using score matching (MCD) (Doucet et al., 2022; Geffner & Domke, 2022) for two standard Bayesian models. Finally we explore a task introduced in (Doucet et al., 2022) that uses a pre-trained normalising flow as a target. Within this setting we propose a benchmarking criterion that allows us to assess mode collapse in high dimensions and explore the benefits of incorporating inductive biases into f θ . We carefully tuned the hyper-parameters of all algorithms (e.g. step size, diffusion coefficient, and such), details can be found in Appendix C.2. Finally training time can be found in Appendix C.5. Additional experiments for the normalizing flows are presented in Appendix C.11 and for the underdamped approach in Appendix C.12. We note that these extensions did not bring any benefit compared to Algorithm 1.

4.1. BENCHMARKING TARGETS

We first discuss two standard target distributions which are often used to benchmark methods; see e.g. (Neal, 2003; Arbel et al., 2021; Heng et al., 2020; Zhang & Chen, 2022) . Results are presented in Figure 2 . Funnel Distribution: This 10-dimensional challenging distribution is given by γ(x 1:10 ) = N (x 1 ; 0, σ 2 f )N (x 2:10 ; 0, exp(x 1 )I), where σ 2 f = 9 (Neal, 2003).foot_2  Log Gaussian Cox process: This model arises in spatial statistics (Møller et al., 1998) . We use a d = M × M = 1600 grid, resulting in the unnormalized target density γ(x) = N (x; µ, K) i∈[1:M ] 2 exp(x i y i -a exp(x i )).

4.2. BAYESIAN MODELS

We explore two standard Bayesian models and compare them with standard AIS and MCD benchmarks (See Appendix C.8) in addition to the SMC and VI benchmarks presented so far. Results for Ionosphere can be found in the 3rd pane of Figure 2 whilst SONAR and Brownian are in Figure 3 . Logistic Regression: We set x ∼ N (0, σ 2 w I), y i ∼ Bernoulli(sigmoid(x ⊤ u i )) . This Bayesian logistic model is evaluated on two datasets, Ionosphere (d = 32) and Sonar (d = 61). Brownian Motion: We consider a discretised Brownian motion with a Gaussian observation model and a latent volatility as a time series model, d = 32. This model, proposed in the software package developed by Sountsov et al. (2020) , is specified in more detail in Appendix C.3.

4.3. MODE COLLAPSE IN HIGH DIMENSIONS

Normalizing Flow Evaluation: Following Doucet et al. ( 2022) we train NICE (Dinh et al., 2014) on a down-sampled d = 14 × 14 variant of MNIST (LeCun & Cortes, 2010) and use the trained model as our target. As we can generate samples from our target, we evaluate the methods samples by measuring the Sinkhorn distance between true and estimated samples. This evaluation criteria allows to asses mode collapse for samplers in high dimensional settings. Results can be seen in Table 1 and the third pane of Figure 3 . For MCD we used the results from (Doucet et al., 2022) .

5. DISCUSSION

We have explored the use of DDPM ideas to sample unnormalized probability distributions and estimate their normalizing constants. The DDS in Algorithm 1 is empirically competitive with state-of-the-art SMC and numerically more stable than PIS. This comes at the cost of a non-negligible training time compared to SMC. When accounting for it, SMC often provide better performance on simple targets. However, in the challenging multimodal NICE example, even a carefully tuned SMC sampler using Hamiltonian Monte Carlo transitions was not competitive to DDS. This is despite the fact that DDS (and PIS) are prone to mode dropping as any method relying on the reverse KL. We have also investigated normalizing flows based on the probability flow ODE as well as DDS based on an underdamped dynamics. Our experimental results were disappointing in both cases in high dimensional scenarios. We conjecture that more sophisticated numerical integrators need to be developed for the normalizing flows to be competitive and that the neural network parameterization used in the underdamped scenario should be improved and better leverage the structure of the logarithmic derivative of the value functions. Overall DDS are a class of algorithms worth investigating and further developing. There has much work devoted to improving successfully DDPM over the past two years including, among many others, modified forward noising mechanisms (Hoogeboom & Salimans, 2022) , denoising diffusion implicit models (Song et al., 2021a) and sophisticated numerical integrators (Karras et al., 2022) . It is highly likely that some of these techniques will lead to more powerful DDS. Advances in KL and path integral control might also be adapted to DDS to provide better training procedures; see e.g. Thalmeier et al. (2020) . Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, and Roger Grosse. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021. Qinsheng Zhang and Yongxin Chen. Path integral sampler: a stochastic control approach for sampling. In International Conference on Learning Representations, 2022. Yan Zhou, Adam M Johansen, and John AD Aston. Toward automatic model comparison: an adaptive sequential monte carlo approach. Journal of Computational and Graphical Statistics, 25 (3):701-726, 2016.

A APPENDIX

A.1 PROOF OF PROPOSITION 1 The Radon-Nikodym derivative expression (9) directly follows from an application of Girsanov theorem (see e.g. Klebaner ( 2012)). The path measures P and P ref are induced by two diffusions following the same dynamics but with different initial conditions so that P(ω) = P(ω|y T )π(y T ) = P ref (ω|y T )π(y T ) = P ref (ω|y T )N (y T ; 0, σ 2 I) π(y T ) N (y T ; 0, σ 2 I) = P ref (ω) π(y T ) N (y T ; 0, σ 2 I) . So it follows directly that KL(Q θ ||P) = KL(Q θ ||P ref ) + E y T ∼q θ 0 ln p ref 0 (y T ) p0(y T ) . Note we apply Girsanov theorem to the time reversals of the path measures Q θ and P using that the Radon-Nikodym between two path measures and their respective reversals are the same. As W t is a Brownian motion under Q θ , the final expression (10) for KL(Q θ ||P) follows.

A.2 COMPARING DDS AND PIS DRIFTS FOR GAUSSIAN TARGETS

The optimal drift for both PIS and DDS can be expressed in terms of logarithmic derivative of the value function plus additional terms: b DDS (x, t) = -β T -t (x -2σ 2 ∇ x ln ϕ T -t (x)), b PIS (x, t) = σ 2 ∇ x ln ϕ T -t (x) where ϕ t is the corresponding value function for DDS or PIS respectively. Recall that ln ϕ t (x) = ln p t (x) -ln p ref t (x). For a target π(x) = N (x; µ, Σ), p ref t (x) = N (x t ; µ ref t , b ref t I) and p t|0 (x t |x 0 ) = N (x t ; a t x 0 , b t I), we obtain p t (x) = π(x 0 )p t|0 (x|x 0 )dx 0 = N (x; a t µ, a 2 t Σ + b t I) so ∇ ln ϕ t (x) = -(a 2 t Σ + b t I) -1 (x -a t µ) + (b ref t ) -1 (x -µ ref t ). ( ) Corollary 1. For DDS, we have µ ref t = 0, b ref t = σ 2 I, a t = √ 1 -λ t , b t = σ 2 λ t I with λ t = 1 -exp(-2 t 0 β s ds). So, for example, for σ = β t = 1 and Σ = I, we have ∇ ln ϕ t (x) = µ exp(-t), b DDS (x, t) = -x + 2µ exp(-(T -t)). ( ) Corollary 2. For PIS, we have a reference process which is is a pinned Brownian motion running backwards in time with µ ref t = 0, b ref t = σ 2 (T -t)I, a t = T -t T , b t = σ 2 t(T -t) T . For σ = 1, Σ = I, we obtain b PIS (x, t) = -(a 2 T -t + b T -t ) -1 (x -a T -t µ) + x t . In particular b PIS (x, t) ≈ x t -(x -µ) as t → 0. Hence for PIS the drift function explodes close to the origin compared to DDS. This explosion holds for any target π as it is only related to the fact that p ref t concentrates to δ 0 as t → T . This makes it harder to approximate for a neural network. Additionally when discretizing the resulting diffusion, this means that smaller discretization steps should be used close to t = 0.

A.3 PROOF OF PROPOSITION 2

Proof. Let us express the discrete time version of the reference process as p ref (y 0:K ) = p ref 0 (y 0 ) K-1 k=0 p ref k+1|k (y k+1 |y k ) and denote by p ref k the corresponding marginal density of y k which satisfies p k+1 (y k+1 ) = p ref k+1|k (y k+1 |y k ) p k (y k )dy k . ( ) The backward decomposition of this joint distribution is given by p ref (y 0:K ) = p ref K (y K ) K-1 k=0 p ref k|k+1 (y k |y k+1 ), where p ref k|k+1 (y k |y k+1 ) = p ref k+1|k (y k+1 |y k )p k (y k ) p k+1 (y k+1 ) . If our chosen integrator induces a transition kernel p ref k+1|k (y k+1 |y k ) which is such that p ref K (y K ) = p ref 0 (y K ), then p ref (y 0:K ) π(y K ) p ref 0 (y K ) = π(y K ) K-1 k=0 p ref k|k+1 (y k |y k+1 ) (31) is a valid (normalised) probability density. Hence it follows that KL(q θ (y 0:K )||p ref (y 0:K )) + E y K ∼q θ 0 ln p ref 0 (y K ) π(y K ) (32) =KL q θ (y 0:K ) p ref (y 0:K ) π(y K ) p ref 0 (y K ) ≥ 0. If the integrator does not preserve the marginals we have that p ref (y 0:K ) π(y K ) p ref 0 (y K ) = p ref K (y K ) p ref 0 (y K ) π(y K ) K-1 k=0 p ref k|k+1 (y k |y k+1 ). ( ) This is not a probability density and thus the objective is no longer guaranteed to be positive and consequently the expectation of our estimator of ln Z will not be necessarily a lower bound for ln Z. Finally simple calculations show that the Euler discretisation does not preserve the invariant distribution of P ref in DDS.

A.4 PROOF OF PROPOSITION 3

We have ln q θ (y 0:K ) p(y 0:K ) = ln q θ (y 0:K ) p ref (y 0:K ) + ln N (y 0 ; 0, σ 2 I) π(y 0 ) . ( ) Now by construction, we have from ( 17) that p ref k-1|k (y K-k+1 |y K-k ) = N (y K-k+1 ; √ 1 -α k y K-k , σ 2 α k I) and from (18) we obtain q θ k-1|k (y K-k+1 |y K-k ) = N (y K-k+1 ; √ 1 -α k y K-k + 2σ 2 (1 - √ 1 -α k )f θ (k, y K-k ), σ 2 α k I). It follows that ln q θ (y 0:K ) p ref (y 0:K ) (35) = ln q K (y 0 ) p ref K (y 0 ) + K k=1 ln q θ k-1|k (y K-k+1 |y K-k ) p ref k-1|k (y K-k+1 |y K-k ) = K k=1 1 2α k σ 2 ||y K-k+1 - √ 1 -α k y K-k || 2 -||y K-k+1 - √ 1 -α k y K-k -2σ 2 (1 - √ 1 -α k )f (k, y K-k )|| 2 , where we have exploited the fact that q K (y Now (19) follows directly from ( 34) and ( 66). Note we have for δ ≪ 1 that 2σ 2 (1- 0 ) = p ref K (y 0 ) = N (y 0 ; 0, σ 2 I). Now using ϵ k := 1 σ √ α k (y K-k+1 - √ 1 -α k y K-k -2σ 2 (1- √ 1 -α k )f (k, y K-k )), we can rewrite (35) as ln q θ (y 0:K ) p ref (y 0:K ) (36) = 1 2 K k=1 ||ϵ k + 2σ (1 - √ 1 -α k ) √ α k f θ (k, y K-k )|| 2 -||ϵ k || 2 =2σ 2 K k=1 (1 - √ 1 -α k ) 2 α k ||f θ (k, y K-k )|| 2 + 2σ K k=1 (1 - √ 1 -α k ) √ α k f θ (k, y K-k ) ⊤ ϵ k . √ 1-α k ) 2 α k ≈ σ 2 β k δ and 2σ (1- √ 1-α k ) √ α k ≈ √ 2β kδ δ as expected from (9). Finally the final expression (20) of the KL follows now from the fact that E q θ [f θ (k, y K-k ) ⊤ ϵ k ] = E q θ k [E q θ k-1|k [f θ (k, y K-k ) ⊤ ϵ k |y K-k ]] = E q θ k [f θ (k, y K-k ) ⊤ E q θ k-1|k [ϵ k |y K-k ]] = 0.

A.5 ALTERNATIVE KULLBACK-LEIBLER DECOMPOSITION

A KL decomposition similar in spirit to the one developed for DDPM (Ho et al., 2020) can be be derived. It leverages the fact that p k-1|k,0 (x k-1 |x k , x 0 ) = N (x k ; μk (x k , x 0 ), σ 2 βk I) (37) for μk (x k , x 0 ) = √ ᾱk-1 β k 1-ᾱk x 0 + √ α k (1-ᾱk-1 ) 1-ᾱk x k , βk = 1-ᾱk-1 1-ᾱk β k . Proposition 4. The reverse Kullback-Leibler discrepancy KL(q θ ||p) satisfies KL(q θ ||p) = E q θ [KL(q K (x K )||p K|0 (x K |x 0 )) + K k=2 KL(q θ k-1|k (x k-1 |x k )||p k-1|0,k (x k-1 |x 0 , x k )) + KL(q θ 0|1 (x 0 |x 1 )||π(x 0 ))]. So for q θ k-1|k (x k-1 |x k ) = N (x k-1 ; √ 1 -β k x k + σ 2 β k f θ (k, x k ), σ 2 β k I), the terms KL(q θ k-1|k (x k-1 |x k )||p k-1|0,k (x k-1 |x 0 , x k )) are KL between two Gaussian distributions and can be calculated analytically. We found this decomposition to be numerically unstable and prone to diverging in our reverse KL setting. Proof. The reverse KL can be decomposed as follows KL(q θ ||p) = E q θ ln q θ (x 0:K ) p(x 0:K ) = E q θ ln q θ (x 0:K ) π(x 0 )p(x 1:K |x 0 ) = E q θ ln q θ (x 0:K ) p(x 1:K |x 0 ) -E q θ [ln π(x 0 )] = L(θ) -E q θ [ln π(x 0 )] where L(θ) = E q θ ln q θ (x 0:K ) p(x 1:K |x 0 ) = E q θ ln q K (x K ) + K k=1 ln q θ k-1|k (x k-1 |x k ) p k|k-1 (x k |x k-1 ) Now using the identity for k ≥ 2 p k-1,k|0 (x k-1 , x k |x 0 ) = p k-1|0 (x k-1 |x 0 )p k|k-1 (x k |x k-1 ) = p k-1|0,k (x k-1 |x 0 , x k )p k|0 (x k |x 0 ), we can rewrite L(θ) as L(θ) = E q θ ln q K (x K ) + K k=2 ln q θ k-1|k (x k-1 |x k ) p k-1|0,k (x k-1 |x 0 , x k ) . p k-1|0 (x k-1 |x 0 ) p k|0 (x k |x 0 ) + ln q θ 0|1 (x 0 |x 1 ) p 1|0 (x 1 |x 0 ) = E q θ ln q K (x K ) p K|0 (x K |x 0 ) + K k=2 ln q θ k-1|k (x k-1 |x k ) p k-1|0,k (x k-1 |x 0 , x k ) + ln q θ 0|1 (x 0 |x 1 ) = E q θ [KL(q K (x K )||p K|0 (x K |x 0 ))] + K k=2 KL(q θ k-1|k (x k-1 |x k )||p k-1|0,k (x k-1 |x 0 , x k )) + ln q θ 0|1 (x 0 |x 1 ) The result now follows directly.

B UNDERDAMPED LANGEVIN DYNAMICS

In the generative modeling context, it has been proposed to extend the original state x ∈ R d by a momentum variable m ∈ R d . One then diffuses in this extended space using an underdamped Langevin dynamics (Dockhorn et al., 2022) targeting N (x; 0, σ 2 I)N (m; 0, M ). It was demonstrated empirically that the resulting scores one needs to estimate are smoother and this led to improved performance. We adapt this approach to Monte Carlo sampling. This adaptation is non trivial and in particular requires to design carefully numerical integrators.

B.1 CONTINUOUS TIME

We now consider an augmented target distribution π(x)N (m; 0, M ) where M is a positive definite mass matrix. We then diffuse this extended target using the following underdamped Langevin dynamics, i.e. dx t = M -1 m t dt, dm t = - x t σ 2 dt -β t m t dt + 2β t M 1/2 dB t , where x 0 ∼ π, m 0 ∼ N (0, M ). The resulting path measure on [0, T ] is denoted again P. From (Haussmann & Pardoux, 1986) , the time-reversal process is also a diffusion satisfying dy t = -M -1 n t dt, dn t = y t σ 2 dt + β T -t n t dt + 2β T -t M ∇ nt ln η T -t (y t , n t )dt + 2β T -t M 1/2 dW t , for (y 0 , n 0 ) ∼ η T where η t denotes the density of (x t , m t ) under (38). Now consider a reference process P ref on [0, T ] defined by the forward process (38) initialized using x 0 ∼ N (0, σ 2 ), m 0 ∼ N (0, M ). In this case one can check that η ref t (x t , m t ) = N (x t ; 0, σ 2 I)N (m t ; 0, M ) and the time-reversal process of P ref satisfies dy t = -M -1 n t dt, dn t = y t σ 2 dt + β T -t n t dt -2β T -t n t dt + 2β T -t M 1/2 dW t = y t σ 2 dt -β T -t n t dt + 2β T -t M 1/2 dW t , as ∇ n ln η ref t (y, n) = ∇ n ln(N (y; 0, σ 2 )N (n; 0, M )) = -M -1 n. Hence it follows that the time-reversal (39) of P can be also be written as dy t = -M -1 n t dt, dn t = y t σ 2 dt -β T -t n t dt + 2β T -t M ∇ nt ln ϕ T -t (y t , n t )dt + 2β T -t M 1/2 dW t , where ϕ t (x, m) := η t (x, m)/η ref t (x, m). To approximate P, we consider a parameterized path measure Q θ whose time reversal is defined for (y 0 , n 0 ) ∼ N (y 0 ; 0, I)N (n 0 ; 0, M ) by dy t = -M -1 n t dt and dn t = y t σ 2 dt -β T -t n t dt + 2β T -t M f θ (T -t, y t , n t )dt + 2β T -t M 1/2 dW t .

B.2 LEARNING THE TIME-REVERSAL THROUGH KL MINIMIZATION

To approximate P, we will consider a parameterized diffusion whose time reversal is defined by dy t = -M -1 n t dt, dn t = y t σ 2 dt -β T -t n t dt + 2β T -t M f θ (T -t, y t , n t )dt + 2β T -t M 1/2 dW t for (y 0 , n 0 ) ∼ N (y 0 ; 0, I)N (n 0 ; 0, M ) inducing a path measure Q θ on the time interval [0, T ]. We can now compute the Radon-Nikodym derivative between Q θ and P ref using an extension of Girsanov theorem (Theorem A.3. in Sottinen & Särkkä (2008) ) ln dQ θ dP ref = T 0 2β T -t M 1/2 f θ (T -t, y t , n t ) ⊤ dW t + 1 2 T 0 ||2β T -t M f θ (T -t, y t , n t )|| 2 (2β T -t M ) -1 dt To summarize, we have the following proposition. Proposition 5. The Radon-Nikodym derivative dQ θ dP ref (y [0,T ] , n [0,T ] ) satisfies under Q θ ln dQ θ dP ref = T 0 β T -t ||f θ (T -t, y t , n t )|| 2 M dt + T 0 2β T -t M 1/2 f θ (T -t, y t , n t ) ⊤ dW t . (45) From KL(Q θ ||P) = KL(Q θ ||P ref ) + E Q θ ln p ref 0 (y T ,n T ) η0(y T ,n T ) , it follows that KL(Q θ ||P) = E Q θ T 0 β T -t ||f θ (T -t, y t , n t )|| 2 M dt + ln N (y T ; 0, σ 2 I) π(y T ) . The second term on the r.h.s. of ( 46) follows from the fact that ln(p ref 0 (y T , n T )/p 0 (y T , n T )) = ln(N (y T ; 0, σ 2 I)/π(y T )).

B.3 NORMALIZING FLOW THROUGH ORDINARY DIFFERENTIAL EQUATION

The following ODE gives exactly the same marginals η T as the SDE (38) defining P dx t = M -1 m t dt, dm t = - x t σ 2 dt -β t m t dt -β t M ∇ mt ln η t (x t , m t )dt, = - x t σ 2 dt -β t M ∇ mt ln ϕ t (x t , m t )dt, Thus if we integrate its time reversal from 0 to T starting from (y 0 , q 0 ) ∼ η T dy t = -M -1 n t dt, dn t = y t σ 2 dt + β T -t M ∇ nt ln ϕ T -t (y t , n t )dt, then we would obtain at time T a sample (y T , n T ) ∼ π(y T )N (n T ; 0, M ). In practice, this suggests that once we have learned an approximation f θ * (t, x, m) ≈ ∇ m ln ϕ t (x, m) by minimization of the ELBO then we can construct a proposal using dy t = -M -1 n t dt, ( ) dn t = y t σ 2 dt + β T -t M f θ * (T -t, y t , n t )dt, for (y 0 , n 0 ) ∼ N (y 0 ; 0, σ 2 I)N (n 0 ; 0, M ). The resulting sample (y T , n T ) ∼ η0 will have distribution close to π × N (0, M ). Again it is possible to compute pointwise the distribution of this sample to perform an importance sampling correction.

B.4 FROM CONTINUOUS TIME TO DISCRETE TIME

We now need to come up with discrete-time integrator for the time-reversal of P ref given by ( 40) and the time-reversal of Q θ given by ( 43). Let us start with (40). We split it into the two components dy t = -M -1 n t dt, dn t = y t σ 2 dt, and dy t = 0, dn t = -β T -t n t dt + 2β T -t M 1/2 dW t . (51) We will compose these transitions. To obtain (y k+1 , n k+1 ) from (y k , n k ), we first integrate (50). To do this, consider the Hamiltonian equation dy t = M -1 n t dt, dn t = -yt σ 2 dt which preserves N (y; 0, σ 2 I)N (n; 0, M ) as invariant distribution. We can integrate this ODE exactly over an interval of length δ and denote its solution Φ(y, n); see Section B.5 for details. We use its inverse Φ -1 (y, n) = Φ flip (y, n) • Φ(y, n) • Φ flip (y, n) where Φ flip (y, n) = (y, -n) so that (y k+1 , n ′ k ) = Φ -1 (y k , n k ). Then we integrate exactly (51) using n k+1 = 1 -α K-k n ′ k + √ α k-1 M 1/2 ϵ k . We have thus design a transition of the form p ref (y k+1 , n k+1 , n ′ k |y k , n k ) = δ Φ -1 (y k ,n k ) (y k+1 , n ′ k )N (n k+1 ; 1 -α K-k n ′ k ; α K-k M ). (53) Now to integrate (43), we split it in three parts. We first integrate (50) using exactly the same integrator (y k+1 , n ′ k ) = Φ -1 (y k , n k ). Then we integrate dn t = 2β T -t M f θ (T -t, y t , n t )dt (54) using n ′′ k = n ′ k + 2(1 -1 -α K-k )M f θ (K -k, y k+1 , n ′ k ), where we abuse notation and write f θ (K -k, y k+1 , n ′ k ) instead of f θ ((K -k)δ, y k+1 , n ′ k ) . Finally, we integrate the OU part using (52) but replacing n ′ k by with n ′′ k . So the final transition we get is q θ (y k+1 , n k+1 , n ′ k , n ′′ k |y k , n k ) = δ Φ -1 (y k ,n k ) (y k+1 , n ′ k )δ n ′ k +2(1- √ 1-α K-k )M f θ (K-k,y k+1 ,n ′ k ) )(n ′′ k ) × N (n k+1 ; 1 -α K-k n ′′ k ; α K-k M ). However, we can integrate n ′′ k analytically to obtain  q θ (y k+1 , n k+1 , n ′ k |y k , n k ) = δ Φ -1 (y k ,n k ) (y k+1 , n ′ k ) ×N (n k+1 ; 1 -α K-k (n ′ k + 2(1 -1 -α K-k )M f θ (K -k, y k+1 , n ′ k )); α K-k M ). dx t = m t dt, dm t = - x t σ 2 dt (57) can be exactly written as (x t , m t ) = Φ t (x 0 , m 0 ) defined through x t = x 0 cos(t/σ) + m 0 σ sin(t/σ), m t = - x 0 σ sin(t/σ) + m 0 cos(t/σ). This is the so-called harmonic oscillator. Now the inverse of the Hamiltonian flow satisfies Φ -1 (x, m) = Φ flip (x, m) • Φ(x, m) • Φ flip (x, m) (see e.g. Leimkuhler & Matthews (2016) ) so we can simply integrate dy t = -n t dt, dn t = y t σ 2 dt, using (y k+1 , n ′ k ) = Φ -1 τ (y k , n k ). This gives y k+1 = y k cos(τ /σ) -n k σ sin(τ /σ) (60) n ′ k = -(- y k σ sin(τ /σ) -n k cos(t/σ)) = y k σ sin(τ /σ) + n k cos(t/σ). Obviously, if we have M = αI then dx t = m t α dt, dm t = - x t σ 2 dt, then we use a reparameterization mt = m t /α and dx t = mt dt, d mt = - 1 α x t σ 2 dt, and the solution is as above with σ being replaced by σ √ α. So for ñk = n k /α, we have y k+1 = y k cos(τ /( √ ασ)) -ñk σ √ α sin(τ /( √ ασ)), ñ′ k = y k √ ασ sin(τ /( √ ασ)) + ñk cos(t/( √ ασ)), so as n k = αñ k and similarly n ′ k = αñ ′ k , we finally obtain y k+1 = y k cos(τ /( √ ασ)) -n k σ √ α sin(τ /( √ ασ)), n ′ k = √ α σ y k sin(τ /( √ ασ)) + n k cos(τ /( √ ασ)).

B.6 DERIVATION OF UNDERDAMPENED KL IN DISCRETE TIME

In this section we derive the discrete-time KL for the underdamped noising dynamics. Proposition 6. The log density ratio lr = ln(q θ (y 0:K , n 0:K , n ′ 0:K )/p ref (y 0:K , n 0:K , n ′ 0:K )) for y 0:K , n 0:K , n ′ 0:K ∼ q θ (•) equals lr = K k=1 2 κ 2 k α k ||f θ (k, y K-k+1 , n ′ K-k )|| 2 M + M 1/2 κ k √ α k f θ (k, y K-k+1 , n ′ K-k ) ⊤ ε k (66) where κ k := √ 1 -α k (1 - √ 1 -α k ) and ε k is obtained as a function of n k+1 and n ′′ k when describing the integrator for Q θ . In particular, we have ε k i.i.d. ∼ N (0, I) and one obtains KL(q θ ||p) = 2E q θ K k=1 κ 2 k α k ||f θ (k, y K-k+1 , n ′ K-k )|| 2 M + ln N (y K ; 0, σ 2 I) π(y K ) . ( ) The integrator has been designed so that the ratio between the transitions of the proposal (56) and the reference process ( 53) is well-defined as the deterministic parts are identical in the two transitions so cancel and q θ (y k+1 , n k+1 , n ′ k |y k , n k ) p ref (y k+1 , n k+1 , n ′ k |y k , n k ) (68) = N (n k+1 ; √ 1 -α K-k (n ′ k + 2(1 - √ 1 -α K-k )M f θ (K -k, y k+1 , n ′ k )); α K-k M ) N (n k+1 ; √ 1 -α K-k n ′ k ; α K-k M ) . The calculations to compute the Radon-Nikodym derivative are now very similar to what we did in the proof of Proposition 3. We have ln q θ (y 0:K , n 0:K , n ′ 0:K ) p ref (y 0:K , n 0:K , n ′ 0:K ) (69) = ln q K (y 0 , n 0 ) p ref K (y 0 , n 0 ) + K k=1 ln q θ k-1|k (y K-k+1 , n K-k+1 , n ′ K-k |y K-k , n K-k ) p ref k-1|k (y K-k+1 , n K-k+1 , n ′ K-k |y K-k , n K-k ) = K k=1 ||n K-k+1 - √ 1 -α k n ′ K-k || 2 (2α k M ) -1 -||n K-k+1 - √ 1 -α k (n ′ K-k + 2(1 - √ 1 -α k )M f θ (k, y K-k+1 , n ′ K-k ))|| 2 (2α k M ) -1 where we have exploited the fact that q K (y 0 , n 0 ) = p ref K (y 0 , n 0 ) = N (y 0 ; 0, σ 2 I)N (n 0 ; 0, M ). Now let us introduce ϵ k := M -1/2 √ α k (n K-k+1 - √ 1 -α k n ′ K-k -2 √ 1 -α k (1- √ 1 -α k )M f θ (k, y K-k , n ′ K-k )) (70) hence M -1/2 √ α k (n K-k+1 - √ 1 -α k n ′ K-k ) =ϵ k + 2M -1/2 √ α k √ 1 -α k (1 - √ 1 -α k )M f θ (k, y K-k+1 , n ′ K-k ) =ϵ k + 2M 1/2 √ α k √ 1 -α k (1 - √ 1 -α k )f θ (k, y K-k+1 , n ′ K-k ). We can rewrite (69) as ln q θ (y 0:K , n 0:K , n ′ 0:K ) p ref (y 0:K , n 0:K , n ′ 0:K ) (71) = 1 2 K k=1 ||ϵ k + 2M 1/2 √ α k √ 1 -α k (1 - √ 1 -α k )f θ (k, y K-k+1 , n ′ K-k ))|| 2 -||ϵ k || 2 = 1 2 K k=1 ||f θ (k, y K-k+1 , n ′ K-k ))|| 2 4M (1-α k )(1- √ 1-α k ) 2 α k + 4M 1/2 √ α k √ 1 -α k (1 - √ 1 -α k )f θ (k, y K-k+1 , n ′ K-k )) ⊤ ε k = K k=1 ||f θ (k, y K-k+1 , n ′ K-k ))|| 2 2M (1-α k )(1- √ 1-α k ) 2 α k + 2M 1/2 √ 1 -α k (1 - √ 1 -α k ) √ α k f θ (k, y K-k+1 , n ′ K-k )) ⊤ ε k where we note that √ 1-α k (1- √ 1-α k ) √ α k ≈ α k /2 ≈ β kδ δ.

C EXPERIMENTS

In this section we incorporate additional experiments and ablations as well as further experimental detail. For both PIS and DDS we created a grid with δ = K T = 0.05 and values of T ∈ {3.4, 6.4, 12.8, 25.6} and corresponding number of steps K ∈ {64, 128, 128, 256}. When tuning PIS we explore different values of σ which in practice, when discretised, amounts to the same effect as changing δ. For DDS we apply the cosine schedule directly on the number of steps however we ensure that k α k = α max T such that we have a similar scaling as in PIS. Finally both PIS and DDS where trained with Adam (Kingma & Ba, 2015) to at most 11000 iterations, although in most cases most both converged in less than 6000 iterations. Across all experiments used a sample size of 300 to estimate the ELBO at train time and of 2000 to for the reported IS estimator of Z for all methods. Finally for the ground truth where available we use the true normalizing constant as is the case with the Funnel distribution and the Normalising Flows used in the NICE Dinh et al. (2014) task. For the rest of the tasks we follow prior work (Zhang & Chen, 2022; Arbel et al., 2021) use a long run SMC chain (1000 temperatures, 30 seeds with 2000 samples each).

C.1 NETWORK PARAMETRISATION

In order to improve numerical stability we re-scale the neural network drift parametrisation as follows: f θ (k, y) = 2 -1 λ -1 K α k fθ (k, y), fθ (k, y) = NN 1 (k, y; θ) + NN 2 (k; θ) ⊙ ∇ ln π(y). This was done such that the reciprocal term α -1 K-k in the DDS objective is cancelled as the term reaches very small values in the boundaries causing the overall objective to be large. Finally as λ k = 1 - √ 1 -α k ≈ α k 2 it follows that our proposed re-parametrisation converges to the same SDE as before, but now with stabler and simpler updates: y k+1,n = 1-α K-k y k,n + σ 2 α K-k fθ (K-k, y k,n ) + σ √ α K-k ε k,n , ε k,n i.i.d. ∼ N (0, I), r k+1,n = r k,n + 2 -1 σ 2 α K-k || fθ (K -k, y k,n )|| 2 . C.2 TUNING HYPER-PARAMETERS For both DDS and PIS we explore a grid of 25 hyper-parameter values. For PIS we search for the best performing value of the volatility coefficient over 25 values, depending on the task we vary the end points of the drift however we noticed that PIS leads to numerical instabilities for γ > 4 thus we never search for values larger than this. For DDS we searched across 5 values for σ and 5 values for α max , this led to a total of 25 combinations we explored for each experiment. Finally for SMC we searched over 3 of its different step sizes leading to a total of 5 3 = 125.

C.3 BROWNIAN MOTION MODEL WITH GAUSSIAN OBSERVATION NOISE

The model is given by α inn ∼ LogNormal(0, 2), α obs ∼ LogNormal(0, 2), x 1 ∼ N (0, α inn ), x i ∼ N (x i-1 , α inn ), i = 2, . . . 20, y i ∼ N (x i , α obs ), i = 1, . . . 30. The goal is to perform inference over the variables α inn , α obs and {x i } 

C.4 DRIFT MAGNITUDE

In Figure 4 we compare the magnitudes of the NN drifts learned by PIS and DDS on a scalar target N (6, σ 2 ) target. The drifts where sampled on randomly evaluated trajectories from each of the learned samplers. We observe the PIS drift to have large magnitudes with higher variance more prone to numerical instabilities, whilst the OU drift has a notably lower magnitude with significantly less variance. We conjecture this is the reason that PIS becomes unstable at training time for a large number of steps across several of our PIS experiments, many of which did not converge due numerical instabilities. For simplicity in this experiment we set α k to be uniform.

C.5 TRAINING TIME

We evaluate the training times of DDS and PIS on an 8-chip TPU circuit and average over 99 runs. , 2013) . We found all approaches to reach a very small error for ln Z very quickly however as seen with PIS across experiments results became unstable for large T and thus many hyper-parameters when tuning the volatility for PIS led to the loss exploding.

C.7 GENERATED IMAGES

In this section we provide some of the generated samples for the normalising flow evaluation task. In Figure 6 we can observe how both PIS and DDS are able to generate more image classes than SMC due to mode collapse. Out of the 3 approaches we can see that DDS mode collapses the least whilst SMC generates the higher quality images. Using a neural network with inductive biases such as a convolutional neural network (LeCun et al., 1998) should improve the image quality over the simple feed-forward networks we have used, we believe this is the reason behind the low quality in the images as both minimising the path KL and sampling high quality images is a difficult task for a small feed-forward network.

C.8 COMPARISON TO AIS AND MCD

We compare our results to the AIS and MCD baselines presented in Geffner & Domke (2022) . Results can be seen in Tables 4 and 3 , we can see that both DDS and PIS outperform LDVI across all number of steps and tasks with DDS outperforming PIS on higher values of k due to inconssitent and unstable behaviour presented by PIS.

C.9 EULER-MAYURAMA VS EXPONENTIAL INTEGRATOR

In this section we demonstrate empirically how using the Euler Mayurama integrator directly on the DDS objective leads to overestimating ln Z. We compare on the Funnel dataset for which we know ln Z to be 0 and the Ionosphere dataset which is small and for which we have a reliable estimate of ln Z via a long run SMC. Results in -113.8 -112.5 -112.8 -112.1 -111.7 -111.6 -115.3 -111.1 -111.9 -109.7 -109.4 -108.9 k = 128 -113.1 -112.2 -112.3 -111.9 -111.6 -112.0 -113.5 -110.2 -110.6 -109.1 -108.9 -109.3 k = 256 -112.7 -112.1 -112.1 -111.7 -111.6 -113.0 -112.1 -109.7 -109.7 -108.9 -108.9 -108.9 Table 3: Comparison of DDS and PIS to AIS based approaches on logistic regression models. Figures 7 and 8 show samples obtained from the probability flow ODE of a trained DDS model. We can see that the probability flow ODE is able to perfectly sample from a uni-modal Gaussian, matters became more challenging with multi-modal distributions. Initially we discretised the ODE (12) using the same type of integrators as the SDEs to obtain y k+1 = y k + δσ 2 (1 -√ 1 -α K-k )f θ (Kk, y k ) for y 0 ∼ N (0; σ 2 I). Unfortunately under this discretisation we found the probability flow ODE to become stiff with more complex distributions such as the mixture of Gaussians, resulting in strange effects in the samples (matching the modes but completely wrong shapes). By using a Heun Proposition 7. We have for k = 1, ..., K p k-1|k (x k-1 |x k ) = ϕ k-1 (x k-1 )p ref k-1|k (x k-1 |x k ) ϕ k (x k ) , for ϕ k (x k ) = p k (x k ) p ref k (x k ) . ( ) The value functions (ϕ k ) K k=1 satisfy a forward Bellman type equation ϕ k (x k ) = ϕ k-1 (x k-1 )p ref k-1|k (x k-1 |x k )dx k-1 , ϕ 0 (x 0 ) = π(x 0 ) p ref 0 (x 0 ) . ( ) It follows that ϕ k (x k ) = ϕ 0 (x 0 ) p ref 0|k (x 0 |x k )dx 0 . Proof. To establish (78), we use Bayes' rule p k-1|k (x k-1 |x k ) = p k-1 (x k-1 )p k|k-1 (x k |x k-1 ) p k (x k ) = ϕ k-1 (x k-1 )p ref k-1 (x k-1 )p k|k-1 (x k |x k-1 ) ϕ k (x k )p ref k (x k-1 ) = ϕ k-1 (x k-1 )p ref k-1|k (x k-1 |x k ) ϕ k (x k ) , where we have used the fact that p k (x k ) = ϕ k (x k )p ref k (x k ), p ref k|k-1 (x k |x k-1 ) = p k|k-1 (x k |x k-1 ) and Bayes' rule again. Now we have p k-1|k (x k-1 |x k )dx k-1 = 1 for any x k , so it follows directly from the expression of this transition kernel that the value function satisfies ϕ k (x k ) = ϕ k-1 (x k-1 )p ref k-1|k (x k-1 |x k )dx k-1 , ϕ 0 (x 0 ) = π(x 0 ) p ref 0 (x 0 ) . ( ) By iterating this recursion, it follows that ϕ k (x k ) = ϕ 0 (x 0 ) p ref 0|k (x 0 |x k )dx 0 .



This is referred to as a Variance Preserving diffusion bySong et al. (2021c). Henceforth, it is assumed that α k can be computed exactly. Zhang & Chen (2022) inadvertently considered the more favourable σ f = 1 scenario for PIS but used σ f = 3 for other methods. This explains the significant differences between their results and ours.



(Ionosphere, steps = 256) over all , max values.

Figure 1: Training loss per hyperparameter: PIS (left) vs DDS (right).

Figure 2: ln Z estimate (median plus upper/lower quartiles) as a function of number of steps K -a) Funnel , b) LGCP, c) Logistic Ionosphere dataset. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

Figure 3: ln Z estimate as a function of number of steps K -a) Logistic Sonar dataset, b) Brownian motion, c) NICE. Yellow dotted line is MF-VI and dashed magenta is the gold standard.

56) B.5 EXACT SOLUTION OF HAMILTONIAN EQUATION Consider the Hamiltonian equation and M = I, the solution of

Figure 5: Results on pretrained VAE from Arbel et al. (2021).

Figure 6: a) DDS , b) PIS , c) SMC. We can see how SMC mode collapses significantly as it only seems to have 5 image classes. For PIS we identified 7 classes whilst for DDS we found 9 out of 10.

Figure 9: ln Z estimate as a function of number of steps K -a) Funnel , b) LGCP, c) Logistic Ionosphere dataset. Yellow Dotted line is MF-VI and dashed magenta is the gold standard.

to rely on the probability flow ODE. This is an ODE admitting the same marginal distributions (p t ) t∈[0,T ] as the noising diffusion given by dx t = -β t x t + σ 2 ∇ ln p t (x t ) dt. We use here this ODE to design a continuous-time normalizing flow to sample from π. For θ such that f θ (t, x) ≈ ∇ ln ϕ t (x) (obtained by minimization of the KL in (10)), we have x t + σ 2 ∇ ln p t (x) ≈ σ 2 f θ (t, x). So it is possible to sample approximately from π by integrating the following ODE over [0, T ] dy t = σ 2 β T -t f θ (T -t, y t )dt, y 0 ∼ N (0, σ 2 I).

training iterations do

Results on NICE target, we performed 30 runs with different seeds for each approach.

Figure 4: Magnitude of the learnt neural net approximation of the drift ∇ x ln ϕ t (x) (see (23)) as a function of t. Training time per ELBO gradient update on 300 samples.



Table 5 show case how for lower values of k = 64, 128

Brownian Motion

ULA MCD UHA LDVI DDS PIS k = 64 -0.7 0.1 0.1 -0.5 1.0 1.0 k = 128 -0.3 0.2 0.4 0.7 1.1 1.0 k = 256 -0.1 0.5 0.6 0.9 1.0 0.7 the Euler-Maruyama (EM) based approach significantly overestimates ln Z and overall does not perform well.

C.9.1 DETACHED GRADIENT ABLATIONS

In this section we perform an ablation over our proposed modification of the PIS-Grad network architecture. We compare the same architecture with and without the gradient attached. We found that detaching gradient led to a favourable performance as well as more stable training. Results are presented in table 7 and were carried out for K = 128 and using the best found diffusion coefficient from each task.We chose to explore this feature as it is known that optimization through unrolled computation graphs can be chaotic and introduce exploding/vanishing gradients (Parmas et al., 2018; Metz et al., 2020) which can numerically introduce bias. This has been successfully applied in related prior work (Greff et al., 2019) where detached scores are provided as features to the recognition networks.Alternative approaches exist based on smoothing the chaotic loss-landscape (Vicol et al., 2021) integrator as proposed in Karras et al. (2022) we were able to obtain better results, however we also found we needed to increase the network size to improve the results as well as train with a learning rate decay of 0.99. With these extra features we were able to simulate a probability flow ODE that roughly matched the marginal densities of the SDE. However even with these fixes we still found the probability flow based estimator of ln Z to drastically overestimate, this is not entirely surprising as in discrete time the expectation of this estimator is not guaranteed to be an ELBO.

C.12 UNDERDAMPED OU RESULTS

We perform some additional experiments with the underdamped OU reference process. Similar to the damped setting we parametrise the network as:Unfortunately, unlike in DDS and PIS we cannot directly aid the update/proposal for x t with ∇ ln π(x), thus this naive parametrisation is not the ideal inductive bias.Results in Figures 9 and 10 show some experiments for this approach. The results perform worse than DDS for a standard OU reference process and PIS. However they are still better than VI. We believe that future work exploring better inductive biases for the network f θ (k, x, p) could narrow the performance gap in this approach. Additionally we highlight that in Figure 9 (Funnel Distribution) we can see the underdamped approach overestimates ln Z for k = 64. We believe this may be due to numerical error when generating a sample.

C.13 BOX PLOTS FOR MAIN RESULTS

For completeness in this section we also present our results via box plots as can be seen in Figures 11 and 12 . In discrete-time, we consider the following discrete-time "forward" Markov processFollowing ( 17), we useWe propose here to sample approximately from π by approximating the ancestral sampling scheme for (75) corresponding to the backward decompositionwhere p 0 (x 0 ) = π(x 0 ). If we could thus sample x K ∼ p K (•) then x k-1 ∼ p k-1|k (•|x k ) for k = K, ..., 1 then x 0 would indeed be a sample from π. However as the marginal densities (p k ) K k=1 are not available in closed-form, we cannot implement exactly this ancestral sampling procedure.As we have by design p K (x K ) ≈ N (x K ; 0, σ 2 I), we can simply initialize the ancestral sampling procedure by a Gaussian sample. The approximation of the backward Markov kernels p k-1|k (x k-1 |x k ) is however more involved.

D.2 REFERENCE PROCESS AND BELLMAN RECURSION

We introduce the "simple" reference process p ref (x 0:K ) defined bywhich is designed by construction to admits the marginal distributions p ref k (x k ) = N (x k ; 0, σ 2 I) for all k. It can be easily verified using the chain rule for KL that the extended target process p is the distribution minimizing the reverse (or forward) KL discrepancy w.r.t. p ref over the set of path measures q(x 0:K ) with marginal q 0 (x 0 ) = π(x 0 ) at the initial time, i.e. p = arg min q KL(q||p ref ) : q 0 = π . 

D.3 APPROXIMATING THE BACKWARD KERNELS

We want to approximate the backward Markov transitions p k-1|k . From ( 78), this would require not only to approximate the value functions -Monte Carlo estimates based on (80) are high variancebut also to sample from the resulting twisted kernel which is difficult. However, if we select β k ≈ 0, then we have ϕ k-1 (x) ≈ ϕ k (x) and by a Taylor expansion of ∇ ln ϕ k around x k we obtainBy differentiating the identity (80) w.r.t. x k , we obtain the following expression for ϕ kAlternatively, we can also use ∇ ln ϕ k (x k ) = ∇ ln p k (x k ) + x σ 2 where ∇ ln p k (x k ) can be easily shown to satisfyFor DDPM, the conditional expectation in ( 85) is reformulated as the solution to a regression problem, leveraging the fact that we can easily obtain samples from π(x 0 )p k|0 (x k |x 0 ) as π is the data distribution in this context. In the Monte Carlo sampling context considered here, we could sample approximately from p 0|k (x 0 |x k ) ∝ π(x 0 )p k|0 (x k |x 0 ) using MCMC so as to approximate the conditional expectations in ( 84) and ( 85) but this would defeat the purpose of the proposed methodology.We propose instead to approximate ∇ ln ϕ k by minimizing a suitable criterion.In practice, we will consider a distribution q θ (x 0:K ) approximating p(x 0:K ) of the formi.e. we selectInspired by ( 83), we will consider an approximation of the formwherePublished as a conference paper at ICLR 2023 D.4 HYPERPARAMETERS

D.4.1 FITTED HYPERPARAMETERS

To aid reproducibility we report all fitted hyperparameters for each of our methods and PIS across all experiments in Tables 8 and 9 .

D.4.2 OPTIMISATION HYPERPARAMETERS

As mentioned in the experimental section across all experiments modulo the Funnel we use the Adam optimiser with a learning rate of 0.0001 with no learning decay and 11000 training iterations, for the rest of the optimisation parameters use the default settings as provided by the Optax library (Hessel et al., 2020) which are b 1 = 0.9, b 2 = 0.999, ϵ = 10 -8 naming as per Kingma & Ba (2015) .From the github repository of Zhang & Chen (2022) we were only able to find hyperparameters reported for the Funnel distribution. In order to first reproduce their results we used the a learning rate of 0.005 and a learning rate decay of 0.95 as per their implementation, their results were initially not reproducible due to a bug in setting σ f = 1 despite comparing to methods at the less favourable values of σ f = 3. For σ f = 1 we were able to reproduce their results. However we report results at σ f = 3 as this is the traditional value used for this loss. As no other optimisation configuration files were reported we used the more conservative learning rate of 0.0001 since PIS was very unstable for 0.005 with decay 0.95 across many of our tasks. Finally we would like to clarify that the exact same optimiser settings where used for both PIS and DDS in order to ensure a fair comparison.

D.4.3 DRIFT AND GRADIENT CLIPPING

We follow the same gradient clipping as in Zhang & Chen (2022) that is : 

