SAMPLING IS AS EASY AS LEARNING THE SCORE: THEORY FOR DIFFUSION MODELS WITH MINIMAL DATA ASSUMPTIONS

Abstract

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL•E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an L 2 -accurate score estimate (rather than L ∞ -accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs.

1. INTRODUCTION

Score-based generative models (SGMs) are a family of generative models which achieve state-ofthe-art performance for generating audio and image data (Sohl-Dickstein et al., 2015; Ho et al., 2020; Dhariwal & Nichol, 2021; Kingma et al., 2021; Song et al., 2021a; b; Vahdat et al., 2021) ; see, e.g., the recent surveys (Cao et al., 2022; Croitoru et al., 2022; Yang et al., 2022) . For example, denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a key component in large-scale generative models such as DALL•E 2 (Ramesh et al., 2022) . As the importance of SGMs continues to grow due to newfound applications in commercial domains, it is a pressing question of both practical and theoretical concern to understand the mathematical underpinnings which explain their startling empirical successes. As we explain in Section 2, at their mathematical core, SGMs consist of two stochastic processes, the forward process and the reverse process. The forward process transforms samples from a data distribution q (e.g., images) into noise, whereas the reverse process transforms noise into samples from q, hence performing generative modeling. Running the reverse process requires estimating the score function of the law of the forward process; this is typically done by training neural networks on a score matching objective (Hyvärinen, 2005; Vincent, 2011; Song & Ermon, 2019) . Providing precise guarantees for estimation of the score function is difficult, as it requires an understanding of the non-convex training dynamics of neural network optimization that is currently out of reach. However, given the empirical success of neural networks on the score estimation task, a natural and important question is whether accurate score estimation implies that SGMs provably converge to the true data distribution in realistic settings. This is a surprisingly delicate question, as even with accurate score estimates, as we explain in Section 2.1, there are several other sources of error which could cause the SGM to fail to converge. Indeed, despite a flurry of recent work (Block et al., 2020; De Bortoli et al., 2021; De Bortoli, 2022; Lee et al., 2022a; Pidstrigach, 2022; Liu et al., 2022) , prior analyses fall short of answering this question, for (at least) one of three main reasons: Given an L 2 -accurate score estimate, SGMs can sample from (essentially) any data distribution. Critically damped Langevin diffusion (CLD). Using our techniques, we also investigate the use of the critically damped Langevin diffusion (CLD) for SGMs, which was proposed in (Dockhorn et al., 2022) . Although numerical experiments and intuition from the log-concave sampling literature suggest that the CLD could potentially speed up sampling via SGMs, we provide theoretical evidence to the contrary. Based on this, in Section 3.3, we conjecture that SGMs based on the CLD do not exhibit improved dimension dependence compared to the original DDPM algorithm.

1.2. PRIOR WORK

We now provide a detailed comparison to prior work. By now, there is a vast literature on providing precise complexity estimates for log-concave sampling; see, e.g., Chewi (2022) for an exposition on recent developments. The proofs in this work build upon the techniques developed in this literature. However, our work addresses the significantly more challenging setting of non-log-concave sampling. The work of De Bortoli et al. (2021) provides guarantees for the diffusion Schrödinger bridge (Song et al., 2021b) . However, as previously mentioned their result is not quantitative, and they require an L ∞ -accurate score estimate. The works Block et al. (2020) ; Lee et al. (2022a) ; Liu et al. (2022) instead analyze SGMs under the more realistic assumption of an L 2 -accurate score estimate. However, the bounds of Block et al. (2020) ; Liu et al. (2022) suffer from exponential dependencies on parameters like dimension and smoothness, whereas the bounds of Lee et al. (2022a) require q to satisfy an LSI. The recent work of De Bortoli (2022) , motivated by the manifold hypothesis, considers a different pointwise assumption on the score estimation error which allows the error to blow up at time 0 and at spatial ∞. We discuss the manifold setting in more detail in Section 3.2. Unfortunately, the bounds of De Bortoli (2022) also scale exponentially in problem parameters such as the manifold diameter. We also mention that the use of reversed SDEs for sampling is implicit in the interpretation of the proximal sampler (Lee et al., 2021) given by Chen et al. (2022c) . Our work can be viewed as expanding upon the theory of Chen et al. (2022c) using a different forward channel (the OU process). Concurrent work. Very recently, Lee et al. (2022b) independently obtained results similar to our results for DDPM. While our assumptions are technically somewhat incomparable (they assume the score error can vary with time but assume the data is compactly supported), our quantitative bounds are stronger. Additionally, the upper and lower bounds for CLD are unique to our work.

2. BACKGROUND ON SGMS

Throughout this paper, given a probability measure p which admits a density w.r.t. Lebesgue measure, we abuse notation and identify it with its density function. Additionally, we will let q denote the data distribution from which we want to generate new samples. We assume that q is a probability measure on R d with full support, and that it admits a smooth density written q = exp(-U ) (we relax this assumption in Section 3.2). In this section, we provide a brief exposition to SGMs, following Song et al. (2021b) .

2.1. BACKGROUND ON DENOISING DIFFUSION PROBABILISTIC MODELING (DDPM)

Forward process. In denoising diffusion probabilistic modeling (DDPM), we start with a forward process, which is a stochastic differential equation (SDE). For clarity, we consider the simplest possible choice, which is the Ornstein-Uhlenbeck (OU) process d Xt = -Xt dt + √ 2 dB t , X0 ∼ q , where (B t ) t≥0 is a standard Brownian motion in R d . The OU process is the unique time-homogeneous Markov process which is also a Gaussian process, with stationary distribution equal to the standard Gaussian distribution γ d on R d . In practice, it is also common to introduce a positive smooth function g : R + → R and consider the time-rescaled OU process d Xt = -g(t) 2 Xt dt + √ 2 g(t) dB t , X 0 ∼ q . (2) Although our analysis could be extended to consider these variants, in this work we stick with the choice g ≡ 1 for simplicity; see Song et al. (2021b) for further discussion. The forward process has the interpretation of transforming samples from the data distribution q into pure noise. From the well-developed theory of Markov diffusions, it is known that if q t := law(X t ) denotes the law of the OU process at time t, then q t → γ d exponentially fast in various divergences and metrics such as the 2-Wasserstein metric W 2 ; see Bakry et al. (2014) . Reverse process. If we reverse the forward process (1) in time, then we obtain a process that transforms noise into samples from q, which is the aim of generative modeling. In general, suppose that we have an SDE of the form d Xt = b t ( Xt ) dt + σ t dB t , where (σ t ) t≥0 is a deterministic matrix-valued process. Then, under mild conditions on the process (e.g., Föllmer, 1985; Cattiaux et al., 2022) , which are satisfied for all processes under consideration in this work, the reverse process also admits an SDE description. Namely, if we fix the terminal time T > 0 and set X← t := XT -t , for t ∈ [0, T ] , then the process ( X← t ) t∈[0,T ] satisfies the SDE d X← t = b ← t ( X← t ) dt + σ T -t dB t , where the backwards drift satisfies the relation b t + b ← T -t = σ t σ T t ∇ ln q t , q t := law( Xt ) . Applying this to the forward process (1) , we obtain the reverse process d X← t = { X← t + 2 ∇ ln q T -t ( X← t )} dt + √ 2 dB t , X← 0 ∼ q T , where now (B t ) t∈[0,T ] is the reversed Brownian motion. 1 Here, ∇ ln q t is called the score function for q t . Since q (and hence q t for t ≥ 0) is not explicitly known, in order to implement the reverse process the score function must be estimated on the basis of samples. The mechanism behind this is the idea of score matching which goes back to Hyvärinen (2005) ; Vincent (2011) : roughly speaking, Gaussian integration by parts implies that minimizing the L 2 (q t ) loss achieved by an estimate s t for the score ∇ ln q t is equivalent to minimizing the L 2 (q t ) loss in predicting, given a sample from the forward process at time t, what noise was applied to the corresponding sample at time 0 to obtain it. We defer an exposition of the details of score matching to Sections A.1 and D of the supplement. In light of this, it is thus most natural to assume an L 2 (q t ) error bound E qt [∥s t -∇ ln q t ∥ 2 ] ≤ ε 2 score for the score estimator s t . If s t is taken to be the empirical risk minimizer for a suitable function class, then guarantees for the L 2 (q t ) error can be obtained via standard statistical analysis, see, e.g., Block et al. (2020) . Discretization and implementation. We now discuss the final steps required to obtain an implementable algorithm. First, in the learning phase, given samples X(1) 0 , . . . , X(n) 0 from q (e.g., a database of natural images), we train a neural network via score matching, see Song & Ermon (2019) . Let h > 0 be the step size of the discretization; we assume that we have obtained a score estimate s kh of ∇ ln q kh for each time k = 0, 1, . . . , N , where T = N h. In order to approximately implement the reverse SDE (4), we first replace the score function ∇ ln q T -t with the estimate s T -t . Then, for t ∈ [kh, (k + 1)h] we freeze the value of this coefficient in the SDE at time kh. It yields the new SDE dX ← t = {X ← t + 2 s T -kh (X ← kh )} dt + √ 2 dB t , t ∈ [kh, (k + 1)h] . (5) Since this is a linear SDE, it can be integrated in closed form; in particular, conditionally on X ← kh , the next iterate X ← (k+1)h has an explicit Gaussian distribution. There is one final detail: although the reverse SDE (4) should be started at q T , we do not have access to q T directly. Instead, taking advantage of the fact that q T ≈ γ d , we instead initialize the algorithm at X ← 0 ∼ γ d , i.e., from pure noise. Let p t := law(X ← t ) denote the law of the algorithm at time t. The goal of this work is to bound TV(p T , q), taking into account three sources of error: (1) estimation of the score; (2) discretization of the SDE with step size h > 0; and (3) initialization of the algorithm at γ d rather than at q T .

2.2. BACKGROUND ON THE CRITICALLY DAMPED LANGEVIN DIFFUSION (CLD)

The critically damped Langevin diffusion (CLD) is based on the forward process d Xt = -Vt dt , d Vt = -( Xt + 2 Vt ) dt + 2 dB t . Compared to the OU process (1) , this is now a coupled system of SDEs, where we have introduced a new variable V representing the velocity process. The stationary distribution of the process is γ 2d , the standard Gaussian measure on phase space R d × R d , and we initialize at X0 ∼ q and V0 ∼ γ d . More generally, the CLD ( 6) is an instance of what is referred to as the kinetic Langevin or the underdamped Langevin process in the sampling literature. In the context of strongly log-concave sampling, the smoother paths of X lead to smaller discretization error, thereby furnishing an algorithm with O( √ d/ε) gradient complexity (as opposed to sampling based on the overdamped Langevin process, which has complexity O(d/ε 2 )), see Cheng et al. (2018) ; Shen & Lee (2019) ; Dalalyan & Riou-Durand (2020) ; Ma et al. (2021) . The recent paper Dockhorn et al. (2022) proposed to use the CLD as the basis for an SGM and they empirically observed improvements over DDPM. Applying (3), the corresponding reverse process is d X← t = -V ← t dt , d V ← t = X← t + 2 V ← t + 4 ∇ v ln q T -t ( X← t , V ← t ) dt + 2 dB t , where q t := law( Xt , Vt ) is the law of the forward process at time t. Note that the gradient in the score function is only taken w.r.t. the velocity coordinate. Upon replacing the score function with an estimate s, we arrive at the algorithm dX ← t = -V ← t dt , dV ← t = X ← t + 2 V ← t + 4 s T -kh (X ← kh , V ← kh ) dt + 2 dB t , for t ∈ [kh, (k + 1)h]. We provide further background on the CLD in Section C.1.

3. RESULTS

We now formally state our assumptions and our main results.

3.1. RESULTS FOR DDPM

For DDPM, we make the following mild assumptions on the data distribution q. Assumption 1 (Lipschitz score). For all t ≥ 0, the score ∇ ln q t is L-Lipschitz. Assumption 2 (second moment bound). For some η > 0, E q [∥•∥ 2+η ] is finite. We also write mfoot_1 2 := E q [∥•∥ 2 ] for the second moment of q. For technical reasons, we need to assume that q has a finite moment of order slightly but strictly bigger than 2, but our quantitative bounds will only depend on the second moment m 2 2 . Assumption 1 is standard and has been used in the prior works Block et al. (2020) ; Lee et al. (2022a) . However, unlike Lee et al. (2022a) ; Liu et al. (2022) , we do not assume Lipschitzness of the score estimate. Moreover, unlike Block et al. (2020); De Bortoli et al. (2021) ; Liu et al. (2022) , we do not assume any convexity or dissipativity assumptions on the potential U , and unlike Lee et al. (2022a) we do not assume that q satisfies a log-Sobolev inequality. Hence, our assumptions cover a wide range of highly non-log-concave data distributions. Our proof technique is fairly robust and even Assumption 1 could be relaxed (as well as other extensions, such as considering the time-changed forward process (2)), although we focus on the simplest setting in order to better illustrate the conceptual significance of our results. We also assume a bound on the score estimation error. Assumption 3 (score estimation error). For all k = 1, . . . , N , E q kh [∥s kh -∇ ln q kh ∥ 2 ] ≤ ε 2 score . This is the same assumption as in Lee et al. (2022a) , and as discussed in Section 2.1, it is a natural and realistic assumption in light of the derivation of the score matching objective. Our main result for DDPM is the following theorem. Theorem 2 (DDPM, see Section B of supplement). Suppose that Assumptions 1, 2, and 3 hold. Let p T be the output of the DDPM algorithm (Section 2.1) at time T , and suppose that the step size h := T /N satisfies h ≲ 1/L, where L ≥ 1. Then, it holds that TV(p T , q) ≲ KL(q ∥ γ d ) exp(-T ) convergence of forward process + (L √ dh + Lm 2 h) √ T discretization error + ε score √ T score estimation error . To interpret this result, suppose that, e.g., KL(q ∥ γ d ) ≤ poly(d) and m 2 ≤ d. 2 Choosing T ≍ log(KL(q ∥ γ)/ε) and h ≍ ε 2 L 2 d , and hiding logarithmic factors, TV(p T , q) ≤ O(ε + ε score ) , for N = Θ L 2 d ε 2 . In particular, in order to have TV(p T , q) ≤ ε, it suffices to have score error ε score ≤ O(ε). We remark that the iteration complexity of N = Θ( L 2 d ε 2 ) matches state-of-the-art complexity bounds for the Langevin Monte Carlo (LMC) algorithm for sampling under a log-Sobolev inequality (LSI), see Vempala & Wibisono (2019) ; Chewi et al. (2021a) . This provides some evidence that our discretization bounds are of the correct order, at least with respect to the dimension and accuracy parameters, and without higher-order smoothness assumptions.

3.2. CONSEQUENCES FOR ARBITRARY DATA DISTRIBUTIONS WITH BOUNDED SUPPORT

We now elaborate upon the implications of our results under the sole assumption that the data distribution q is compactly supported, supp q ⊆ B(0, R). In particular, we do not assume that q has a smooth density w.r.t. Lebesgue measure, which allows for studying the case when q is supported on a lower-dimensional submanifold of R d as in the manifold hypothesis. This setting was investigated recently in De Bortoli (2022) . For this setting, our results do not apply directly because the score function of q is not well-defined and hence Assumption 1 fails to hold. Also, the bound in Theorem 2 has a term involving KL(q ∥ γ d ) which is infinite if q is not absolutely continuous w.r.t. γ d . As pointed out by De Bortoli (2022) , in general we cannot obtain non-trivial guarantees for TV(p T , q), because p T has full support and therefore TV(p T , q) = 1 under the manifold hypothesis. Nevertheless, we show that we can apply our results using an early stopping technique. Namely, using the following lemma, we obtain a sequence of corollaries. Lemma 3 (Lemma 21 in supplement). Suppose that supp q ⊆ B(0, R) where R ≥ 1, and let q t denote the law of the OU process at time t, started at q. Let ε W2 > 0 be such that ε W2 ≪ √ d and set t ≍ ε 2 W2 /( √ d (R ∨ √ d)). Then, (1) W 2 (q t , q) ≤ ε W2 , (2) q t satisfies KL(q t ∥ γ d ) ≲ √ d (R∨ √ d) 3 ε 2 W 2 , and (3) for every t ′ ≥ t, q t ′ satisfies Assumption 1 with L ≲ dR 2 (R∨ √ d) 2 ε 4 W 2 . By substituting q t for this choice of t in place of q in Theorem 2, we obtain Corollary 4 below. We remark that taking q t as the new target corresponds to stopping the algorithm early: instead of running the algorithm backward for a time T , we run the algorithm backward for a time T -t (note that T -t should be a multiple of the step size h). Corollary 4 (compactly supported data). Suppose that q is supported on the ball of radius R ≥ 1. Let t ≍ ε 2 W2 /( √ d (R ∨ √ d)). Then, the output p T -t of DDPM is ε TV -close in TV to the distribution q t , which is ε W2 -close in W 2 to q, provided that the step size h is chosen appropriately according to Theorem 2 and N = Θ d 3 R 4 (R∨ √ d) 4 ε 2 TV ε 8 W 2 and ε score ≤ O(ε TV ). Observing that both the TV and W 1 metrics are upper bounds for the bounded Lipschitz metric d BL (µ, ν) := sup{ f dµ -f dν f : R d → [-1, 1] is 1-Lipschitz}, we immediately obtain the following corollary. Corollary 5 (compactly supported data, BL metric). Suppose that q is supported on the ball of radius R ≥ 1. Let t ≍ ε 2 W2 /( √ d (R ∨ √ d)) . Then, the output p T -t of the DDPM algorithm satisfies d BL (p T -t , q) ≤ ε, provided that the step size h is chosen appropriately according to Theorem 2 and N = Θ(d 3 R 4 (R ∨ √ d) 4 /ε 10 ) and ε score ≤ O(ε TV ). Finally, if the output p T -t of DDPM at time T -t is projected onto B(0, R 0 ) for an appropriate choice of R 0 , then we can also translate our guarantees to the standard W 2 metric, which we state as the following corollary. Corollary 6 (compactly supported data, W 2 metric; see Section B.5 in supplement). Suppose that q is supported on the ball of radius R ≥ 1. Let t ≍ ε 2 W2 /( √ d (R ∨ √ d)), and let p T -t,R0 denote the output of DDPM at time T -t projected onto B(0, R 0 ) for R 0 = Θ(R). Then, it holds that W 2 (p T -t,R0 , q) ≤ ε, provided that the step size h is chosen appropriately according to Theorem 2, N = Θ(d 3 R 8 (R ∨ √ d) 4 /ε 12 ), and ε score ≤ O(ε TV ). Note that the dependencies in the three corollaries above are polynomial in all of the relevant problem parameters. In particular, since the last corollary holds in the W 2 metric, it is directly comparable to De Bortoli (2022) and vastly improves upon the exponential dependencies therein.

3.3. RESULTS FOR CLD

In order to state our results for score-based generative modeling based on the CLD, we must first modify Assumptions 1 and 3 accordingly. Assumption 4. For all t ≥ 0, the score ∇ v ln q t is L-Lipschitz. Assumption 5. For all k = 1, . . . , N , E q kh [∥s kh -∇ v ln q kh ∥ 2 ] ≤ ε 2 score . If we ignore the dependence on L and assume that the score estimate is sufficiently accurate, then the iteration complexity guarantee of Theorem 2 is N = Θ(d/ε 2 ). On the other hand, recall from Section 2.2 that based on intuition from the literature on log-concave sampling and from empirical findings in Dockhorn et al. (2022) , we might expect that SGMs based on the CLD have a smaller iteration complexity than DDPM. We prove the following theorem. Theorem 7 (CLD, see Section C of supplement). Suppose that Assumptions 2, 4, and 5 hold. Let p T be the output of the SGM algorithm based on the CLD (Section 2.2) at time T , and suppose the step size h := T /N satisfies h ≲ 1/L, where L ≥ 1. Then, there is a universal constant c > 0 such that TV(p T , q ⊗ γ d ) is bounded, up to a constant factor, by KL(q ∥ γ d ) + FI(q ∥ γ d ) exp(-cT ) convergence of forward process + (L √ dh + Lm 2 h) √ T discretization error + ε score √ T score estimation error , where FI(q ∥ γ d ) is the relative Fisher information FI(q ∥ γ d ) := E q [∥∇ ln(q/γ d )∥ 2 ]. Note that the result of Theorem 7 is in fact no better than our guarantee for DDPM in Theorem 2. Although it is possible that this is an artefact of our analysis, we believe that it is in fact fundamental. As we discuss in Remark C.2, from the form of the reverse process ( 7), the SGM based on CLD lacks a certain property (that the discretization error should only depend on the size of the increment of the X process, not the increments of both the X and V processes) which is crucial for the improved dimension dependence of the CLD over the Langevin diffusion in log-concave sampling. Hence, in general, we conjecture that under our assumptions, SGMs based on the CLD do not achieve a better dimension dependence than DDPM. We provide evidence for our conjecture via a lower bound. In our proofs of Theorems 2 and 7, we rely on bounding the KL divergence between certain measures on the path space C([0, T ]; R d ) via Girsanov's theorem. The following result lower bounds this KL divergence, even for the setting in which the score estimate is perfect (ε score = 0) and the data distribution q is the standard Gaussian. Theorem 8 (Section C.5 of supplement). Let p T be the output of the SGM algorithm based on the CLD (Section 2.2) at time T , where the data distribution q is the standard Gaussian γ d , and the score estimate is exact (ε score = 0). Suppose that the step size h satisfies h ≲ 1/(T ∨ 1). Then, for the path measures P T and Q ← T of the algorithm and the continuous-time process (7) respectively (see Section C for details), it holds that KL(Q ← T ∥ P T ) ≥ dhT . Theorem 8 shows that in order to make the KL divergence between the path measures small, we must take h ≲ 1/d, which leads to an iteration complexity that scales linearly in the dimension d. Theorem 8 is not a proof that SGMs based on the CLD cannot achieve better than linear dimension dependence, as it is possible that the output p T of the SGM is close to q ⊗ γ d even if the path measures are not close, but it rules out the possibility of obtaining a better dimension dependence via our Girsanov proof technique. We believe that it provides compelling evidence for our conjecture, i.e., that under our assumptions, the CLD does not improve the complexity of SGMs over DDPM. We remark that in this section, we have only considered the error arising from discretization of the SDE. It is possible that the score function ∇ v ln q t for the SGM with the CLD is easier to estimate than the score function for DDPM, providing a statistical benefit of using the CLD. Indeed, under the manifold hypothesis, the score ∇ ln q t for DDPM blows up at t = 0, but the score ∇ v ln q t for CLD is well-defined at t = 0, and hence may lead to improvements over DDPM. We do not investigate this question here and leave it as future work.

4. TECHNICAL OVERVIEW

We now give a detailed technical overview for the proof for DDPM (Theorem 2). The proof for CLD (Theorem 7) follows along similar lines. Recall that we must deal with three sources of error: (1) the estimation of the score function; (2) the discretization of the SDE; and (3) the initialization of the reverse process at γ d rather than at q T . First, we ignore the errors ( 1) and ( 2), and focus on the error (3). Hence, we consider the continuous-time reverse SDE (4), initialized from γ d (resp. q T ) and denote by pt (resp. q T -t ) its marginal distributions. Note that p0 = γ d and that q 0 = q, the data distribution. First, using the exponential contraction of the KL divergence along the (forward) OU process, we have KL(q T ∥ γ d ) ≤ exp(-2T ) KL(q ∥ γ d ). Then, using the data processing inequality along the backward process, we have TV(p T , q) ≤ TV(γ d , q T ). Therefore, using Pinsker inequality, we get TV(p T , q) ≤ TV(γ d , q T ) ≤ KL(q T ∥ γ d ) ≤ exp(-T ) KL(q ∥ γ d ) , i.e., the output pT converges to the data distribution q exponentially fast as T → ∞. Next, we consider the score estimation error (1) and the discretization error (2). Using Girsanov's theorem, these errors can be bounded by N -1 k=0 E (k+1)h kh ∥s T -kh ( X← kh ) -∇ ln q T -t ( X← t )∥ 2 dt (see the inequality (15) in the supplement). Unlike other proof techniques, such as the interpolation method in Lee et al. (2022a) , the error term (8) in Girsanov's theorem involves an expectation under the law of the true reverse process, instead of the law of the algorithm (see Lee et al. (2022a) ). This difference allows us to bound the score estimation error using Assumption 3 directly, which allows a simpler proof that works under milder assumptions on the data distribution. However, the use of Girsanov's theorem typically requires a technical condition known as Novikov's condition, which fails to hold under under our minimal assumptions. To circumvent this issue, we use an approximation argument relying on abstract results on the convergence of stochastic processes. A recent concurrent and independent work Liu et al. (2022) also uses Girsanov's theorem, but assumes that Novikov's condition holds at the outset.

5. CONCLUSION

In this work, we provided the first convergence guarantees for SGMs which hold under realistic assumptions (namely, L 2 -accurate score estimation and arbitrarily non-log-concave data distributions) and which scale polynomially in the problem parameters. Our results take a step towards explaining the remarkable empirical success of SGMs, at least assuming the score is learned with small L 2 error. The main limitation of this work is that we did not address the question of when the score function can be learned well. In general, studying the non-convex training dynamics of learning the score function via neural networks is challenging, but we believe that the resolution of this problem, even for simple learning tasks, would shed considerable light on SGMs. Together with the results in this paper, it would yield the first end-to-end guarantees for SGMs. In light of the interpretation of our result as a reduction of the task of sampling to the task of score function estimation, we also ask whether there are interesting situations where it is easier to learn the score function (not necessarily via a neural network) than to (directly) sample.

A PRELIMINARIES

In this section, we review the notion of score matching and provide a list of notation for the proofs.

A.1 PRIMER ON SCORE MATCHING

In order to estimate the score function ∇ ln q t , consider minimizing the L 2 (q t ) loss over a function class F , minimize st∈F E qt [∥s t -∇ ln q t ∥ 2 ] , where F could be, e.g., a class of neural networks. The idea of score matching, which goes back to Hyvärinen (2005) ; Vincent (2011) , is that after applying integration by parts for the Gaussian measure, the problem ( 9) is equivalent to the following problem: minimize st∈F E s t ( Xt ) + 1 1 -exp(-2t) Z t 2 , ( ) where Z t ∼ normal(0, I d ) is independent of X0 and Xt = exp(-t) X0 + 1 -exp(-2t) Z t , in the sense that ( 9) and ( 10) share the same minimizers. We give a self-contained derivation in Appendix D for the sake of completeness. Unlike (9), however, the objective in (10) can be replaced with an empirical version and estimated on the basis of samples X(1) 0 , . . . , X(n) 0 from q, leading to the finite-sample problem minimize st∈F 1 n n i=1 s t ( X(i) t ) + 1 1 -exp(-2t) Z (i) t 2 , where (Z (i) t ) i∈[n] are i.i.d. standard Gaussians independent of ( X(i) 0 ) i∈[n] . Moreover, if we parameterize the score as s t = - 1 √ 1-exp(-2t) z t , then the empirical problem is equivalent to minimize zt∈- √ 1-exp(-2t) F 1 n n i=1 z t ( X(i) t ) -Z (i) t 2 , which has the illuminating interpretation of predicting the added noise Z (i) t from the noised data X(i) t , i.e., denoising.

NOTATION

For a measurable mapping T : X → X and a measure µ on X, where X is a measurable space, the notation T # µ refers to the pushforward of µ by the mapping T , i.e., if X ∼ µ, then T (X) ∼ T # µ. Stochastic processes and their laws. • The data distribution is q = q 0 . • The forward process ( 1) is denoted ( Xt ) t∈[0,T ] , and Xt ∼ q t . • The reverse process (4) is denoted ( X← t ) t∈[0,T ] , where X← t := XT -t ∼ q T -t . • The SGM algorithm (5) is denoted (X ← t ) t∈[0,T ] , and X ← t ∼ p t . Recall that we initialize at p 0 = γ d , the standard Gaussian measure. • The process (X ←,q T t ) t∈[0,T ] is the same as (X ← t ) t∈[0,T ] , except that we initialize this process at q T rather than at γ d . We write X ←,q T t ∼ p q T t . Conventions for Girsanov's theorem. When we apply Girsanov's theorem, it is convenient to instead think about a single stochastic process, which for ease of notation we denote simply via (X t ) t∈[0,T ] , and we consider different measures over the path space C([0, T ]; R d ). The two measures we consider over path space are: • Q ← T , under which (X t ) t∈[0,T ] has the law of the reverse process (4); • P q T T , under which (X t ) t∈[0,T ] has the law of the SGM algorithm initialized at q T (corresponding to the process (X ←,q T t ) t∈[0,T ] defined above). We also use the following notion from stochastic calculus (Le Gall, 2016, Definition 4.6): • A local martingale (L t ) t∈[0,T ] is a stochastic process s.t. there exists a sequence of nondecreasing stopping times T n → T s.t. L n = (L t∧Tn ) t∈[0,T ] is a martingale. Other parameters. We recall that T > 0 denotes the total time for which we run the forward process; h > 0 is the step size of the discretization; L ≥ 1 is the Lipschitz constant of the score function; m 2 2 := E q [∥•∥ 2 ] is the second moment under the data distribution; and ε score is the L 2 score estimation error. Notation for CLD. The notational conventions for the CLD are similar; however, we must also consider a velocity variable V . When discussing quantities which involve both position and velocity (e.g., the joint distribution q t of ( Xt , Vt )), we typically use boldface fonts.

B PROOFS FOR DDPM B.1 PRELIMINARIES ON GIRSANOV'S THEOREM AND A FIRST ATTEMPT AT APPLYING

GIRSANOV'S THEOREM First, we recall a consequence of Girsanov's theorem that can be obtained by combining Pages 136-139, Theorem 5.22, and Theorem 4.13 of Le Gall (2016). Theorem 9. For t ∈ [0, T ], let L t = t 0 b s dB s where B is a Q-Brownian motion. Assume that E Q T 0 ∥b s ∥ 2 ds < ∞. Then, L is a Q-martingale in L 2 (Q). Moreover, if E Q E(L) T = 1 , where E(L) t := exp t 0 b s dB s - 1 2 t 0 ∥b s ∥ 2 ds , then E(L) is also a Q-martingale and the process t → B t - t 0 b s ds is a Brownian motion under P := E(L) T Q, the probability distribution with density E(L) T w.r.t. Q. If the assumptions of Girsanov's theorem are satisfied (i.e., the condition ( 11)), we can apply Girsanov's theorem to Q = Q ← T and b t = √ 2 (s T -kh (X kh ) -∇ ln q T -t (X t )) , where t ∈ [kh, (k + 1)h]. This tells us that under P = E(L) T Q ← T , there exists a Brownian motion (β t ) t∈[0,T ] s.t. dB t = √ 2 (s T -kh (X kh ) -∇ ln q T -t (X t )) dt + dβ t . Recall that under Q ← T we have a.s. dX t = {X t + 2 ∇ ln q T -t (X t )} dt + √ 2 dB t , X 0 ∼ q T . The equation above still holds P -a.s. since P ≪ Q ← T (even if B is no longer a P -Brownian motion). Plugging ( 12) into (13) we have P -a.s.,foot_2  dX t = {X t + 2 s T -kh (X kh )} dt + √ 2 dβ t , X 0 ∼ q T . In other words, under P , the distribution of X is the SGM algorithm started at q T , i.e., P = P q T T = E(L) T Q ← T . Therefore, KL(Q ← T ∥ P q T T ) = E Q ← T ln dQ ← T dP q T T = E Q ← T ln E(L) -1 T (14) = N -1 k=0 E Q ← T (k+1)h kh ∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 dt , where we used E Q ← T L t = 0 because L is a martingale. The equality ( 14) allows us to bound the discrepancy between the SGM algorithm and the reverse process.

B.2 CHECKING THE ASSUMPTIONS OF GIRSANOV'S THEOREM AND THE GIRSANOV DISCRETIZATION ARGUMENT

In most applications of Girsanov's theorem in sampling, a sufficient condition for (11) to hold, known as Novikov's condition, is satisfied. Here, Novikov's condition writes E Q ← T exp N -1 k=0 (k+1)h kh ∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 dt < ∞ , and if Novikov's condition holds, we can apply Girsanov's theorem directly. However, under Assumptions 1, 2, and 3 alone, Novikov's condition need not hold. Indeed, in order to check Novikov's condition, we would want X 0 to have sub-Gaussian tails for instance. Furthermore, we also could not check that the condition (11), which is weaker than Novikov's condition, holds. Therefore, in the proof of the next Theorem, we use a approximation technique to show that KL(Q ← T ∥ P q T T ) = E Q ← T ln dQ ← T dP q T T ≤ E Q ← T ln E(L) -1 T (15) = N -1 k=0 E Q ← T (k+1)h kh ∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 dt . We then use a discretization argument based on stochastic calculus to further bound this quantity. The result is the following theorem. Theorem 10 (discretization error for DDPM). Suppose that Assumptions 1, 2, and 3 hold. Let Q ← T and P q T T denote the measures on path space corresponding to the reverse process (4) and the SGM algorithm with L 2 -accurate score estimate initialized at q T . Assume that L ≥ 1 and h ≲ 1/L. Then, TV(P q T T , Q ← T ) 2 ≤ KL(Q ← T ∥ P q T T ) ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T . Proof. We start by proving N -1 k=0 E Q ← T (k+1)h kh ∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 dt ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T . Then, we give the approximation argument to prove the inequality (15). Bound on the discretization error. For t ∈ [kh, (k + 1)h], we can decompose E Q ← T [∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 ] ≲ E Q ← T [∥s T -kh (X kh ) -∇ ln q T -kh (X kh )∥ 2 ] + E Q ← T [∥∇ ln q T -kh (X kh ) -∇ ln q T -t (X kh )∥ 2 ] + E Q ← T [∥∇ ln q T -t (X kh ) -∇ ln q T -t (X t )∥ 2 ] ≲ ε 2 score + E Q ← T ∇ ln q T -kh q T -t (X kh ) 2 + L 2 E Q ← T [∥X kh -X t ∥ 2 ] . ( ) We must bound the change in the score function along the forward process. If S : R d → R d is the mapping S(x) := exp(-(t -kh)) x, then q T -kh = S # q T -t * normal(0, 1 -exp(-2 (t -kh))). We can then use Lee et al. (2022a, Lemma C.12 ) (or the more general Lemma 17 that we prove in Section C.4) with α = exp(t -kh) = 1 + O(h) and σ 2 = 1 -exp(-2 (t -kh)) = O(h) to obtain ∇ ln q T -kh q T -t (X kh ) 2 ≲ L 2 dh + L 2 h 2 ∥X kh ∥ 2 + (1 + L 2 ) h 2 ∥∇ ln q T -t (X kh )∥ 2 ≲ L 2 dh + L 2 h 2 ∥X kh ∥ 2 + L 2 h 2 ∥∇ ln q T -t (X kh )∥ 2 where the last line uses L ≥ 1. For the last term, ∥∇ ln q T -t (X kh )∥ 2 ≲ ∥∇ ln q T -t (X t )∥ 2 + ∥∇ ln q T -t (X kh ) -∇ ln q T -t (X t )∥ 2 ≲ ∥∇ ln q T -t (X t )∥ 2 + L 2 ∥X kh -X t ∥ 2 , where the second term above is absorbed into the third term of the decomposition ( 16). Hence, E Q ← T [∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 ] ≲ ε 2 score + L 2 dh + L 2 h 2 E Q ← T [∥X kh ∥ 2 ] + L 2 h 2 E Q ← T [∥∇ ln q T -t (X t )∥ 2 ] + L 2 E Q ← T [∥X kh -X t ∥ 2 ] . Using the fact that under Q ← T , the process (X t ) t∈[0,T ] is the time reversal of the forward process ( Xt ) t∈[0,T ] , we can apply the moment bounds in Lemma 11 and the movement bound in Lemma 12 to obtain Gall (2016, Proposition 5.11) , (E(L) t ) t∈[0,T ] is a local martingale. Therefore, there exists a non-decreasing sequence of stopping times E Q ← T [∥s T -kh (X kh ) -∇ ln q T -t (X t )∥ 2 ] ≲ ε 2 score + L 2 dh + L 2 h 2 (d + m 2 2 ) + L 3 dh 2 + L 2 (m 2 2 h 2 + dh) ≲ ε 2 score + L 2 dh + L 2 m 2 2 h 2 . Approximation argument. For t ∈ [0, T ], let L t = t 0 b s dB s where B is a Q ← T -Brownian motion and we define b t = √ 2 {s T -kh (X kh ) -∇ ln q T -t (X t )} , for t ∈ [kh, (k + 1)h]. We proved that E Q ← T T 0 ∥b s ∥ 2 ds ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T < ∞. Using Le T n ↗ T s.t. (E(L) t∧Tn ) t∈[0,t] is a martingale. Note that E(L) t∧Tn = E(L n ) t where L n t = L t∧Tn . Since E(L n ) is a martingale, we have E Q ← T E(L n ) T = E Q ← T E(L n ) 0 = 1 , i.e., E Q ← T E(L) Tn = 1. We apply Girsanov's theorem to L n t = t 0 b s 1 [0,Tn] (s) dB s , where B is a Q ← T -Brownian mo- tion. Since E Q ← T T 0 ∥b s 1 [0,Tn] (s)∥ 2 ds ≤ E Q ← T T 0 ∥b s ∥ 2 ds < ∞ (see the last paragraph) and E Q ← T E(L n ) T = 1, we obtain that under P n := E(L n ) T Q ← T there exists a Brownian motion β n s.t. for t ∈ [0, T ], dB t = √ 2 {s T -kh (X kh ) -∇ ln q T -t (X t )} 1 [0,Tn] (t) dt + dβ n t . Recall that under Q ← T we have a.s. dX t = {X t + 2 ∇ ln q T -t (X t )} dt + √ 2 dB t , X 0 ∼ q T . The equation above still holds P n -a.s. since P n ≪ Q ← T . Combining the last two equations we then obtain P n -a.s., dX t = {X t +2 s T -kh (X kh )} 1 [0,Tn] (t) dt+{X t +2 ∇ ln q T -t (X t )} 1 [Tn,T ] (t) dt+ √ 2 dβ n t , and X 0 ∼ q T . In other words, P n is the law of the solution of the SDE (17). At this stage we have the bound KL(Q ← T ∥ P n ) = E Q ← T ln E(L) -1 Tn = E Q ← T -L Tn + 1 2 Tn 0 ∥b s ∥ 2 ds = E Q ← T 1 2 Tn 0 ∥b s ∥ 2 ds ≤ E Q ← T 1 2 T 0 ∥b s ∥ 2 ds ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T, where we used that E Q ← T L Tn = 0 because L is a Q ← T -martingale and T n is a bounded stopping time (Le Gall, 2016, Corollary 3.23) . Our goal is now to show that we can obtain the final result by an approximation argument. We consider a coupling of (P n ) n∈N , P q T T : a sequence of stochastic processes (X n ) n∈N over the same probability space, a stochastic process X and a single Brownian motion W over that space s.t.foot_3  dX n t = {X n t +2 s T -kh (X n kh )} 1 [0,Tn] (t) dt+{X n t +2 ∇ ln q T -t (X n t )} 1 [Tn,T ] (t) dt+ √ 2 dW t , and dX t = {X t + 2 s T -kh (X n kh )} dt + √ 2 dW t , with X 0 = X n 0 a.s. and X 0 ∼ q T . Note that the distribution of X n (resp. X) is P n (resp. P q T T ). Let ε > 0 and consider the map π ε : C([0, T ]; R d ) → C([0, T ]; R d ) defined by π ε (ω)(t) := ω(t ∧ T -ε) . Noting that X n t = X t for every t ∈ [0, T n ] and using Lemma 13, we have π ε (X n ) → π ε (X) a.s., uniformly over [0, T ]. Therefore, π ε# P n → π ε# P q T T weakly. Using the lower semicontinuity of the KL divergence and the data-processing inequality (Ambrosio et al., 2005, Lemma 9.4.3 and Lemma 9.4.5) , we obtain KL((π ε ) # Q ← T ∥ (π ε ) # P q T T ) ≤ lim inf n→∞ KL((π ε ) # Q ← T ∥ (π ε ) # P n ) ≤ lim inf n→∞ KL(Q ← T ∥ P n ) ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T . Finally, using Lemma 14, π ε (ω) → ω as ε → 0, uniformly over [0, T ]. Therefore, using Ambrosio et al. (2005, Corollary 9.4.6)  , KL((π ε ) # Q ← T ∥ (π ε ) # P q T T ) → KL(Q ← T ∥ P q T T ) as ε ↘ 0. Therefore, KL(Q ← T ∥ P q T T ) ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T . We conclude with Pinsker's inequality (TV 2 ≤ KL).

B.3 PROOF OF THEOREM 2

We can now conclude our main result. Proof of Theorem 2. We recall the notation from Section 4. By the data processing inequality, TV(p T , q) ≤ TV(P T , P q T T ) + TV(P q T T , Q ← T ) ≤ TV(q T , γ d ) + TV(P q T T , Q ← T ) . Using the convergence of the OU process in KL divergence (see, e.g., Bakry et al., 2014, Theorem 5.2.1) and applying Theorem 10 for the second term, TV(p T , q) ≲ KL(q ∥ γ d ) exp(-T ) + (ε score + L √ dh + Lm 2 h) √ T , which proves the result. Published as a conference paper at ICLR 2023

B.4 AUXILIARY LEMMAS

In this section, we prove some auxiliary lemmas which are used in the proof of Theorem 2. Lemma 11 (moment bounds for DDPM). Suppose that Assumptions 1 and 2 hold. Let ( Xt ) t∈[0,T ] denote the forward process (1). 1. (moment bound) For all t ≥ 0, E[∥ Xt ∥ 2 ] ≤ d ∨ m 2 2 . 2. (score function bound) For all t ≥ 0, E[∥∇ ln q t ( Xt )∥ 2 ] ≤ Ld . Proof. 1. Along the OU process, we have Xt d = exp(-t) X0 + 1 -exp(-2t) ξ, where ξ ∼ normal(0, I d ) is independent of X0 . Hence, E[∥ Xt ∥ 2 ] = exp(-2t) E[∥X∥ 2 ] + {1 -exp(-2t)} d ≤ d ∨ m 2 2 . 2. This follows from the L-smoothness of ln q t (see, e.g., Vempala & Wibisono, 2019, Lemma 9) . We give a short proof for the sake of completeness. If L t f := ∆f -⟨∇U t , ∇f ⟩ is the generator associated with q t ∝ exp(-U t ), then 0 = E qt L U t = E qt ∆U t -E qt [∥∇U t ∥ 2 ] ≤ Ld -E qt [∥∇U t ∥ 2 ] . Lemma 12 (movement bound for DDPM). Suppose that Assumption 2 holds. Let ( Xt ) t∈[0,T ] denote the forward process (1) . For 0 ≤ s < t with δ := t -s, if δ ≤ 1, then E[∥ Xt -Xs ∥ 2 ] ≲ δ 2 m 2 2 + δd . Proof. We can write E[∥ Xt -Xs ∥ 2 ] = E - t s Xr dr + √ 2 (B t -B s ) 2 ≲ δ t s E[∥ Xr ∥ 2 ] dr + δd ≲ δ 2 (d + m 2 2 ) + δd ≲ δ 2 m 2 2 + δd , where we used Lemma 11. We omit the proofs of the two next lemmas as they are straightforward. Lemma 13. Consider f n , f : [0, T ] → R d s.t. there exists an increasing sequence (T n ) n∈N ⊆ [0, T ] satisfying the conditions • T n → T as n → ∞, • f n (t) = f (t) for every t ≤ T n . Then, for every ε > 0, f n → f uniformly over [0, T -ε]. In particular, f n (• ∧ T -ε) → f (• ∧ T -ε) uniformly over [0, T ]. Lemma 14. Consider f : [0, T ] → R d continuous, and f ε : [0, T ] → R d s.t. f ε (t) = f (t ∧ (T -ε)) for ε > 0. Then f ε → f uniformly over [0, T ] as ε → 0.

B.5 PROOF OF COROLLARY 6

Proof of Corollary 6. For R 0 > 0, let Π R0 denote the projection onto B(0, R 0 ). We want to prove that W 2 ((Π R0 ) # p T -t , q) ≤ ε. We use the decomposition W 2 ((Π R0 ) # p T -t , q) ≤ W 2 ((Π R0 ) # p T -t , (Π R0 ) # q t ) + W 2 ((Π R0 ) # q t , q) . For the first term, since (Π R0 ) # p T -t and (Π R0 ) # q t both have support contained in B(0, R 0 ), we can upper bound the Wasserstein distance by the total variation distance. Namely, Rolland (2022, Lemma 9) implies that W 2 ((Π R0 ) # p T -t , (Π R0 ) # q t ) ≲ R 0 TV((Π R0 ) # p T -t , (Π R0 ) # q t ) + R 0 exp(-R 0 ) . By the data-processing inequality, TV((Π R0 ) # p T -t , (Π R0 ) # q t ) ≤ TV(p T -t , q t ) ≤ ε TV , where ε TV is from Corollary 4, yielding W 2 ((Π R0 ) # p T -t , (Π R0 ) # q t ) ≲ R 0 √ ε TV + R 0 exp(-R 0 ) . Next, we take R 0 ≥ R so that (Π R0 ) # q = q. Since Π R0 is 1-Lipschitz, we have W 2 ((Π R0 ) # q t , q) = W 2 ((Π R0 ) # q t , (Π R0 ) # q) ≤ W 2 (q t , q) ≤ ε W2 , where ε W2 is from Corollary 4. Combining these bounds, W 2 ((Π R0 ) # p T -t , q) ≲ R 0 √ ε TV + R 0 exp(-R 0 ) + ε W2 . We now take ε W2 = ε/3, R 0 = Θ(R), and ε TV = Θ(ε 2 /R 2 ) to obtain the desired result. The iteration complexity follows from Corollary 4.

C PROOFS FOR CLD

C.1 BACKGROUND ON THE CLD PROCESS More generally, for the forward process we can introduce a friction parameter γ > 0 and consider d Xt = Vt dt , d Vt = -Xt dt -γ Vt dt + 2γ dB t . If we write θt := ( Xt , Vt ), then the forward process satisfies the linear SDE d θt = A γ θt dt + Σ γ dB t , where A γ := 0 1 -1 -γ and Σ γ := 0 √ 2γ . The solution to the SDE is given by θt = exp(tA γ ) θ0 + t 0 exp{(t -s) A γ } Σ γ dB s , which means that by the Itô isometry, law( θt ) = exp(tA γ ) # law( θ0 ) * normal 0, t 0 exp{(t -s) A γ } Σ γ Σ T γ exp{(t -s) A T γ } ds . Since det A γ = 1, A γ is always invertible. Moreover, from tr A γ = -γ, one can work out that the spectrum of A γ is spec(A γ ) = - γ 2 ± γ 2 4 -1 . However, A γ is not diagonalizable. The case of γ = 2 is special, as it corresponds to the case when the spectrum is {-1}, and it corresponds to the critically damped case. Following Dockhorn et al. (2022) , which advocated for setting γ = 2, we will also only consider the critically damped case. This also has the advantage of substantially simplifying the calculations.

C.2 GIRSANOV DISCRETIZATION ARGUMENT

In order to apply Girsanov's theorem, we introduce the path measures P q T T and Q ← T , under which dX t = -V t dt , dV t = {X t + 2 V t + 4 s T -kh (X kh , V kh )} dt + 2 dB t , for t ∈ [kh, (k + 1)h], and dX t = -V t dt , dV t = {X t + 2 V t + 4 ∇ v ln q T -t (X t , V t )} dt + 2 dB t , respectively. Applying Girsanov's theorem, we have the following theorem. Corollary 15. Suppose that Novikov's condition holds: E Q ← T exp 2 N -1 k=0 (k+1)h kh ∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 dt < ∞ . Then, KL(Q ← T ∥ P q T T ) = E Q ← T ln dQ ← T dP q T T = 2 N -1 k=0 E Q ← T (k+1)h kh ∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 dt . Similarly to Appendix B.2, even if Novikov's condition does not hold, one can use an approximation to argue that the KL divergence is still upper bounded by the last expression. Since the argument follows along the same lines, we omit it for brevity. Using this, we now aim to prove the following theorem. Theorem 16 (discretization error for CLD). Suppose that Assumptions 2, 4, and 5 hold. Let Q ← T and P q T T denote the measures on path space corresponding to the reverse process (7) and the SGM algorithm with L 2 -accurate score estimate initialized at q T . Assume that L ≥ 1 and h ≲ 1/L. Then, TV(P q T T , Q ← T ) 2 ≤ KL(Q ← T ∥ P q T T ) ≲ (ε 2 score + L 2 dh + L 2 m 2 2 h 2 ) T . Proof. For t ∈ [kh, (k + 1)h], we can decompose E Q ← T [∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 ] ≲ E Q ← T [∥s T -kh (X kh , V kh ) -∇ v ln q T -kh (X kh , V kh )∥ 2 ] + E Q ← T [∥∇ v ln q T -kh (X kh , V kh ) -∇ v ln q T -t (X kh , V kh )∥ 2 ] + E Q ← T [∥∇ v ln q T -t (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 ] ≲ ε 2 score + E Q ← T ∇ v ln q T -kh q T -t (X kh , V kh ) 2 + L 2 E Q ← T [∥(X kh , V kh ) -(X t , V t )∥ 2 ] . The change in the score function is bounded by Lemma 17, which generalizes Lee et al. (2022a, Lemma C.12) . From the representation (18) of the solution to the CLD, we note that q T -kh = (M 0 ) # q T -t * normal(0, M 1 ) with M 0 = exp (t -kh) A 2 , M 1 = t-kh 0 exp{(t -kh -s) A 2 } Σ 2 Σ T 2 exp{(t -kh -s) A T 2 } ds . In particular, since ∥A 2 ∥ op ≲ 1, ∥A -1 2 ∥ op ≲ 1, and ∥Σ 2 ∥ op ≲ 1 it follows that ∥M 0 ∥ op = 1 + O(h) and ∥M 1 ∥ op = O(h). Substituting this into Lemma 17, we deduce that if h ≲ 1/L, then ∇ v ln q T -kh q T -t (X kh , V kh ) 2 ≤ ∇ ln q T -kh q T -t (X kh , V kh ) 2 ≲ L 2 dh + L 2 h 2 (∥X kh ∥ 2 + ∥V kh ∥ 2 ) + (1 + L 2 ) h 2 ∥∇ ln q T -t (X kh , V kh )∥ 2 ≲ L 2 dh + L 2 h 2 (∥X kh ∥ 2 + ∥V kh ∥ 2 ) + L 2 h 2 ∥∇ ln q T -t (X kh , V kh )∥ 2 , where in the last step we used L ≥ 1. For the last term, ∥∇ ln q T -t (X kh , V kh )∥ 2 ≲ ∥∇ ln q T -t (X t , V t )∥ 2 + L 2 ∥(X kh , V kh ) -(X t , V t )∥ 2 , where the second term above is absorbed into the third term of the decomposition (19). Hence, E Q ← T [∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 ] ≲ ε 2 score + L 2 dh + L 2 h 2 E Q ← T [∥X kh ∥ 2 + ∥V kh ∥ 2 ] + L 2 h 2 E Q ← T [∥∇ ln q T -t (X t , V t )∥ 2 ] + L 2 E Q ← T [∥(X kh , V kh ) -(X t , V t )∥ 2 ] . By applying the moment bounds in Lemma 18 together with Lemma 19 on the movement of the CLD process, we obtain E Q ← T [∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 ] ≲ ε 2 score + L 2 dh + L 2 h 2 (d + m 2 2 ) + L 3 dh 2 + L 2 (dh + m 2 2 h 2 ) ≲ ε 2 score + L 2 dh + L 2 m 2 2 h 2 . The proof is concluded via an approximation argument as in Section B.2. Remark. We now pause to discuss why the discretization bound above does not improve upon the result for DDPM (Theorem 10). In the context of log-concave sampling, one instead considers the underdamped Langevin process dX t = V t , dV t = -∇U (X t ) dt -γ V t dt + 2γ dB t , which is discretized to yield the algorithm dX t = V t , dV t = -∇U (X kh ) dt -γ V t dt + 2γ dB t , for t ∈ [kh, (k + 1)h]. Let P T denote the path measure for the algorithm, and let Q T denote the path measure for the continuous-time process. After applying Girsanov's theorem, we obtain KL(Q T ∥ P T ) ≍ 1 γ N -1 k=0 E Q T (k+1)h kh ∥∇U (X t ) -∇U (X kh )∥ 2 dt . In this expression, note that ∇U depends only on the position coordinate. Since the X process is smoother (as we do not add Brownian motion directly to X), the error ∥∇U (X t ) -∇U (X kh )∥ 2 is of size O(dh 2 ), which allows us to take step size h ≲ 1/ √ d. This explains why the use of the underdamped Langevin diffusion leads to improved dimension dependence for log-concave sampling. In contrast, consider the reverse process, in which KL(Q ← T ∥ P q T T ) = 2 N -1 k=0 E Q ← T (k+1)h kh ∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 dt . Since discretization of the reverse process involves the score function, which depends on both X and V , the error now involves controlling ∥V t -V kh ∥ 2 , which is of size O(dh) (the process V is not very smooth because it includes a Brownian motion component). Therefore, from the form of the reverse process, we may expect that SGMs based on the CLD do not improve upon the dimension dependence of DDPM. In Section C.5, we use this observation in order to prove a rigorous lower bound against discretization of SGMs based on the CLD. 2. (score function bound) For all t ≥ 0, E[∥∇ ln q t ( Xt , Vt )∥ 2 ] ≤ Ld . Proof. 1. We can write E[∥( Xt , Vt )∥ 2 ] = W 2 2 (q t , δ 0 ) ≲ W 2 2 (q t , γ 2d ) + W 2 2 (γ 2d , δ 0 ) ≲ d + W 2 2 (q t , γ 2d ) . Next, the coupling argument of Cheng et al. (2018) shows that the CLD converges exponentially fast in the Wasserstein metric associated to a twisted norm  W 2 2 (q t , γ 2d ) ≲ W 2 2 (q, γ 2d ) ≲ W 2 2 (q, δ 0 ) + W 2 2 (δ 0 , γ 2d ) ≲ d + m 2 2 . 2. The proof is the same as in Lemma 11. Lemma 19 (movement bound for CLD). Suppose that Assumptions 2 holds. Let ( Xt , Vt ) t∈[0,T ] denote the forward process (6). For 0 < s < t with δ : = t -s, if δ ≤ 1, E[∥( Xt , Vt ) -( Xs , Vs )∥ 2 ] ≲ δ 2 m 2 2 + δd . Proof. First, E[∥ Xt -Xs ∥ 2 ] = E t s Vr dr 2 ≤ δ t s E[∥ Vr ∥ 2 ] dr ≲ δ 2 (d + m 2 2 ) , where we used the moment bound in Lemma 18. Next, E[∥ Vt -Vs ∥ 2 ] = E t s (-Xr -2 Vr ) dr + 2 (B t -B s ) 2 ≲ δ t s E[∥ Xr ∥ 2 + ∥ Vr ∥ 2 ] dr + δd ≲ δ 2 (d + m 2 2 ) + δd , where we used Lemma 18 again.

C.5 LOWER BOUND AGAINST CLD

When proving upper bounds on the KL divergence, we can use the approximation argument described in Section B.2 in order to invoke Girsanov's theorem. However, when proving lower bounds on the KL divergence, this approach no longer works, so we check Novikov's condition directly for the setting of Theorem 8. Lemma 20 (Novikov's condition holds for CLD). Consider the setting of Theorem 8. Then, Novikov's condition 15 holds. We defer the proof of Lemma 20 to the end of this section. Admitting Lemma 20, we now prove Theorem 8. Proof of Theorem 8. Since q 0 = γ d ⊗ γ d = γ 2d is stationary for the forward process (6), we have q t = γ 2d for all t ≥ 0. In this proof, since the score estimate is perfect and q T = γ 2d , we simply denote the path measure for the algorithm as P T = P q T T . From Girsanov's theorem in the form of Corollary 15 and from s T -kh (x, v) = ∇ v ln q T -kh (x, v) = -v, we have KL(Q ← T ∥ P T ) = 2 N -1 k=0 E Q ← T (k+1)h kh ∥V kh -V t ∥ 2 dt . To lower bound this quantity, we use the inequality ∥x + y∥ 2 ≥ 1 2 ∥x∥ 2 -∥y∥ 2 to write, for t ∈ [kh, (k + 1)h] E Q ← T [∥V kh -V t ∥ 2 ] = E[∥ VT -kh -VT -t ∥ 2 ] = E T -kh T -t {-Xs -2 Vs } ds + 2 (B T -kh -B T -t ) 2 ≥ 2 E[∥B T -kh -B T -t ∥ 2 ] -E T -kh T -t {-Xs -2 Vs } ds 2 ≥ 2d (t -kh) -(t -kh) T -kh T -t E[∥ Xs + 2 Vs ∥ 2 ] ds ≥ 2d (t -kh) -(t -kh) T -kh T -t E[2 ∥ Xs ∥ 2 + 8 ∥ Vs ∥ 2 ] ds . Using the fact that Xs ∼ γ d and Vs ∼ γ d for all s ∈ [0, T ], we can then bound E Q ← T [∥V kh -V t ∥ 2 ] ≥ 2d (t -kh) -10d (t -kh) 2 ≥ d (t -kh) , provided that h ≤ 1 10 . Substituting this into (21), KL(Q ← T ∥ P T ) ≥ 2d N -1 k=0 (k+1)h kh (t -kh) 2 dt = dh 2 N = dhT . This proves the result. This lower bound shows that the Girsanov discretization argument of Theorem 16 is essentially tight (except possibly the dependence on L). We now prove Lemma 20. Proof of Lemma 20. Similarly to the proof of Theorem 8 above, we note that ∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 = ∥ VT -kh -VT -t ∥ 2 = T -kh T -t {-Xs -2 Vs } ds + 2 (B T -kh -B T -t ) 2 ≲ h 2 sup s∈[0,T ] (∥ Xs ∥ 2 + ∥ Vs ∥ 2 ) + sup s∈[T -(k+1)h,T -kh] ∥B T -kh -B s ∥ 2 . Hence, for a universal constant C > 0 (which may change from line to line) E Q ← T exp 2 N -1 k=0 (k+1)h kh ∥s T -kh (X kh , V kh ) -∇ v ln q T -t (X t , V t )∥ 2 dt ≤ E exp CT h 2 sup s∈[0,T ] (∥ Xs ∥ 2 + ∥ Vs ∥ 2 ) + Ch N -1 k=0 sup s∈[T -(k+1)h,T -kh] ∥B T -kh -B s ∥ 2 . By the Cauchy-Schwarz inequality, to prove that this expectation is finite, it suffices to consider the two terms in the exponential separately. where B is another standard Brownian motion and we use the interpretation of stochastic integrals as time changes of Brownian motion (Steele, 2001, Corollary 7.1) . Since ( X0 , V0 ) ∼ γ 2d has independent entries, then E exp(CT h 2 {∥ X0 ∥ 2 + ∥ V0 ∥ 2 }) = d j=1 E exp(CT h 2 ⟨e j , X0 ⟩ 2 ) E exp(CT h 2 ⟨e j , V0 ⟩ 2 ) < ∞ provided that h ≲ 1/ √ T . Also, by the Cauchy-Schwarz inequality, we can give a crude bound: writing τ (t) = (exp(2t) -1)/2, E exp CT h 2 sup t∈[0,T ] exp(-2t) ∥ Bτ(t) ∥ 2 ≤ E exp 2CT h 2 sup t∈[0,1] exp(-2t) ∥ Bτ(t) ∥ 2 1/2 × E exp 2CT h 2 sup t∈[1,T ] exp(-2t) ∥ Bτ(t) ∥ 2 1/2 where, by standard estimates on the supremum of Brownian motion (see, e.g., Chewi et al., 2021b, Lemma 23) , the first factor is finite if h ≲ 1/ √ T (again using independence across the dimensions). For the second factor, if we split the sum according to exp(-2t) ≍ 2 k and use Hölder's inequality, E exp CT h 2 sup t∈[1,T ] exp(-2t) ∥ Bτ(t) ∥ 2 ≤ K k=1 E exp CKT h 2 sup 2 k ≤t≤2 k+1 exp(-2t) ∥ Bτ(t) ∥ 2 1/K where K = O(T ). Then, By Chewi et al. (2021b, Lemma 23) , this quantity is finite if h ≲ 1, which completes the proof. E exp CT 2 h 2 sup 2 k ≤t≤2 k+1 exp(-2t) ∥ Bτ(t) ∥ 2 ≤ E exp CT 2 h 2 2 -k sup 1≤t≤2 k+1 ∥ Bτ(t) ∥ 2 < ∞ ,

D DERIVATION OF THE SCORE MATCHING OBJECTIVE

In this section, we present a self-contained derivation of the score matching objective (10) for the reader's convenience. See also Hyvärinen (2005) ; Vincent (2011); Song & Ermon (2019) . Recall that the problem is to solve minimize st∈F E qt [∥s t -∇ ln q t ∥ 2 ] . This objective cannot be evaluated, even if we replace the expectation over q t with an empirical average over samples from q t . The trick is to use an integration by parts identity to reformulate the objective. Here, C will denote any constant that does not depend on the optimization variable s t . Expanding the square, E qt [∥s t -∇ ln q t ∥ 2 ] = E qt [∥s t ∥ 2 -2 ⟨s t , ∇ ln q t ⟩] + C . We can rewrite the second term using integration by parts: ⟨s t , ∇ ln q t ⟩ dq t = ⟨s t , ∇q t ⟩ = -(div s t ) dq t = -(div s t ) exp(-t) x 0 + 1 -exp(-2t) z t dq(x 0 ) dγ d (z t ) , where γ d = normal(0, I d ) and we used the explicit form of the law of the OU process at time t. Recall the Gaussian integration by parts identity: for any vector field v : R d → R d , (div v) dγ d = ⟨x, v(x)⟩ dγ d (x) . Applying this identity, ⟨s t , ∇ ln q t ⟩ dq t = -1 1 -exp(-2t) ⟨z t , s t (x t )⟩ dq(x 0 ) dγ d (z t ) where x t = exp(-t) x 0 + 1 -exp(-2t) z t . Substituting this in, E qt [∥s t -∇ ln q t ∥ 2 ] = E ∥s t (X t )∥ 2 + 2 1 -exp(-2t) ⟨Z t , s t (X t )⟩ + C = E s(X t ) + 1 1 -exp(-2t) Z t 2 + C , where X 0 ∼ q and Z t ∼ γ d are independent, and X t := exp(-t) X 0 + 1 -exp(-2t) Z t .

E DEFERRED PROOFS

Lemma 21. Suppose that supp q ⊆ B(0, R) where R ≥ 1, and let q t denote the law of the OU process at time t, started at q. Let ε > 0 be such that ε ≪ √ d and set t ≍ ε 2 /( √ d (R ∨ √ d)). Then, 1. W 2 (q t , q) ≤ ε. 2. q t satisfies KL(q t ∥ γ d ) ≲ √ d (R ∨ √ d) 3 ε 2 . 3. For every t ′ ≥ t, q t ′ satisfies Assumption 1 with L ≲ dR 2 (R ∨ √ d) 2 ε 4 . Proof. 1. For the OU process (1), we have Xt := exp(-t) X0 + 1 -exp(-2t) Z, where Z ∼ normal(0, I d ) is independent of X0 . Hence, for t ≲ 1, W 2 2 (q, q t ) ≤ E 1 -exp(-t) X0 + 1 -exp(-2t) Z 2 = 1 -exp(-t) 2 E[∥ X0 ∥ 2 ] + 1 -exp(-2t) d ≲ R 2 t 2 + dt . We now take t ≲ min{ε/R, ε 2 /d} to ensure that W 2 2 (q, q t ) ≤ ε 2 . Since ε ≪ √ d, it suffices to take t ≍ ε 2 /( √ d (R ∨ √ d)). 2. For this, we use the short-time regularization result in Otto & Villani (2001, Corollary 2) , which implies that KL(q t ∥ γ d ) ≤ W 2 2 (q, γ d ) 4t ≲ W 2 2 (q, δ 0 ) + W 2 2 (γ d , δ 0 ) t ≲ √ d (R ∨ √ d) 3 ε 2 . 3. Using Mikulincer & Shenfeld (2022, Lemma 4) , along the OU process, 1 1 -exp(-2t) I d - exp(-2t) R 2 (1 -exp(-2t)) 2 I d ≼ -∇ 2 ln q t (x) ≼ 1 1 -exp(-2t) I d . With our choice of t, it implies ∥∇ 2 ln q t ′ ∥ op ≲ 1 1 -exp(-2t ′ ) ∨ exp(-2t ′ ) R 2 (1 -exp(-2t ′ )) 2 ≲ 1 t ∨ R 2 t 2 ≲ dR 2 (R ∨ √ d) 2 ε 4 .



For ease of notation, we do not distinguish between the forward and the reverse Brownian motions. For many distributions of interest, e.g., the standard Gaussian distribution or product measures, in fact we have m2 = O( √ d). Also, for applications to images in which each coordinate (i.e., pixel) lies in a bounded range [-1, 1], we also have m2 ≤ √ d. We still have X0 ∼ qT under P because the marginal at time t = 0 of P is equal to the marginal at timet = 0 of Q ← T . That is a consequence of the fact that E(L) is a (true) Q ← T -martingale. Such a coupling always exists, see Le Gall (2016, Corollary 8.5).



Next, we recall thatd Xt = Vt dt , d Vt = -( Xt + 2 Vt ) dt + 2 dB t .Define Ȳt := Xt + Vt . Then, d Ȳt = -Ȳt dt + 2 dB t , which admits the explicit solutionȲt = exp(-t) Ȳ0 + 2 t 0 exp{-(t -s)} dB s .Also, d Xt = -Xt dt + Ȳt dt, which admits the solutionXt = exp(-t) X0 + t 0 exp{-(t -s)} Ȳt dt . Hence, ∥ Xt ∥ + ∥ Vt ∥ ≤ 2 ∥ Xt ∥ + ∥ Ȳt ∥ ≲ ∥ X0 ∥ + sup s∈[0,T ] Ȳt ∥ ≲ ∥ X0 ∥ + ∥ V0 ∥ + sup ∥ X0 ∥ + ∥ V0 ∥ + sup t∈[0,T ]exp(-t) ∥ B(exp(2t)-1)/2 ∥

T , where we again useChewi et al. (2021b, Lemma 23)  and split across the coordinates. The Cauchy-Schwarz inequality then impliesE exp CT h 2 sup s∈[0,T ] (∥ Xs ∥ 2 + ∥ Vs ∥ 2 ) < ∞ .For the second term, by independence of the increments,-(k+1)h,T -kh] ∥B T -kh -B s ∥ 2 = N -1 k=0 E exp Ch sup s∈[T -(k+1)h,T -kh] ∥B T -kh -B s ∥ 2 = E exp Ch sup s∈[0,h] ∥B s ∥ 2 N .

|||•||| which is equivalent (up to universal constants) to the Euclidean norm ∥•∥. It implies the following result, see, e.g., Cheng et al. (2018, Lemma 8):

annex

Published as a conference paper at ICLR 2023

C.3 PROOF OF THEOREM 7

Proof of Theorem 7. By the data processing inequality, TV(p T , q 0 ) ≤ TV(P T , P q T T ) + TV(P q T T , Q ← T ) ≤ TV(q T , γ 2d ) + TV(P q T T , Q ← T ) . In Ma et al. (2021) , following the entropic hypocoercivity approach of Villani (2009) , Ma et al. consider a Lyapunov functional L which is equivalent to the sum of the KL divergence and the Fisher information,which decays exponentially fast in time: there exists a universal constant c > 0 such that for all t ≥ 0,. By Pinsker's inequality and Theorem 16, we deduce thatwhich completes the proof.

C.4 AUXILIARY LEMMAS

We start with a perturbation lemma for the score function.Lemma 17Proof. The proof follows along the lines of Lee et al. (2022a, Lemma C.12) . First, we show that whenLet S denote the subspace S := range M 1 . Then, sincewhere M -1 1 is well-defined on S, we haveHere, q θ is the measure on θ + S such thatNote that since L ≤ 1 2 ∥M1∥op , then if we write q θ (θ ′ ) ∝ exp(-H θ (θ ′ )), we haveLet θ ⋆ ∈ arg min H θ denote a mode. We boundFor the first term, (Dalalyan et al., 2019 , Proposition 2) yieldsFor the second term, since the mode satisfies ∇H(θAfter combining the bounds, we obtain the claimed estimate (20).Next, we consider the case of general M 0 . We haveWe can apply ( 20) with (M 0 ) # q in place of q, noting thatWe also haveCombining the bounds,so the lemma follows.Next, we prove the moment and movement bounds for the CLD.Lemma 18 (moment bounds for CLD). Suppose that Assumptions 2 and 4 hold. Let ( Xt , Vt ) t∈[0,T ] denote the forward process (6).

