SAMPLING IS AS EASY AS LEARNING THE SCORE: THEORY FOR DIFFUSION MODELS WITH MINIMAL DATA ASSUMPTIONS

Abstract

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL•E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an L 2 -accurate score estimate (rather than L ∞ -accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs.

1. INTRODUCTION

Score-based generative models (SGMs) are a family of generative models which achieve state-ofthe-art performance for generating audio and image data (Sohl-Dickstein et al., 2015; Ho et al., 2020; Dhariwal & Nichol, 2021; Kingma et al., 2021; Song et al., 2021a; b; Vahdat et al., 2021) ; see, e.g., the recent surveys (Cao et al., 2022; Croitoru et al., 2022; Yang et al., 2022) . For example, denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a key component in large-scale generative models such as DALL •E 2 (Ramesh et al., 2022) . As the importance of SGMs continues to grow due to newfound applications in commercial domains, it is a pressing question of both practical and theoretical concern to understand the mathematical underpinnings which explain their startling empirical successes. As we explain in Section 2, at their mathematical core, SGMs consist of two stochastic processes, the forward process and the reverse process. The forward process transforms samples from a data distribution q (e.g., images) into noise, whereas the reverse process transforms noise into samples from q, hence performing generative modeling. Running the reverse process requires estimating the score function of the law of the forward process; this is typically done by training neural networks on a score matching objective (Hyvärinen, 2005; Vincent, 2011; Song & Ermon, 2019) . Providing precise guarantees for estimation of the score function is difficult, as it requires an understanding of the non-convex training dynamics of neural network optimization that is currently out of reach. However, given the empirical success of neural networks on the score estimation task, a natural and important question is whether accurate score estimation implies that SGMs provably converge to the true data distribution in realistic settings. This is a surprisingly delicate question, as even with accurate score estimates, as we explain in Section 2.1, there are several other sources of error which could cause the SGM to fail to converge. Indeed, despite a flurry of recent work (Block et al., 2020; De Bortoli et al., 2021; De Bortoli, 2022; Lee et al., 2022a; Pidstrigach, 2022; Liu et al., 2022) , prior analyses fall short of answering this question, for (at least) one of three main reasons: 1. Super-polynomial convergence. , 2021) . This is problematic because the score matching objective is an L 2 loss (see Section A.1 in the supplement), and there are empirical studies suggesting that in practice, the score estimate is not in fact L ∞ -accurate (e.g., Zhang & Chen, 2022). Intuitively, this is because we cannot expect that the score estimate we obtain will be accurate in regions of space where the true density is very low, simply because we do not expect to see many (or indeed, any) samples from there. Providing an analysis which goes beyond these limitations is a pressing first step towards theoretically understanding why SGMs actually work in practice.

1.1. OUR CONTRIBUTIONS

In this work, we take a step towards bridging theory and practice by providing a convergence guarantee for SGMs, under realistic (in fact, quite minimal) assumptions, which scales polynomially in all relevant problem parameters. Namely, our main result (Theorem 2) only requires the following assumptions on the data distribution q, which we make more quantitative in Section 3: A1 The score function of the forward process is L-Lipschitz.

A2

The (2 + η)-th moment of q is finite, where η > 0 is an arbitrarily small constant.

A3

The data distribution q has finite KL divergence w.r.t. the standard Gaussian. We note that all of these assumptions are either standard or, in the case of A2, far weaker than what is needed in prior work. Crucially, unlike prior works, we do not assume log-concavity, an LSI, or dissipativity; hence, our assumptions cover arbitrarily non-log-concave data distributions. Our main result is summarized informally as follows. Theorem 1 (informal, see Theorem 2). Under assumptions A1-A3, and if the score estimation error in L 2 is at most O(ε), then with an appropriate choice of step size, the SGM outputs a measure which is ε-close in total variation (TV) distance to q in O(L 2 d/ε 2 ) iterations. Our iteration complexity is quite tight: it matches state-of-the-art discretization guarantees for the Langevin diffusion (Vempala & Wibisono, 2019; Chewi et al., 2021a) . We find Theorem 1 surprising, because it shows that SGMs can sample from the data distribution q with polynomial complexity, even when q is highly non-log-concave (a task that is usually intractable), provided that one has access to an accurate score estimator. This answers the open question of (Lee et al., 2022a) regarding whether or not SGMs can sample from multimodal distributions, e.g., mixtures of distributions with bounded log-Sobolev constant. In the context of neural networks, our result implies that so long as the neural network succeeds at the score estimation task, the remaining part of the SGM algorithm based on the diffusion model is completely principled, in that it admits a strong theoretical justification.



The bounds obtained are not quantitative(e.g., De Bortoli et al.,  2021; Pidstrigach, 2022), or scale exponentially in the dimension and other problem parameters like time and smoothness(Block et al., 2020; De Bortoli, 2022; Liu et al., 2022), and hence are typically vacuous for the high-dimensional settings of interest in practice. 2. Strong assumptions on the data distribution. The bounds require strong assumptions on the true data distribution, such as a log-Sobelev inequality (LSI) (see, e.g.,Lee et al., 2022a). While the LSI is slightly weaker than strong log-concavity, it ultimately precludes the presence of substantial non-convexity, which impedes the application of these results to complex and highly multi-modal real-world data distributions. Indeed, obtaining a polynomial-time convergence analysis for SGMs that holds for multi-modal distributions was posed as an open question in (Lee et al., 2022a). 3. Strong assumptions on the score estimation error. The bounds require that the score estimate is L

