SAMPLING IS AS EASY AS LEARNING THE SCORE: THEORY FOR DIFFUSION MODELS WITH MINIMAL DATA ASSUMPTIONS

Abstract

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL•E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an L 2 -accurate score estimate (rather than L ∞ -accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs.

1. INTRODUCTION

Score-based generative models (SGMs) are a family of generative models which achieve state-ofthe-art performance for generating audio and image data (Sohl-Dickstein et al., 2015; Ho et al., 2020; Dhariwal & Nichol, 2021; Kingma et al., 2021; Song et al., 2021a; b; Vahdat et al., 2021) ; see, e.g., the recent surveys (Cao et al., 2022; Croitoru et al., 2022; Yang et al., 2022) . For example, denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a key component in large-scale generative models such as DALL•E 2 (Ramesh et al., 2022) . As the importance of SGMs continues to grow due to newfound applications in commercial domains, it is a pressing question of both practical and theoretical concern to understand the mathematical underpinnings which explain their startling empirical successes. As we explain in Section 2, at their mathematical core, SGMs consist of two stochastic processes, the forward process and the reverse process. The forward process transforms samples from a data distribution q (e.g., images) into noise, whereas the reverse process transforms noise into samples from q, hence performing generative modeling. Running the reverse process requires estimating the score function of the law of the forward process; this is typically done by training neural networks on a score matching objective (Hyvärinen, 2005; Vincent, 2011; Song & Ermon, 2019) . Providing precise guarantees for estimation of the score function is difficult, as it requires an understanding of the non-convex training dynamics of neural network optimization that is currently out of reach. However, given the empirical success of neural networks on the score estimation task,

