DENOISING DIFFUSION IMPLICIT MODELS

Abstract

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error.

1. INTRODUCTION

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al., 2020; van den Oord et al., 2016a) . In terms of image generation, generative adversarial networks (GANs, Goodfellow et al. (2014) ) currently exhibits higher sample quality than likelihood-based methods such as variational autoencoders (Kingma & Welling, 2013) , autoregressive models (van den Oord et al., 2016b) and normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2016) . However, GANs require very specific choices in optimization and architectures in order to stabilize training (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018) , and could fail to cover modes of the data distribution (Zhao et al., 2018) . Recent works on iterative generative models (Bengio et al., 2014) , such as denoising diffusion probabilistic models (DDPM, Ho et al. (2020) ) and noise conditional score networks (NCSN, Song & Ermon (2019) ) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training. To achieve this, many denoising autoencoding models are trained to denoise samples corrupted by various levels of Gaussian noise. Samples are then produced by a Markov chain which, starting from white noise, progressively denoises it into an image. This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al., 2015) . A critical drawback of these models is that they require many iterations to produce a high quality sample. For DDPMs, this is because that the generative process (from noise to data) approximates the reverse of the forward diffusion process (from data to noise), which could have thousands of steps; iterating over all the steps is required to produce a single sample, which is much slower compared to GANs, which only needs one pass through a network. For example, it takes around 20 hours to sample 50k images of size 32 × 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size 256 × 256 could take nearly 1000 hours on the same GPU. To close this efficiency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs). DDIMs are implicit probabilistic models (Mohamed & Lakshminarayanan, 2016) and are closely related to DDPMs, in the sense that they are trained with the same objective function. In Section 3, we generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones, for which we are still able to design suitable reverse generative Markov chains. We show that the resulting variational training objectives have a shared surrogate objective, which is exactly the objective used to train DDPM. Therefore, we can freely choose from a large family of generative models using the same neural network simply by choosing a different, non-Markovian diffusion process (Section 4.1) and the corresponding reverse generative Markov Chain. In particular, we are able to use non-Markovian diffusion processes which lead to "short" generative Markov chains (Section 4.2) that can be simulated in a small number of steps. This can massively increase sample efficiency only at a minor cost in sample quality. In Section 5, we demonstrate several empirical benefits of DDIMs over DDPMs. First, DDIMs have superior sample generation quality compared to DDPMs, when we accelerate sampling by 10× to 100× using our proposed method. Second, DDIM samples have the following "consistency" property, which does not hold for DDPMs: if we start with the same initial latent variable and generate several samples with Markov chains of various lengths, these samples would have similar high-level features. Third, because of "consistency" in DDIMs, we can perform semantically meaningful image interpolation by manipulating the initial latent variable in DDIMs, unlike DDPMs which interpolates near the image space due to the stochastic generative process.

2. BACKGROUND

Given samples from a data distribution q(x 0 ), we are interested in learning a model distribution p θ (x 0 ) that approximates q(x 0 ) and is easy to sample from. Denoising diffusion probabilistic models (DDPMs, Sohl-Dickstein et al. (2015) ; Ho et al. ( 2020)) are latent variable models of the form p θ (x 0 ) = p θ (x 0:T )dx 1:T , where p θ (x 0:T ) := p θ (x T ) T t=1 p (t) θ (x t-1 |x t ) where x 1 , . . . , x T are latent variables in the same sample space as x 0 (denoted as X ). The parameters θ are learned to fit the data distribution q(x 0 ) by maximizing a variational lower bound: max θ E q(x0) [log p θ (x 0 )] ≤ max θ E q(x0,x1,...,x T ) [log p θ (x 0:T ) -log q(x 1:T |x 0 )] where q(x 1:T |x 0 ) is some inference distribution over the latent variables. Unlike typical latent variable models (such as the variational autoencoder (Rezende et al., 2014)), DDPMs are learned with a fixed (rather than trainable) inference procedure q(x 1:T |x 0 ), and latent variables are relatively high dimensional. For example, Ho et al. ( 2020) considered the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α 1:T ∈ (0, 1] T : q(x 1:T |x 0 ) := T t=1 q(x t |x t-1 ), where q(x t |x t-1 ) := N α t α t-1 x t-1 , 1 - α t α t-1 I where the covariance matrix is ensured to have positive terms on its diagonal. This is called the forward process due to the autoregressive nature of the sampling procedure (from x 0 to x T ). We call the latent variable model p θ (x 0:T ), which is a Markov chain that samples from x T to x 0 , the generative process, since it approximates the intractable reverse process q(x t-1 |x t ). Intuitively, the forward process progressively adds noise to the observation x 0 , whereas the generative process progressively denoises a noisy observation (Figure 1 , left). A special property of the forward process is that q(x t |x 0 ) := q(x 1:t |x 0 )dx 1:(t-1) = N (x t ; √ α t x 0 , (1 -α t )I);



Figure 1: Graphical models for diffusion (left) and non-Markovian (right) inference models.

