IMPROVED DENOISING DIFFUSION PROBABILISTIC MODELS

Abstract

We explore denoising diffusion probabilistic models, a class of generative models which have recently been shown to produce excellent samples in the image and audio domains. While these models produce excellent samples, it has yet to be shown that they can achieve competitive log-likelihoods. We show that, with several small modifications, diffusion models can achieve competitive log-likelihoods in the image domain while maintaining high sample quality. Additionally, our models allow for sampling with an order of magnitude fewer diffusion steps with only a modest difference in sample quality. Finally, we explore how sample quality and log-likelihood scale with the number of diffusion steps and the amount of model capacity. We conclude that denoising diffusion probabilistic models are a promising class of generative models with excellent scaling properties and sample quality.

1. INTRODUCTION

introduced diffusion probabilistic models ("diffusion models" for brevity), a class of generative models which match a data distribution by learning to reverse a gradual, multi-step noising process. More recently, Ho et al. (2020) showed an equivalence between these models and score based generative models (Song & Ermon, 2019; 2020) , which learn a gradient of the log-density of the data distribution using denoising score matching (Hyvärinen, 2005) . It has recently been shown that this class of models can produce high-quality images (Ho et al., 2020; Song & Ermon, 2020; Jolicoeur-Martineau et al., 2020) and audio (Chen et al., 2020b; Kong et al., 2020) , but it has yet to be shown that diffusion models can achieve competitive log-likelihoods. Furthermore, while Ho et al. (2020) showed extremely good results on the CIFAR-10 (Krizhevsky, 2009) and LSUN (Yu et al., 2015) datasets, it is unclear how well diffusion models scale to datasets with higher diversity such as ImageNet. Finally, while Chen et al. (2020b) found that diffusion models can efficiently generate audio using a small number of sampling steps, it has yet to be shown that the same is true for images. In this paper, we show that diffusion models can achieve competitive log-likelihoods while maintaining good sample quality, even on high-diversity datasets like ImageNet. Additionally, we show that our improved models can produce competitive samples an order of magnitude faster than those from Ho et al. ( 2020). We achieve these results by combining a simple reparameterization of the reverse process variance, a hybrid learning objective that combines the variational lower-bound with the simplified objective from Ho et al. ( 2020), and a novel noise schedule which allows the model to better leverage the entire diffusion process. We find surprisingly that, with our hybrid objective, our models obtain better log-likelihoods than those obtained by optimizing the log-likelihood directly, and discover that the latter objective has much more gradient noise during training. We show that a simple importance sampling technique reduces this noise and allows us to achieve better log-likelihoods than with the hybrid objective. Using our trained models, we study how sample quality and log-likelihood change as we adjust the number of diffusion steps used at sampling time. We demonstrate that our improved models allow us to use an order of magnitude fewer steps at test time with only a modest change in sample quality and log-likelihood, thus speeding up sampling for use in practical applications. Finally, we evaluate the performance of these models as we increase model size, and observe trends that suggest predictable improvements in performance as we increase training compute.

2. DENOISING DIFFUSION PROBABILISTIC MODELS

We briefly review the formulation of diffusion models from Ho et al. (2020) . This formulation makes various simplifying assumptions, such as a fixed noising process q which adds diagonal Gaussian noise at each timestep. For a more general derivation, see Sohl-Dickstein et al. (2015) .

2.1. DEFINITIONS

Given a data distribution x 0 ∼ q(x 0 ), we define a forward noising process q which produces latents x 1 through x T by adding Gaussian noise at time t with variance β t ∈ (0, 1) as follows: q(x 1 , ..., x T |x 0 ) := T t=1 q(x t |x t-1 ) (1) q(x t |x t-1 ) := N (x t ; 1 -β t x t-1 , β t I) (2) Given sufficiently large T and a well behaved schedule of β t , the latent x T is nearly an isotropic Gaussian distribution. Thus, if we know the exact reverse distribution q(x t-1 |x t ), we can sample x T ∼ N (0, I) and run the process in reverse to get a sample from q(x 0 ). However, since q(x t-1 |x t ) depends on the entire data distribution, we approximate it using a neural network: p θ (x t-1 |x t ) := N (x t-1 ; µ θ (x t , t), Σ θ (x t , t)) The combination of q and p is a variational auto-encoder (Kingma & Welling, 2013), and we can write the variational lower bound (VLB) as follows: L vlb := L 0 + L 1 + ... + L T -1 + L T (4) L 0 := -log p θ (x 0 |x 1 ) (5) L t-1 := D KL (q(x t-1 |x t , x 0 ) || p θ (x t-1 |x t )) (6) L T := D KL (q(x T |x 0 ) || p(x T )) Aside from L 0 , each term of Equation 4 is a KL divergence between two Gaussian distributions, and can thus be evaluated in closed form. To evaluate L 0 for images, we assume that each color component is divided into 256 bins, and we compute the probability of p θ (x 0 |x 1 ) landing in the correct bin (which is tractable using the CDF of the Gaussian distribution). Also note that while L T does not depend on θ, it will be close to zero if the forward noising process adequately destroys the data distribution so that q(x T |x 0 ) ≈ N (0, I). It is useful to define and derive several other quantities which are relevant to the forward noising process, so we repeat them here from Ho et al. ( 2020): α t := 1 -β t (8) ᾱt := t s=0 α s (9) βt := 1 -ᾱt-1 1 -ᾱt β t (10) μt (x t , x 0 ) := √ ᾱt-1 β t 1 -ᾱt x 0 + √ α t (1 -ᾱt-1 ) 1 -ᾱt x t q(x t |x 0 ) = N (x t ; √ ᾱt x 0 , (1 -ᾱt )I) (12) q(x t-1 |x t , x 0 ) = N (x t-1 ; μ(x t , x 0 ), βt I) (13)

2.2. TRAINING IN PRACTICE

Equation 12 provides an efficient way to jump directly to an arbitrary step of the forward noising process. This makes it possible to randomly sample t during training. Ho et al. ( 2020) uniformly sample t for each image in each mini-batch.

