TRUNCATED DIFFUSION PROBABILISTIC MODELS AND DIFFUSION-BASED ADVERSARIAL AUTO-ENCODERS

Abstract

Employing a forward diffusion chain to gradually map the data to a noise distribution, diffusion-based generative models learn how to generate the data by inferring a reverse diffusion chain. However, this approach is slow and costly because it needs many forward and reverse steps. We propose a faster and cheaper approach that adds noise not until the data become pure random noise, but until they reach a hidden noisy-data distribution that we can confidently learn. Then, we use fewer reverse steps to generate data by starting from this hidden distribution that is made similar to the noisy data. We reveal that the proposed model can be cast as an adversarial auto-encoder empowered by both the diffusion process and a learnable implicit prior. Experimental results show even with a significantly smaller number of reverse diffusion steps, the proposed truncated diffusion probabilistic models can provide consistent improvements over the non-truncated ones in terms of performance in both unconditional and text-guided image generations.

1. INTRODUCTION

Generating photo-realistic images with probabilistic models is a challenging and important task in machine learning and computer vision, with many potential applications in data augmentation, image editing, style transfer, etc. Recently, a new class of image generative models based on diffusion processes (Sohl-Dickstein et al., 2015) has achieved remarkable results on various commonly used image generation benchmarks (Song & Ermon, 2019; Ho et al., 2020; Song & Ermon, 2020; Song et al., 2021b; Dhariwal & Nichol, 2021) , surpassing many existing deep generative models, such as autoregressive models (van den Oord et al., 2016) , variational auto-encoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014; van den Oord et al., 2017; Razavi et al., 2019) , and generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Arjovsky et al., 2017; Miyato et al., 2018; Brock et al., 2019; Karras et al., 2019; 2020b) . This new modeling class, which includes both score-based and diffusion-based generative models, uses noise injection to gradually corrupt the data distribution into a simple noise distribution that can be easily sampled from, and then uses a denoising network to reverse the noise injection to generate photo-realistic images. From the perspective of score matching (Hyvärinen & Dayan, 2005; Vincent, 2011) and Langevin dynamics (Neal, 2011; Welling & Teh, 2011) , the denoising network is trained by matching the score function, which is the gradient of the log-density of the data, of the corrupted data distribution and that of the generator distribution at different noise levels (Song & Ermon, 2019) . This training objective can also be formulated under diffusion-based generative models (Sohl-Dickstein et al., 2015; Ho et al., 2020) . These two types of models have been further unified by Song et al. (2021b) under the framework of discretized stochastic differential equations. Despite their impressive performance, diffusion-based (or score-based) generative models suffer from high computational costs, both in training and sampling. This is because they need to perform a large number of diffusion steps, typically hundreds or thousands, to ensure that the noise injection is small enough at each step to make the assumption that both the diffusion and denoising processes have the Gaussian form hold in the limit of small diffusion rate (Feller, 1949; Sohl-Dickstein et al., 2015) . In other words, when the number of diffusion steps is small or when the rate is large, the Gaussian assumption may not hold well, and the model may not be able to capture the true score function In this paper, we propose a novel way to shorten the diffusion trajectory by learning an implicit distribution to start the reverse diffusion process, instead of relying on a tractable noise distribution. We call our method truncated diffusion probabilistic modeling (TDPM), which is based on the idea of truncating the forward diffusion chain of an existing diffusion model, such as the denoising diffusion probabilistic model (DDPM) of Ho et al. (2020) . To significantly accelerate diffusion-based text-to-image generation, we also introduce the truncated latent diffusion model (TLDM), which truncates the diffusion chain of the latent diffusion model (LDM) of Rombach et al. (2022) . We note LDM is the latent text-to-image diffusion model behind Stable Diffusion, an open-source project that provides state-of-the-art performance in generating photo-realistic images given text input. By truncating the chain, we can reduce the number of diffusion steps to an arbitrary level, but at the same time, we also lose the tractability of the distribution at the end of the chain. Therefore, we need to learn an implicit generative distribution that can approximate this distribution and provide the initial samples for the reverse diffusion process. We show that this implicit generative distribution can be implemented in different ways, such as using a separate generator network or reusing the denoising network. The former option has more flexibility and can improve the generation quality, while the latter option has no additional parameters and can achieve comparable results. We reveal that DDPM and VAE have a similar relationship as TDPM and adversarial auto-encoder (AAE, Makhzani et al. (2015) ). Specifically, DDPM is like a VAE with a fixed encoder and a learnable decoder that use a diffusion process, and a predefined prior. TDPM is like an AAE with a fixed encoder and a learnable decoder that use a truncated diffusion process, and a learnable implicit prior. Our truncation method has several advantages when we use it to modify DDPM for generating images without text guidance or LDM for generating images with text guidance. First, it can generate samples much faster by using fewer diffusion steps, without sacrificing or even enhancing the generation quality. Second, it can exploit the cooperation between the implicit model and the diffusion model, as the diffusion model helps the implicit model train by providing noisy data samples, and the implicit model helps the diffusion model reverse by providing better initial samples. Third, it can adapt the truncation level to balance the generation quality and efficiency, depending on the data complexity and the computational resources. For generating images with text guidance, our method can speed up the generation significantly and make it suitable for real-time processing: while LDM takes the time to generate one photo-realistic image, our TLDM can generate more than 50 such images.



Figure 1: (Best viewed in color) An illustrative depiction of diffusion models and our truncated diffusion models. Top: The conventional denoising diffusion models add Gaussian noise gradually with a large number of time steps, where the true posterior can be kept close to Gaussian and hence easy to fit with denoising (score-matching) loss (marked in a solid blue box). Bottom: Truncated diffusion models truncate the diffusion chain to keep its first few steps and small diffusion segment (marked in the dashed blue box). This truncated diffusion chain can still be learned with previous denoising methods. Meanwhile, as the left part is truncated, the Gaussian prior p(xT ) will have a large gap to the truncated point q(xT | x0), which is bridged with an implicit generative distribution p ψ (xT ) = p ψ (xT |z)p(z)dz (marked in dashed red box).of the data. Therefore, previous works have tried to reduce the number of diffusion steps by using non-Markovian reverse processes(Song et al., 2020; Kong & Ping, 2021), adaptive noise scheduling(San-Roman et al., 2021; Kingma et al., 2021), knowledge distillation(Luhman & Luhman, 2021;  Salimans & Ho, 2022), diffusing in a lower-dimension latent space(Rombach et al., 2022), etc., but they still cannot achieve significant speedup without sacrificing the generation quality.

