PYRAMIDAL DENOISING DIFFUSION PROBABILISTIC MODELS

Abstract

Recently, diffusion model have demonstrated impressive image generation performances, and have been extensively studied in various computer vision tasks. Unfortunately, training and evaluating diffusion models consume a lot of time and computational resources. To address this problem to allow training with even a single GPU, here we present a novel pyramidal diffusion model that can generate high resolution images starting from much coarser resolution images using a single score function trained with a positional embedding. This enables a neural network to be much lighter and also enables time-efficient image generation without compromising its performances. Furthermore, we show that the proposed approach can be also efficiently used for multi-scale super-resolution problem using a single score function.

1. INTRODUCTION

Diffusion models produce high quality images via reverse diffusion processes and have achieved impressive performances in many computer vision tasks. Score-based generative models (Song et al., 2021b) produce images by solving a stochastic differential equation using a score function estimated by a neural network. Denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020; Sohl-Dickstein et al., 2015) can be considered as discrete form of score-based generative models. Thanks to the state-of-art image generation performance, these diffusion models have been widely investigated for various applications. For example, Rombach et al. ( 2021) trained a diffusion model on the latent space of a convolutional neural network (CNN)-based generative model, which enabled various of tasks. Diffusion-CLIP (Kim & Ye, 2021) leveraged contrastive language-image pretraining (CLIP) loss (Radford et al., 2021) and the denoising diffusion implicit model (DDIM) (Song et al., 2021a) for text-driven style transfer. ILVR (Choi et al., 2021) proposed conditional diffusion models using unconditionally trained score functions, and CCDF (Chung et al., 2021) developed its generalized frameworks and their acceleration techniques. Also, recently proposed models (Nichol et al., 2021; Ramesh et al., 2022) have achieved incredible performances on text-conditioned image generation and editing. In spite of the amazing performance and flexible extensions, slow training and generation speed remains as a critical drawback. To resolve the problem, various approaches have been investigated. 2022a) that refine low resolution images to high resolution using cascaded applications of multiple diffusion models. However, in contrast to (Saharia et al., 2021; Ho et al., 2022a) , our model does not need to train multiple models, and can be implemented on a much lighter single architecture which results in speed enhancement in both training and inference without compromising the generation quality. Specifically, in contrast to the existing diffusion models that adopt encoder-decoder architecture for the same dimensional input and output, here we propose a new conditional training method for the score function using positional information, which gives flexibility in the sampling process of reverse diffusion. Specifically, our pyramidal DDPM can generate a multiple resolution images using a single score function by utilizing positional information as a condition for training and inference. Fig. 1 shows the result of generated images in three different resolutions using only one model in the reverse diffusion process, which clearly demonstrates the flexibility of our method. Furthermore, as a byproduct, we also demonstrate multi-scale super-resolution using a single diffusion model. The contribution of this work can be summarized as following: • We propose a novel method of conditionally training diffusion model for multi-scale image generation by exploiting the positional embedding. In contrast to the existing diffusion model, in which the latent dimension and the output dimension are the same, in our method the output dimension can be arbitrarily large compared to the latent input dimension. • Using a single score network, we mitigate high computation problem and slow speed issue of reverse diffusion process using a coarse-to-fine refinement while preserving the generation quality. The key element for this is again the positional encoding as a condition for the diffusion model. • We present multi-scale super-resolution which recursively refines the image resolution using a single score model.

2.1. DENOISING DIFFUSION PROBABILISTIC MODELS

In DDPMs (Ho et al., 2020; Sohl-Dickstein et al., 2015) , for a given data distribution x 0 ∼ q(x 0 ), we define a forward diffusion process q(x t |x t-1 ) as a Markov chain by gradually adding Gaussian



Rombach et al. (2021); Vahdat et al. (2021) trained a diffusion model in a low-dimensional representational space provided by pre-trained autoencoders. DDIM (Song et al., 2021a) proposed deterministic forward and reverse sampling schemes to accelerate the generation speed. Song & Ermon (2020) proposed a parameterization of covariance term to achieve better performance and faster sampling speed. Jolicoeur-Martineau et al. (2021) used adaptive step size without any tuning. PNDM (Liu et al., 2022) devised a pseudo numerical method by slightly changing classical numerical methods (Sauer, 2011) for speed enhancement. Salimans & Ho (2022) reduced the sampling time by progressively halving the diffusion step without losing the sample quality. Denoising diffusion GANs (Xiao et al., 2021) enabled large denoising steps through parameterizing the diffusion process by multimodal conditional GANs. For conditional diffusion, a short forward diffusion steps of corrupted input can reduce the number of reverse diffusion step in SDEdit (Meng et al., 2021) and RePaint (Lugmayr et al., 2022), whose theoretical justification was discovered in in CCDF (Chung et al., 2021) using the stochastic contraction theory.

Figure 1: Progressive image generation from noises using the proposed method trained on FFHQ (Choi et al., 2020) dataset. Three different resolution images are generated from noise through reverse diffusion processes using a single model. In red boxes, the preservation of the semantic information at different resolution images is observed.

