PYRAMIDAL DENOISING DIFFUSION PROBABILISTIC MODELS

Abstract

Recently, diffusion model have demonstrated impressive image generation performances, and have been extensively studied in various computer vision tasks. Unfortunately, training and evaluating diffusion models consume a lot of time and computational resources. To address this problem to allow training with even a single GPU, here we present a novel pyramidal diffusion model that can generate high resolution images starting from much coarser resolution images using a single score function trained with a positional embedding. This enables a neural network to be much lighter and also enables time-efficient image generation without compromising its performances. Furthermore, we show that the proposed approach can be also efficiently used for multi-scale super-resolution problem using a single score function.

1. INTRODUCTION

Diffusion models produce high quality images via reverse diffusion processes and have achieved impressive performances in many computer vision tasks. Score-based generative models (Song et al., 2021b) produce images by solving a stochastic differential equation using a score function estimated by a neural network. Denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020; Sohl-Dickstein et al., 2015) can be considered as discrete form of score-based generative models. Thanks to the state-of-art image generation performance, these diffusion models have been widely investigated for various applications. 



For example, Rombach et al. (2021) trained a diffusion model on the latent space of a convolutional neural network (CNN)-based generative model, which enabled various of tasks. Diffusion-CLIP (Kim & Ye, 2021) leveraged contrastive language-image pretraining (CLIP) loss(Radford  et al., 2021)  and the denoising diffusion implicit model (DDIM)(Song et al., 2021a)  for text-driven style transfer. ILVR(Choi et al., 2021)  proposed conditional diffusion models using unconditionally trained score functions, and CCDF(Chung et al., 2021)  developed its generalized frameworks and their acceleration techniques. Also, recently proposed models(Nichol et al., 2021; Ramesh et al.,  2022)  have achieved incredible performances on text-conditioned image generation and editing.In spite of the amazing performance and flexible extensions, slow training and generation speed remains as a critical drawback. To resolve the problem, various approaches have been investigated. Rombach et al. (2021); Vahdat et al. (2021) trained a diffusion model in a low-dimensional representational space provided by pre-trained autoencoders. DDIM (Song et al., 2021a) proposed deterministic forward and reverse sampling schemes to accelerate the generation speed. Song & Ermon (2020) proposed a parameterization of covariance term to achieve better performance and faster sampling speed. Jolicoeur-Martineau et al. (2021) used adaptive step size without any tuning. PNDM (Liu et al., 2022) devised a pseudo numerical method by slightly changing classical numerical methods (Sauer, 2011) for speed enhancement. Salimans & Ho (2022) reduced the sampling time by progressively halving the diffusion step without losing the sample quality. Denoising diffusion GANs (Xiao et al., 2021) enabled large denoising steps through parameterizing the diffusion process by multimodal conditional GANs. For conditional diffusion, a short forward diffusion steps of corrupted input can reduce the number of reverse diffusion step in SDEdit (Meng et al., 2021) and RePaint (Lugmayr et al., 2022), whose theoretical justification was discovered in in CCDF (Chung et al., 2021) using the stochastic contraction theory. 1

