COLD DIFFUSION: INVERTING ARBITRARY IMAGE TRANSFORMS WITHOUT NOISE

Abstract

Standard diffusion models involve an image transform -adding Gaussian noise -and an image restoration operator that inverts this degradation. We observe that the generative behavior of diffusion models is not strongly dependent on the choice of image degradation, and in fact an entire family of generative models can be constructed by varying this choice. Even when using completely deterministic degradations (e.g., blur, masking, and more), the training and test-time update rules that underlie diffusion models can be easily generalized to create generative models. The success of these fully deterministic models calls into question the community's understanding of diffusion models, which relies on noise in either gradient Langevin dynamics or variational inference, and paves the way for generalized diffusion models that invert arbitrary processes.

Original

Forward -----------------------→ Degraded Reverse ----------------------→ Generated Snow Pixelate Mask Animorph Blur Noise Figure 1 : Demonstration of the forward and backward processes for both hot and cold diffusions. While standard diffusions are built on Gaussian noise (top row), we show that generative models can be built on arbitrary and even noiseless/cold image transforms, including the ImageNet-C snowification operator, and an animorphosis operator that adds a random animal image from AFHQ.

1. INTRODUCTION

Diffusion models have recently emerged as powerful tools for generative modeling (Ramesh et al., 2022) . Diffusion models come in many flavors, but all are built around the concept of random noise removal; one trains an image restoration/denoising network that accepts an image contaminated with Gaussian noise, and outputs a denoised image. At test time, the denoising network is used to convert pure Gaussian noise into a photo-realistic image using an update rule that alternates between applying the denoiser and adding Gaussian noise. When the right sequence of updates is applied, complex generative behavior is observed. The origins of diffusion models, and also our theoretical understanding of these models, are strongly based on the role played by Gaussian noise during training and generation. Diffusion has been understood as a random walk around the image density function using Langevin dynamics (Sohl-Dickstein et al., 2015; Song & Ermon, 2019) , which requires Gaussian noise in each step. The walk begins in a high temperature (heavy noise) state, and slowly anneals into a "cold" state with little if any noise. Another line of work derives the loss for the denoising network using variational inference with a Gaussian prior (Ho et al., 2020; Song et al., 2021a; Nichol & Dhariwal, 2021) . In this work, we examine the need for Gaussian noise, or any randomness at all, for diffusion models to work in practice. We consider generalized diffusion models that live outside the confines of the theoretical frameworks from which diffusion models arose. Rather than limit ourselves to models built around Gaussian noise, we consider models built around arbitrary image transformations like blurring, downsampling, etc. We train a restoration network to invert these deformations using a simple ℓ p loss. When we apply a sequence of updates at test time that alternate between the image restoration model and the image degradation operation, generative behavior emerges, and we obtain photo-realistic images. The existence of cold diffusions that require no Gaussian noise (or any randomness) during training or testing raises questions about the limits of our theoretical understanding of diffusion models. It also unlocks the door for potentially new types of generative models with very different properties than conventional diffusion seen so far.

2. BACKGROUND

Both the Langevin dynamics and variational inference interpretations of diffusion models rely on properties of the Gaussian noise used in the training and sampling pipelines. From the scorematching generative networks perspective (Song & Ermon, 2019; Song et al., 2021b) , noise in the training process is critically thought to expand the support of the low-dimensional training distribution to a set of full measure in ambient space. The noise is also thought to act as data augmentation to improve score predictions in low density regions, allowing for mode mixing in the stochastic gradient Langevin dynamics (SGLD) sampling. The gradient signal in low-density regions can be further improved during sampling by injecting large magnitudes of noise in the early steps of SGLD and gradually reducing this noise in later stages. Kingma et al. (2021) propose a method to learn a noise schedule that leads to faster optimization. Using a classic statistical result, Kadkhodaie & Simoncelli (2021) show the connection between removing additive Gaussian noise and the gradient of the log of the noisy signal density in deterministic linear inverse problems. Here, we shed light on the role of noise in diffusion models through theoretical and empirical results in applications to inverse problems and image generation. Iterative neural models have been used for various inverse problems (Romano et al., 2016; Metzler et al., 2017) . Recently, diffusion models have been applied to them (Song et al., 2021b) for the problems of deblurring, denoising, super-resolution, and compressive sensing (Whang et al., 2021; Kawar et al., 2021; Saharia et al., 2021; Kadkhodaie & Simoncelli, 2021) . Although not their focus, previous works on diffusion models have included experiments with deterministic image generation (Song et al., 2021a; Dhariwal & Nichol, 2021; Karras et al., 2022) and in selected inverse problems (Kawar et al., 2022 ). Recently, Rissanen et al. (2022) use a combination of Gaussian noise and blurring as a forward process for diffusion. Though they show the feasibility of a different degradation, here we show definitively that noise is not a necessity in diffusion models, and we observe the effects of removing noise for a number of inverse problems. Despite prolific work on generative models in recent years, methods to probe the properties of learned distributions and measure how closely they approximate the real training data are by no means closed fields of investigation. Indirect feature space similarity metrics such as Inception Score (Salimans et al., 2016 ), Mode Score (Che et al., 2016) , Frechet inception distance (FID) (Heusel et al., 2017) , and Kernel inception distance (KID) (Bińkowski et al., 2018) have been proposed and adopted to some extent, but they

