BLURRING DIFFUSION MODELS

Abstract

Recently, Rissanen et al. (2022) have presented a new type of diffusion process for generative modeling based on heat dissipation, or blurring, as an alternative to isotropic Gaussian diffusion. Here, we show that blurring can equivalently be defined through a Gaussian diffusion process with non-isotropic noise. In making this connection, we bridge the gap between inverse heat dissipation and denoising diffusion, and we shed light on the inductive bias that results from this modeling choice. Finally, we propose a generalized class of diffusion models that offers the best of both standard Gaussian denoising diffusion and inverse heat dissipation, which we call Blurring Diffusion Models.

1. INTRODUCTION

Diffusion models are becoming increasingly successful for image generation, audio synthesis and video generation. Diffusion models define a (stochastic) process that destroys a signal such as an image. In general, this process adds Gaussian noise to each dimension independently. However, data such as images clearly exhibit multi-scale properties which such a diffusion process ignores. Recently, the community is looking at new destruction processes which are referred to as deterministic or 'cold' diffusion (Rissanen et al., 2022; Bansal et al., 2022) . In these works, the diffusion process is either deterministic or close to deterministic. For example, in (Rissanen et al., 2022) a diffusion model that incorporates heat dissipation is proposed, which can be seen as a form of blurring. Blurring is a natural destruction for images, because it retains low frequencies over higher frequencies. However, there still exists a considerable gap between the visual quality of standard denoising diffusion models and these new deterministic diffusion models. This difference cannot be explained away by a limited computational budget: A standard diffusion model can be trained with relative little compute (about one to four GPUs) with high visual quality on a task such as unconditional CIFAR10 generation 1 . In contrast, the visual quality of deterministic diffusion models have been 1 An example of a denoising diffusion implementation https://github.com/w86763777/pytorch-ddpm much worse so far. In addition, fundamental questions remain around the justification of deterministic diffusion models: Does their specification offer any guarantees about being able to model the data distribution? In this work, we aim to resolve the gap in quality between models using blurring and additive noise. We present Blurring Diffusion Models, which combine blurring (or heat dissipation) and additive Gaussian noise. We show that the given process can have Markov transitions and that the denoising process can be written with diagonal covariance in frequency space. As a result, we can use modern techniques from denoising diffusion. Our model generates samples with higher visual quality, which is evidenced by better FID scores.

2. BACKGROUND

2.1 DIFFUSION MODELS Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) learn to generate data by denoising a pre-defined destruction process which is named the diffusion process. Commonly, the diffusion process starts with a datapoint and gradually adds Gaussian noise to the datapoint. Before defining the generative process, this diffusion process needs to be defined. Following the definition of (Kingma et al., 2021) the diffusion process can be written as: q(z t |x) = N (z t |α t x, σ 2 t I), where x represents the data and z t are the noisy latent variables. Since α t is monotonically decreasing and σ t is monotonically increasing, the information from x in z t will be gradually destroyed as t increases. Assuming that the above process defined by Equation 1is Markov, it has transition distributions for z t given z s where 0 ≤ s < t: q(z t |z s ) = N (z t |α t|s z s , σ 2 t|s I), where α t|s = α t /α s and σ 2 t|s = σ 2 t -α 2 t|s σ 2 s . A convenient property is that the grid of timesteps can be defined arbitrarily and does not depend on the specific spacing of s and t. We let T = 1 denote the last diffusion step where q(z T |x) ≈ N (z T |0, I), a standard normal distribution. Unless otherwise specified, a time step lies in the unit interval [0, 1]. The Denoising Process Another important distribution is the true denoising distribution q(z s |z t , x) given a datapoint x. Using that q(z s |z t , x) ∝ q(z t |z s )q(z s |x) one can derive that: q(z s |z t , x) = N (z s |µ t→s , σ 2 t→s I), where σ 2 t→s = 1 σ 2 s + α 2 t|s σ 2 t|s -1 and µt→s = σ 2 t→s α t|s σ 2 t|s zt + αs σ 2 s x To generate data, the true denoising process is approximated by a learned denoising process p(z s |z t ), where the datapoint x is replaced by a prediction from a learned model. The model distribution is then given by p(z s |z t ) = q(z s |z t , x(z t )), where x(z t ) is a prediction provided by a neural network. As shown by Song et al. (2020) , the true q(z s |z t ) → q(z s |z t , x = E[x|z t ]) as s → t, which justifies this choice of model: If the generative model takes sufficiently small steps, and if x(z t ) is sufficiently expressive, the model can learn the data distribution exactly. Instead of directly predicting x, diffusion models can also model ˆ t = f θ (z t , t), where f θ is a neural net, so that: x = z t /α t -σ t /α t ˆ t , which is inspired by the reparametrization to sample from Equation 1 which is z t = α t x + σ t t . This parametrization is called the epsilon parametrization and empirically leads to better sample quality than predicting x directly (Ho et al., 2020) .



(a) Diffusion (Sohl-Dickstein et al., 2015; Ho et al., 2020) (b) Heat Dissipation (Rissanen et al., 2022) (c) Blurring Diffusion

Figure 1: Comparison between standard diffusion, heat dissipation and blurring diffusion.

