COLD DIFFUSION: INVERTING ARBITRARY IMAGE TRANSFORMS WITHOUT NOISE

Abstract

Standard diffusion models involve an image transform -adding Gaussian noise -and an image restoration operator that inverts this degradation. We observe that the generative behavior of diffusion models is not strongly dependent on the choice of image degradation, and in fact an entire family of generative models can be constructed by varying this choice. Even when using completely deterministic degradations (e.g., blur, masking, and more), the training and test-time update rules that underlie diffusion models can be easily generalized to create generative models. The success of these fully deterministic models calls into question the community's understanding of diffusion models, which relies on noise in either gradient Langevin dynamics or variational inference, and paves the way for generalized diffusion models that invert arbitrary processes.

1. INTRODUCTION

Diffusion models have recently emerged as powerful tools for generative modeling (Ramesh et al., 2022) . Diffusion models come in many flavors, but all are built around the concept of random noise removal; one trains an image restoration/denoising network that accepts an image contaminated with Gaussian noise, and outputs a denoised image. At test time, the denoising network is used to convert pure Gaussian noise into a photo-realistic image using an update rule that alternates between applying the denoiser and adding Gaussian noise. When the right sequence of updates is applied, complex generative behavior is observed. The origins of diffusion models, and also our theoretical understanding of these models, are strongly based on the role played by Gaussian noise during training and generation. Diffusion has been understood as a random walk around the image density function using Langevin dynamics (Sohl-Dickstein et al., 2015; Song & Ermon, 2019) , which requires Gaussian noise in each step. The walk begins in a high temperature (heavy noise) state, and slowly anneals into a "cold" state with little if any noise. Another line of work derives the loss for the denoising network using variational inference with a Gaussian prior (Ho et al., 2020; Song et al., 2021a; Nichol & Dhariwal, 2021) . In this work, we examine the need for Gaussian noise, or any randomness at all, for diffusion models to work in practice. We consider generalized diffusion models that live outside the confines of the theoretical frameworks from which diffusion models arose. Rather than limit ourselves to models built around Gaussian noise, we consider models built around arbitrary image transformations like blurring, downsampling, etc. We train a restoration network to invert these deformations using a simple ℓ p loss. When we apply a sequence of updates at test time that alternate between the image restoration model and the image degradation operation, generative behavior emerges, and we obtain photo-realistic images. The existence of cold diffusions that require no Gaussian noise (or any randomness) during training or testing raises questions about the limits of our theoretical understanding of diffusion models. It also unlocks the door for potentially new types of generative models with very different properties than conventional diffusion seen so far.

2. BACKGROUND

Both the Langevin dynamics and variational inference interpretations of diffusion models rely on properties of the Gaussian noise used in the training and sampling pipelines. From the scorematching generative networks perspective (Song & Ermon, 2019; Song et al., 2021b) , noise in the training process is critically thought to expand the support of the low-dimensional training distribution to a set of full measure in ambient space. The noise is also thought to act as data augmentation to improve score predictions in low density regions, allowing for mode mixing in the stochastic gradient Langevin dynamics (SGLD) sampling. The gradient signal in low-density regions can be further improved during sampling by injecting large magnitudes of noise in the early steps of SGLD and gradually reducing this noise in later stages. Kingma et al. (2021) propose a method to learn a noise schedule that leads to faster optimization. Using a classic statistical result, Kadkhodaie & Simoncelli (2021) show the connection between removing additive Gaussian noise and the gradient of the log of the noisy signal density in deterministic linear inverse problems. Here, we shed light on the role of noise in diffusion models through theoretical and empirical results in applications to inverse problems and image generation. Iterative neural models have been used for various inverse problems (Romano et al., 2016; Metzler et al., 2017) . Recently, diffusion models have been applied to them (Song et al., 2021b) for the problems of deblurring, denoising, super-resolution, and compressive sensing (Whang et al., 2021; Kawar et al., 2021; Saharia et al., 2021; Kadkhodaie & Simoncelli, 2021) . Although not their focus, previous works on diffusion models have included experiments with deterministic image generation (Song et al., 2021a; Dhariwal & Nichol, 2021; Karras et al., 2022) and in selected inverse problems (Kawar et al., 2022 ). Recently, Rissanen et al. (2022) use a combination of Gaussian noise and blurring as a forward process for diffusion. Though they show the feasibility of a different degradation, here we show definitively that noise is not a necessity in diffusion models, and we observe the effects of removing noise for a number of inverse problems. Despite prolific work on generative models in recent years, methods to probe the properties of learned distributions and measure how closely they approximate the real training data are by no means closed fields of investigation. Indirect feature space similarity metrics such as Inception Score (Salimans et al., 2016 ), Mode Score (Che et al., 2016) , Frechet inception distance (FID) (Heusel et al., 2017) , and Kernel inception distance (KID) (Bińkowski et al., 2018) have been proposed and adopted to some extent, but they have notable limitations (Barratt & Sharma, 2018) . To adopt a popular frame of reference, we will use FID as the feature similarity metric for our experiments.

3. GENERALIZED DIFFUSION

Standard diffusion models are built around two components. First, there is an image degradation operator that contaminates images with Gaussian noise. Second, a trained restoration operator is created to perform denoising. The image generation process alternates between the application of these two operators. In this work, we consider the construction of generalized diffusions built around arbitrary degradation operations. These degradations can be randomized (as in the case of standard diffusion) or deterministic.

3.1. MODEL COMPONENTS AND TRAINING

Given an image x 0 ∈ R N , consider the degradation of x 0 by operator D with severity t, denoted x t = D(x 0 , t). The output distribution D(x 0 , t) of the degradation should vary continuously in t, and the operator should satisfy D(x 0 , 0) = x 0 . In the standard diffusion framework, D adds Gaussian noise with variance proportional to t. In our generalized formulation, we choose D to perform various other transformations such as blurring, masking out pixels, downsampling, and more, with severity that depends on t. We explore a range of choices for D in Section 4. We also require a restoration operator R that (approximately) inverts D. This operator has the property that R(x t , t) ≈ x 0 . In practice, this operator is implemented via a neural network parameterized by θ. The restoration network is trained via the minimization problem min θ E x∼X ∥R θ (D(x, t), t) -x∥, where x denotes a random image sampled from distribution X and ∥ • ∥ denotes a norm, which we take to be ℓ 1 in our experiments. We have so far used the subscript R θ to emphasize the dependence of R on θ during training, but we will omit this symbol for simplicity in the discussion below.

Algorithm 1 Naive Sampling

Input: A degraded sample x t for s = t, t -1, . . . , 1 do x0 ← R(x s , s) x s-1 = D(x 0 , s -1) end for Return: x 0 Algorithm 2 Transformation Agnostic Cold Sampling Input: A degraded sample x t for s = t, t -1, . . . , 1 do x0 ← R(x s , s) x s-1 = x s -D(x 0 , s) + D(x 0 , s -1) end for After choosing a degradation D and training a model R to perform the restoration, these operators can be used in tandem to invert severe degradations by using standard methods borrowed from the diffusion literature. For small degradations (t ≈ 0), a single application of R can be used to obtain a restored image in one shot. However, because R is typically trained using a simple convex loss, it yields blurry results when used with large t. Rather, diffusion models (Song et al., 2021a; Ho et al., 2020) perform generation by iteratively applying the denoising operator and then adding noise back to the image, with the level of added noise decreasing over time. This is the standard update sequence in Algorithm 1. When the restoration operator is perfect, i.e. when R(D(x 0 , t), t) = x 0 for all t, one can easily see that Algorithm 1 produces exact iterates of the form x s = D(x 0 , s). But what happens for imperfect restoration operators? In this case, errors can cause the iterates x s to wander away from D(x 0 , s), and inaccurate reconstruction may occur. We find that the standard sampling approach in Algorithm 1 (explained further in A.8) works well for noise-based diffusion, possibly because the restoration operator R has been trained to correct (random Gaussian) errors in its inputs. However, we find that it yields poor results in the case of cold diffusions with smooth/differentiable degradations as demonstrated for a deblurring model in Figure 2 . We propose Transformation Agnostic Cold Sampling (TACoS) in Algorithm 2, which we find to be superior for inverting smooth, cold degradations. This sampler has important mathematical properties that enable it to recover high quality results. Specifically, for a class of linear degradation operations, it can be shown to produce exact reconstruction (i.e. x s = D(x 0 , s)) even when the restoration operator R fails to perfectly invert D. We discuss this in the following section.

3.3. PROPERTIES OF TACOS

Figure 2 : Comparison of sampling methods for unconditional generation using cold diffusion on the CelebA dataset. Iterations 2, 4, 8, 16, 32, 64, 128, 192, and 256 are presented. Top: Algorithm 1 produces compounding artifacts and fails to generate a new image. Bottom: TACoS succeeds in sampling a high quality image without noise. It is clear from inspection that both Algorithms 1 and 2 perfectly reconstruct the iterate x s = D(x 0 , s) for all s < t if the restoration operator is a perfect inverse for the degradation operator. In this section, we analyze the stability of these algorithms to errors in the restoration operator. For small values of x and s, TACoS as described in 2 is tolerant of error in the restoration operator R.To see why, consider a model problem with a linear degradation function of the form D(x, s) ≈ x + s • e for a constant vector e. We chose this ansatz because the Taylor expansion of any smooth degradation D(x, s) around x = x 0 , s = 0 has the form D(x, s) ≈ x+s•e(x)+HOT where HOT denotes higher order terms. Note, however, the analysis below requires e to be a constant that does not depend on x. The constant/zeroth-order term in this Taylor expansion is zero because we assumed above that the degradation operator satisfies D(x, 0) = x. For a degradation D(x, s) and any restoration operator R, the term x s-1 in TACoS becomes x s -D(R(x s , s), s) + D(R(x s , s), s -1) = D(x 0 , s) -D(R(x s , s), s) + D(R(x s , s), s -1) = x 0 + s • e -R(x s , s) -s • e + R(x s , s) + (s -1) • e = x 0 + (s -1) = D(x 0 , s -1) By induction, we see that the algorithm produces the value x s = D(x 0 , s) for all s < t, regardless of the choice of R. In other words, for any choice of R, the iteration behaves the same as it would when R is a perfect inverse for the degradation D. By contrast, Algorithm 1 does not enjoy this behavior even for small values of s. In fact, when R is not a perfect inverse for D, x 0 is not a fixed point of the update rule in Algorithm 1 because x 0 ̸ = D(R(x, 0), 0) = R(x, 0) and hence compounds errors. If R does not perfectly invert D we should expect Algorithm 1 to incur errors, even for small values of s. Meanwhile, for small values of s, the behavior of D approaches its first-order Taylor expansion and Algorithm 2 becomes immune to errors in R. Figure 2 demonstrates the stability of TACoS described in Algorithm 2 vs Algorithm 1 for a deblurring model. Note that our analysis is not meant to be a complete convergence theory, rather to highlight a desirable theoretical property of our method that a naive sampler lacks.

4. GENERALIZED DIFFUSIONS WITH VARIOUS TRANSFORMATIONS

In this section, we take the first step towards cold diffusion by reversing different degradations and hence performing conditional generation. We will extend our methods to perform unconditional (i.e. from scratch) generation in Section 5. We emprically evaluate generalized diffusion models trained on different degradations with TACoS proposed in Algorithm 2. We perform experiments on the vision tasks of deblurring, inpainting, super-resolution, and the unconventional task of synthetic snow removal. We perform our experiments on MNIST (LeCun et al., 1998) , CIFAR-10 (Krizhevsky, 2009) , and CelebA (Liu et al., 2015) . In each of these tasks, we gradually remove the information from the clean image, creating a sequence of images such that D(x 0 , t) retains less information than D(x 0 , t -1). For these different tasks, we present both qualitative and quantitative results on a held-out testing dataset and demonstrate the importance of the sampling technique described in Algorithm 2. For all quantitative results in this section, the Frechet inception distance (FID) scores (Heusel et al., 2017) for degraded and reconstructed images are measured with respect to the testing data. Additional information about the quantitative results, convergence criteria, hyperparameters, and architecture of the models presented below can be found in the appendix.

4.1. DEBLURRING

We consider a generalized diffusion based on a Gaussian blur operation (as opposed to Gaussian noise) in which an image at step t has more blur than at t -1. The forward process given the Gaussian kernels {G s } and the image x t-1 at step t -1 can thus be written as x t = G t * x t-1 = G t * . . . * G 1 * x 0 = Ḡt * x 0 = D(x 0 , t), where * denotes the convolution operator, which blurs an image using a kernel. We train a deblurring model by minimizing the loss equation 1, and then use TACoS to invert this blurred diffusion process for which we trained a DNN to predict the clean image x0 . Qualitative results are shown in Figure 3 and quantitative results in Table 1 . Qualitatively, we can see that images created using the sampling process are sharper and in some cases completely different as compared to the direct reconstruction of the clean image. Quantitatively we can see that the reconstruction metrics such as RMSE and PSNR get worse when we use the sampling process, but on the other hand FID with respect to held-out test data improves. The qualitative improvements and decrease in FID show the benefits of the generalized sampling routine, which brings the learned distribution closer to the true data manifold. In the case of blur operator, the sampling routine can be thought of adding frequencies at each step. This is because the sampling routine involves the term D( x0 , t) -D( x0 , t -1) which in the case of blur becomes Ḡt * x 0 -Ḡt-1 * x 0 . This results in a difference of Gaussians, which is a band pass filter and contains frequencies that were removed at step t. Thus, in the sampling process, we sequentially add the frequencies that were removed during the degradation process. Degraded Direct TACoS Original 

4.2. INPAINTING

We define a schedule of transforms that progressively grays-out pixels from the input image. We remove pixels using a Gaussian mask as follows: For input images of size n × n we start with a 2D Gaussian curve of variance β, discretized into an n × n array. We normalize so the peak of the curve has value 1, and subtract the result from 1 so the center of the mask as value 0. We randomize the location of the Gaussian mask for MNIST and CIFAR-10, but keep it centered for CelebA. We denote the final mask by z β . Input images x 0 are iteratively masked for T steps via multiplication with a sequence of masks {z βi } with increasing β i . We can control the amount of information removed at each step by tuning the β i parameter. In the language of Section 3, D(x 0 , t) = x 0 • t i=1 z βi , where the operator • denotes entry-wise multiplication. Figure 4 presents results on test images and compares the output of the inpainting model to the original image. The reconstructed images display reconstructed features qualitatively consistent with the context provided by the unperturbed regions of the image. We quantitatively assess the effectiveness of the inpainting models on each of the datasets by comparing distributional similarity metrics before and after the reconstruction. Our results are summarized in Table 2 . Note, the FID scores here are computed with respect to the held-out validation set.

Degraded

Direct TACoS Original 

4.4. SNOWIFICATION

Apart from traditional degradations, we additionally provide results for the task of synthetic snow removal using the offical implementation of the snowification transform from ImageNet-C (Hendrycks & Dietterich, 2019) . The purpose of this experiment is to demonstrate that generalized diffusion can succeed even with exotic transforms that lack the scale-space and compositional properties of blur operators. Similar to other tasks, we degrade the images by adding snow, such that the level of snow increases with step t. We provide more implementation details in Appendix. We illustrate our desnowification results in Figure 6 . et al., 2021; Ho et al., 2022) . We will first discuss deterministic generation using Gaussian noise and then discuss in detail unconditional generation using deblurring. Finally, we provide a proof of concept that the TACoS described in Algorithm 2 can be extended to other degradations.

5.1. GENERATION USING DETERMINISTIC NOISE DEGRADATION

Here we discuss image generation using a noise-based degradation presented in our notation from Section 3, which we will later prove is equivalent to DDIM (Song et al., 2021a) . We use the following degradation operator: D(x, t) = √ α t x + √ 1 -α t z. D is an interpolation between the data point x and a sampled noise pattern z ∈ N (0, 1). During training, D is applied once and thus z is sampled once for every image in every batch. However, sampling involves iterative applications of the degradation operator D, which poses the question of how to pick z for the sequence of degradations D applied in a single image generation. There are three possible choices for z. The first would be to resample z for each application of D, but this would make the sampling process nondeterministic for a fixed starting point. Another option is to sample a noise pattern z once for each separate image generation and reuse it in each application of D. In Table 5 we refer to this approach as Fixed Noise. Finally, one can calculate the noise vector z to be used in step t of reconstruction by using the formula ẑ(x t , t) = x t - √ α t R(x t , t) √ 1 -α t . This method denoted Estimated Noise in Table 5 turns out to be equivalent to the deterministic sampling proposed in Song et al. (2021a) . We discuss this equivalence in detail in Appendix A.6.

5.2. IMAGE GENERATION USING BLUR

The forward diffusion process in noise-based diffusion models has the advantage that the degraded image distribution at the final step T is simply an isotropic Gaussian. One can therefore perform (unconditional) generation by first drawing a sample from the isotropic Gaussian, and sequentially denoising it with backward diffusion. When using blur as a degradation, the fully degraded images do not form a nice closed-form distribution that we can sample from. They do, however, form a simple enough distribution that can be modeled with simple methods. Note that every image x 0 degenerates to an x T that is constant (i.e., every pixel is the same color) for large T . Furthermore, the constant value is exactly the channelwise mean of the RGB image x 0 , and can be represented with a 3-vector. This 3-dimensional distribution is easily represented using a Gaussian mixture model (GMM). This GMM can be sampled to produce the random pixel values of a severely blurred image, which can be deblurred using cold diffusion to create a new image. Our generative model uses a blurring schedule where we progressively blur each image with a Gaussian kernel of size 27 × 27 over 300 steps. The standard deviation of the kernel starts at 1 and increases exponentially at the rate of 0.01. We then fit a simple GMM with one component to the distribution of channel-wise means. To generate an image from scratch, we sample the channel-wise mean from the GMM, expand the 3D vector into a 128 × 128 image with three channels, and then apply TACoS. Empirically, the presented pipeline generates images with high fidelity but low diversity, as reflected quantitatively by comparing the perfect symmetry column with results from hot diffusion in Table 5 . We attribute this to the perfect correlation between pixels of x T sampled from the channel-wise mean Gaussian mixture model. To break the symmetry between pixels, we add a small amount of Gaussian noise (of standard deviation 0.002) to each sampled x T . As shown in Table 5 , the simple trick drastically improves the quality of generated images. We also present the qualitative results for cold diffusion using blur transformation in Figure 7 , and further discuss the necessity of TACoS proposed in Algorithm 2 for generation in Appendix A.7. Table 5 : FID scores for CelebA and AFHQ datasets using hot (noise) and cold diffusion (blur transformation). Breaking the symmetry within pixels of the same channel further improves FID. In this section, we provide a proof of concept that generation can be extended to other transformations. Specifically, we show preliminary results on inpainting, super-resolution, and animorphosis. Inspired by the simplicity of the degraded image distribution for the blurring routine presented in the previous section, we use degradation routines with predictable final distributions here as well. To use the Gaussian mask transformation for generation, we modify the masking routine so the final degraded image is completely devoid of information. One might think a natural option is to send all of the images to a completely black image x T , but this would not allow for any diversity in generation. To get around this maximally non-injective property, we instead make the mask turn all pixels to a random, solid color. This still removes all of the information from the image, but it allows us to recover different samples from the learned distribution via Algorithm 2 by starting off with different color images. More formally, a Gaussian mask G t = t i=1 z βi is created in a similar way as discussed in the Section 4.2, but instead of multiplying it directly to the image x 0 , we create x t as G t • x 0 + (1 -G t ) • c , where c is an image of a randomly sampled color. For super-resolution, the routine down-samples to a resolution of 2 × 2, or 4 values in each channel. These degraded images can be represented as one-dimensional vectors, and their distribution is modeled using one Gaussian distribution. Using the same methods described for generation using blurring described above, we sample from this Gaussian-fitted distribution of the lower-dimensional degraded image space and pass this sampled point through the generation process trained on superresolution data to create one output. Additionally to show one can invert nearly any transformation, we include a new transformation deemed animorphosis, where we iteratively transform a human face from CelebA to an animal face from AFHQ. Though we chose CelebA and AFHQ for our experimentation, in principle such interpolation can be done for any two initial data distributions. More formally, given an image x and a random image z sampled from the AFHQ manifold, x t can be written as x t = √ α t x + √ 1 -α t z. Note this is essentially the same as the noising procedure, but instead of adding noise we are adding a progressively higher weighted AFHQ image. In order to sample from the learned distribution, we sample a random image of an animal and use TACoS. We present results for the CelebA dataset, and hence the quantitative results in terms of FID scores for inpainting, super-resolution and animorphosis are 90.14, 92.91 and 48.51 respectively. We further show some qualitative samples in Figure 8 , and in Figure 1 . 

6. CONCLUSION

Existing diffusion models rely on Gaussian noise for both forward and reverse processes. In this work, we find that the random noise can be removed entirely from the diffusion model framework, and replaced with arbitrary transforms. In doing so, our generalization of diffusion models and their sampling procedures allows us to restore images afflicted by deterministic degradations such as blur, inpainting and downsampling. This framework paves the way for a more diverse landscape of diffusion models beyond the Gaussian noise paradigm. The different properties of these diffusions may prove useful for a range of applications, including image generation and beyond.

REPRODUCIBILITY STATEMENT

We provided our full code base as supplementary material, which is a modified version of the traditional diffusion database found at https://github.com/lucidrains/denoising-diffusion-pytorch. To facilitate the reproducibility of our results, we have included detailed hyperparameters for training each of our cold diffusion models in Appendices A.1-A.5. Due to space constraints in the main body, we opted to present a relatively small number of qualitative results. Many more examples of both conditionally and unconditionally generated images can be found in the Appendix.

A APPENDIX

A.1 DEBLURRING For the deblurring experiments, we train the models on different datasets for 700,000 gradient steps. We use the Adam (Kingma & Ba, 2014) optimizer with learning rate 2 × 10 -5 . The training was done on the batch size of 32, and we accumulate the gradients every 2 steps. Our final model is an Exponential Moving Average of the trained model with decay rate 0.995 which is updated after every 10 gradient steps. For the MNIST dataset, we blur recursively 40 times, with a discrete Gaussian kernel of size 11x11 and a standard deviation 7. In the case of CIFAR-10, we recursively blur with a Gaussian kernel of fixed size 11x11, but at each step t, the standard deviation of the Gaussian kernel is given by 0.01 * t + 0.35. The blur routine for CelebA dataset involves blurring images with a Gaussian kernel of 15x15 and the standard deviation of the Gaussian kernel grows exponentially with time t at the rate of 0.01. To avoid potential leakage of information due to floating point computation of the Gaussian mask, we discretize the masked image before passing it through the inpainting model. This was done by rounding all pixel values to the eight most significant digits. Figure 11 shows nine additional inpainting examples on each of the MNIST, CIFAR-10, and CelebA datasets. Figure 10 demonstrates an example of the iterative sampling process of an inpainting model for one image in each dataset.

A.3 SUPER-RESOLUTION

We train the super-resolution model per Section 3.1 for 700,000 iterations. We use the Adam (Kingma & Ba, 2014) optimizer with learning rate 2 × 10 -5 . The batch size is 32, and we accumulate the gradients every 2 steps. Our final model is an Exponential Moving Average of the trained model with decay rate 0.995. We update the EMA model every 10 gradient steps. The number of time-steps depends on the size of the input image and the final image. For MNIST and for CIFAR10, the number of time steps is 3, as it takes three steps of halving the resolution to reduce the initial image down to 4 × 4. For CelebA, the number of time steps is 6 to reduce the initial image down to 2 × 2. For CIFAR10, we apply random crop and random horizontal flip for regularization. Figure 13 shows an additional nine super-resolution examples on each of the MNIST, CIFAR-10, and CelebA datasets. Figure 12 shows one example of the progressive increase in resolution achieved with the sampling process using a super-resolution model for each dataset.

A.4 COLORIZATION

Here we provide results for the additional task of colorization. Starting with the original RGBimage x 0 , we realize colorization by iteratively desaturating for T steps until the final image x T is a fully gray-scale image. We use a series of three-channel 1 × 1 convolution filters z(α) = {z 1 (α), z 2 (α), z 3 (α)} with the form z 1 (α) = α 1 3 1 3 1 3 + (1 -α) (1 0 0) z 2 (α) = α 1 3 1 3 1 3 + (1 -α) (0 1 0) z 3 (α) = α 1 3 1 3 1 3 + (1 -α) (0 0 1) and obtain D(x, t) = z(α t ) * x via a schedule defined as α 1 , . . . , α t for each respective step. Notice that a gray image is obtained when x T = z(1) * x 0 . We can tune the ratio α t to control the amount of information removed in each step. For our experiment, we schedule the ratio such that for every t we have x t = z(α t ) * . . . * z(α 1 ) * x 0 = z( t T ) * x 0 . This schedule ensures that color information lost between steps is smaller in earlier stage of the diffusion and becomes larger as t increases. We train the models on different datasets for 700,000 gradient steps. We use Adam (Kingma & Ba, 2014) optimizer with learning rate 2 × 10 -5 . We use batch size 32, and we accumulate the gradients every 2 steps. Our final model is an exponential moving average of the trained model with decay rate 0.995. We update the EMA model every 10 gradient steps. For CIFAR-10 we use T = 50 and for CelebA we use T = 20. We illustrate our recolorization results in Figure 14 . We present testing examples, as well as their grey scale images, from all the datasets, and compare the recolorization results with the original images. The recolored images feature correct color separation between different regions, and feature various and yet semantically correct colorization of objects. Our sampling technique still yields minor differences in comparison to the direct reconstruction, although the change is not visually apparent. We attribute this to the shape restriction of colorization task, as human perception is rather insensitive to minor color change. We also provide quantitative measurement for the effectiveness of our recolorization results in terms of different similarity metrics, and summarize the results in Table 6 .  S C [i][j] = 0, S B [i][j] ≤ c 1 S B [i][j], S B [i][j] > c 1 and clip each entry of S C into the range [0, 1]. We then convolve S C using a motion blur kernel with standard deviation c 2 to create the snow pattern S and its up-side-down rotation S ′ . The direction of x t-1 = x t -D( x0 , t) + D( x0 , t -1) = x t -( √ α t x0 + √ 1 -α t ẑ) + ( √ α t-1 x0 + 1 -α t-1 ẑ) = √ α t-1 x0 + 1 -α t-1 ẑ (1) which is same as the sampling method as described in (Song et al., 2021a) . The only difference from the original (Song et al., 2021a) is the order for estimating x0 and ẑ. The original (Song et al., 2021a) paper estimated ẑ first and then used this to predict clean image x0 , while we first predict the clean image x0 and then estimate the noise ẑ. A.7 GENERATION USING BLUR TRANSFORMATION: FURTHER DETAILS Figure 16 : Examples of generated samples from 128×128 CelebA and AFHQ datasets using Method 2 with perfect symmetry. The Figure 16 , shows the generation without breaking any symmetry within each channel are quite promising as well. Necessity of Algorithm 2: In the case of unconditional generation, we observe a marked superiority in quality of the sampled reconstruction using Algorithm 2 over any other method considered. For example, in the broken symmetry case, the FID of the directly reconstructed images is 257.69 for CelebA and 214.24 for AFHQ, which are far worse than the scores of 49.45 and 54.68 from Table 5 . In Figure 17 , we also give a qualitative comparison of this difference. We can also clearly see from x t-1 = √ α t-1 • "predicted x 0 " + 1 -α t-1 -σ 2 t ϵ θ (x t ) + σ t ϵ t where ϵ θ (x t ) is the noise predicted by the diffusion model given x t and t. The term "predicted x 0 " or x0 can be computed directly given x t and ϵ θ (x t ) as x0 = x t - √ 1 -α t ϵ θ (x t ) √ α t , Hence using ẑ instead of ϵ θ (x t ) and x0 to indicate predicted clean image, we have x t-1 = √ α t-1 • x0 + 1 -α t-1 -σ 2 t ẑ + σ t ϵ t Thus, the sampling step can interpreted as follows: At each step t, we start with a noisy image x t and use the diffusion model to estimate the clean image x0 and the noise ẑ that was added to this clean image x0 to get the noisy image x t . In order to move to lesser noisy image x t-1 , one "adds back" lesser noise to the the "predicted clean image" x0 . Now one can add back noise in 2 ways, either the noise which was added to the clean image x0 which is ẑ or sample a new uncorrelated noise ϵ t . Infact both of these noise can be added using σ t as the hyperparameter that weighs the amount of each noise added. This σ t is placed in the equation such that for any choice of σ t , the standard deviation of noise added back is √ 1 -α t-1 . For σ t = 0, we only add back the estimated noise ẑ and no uncorrelated noise ϵ t which is infact the DDIM sampling. While for σ t = (1 -α t-1 )/(1 -α t ) 1 -α t /α t-1 we get the sampling method described in DDPM. Nevertheless, for any choice of σ t , the sampling method involves a denoising operation which is shown as R(x s , s) in Algorithm 1 and adding back noise shown as x s-1 = D( x0 , s-1) in Algorithm 1. The only difference between different sampling methods explained in DDPM or DDIM is how one degrades the image back. 



Figure 3: Deblurring models trained on the MNIST, CIFAR-10, and CelebA datasets. Left to right: degraded inputs D(x 0 , T ) , direct reconstruction R(D(x 0 , T )), sampled reconstruction with TACoS described in Algorithm 2, and original image.

Figure 4: Inpainting models trained on the MNIST, CIFAR-10, and CelebA datasets. Left to right: Degraded inputs D(x 0 , T ) , direct reconstruction R(D(x 0 , T )), sampled reconstruction with TACoS described in Algorithm 2, and original image.

Figure 5 presents example testing data inputs for all datasets and compares the output of the superresolution model to the original image. Though the reconstructed images are not perfect for the more challenging datasets, the reconstructed features are qualitatively consistent with the context provided by the low resolution image.Table 3 compares the distributional similarity metrics between degraded/reconstructed images and test samples. Degraded Direct TACoS Original

Figure 6: Desnowification models trained on the CIFAR-10, and CelebA datasets. Left to right: degraded inputs D(x 0 , T ) , direct reconstruction R(D(x 0 , T )), sampled reconstruction with TACoS described in Algorithm 2, and original image.

Figure 7: Examples of generated samples from 128 × 128 CelebA and AFHQ datasets using cold diffusion with blur transformation

Figure 8: Preliminary demonstration of the generative abilities of other cold diffusins on the 128 × 128 CelebA dataset. The top row is with animorphosis models, the middle row is with inpainting models, and the bottom row exhibits super-resolution models.

Figure 9 shows an additional nine images for each of MNIST, CIFAR-10 and CelebA. Figures 19 and 20 show the iterative sampling process using a deblurring model for ten example images from each dataset. We further show 400 random images to demonstrate the qualitative results in the Figure 21. Degraded Direct TACoS Original

Figure 10: Progressive inpainting of selected masked MNIST, CIFAR-10, and CelebA images.

Figure 12: Progressive upsampling of selected downsampled MNIST, CIFAR-10, and CelebA images. The original image is at the left for each of these progressive upsamplings.

Figure 13: Additional examples from super-resolution models trained on the MNIST, CIFAR-10, and CelebA datasets. Left to right: degraded inputs D(x 0 , T ) , direct reconstruction R(D(x 0 , T )), sampled reconstruction with TACoS described in Algorithm 2, and original image. Degraded Direct TACoS Original

Figure 18 that Algorithm 1, the method used in Song et al. (2021b) and Ho et al. (2020), completely fails to produce an image close to the target data distribution. A.8 ALGORITHM 1 IS SAME AS DDIM/DDPM SAMPLING The sampling method proposed in Song et al. (2021a) in it's equation 12 is given as

Figure 17: Comparison of direct reconstruction with sampling using TACoS described in Algorithm 2 for generation with blur transformation and broken symmetry. Left-hand column is the initial cold images generated using the simple Gaussian model. Middle column has images generated in one step (i.e. direct reconstruction). Right-hand column are the images sampled with TACoS described in Algorithm 2. We present results for both CelebA (top) and AFHQ (bottom) with resolution 128 × 128.

Figure 18: Comparison of Algorithm 1 (top row) and Algorithm 2 (bottom row) for generation with Method 2 and broken symmetry on 128 × 128 CelebA dataset. We demonstrate that Algorithm 1 fails completely to generate a new image.

Quantitative metrics for quality of image reconstruction using deblurring models.

Quantitative metrics for quality of image reconstruction using inpainting models.

Quantitative metrics for quality of image reconstruction using super-resolution models.

Quantitative metrics for quality of image reconstruction using desnowification models.

Quantitative metrics for quality of image reconstruction using recolorization models for all three channel datasets.

Original

Forward ----------------

annex

the motional blur kernel is randomly chosen as either vertical or horizontal. The final snow image is created by again clipping each value of x 0 + S + S ′ into the range [0, 1]. For simplicity, we abstract the process as a function h(x 0 , S A , c 0 , c 1 ).

Degraded

Direct TACoS Original To create a series of T images with increasing snowification, we linearly interpolate c 0 and c 1 between [c start 0 , c end 0 ] and [c start 1 , c end 1 ] respectively, to create c 0 (t) and c 1 (t), t = 1, . . . , T . Then for each x 0 , a seed matrix S x is sampled, the motion blur direction is randomized, and we construct each related x t by x t = h(x 0 , S x , c 0 (t), c 1 (t)). Visually, c 0 (t) dictates the severity of the snow, while c 1 (t) determines how "windy" the snowified image seems. = 0.05 and c end 1 = 20, which generates a visually heavier snow. We train the models on different datasets for 700,000 gradient steps. We use Adam (Kingma & Ba, 2014) optimizer with learning rate 2 × 10 -5 . We use batch size 32, and we accumulate the gradients every 2 steps. Our final model is an exponential moving average of the trained model with decay rate 0.995. We update the EMA model every 10 gradient steps. For CIFAR-10 we use T = 200 and for CelebA we use T = 200. We note that the seed matrix is resampled for each individual training batch, and hence the snow pattern varies across the training stage.

A.6 GENERATION USING NOISE : FURTHER DETAILS

Here we show the equivalence between the sampling method proposed in Algorithm 2 and the deterministic sampling in DDIM (Song et al., 2021a) . Given the image x t at step t, we have the restored clean image x0 from the diffusion model. Hence given the estimated x0 and x t , we can estimate the noise z(x t , t) (or ẑ) asThus, the D( x0 , t) and D( x0 , t -1) can be written asusing which the sampling process in Algorithm 2 to estimate x t-1 can be written as, 

