SOFT DIFFUSION SCORE MATCHING FOR GENERAL CORRUPTIONS

Abstract

We define a broader family of corruption processes that generalizes previously known diffusion models. To reverse these general diffusions, we propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process and yields state of the art results for CelebA. Soft Score Matching incorporates the degradation process in the network. Our new loss trains the model to predict a clean image, that after corruption, matches the diffused observation. We show that our objective learns the gradient of the likelihood under suitable regularity conditions for a family of corruption processes. We further develop a principled way to select the corruption levels for general diffusion processes and a novel sampling method that we call Momentum Sampler. We show experimentally that our framework works for general linear corruption processes, such as Gaussian blur and masking. We achieve state-of-the-art FID score 1.85 on CelebA-64, outperforming all previous linear diffusion models. We also show significant computational benefits compared to vanilla denoising diffusion.

1. INTRODUCTION

Score-based models (Song & Ermon, 2019; 2020; Song et al., 2021b) and Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021a) are two powerful classes of generative models that produce samples by inverting a diffusion process. These two classes have been unified under a single framework (Song et al., 2021b) and are widely known as diffusion models. Diffusion modeling has found great success in a wide range of applications (Croitoru et al., 2022; Yang et al., 2022) , including image (Saharia et al., 2022a; Ramesh et al., 2022; Rombach et al., 2022; Dhariwal & Nichol, 2021) , audio (Kong et al., 2021; Richter et al., 2022; Serrà et al., 2022) , video generation (Ho et al., 2022b) , as well as solving inverse problems (Daras et al., 2022; Kadkhodaie & Simoncelli, 2021; Kawar et al., 2022; 2021; Jalal et al., 2021; Saharia et al., 2022b; Laumont et al., 2022; Whang et al., 2022; Chung et al., 2022) . Karras et al. (2022) analyze the design space of diffusion models. The authors identify three stages: i) the noise scheduling, ii) the network parametrization (each one leads to a different loss function), iii) the sampling algorithm. We argue that there is one more important step: choosing how to corrupt. Typically, the diffusion is additive noise of different magnitudes (and sometimes input rescalings). There have been a few recent attempts to use different corruptions (Deasy et al., 2021; Hoogeboom et al., 2022a; b; Avrahami et al., 2022; Nachmani et al., 2021; Johnson et al., 2021; Lee et al., 2022; Ye et al., 2022) , but the results are usually inferior to diffusion with additive noise. Also, a common framework on how to properly design general corruption processes is missing. We present such a principled framework for learning to invert a general class of corruption processes. We propose a new objective called Soft Score Matching that provably learns the score for any regular linear corruption process. Soft Score Matching incorporates the filtering process in the network and trains the model to predict a clean image that after corruption matches the diffused observation. Our theoretical results show that Soft Score Matching learns the score (i.e. likelihood gradients) for corruption processes that satisfy a regularity condition that we identify: the diffusion must transform any image into any other image with nonzero likelihood. Using our method and Gaussian Blur paired with little noise as the diffusion mechanism, we achieve state-of-the-art FID on CelebA (FID 1.85) for linear diffusion models. We also show that our corruption process leads to generative models that are faster compared to vanilla Gaussian denoising diffusion.

Reverse Process

Reverse Process Figure 1 : Top two rows: Demonstration of our generalized diffusion method. Instead of corrupting by only adding noise, we propose a framework to provably learn the score function to reverse any linear diffusion (left: blur and noise, right: masking and noise). Our (blur and noise) models achieve state-of-the-art FID score 1.85 for linear diffusion models on CelebA-64. Uncurated samples shown in the last three rows. Our contributions: a) We propose a learning objective that: i) provably learns the score for a wide family of regular diffusion processes and ii) enables learning under limited randomness in the diffusion. b) We present a principled way to select the intermediate distributions. Our method minimizes the Wasserstein distance along the path from the initial to the final distribution. c) We propose a novel sampling method that we call Momentum Sampler: our sampler uses a convex combination of corruptions at different diffusion levels and is inspired by momentum methods in optimization. d) We train models on CelebA and CIFAR-10. Our trained models on CelebA achieve a new state-of-the-art FID score of 1.85 for linear diffusion models while being significantly faster compared to models trained with vanilla Gaussian denoising diffusion.

2. BACKGROUND

Diffusion models are generative models that produce samples by inverting a corruption process. The corruption level is typically indexed by a time t, with t = 0 corresponding to clean and t = 1 to fully corrupted images. The diffusion process can be discrete or continuous. The two general classes of diffusion models are Score-Based Models (Song & Ermon, 2019; 2020; Song et al., 2021b) and Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) . The typical diffusion in score-based modeling is additive noise of increasing magnitude. The perturbation kernel at time t is: q t (x t |x 0 ) = N (x t |µ = x 0 , Σ = σ 2 t I), where x 0 ∼ q 0 is a clean image. Score models are trained with the Denoising Score Matching (DSM) objective: min θ E t∼U [0,1] w t E (x0,xt)∼q0(x0)qt(xt|x0) ||s θ (x t |t) -∇ xt log q t (x t |x 0 )|| 2 , where w t scales differently the weights of the inner objectives. If we train for each noise level t independently, given enough data and model capacity, the network is guaranteed to recover the gradient of the log likelihood (Vincent, 2011) , known as the score function. In other words, the model s θ (x t |t) is trained such that: s θ (x t |t) ≈ ∇ xt log q t (x t ). In practice, we use parameter sharing and conditioning on time t to learn all the scores. Once the model is trained, we start from a sample of the final distribution, q 1 , and then use the learned score to gradually denoise it (Song & Ermon, 2019; 2020) . The final variance σ 2 1 is selected to be very large such that the distribution q 1 is approximately Gaussian, i.e. the signal to noise ratio tends to 0. DDPMs corrupt by rescaling the input images and by adding noise. The corruption can be modelled with a Markov chain with perturbation kernel q t (x t |x t-∆t ) = N (x t |µ = √ 1 -β t x t-∆t , Σ = β t I). Typically, β 1 = 1 and hence q 1 = N (0, I). DDPMs are also trained with the DSM objective which is derived by minimizing an evidence lower bound (ELBO) (Ho et al., 2020) . In their seminal work, Song et al. (2021b) observe that the diffusions of both Score-Based models and DDPMs can be expressed as solutions of Stochastic Differential Equations (SDEs) of the form: dx = f (x, t)dt + g(t)dw, ( ) where w is the standard Wiener process. Particularly, Score-Based models use: f (x, t) = 0, g(t) = dσ 2 t dt and DDPMs use: f (x, t) = -1 2 β t x, g(t) = √ β t . As explained earlier, for Score-Based models we need large noise at the end for the final distribution to approximate a normal distribution. Hence, the corresponding SDE is named Variance Exploding (VE) SDE (Song et al., 2021b) . DDPMs usually have a final distribution of unit variance and hence their SDE is known as the Variance Preserving (VP) SDE (Song et al., 2021b) . Song et al. (2021b) also propose another SDE with bounded variance, the subVP-SDE, that experimentally yields better likelihoods. For both Score-Based models and DDPMs, Eq. ( 2) is known as the Forward SDE. This SDE is reversible Anderson (1982) and the Reverse SDE is given below: dx = f (x, t) -g 2 (t)∇ x log q t (x) dt + g(t)d w, ( ) where w is the reverse time standard Wiener process. Typically, ∇ x log q t (x) is approximated by s θ (x t |t) and samples are generated by solving the Reverse SDE (Song et al., 2021b) .

3. METHOD

Our framework for training diffusion models with more general corruptions includes three components: i) the training objective, ii) the sampling, iii) the scheduling of the corruption mechanism.

3.1. TRAINING OBJECTIVE

We study corruption processes of the form: x t = C t x 0 + s t η t , where C t : R n → R n is a deterministic linear operator, η t is a Wiener process, and s t is a nonnegative scalar controlling the noise level at time t, and x 0 ∼ q 0 (x). We further denote with σ 2 t the variance of the noise at level t and we assume that it is a non-decreasing function of t. Unless stated otherwise, we assume that time is continuous and runs from t = 0 to t = 1. Additionally, we assume that at t = 0, we have C 0 = I n×n and σ 0 = 0, i.e. t = 0 corresponds to natural images. We also assume that recovering x 0 from x t is harder as t gets larger (i.e., entropy of q t (x 0 | x t ) increases with t). Eq. ( 4) defines a general class of diffusion processes, that includes (as special cases) the VE, VP and subVP SDEs used in Song et al. (2021b) . Our diffusion is the sum of a deterministic linear corruption of x 0 and a stochastic part that progressively adds noise. For any corruption process of this family, we are interested in learning the scores, i.e. ∇ xt log q t (x t ) for all t. For the vanilla Gaussian denoising diffusion, the celebrated result of Vincent (2011) shows that we only need access to the gradient of the conditional log-likelihood, ∇ xt log q t (x t |x 0 ), in order to learn the score, ∇ xt log q t (x t ). By revisiting the proof of Vincent (2011) , we find that this is actually true for a wide set of corruption processes, as long as some mild technical conditions are satisfied. In fact, the following general Theorem holds: Theorem 3.1. Let q 0 , q t be two distributions in R n . Assume that all conditional distributions, q t (x t |x 0 ), are fully supported and differentiable in R n . Let: J 1 (θ) = 1 2 E xt∼qt ||s θ (x t ) -∇ xt log q t (x t )|| 2 , J 2 (θ) = 1 2 E (x0,xt)∼q0(x0)qt(xt|x0) ||s θ (x t ) -∇ xt log q t (x t |x 0 )|| 2 . (6) Then, there is a universal constant C (that does not depend on θ) such that: J 1 (θ) = J 2 (θ) + C. Theorem 3.1 implies that minimizing the second function is equivalent to minimizing the first one. The second function is nothing else than the DSM objective. Our main observation is that noise is not always necessary for learning the score using the DSM objective. A necessary condition is that the corruption process gives non-zero probability to all x t for any image x 0 . This is easily achieved by adding Gaussian noise, but this is not the only option. The proof of this Theorem is deferred in the Appendix and it is following the calculations of Vincent (2011) . Network parametrization. For the class of diffusion processes given by Eq. ( 4), we have that: q t (x t |x 0 ) = N x t ; µ = C t x 0 , Σ = σ 2 t I and hence the objective becomes: L(t) = 1 2 E (x0,xt)∼q0(x0)qt(xt|x0) s θ (x t |t) - C t x 0 -x t σ 2 t 2 . ( ) As shown, the objective of the model is to predict the (normalized) difference, C t x 0 -x t , which is actually the noise, η t . We argue that even though this objective is theoretically grounded, in many cases, it would not work in practice because we would need infinite samples to actually learn the vector-field ∇ xt log q t (x t ) in a way that would allow sampling. Assume that the corruption process is blurring (at different levels) paired with additive noise of small magnitude. The objective written in Eq. ( 7) learns the distributions of blurry (and slightly noisy images) by just removing noise. Hence, in practice we might only learn these distributions locally (around the blurry images) and hence we might not be able to reduce the blurriness. This point might be better understood after we present our Sampling Method in Section 3.2. To account for this problem, we propose a network reparametrization which leverages that we know the linear corruption mechanism, C t . Specifically, we propose the following parametrization: s θ (x t |t) = C t h θ (x t |t) -x t σ 2 t . Crucially, the network incorporates the corruption process. The loss becomes: L(t) = 1 2 E (x0,xt)∼q0(x0)qt(xt|x0) 1 σ 4 t ||C t (h θ (x t |t) -x 0 )|| 2 . ( ) When C t is a blurring matrix, this loss function is the MSE between the blurred prediction of h θ and the blurred clean image. Finally, as observed in previous works (Song & Ermon, 2019; Ho et al., 2020; Karras et al., 2022) , the optimization landscape becomes smoother when we are predicting the residual, instead of the clean image directly. This corresponds to the additional reparametrization: h θ (x t |t) = φ θ (x t |t) + x t , ) which leads to the final form of our loss function: L(t) = 1 2 E (x0,xt)∼q0(x0)qt(xt|x0) 1 σ 4 t ||C t (φ θ (x t |t) -r t )|| 2 , ( ) where r t is the residual with respect to the clean image, i.e. r t = x 0 -x t . Following prior work, we use a single network conditioned on time t that is optimized for all L(t). Hence, the total loss is: L = E t∼U [0,1] w(t) E (x0,xt)∼q0(x0)qt(xt|x0) ||C t (φ θ (x t |t) -r t )|| 2 , ( ) where the weights are usually chosen to be 1 or 1/σ 2 t (Karras et al., 2022; Kingma et al., 2021) . We call our training objective Soft Score Matching. The name is inspired from "soft filtering" a term used in photography to denote an image filter that removes fine details (e.g., blur, fading, etc). As in the Denoising Score Matching, the network is essentially trained to predict the residual to the clean image, but in our case, the loss is in the filtered space. When there is no filtering matrix, i.e. C t = I, we recover the DSM objective used in (Song & Ermon, 2019; 2020; Song et al., 2021b) . Comparison with objectives used in other works. Previous (Anonymous, 2022) or concurrent (Bansal et al., 2022; Rissanen et al., 2022; Hoogeboom & Salimans, 2022) Naive Sampler. Once the model is trained, we need a way to generate samples. The simplest idea is to use recursively our model, φ θ (x t |t), to get estimates of the clean image, x0 . To move from corruption level t to corruption level t -∆t, we feed the image x t to the model to get an estimate of the clean image, x0 , and then corrupt back to level t -∆t. This idea is shown in Algorithm 1. Momentum Sampler. As we will show in the Experiments section, the naive sampler we presented above leads to generated images that lack diversity. We propose an simple, yet novel, alternative method for sampling from the general linear diffusion model presented in Eq. (4). Our method is inspired by the continuous formulation of diffusion models that is introduced in Song et al. (2021b) . The first step is to find a Markovian stochastic process that is "close" to the non-Markovian corruption process of Eq. ( 4). Consider the following SDE: dx t = Ċt E[x 0 |x t ]dt + d(σ 2 t ) dt dw, ( ) where w is the standard Wiener process. This is a special case of the Itô SDE: To build some intuition, it is useful to think of the toy setting where the dataset contains one image, α ∈ R n . Under this setting, the Markovian corruption described by the SDE of Eq. ( 13) has the same marginal distributions as the process of Eq. ( 4). As we explain in the Appendix (Section E), in the general case, the SDE (13) introduces an approximation error with respect to the corruption process of Eq. (4) -the former uses the conditional expectation, E[x 0 |x t ], while the latter corrupts directly x 0 . Nevertheless, the approximation error for the considered corruptions seems to be small, given the experimental success of the derived sampler. Intuitively, this is because for low corruption levels, the conditional expectation and x 0 are close, while for high corruption levels, the distance is contracted by the multiplication with the corruption matrix C t (see also Appendix, Section E). dx = f (x, t)dt + g(t) Eq. ( 13) describes a reversible diffusion process (Anderson, 1982) . The reverse is also an Ito SDE: dx t = Ċt E[x 0 |x t ] - d(σ 2 t ) dt ∇ xt log q t (x t ) dt + d(σ 2 t ) dt d w. ( ) where w is a standard Wiener process when time flows backwards from t = 1 to t = 0. In practice, to solve Eq. ( 14), we discretize the SDE (i.e., apply Euler-Maruyama, and approximate the function derivatives with finite differences). The Euler-Maruyama discretization is given below: x t-∆t -x t = (C t-∆t -C t )E[x 0 |x t ] -(σ 2 t-∆t -σ 2 t )∇ xt log q t (x t ) + σ 2 t -σ 2 t-∆t η, ( ) where η ∼ N (0, I). In this update Equation, there are two unknowns: the conditional expectation, E[x 0 |x t ], and the score function, ∇ xt log q t (x t ). We show that these are actually connected, through the Tweedie's (Efron, 2011; Robbins, 1956; Stein, 1981) formula. Specifically, it holds that: C t E[x 0 |x t ] = x t + σ 2 t ∇ xt log q t (x t ). ( ) The proof is given for completeness in the Appendix, Lemma A.1. To estimate ∇ xt log q t (x t ) we use our model that provably learns the score according to Theorem 3.1. Putting everything together: ∆x t = x t-∆t -x t = (17) = (C t-∆t -C t ) φ θ (x t |t) + x t - σ 2 t-∆t -σ 2 t σ 2 t C t φ θ (x t |t) + x t -x t + σ 2 t -σ 2 t-∆t η. x0 Our sampler is summarized in Algorithm 2. Essentially, there is one update for deblurring and one for denoising. At the core of this update equation, is the prediction of the clean image, x0 . Once Algorithm 2 Momentum Sampler Require: p1, φ θ , Ct, σt, ∆t x1 ∼ p1(x) for t = 1 to 0 with step -∆t do x0 = φ θ (xt|t) + xt Coarse prediction of the clean image. ŷt ← Ct x0 Coarse prediction of filtered image at t. ηt ∼ N (0, I) ˆ t ← ŷt -xt Estimate of noise at t. zt-∆t ← xt - (σ 2 t-∆t -σ 2 t ) σ 2 t ˆ t + σ 2 t -σ 2 t-∆t ηt Filtered image at t with noise at t -∆t. ŷt-∆t ← Ct-∆t x0 Coarse prediction of filtered image at t -∆t. xt-∆t ← zt-∆t + ( ŷt-∆t -ŷt) Filtered image at t -∆t with noise at t -∆t. end for return x0 the clean image is predicted, we blur it back to two different corruption levels, t and t -∆t. The deblurring gradient is the residual between the blurred images at levels t -∆t and t. Interestingly, the denoising update is the same as the one used in typical score-based models (that only use additive noise). In fact, if there is no blur (C t = I), our sampler becomes exactly the sampler used for the Variance Exploding (VE) SDE in Song et al. (2021b) . We call our sampler Momentum Sampler because we can think of it as a generalization of the update of the Naive Sampler, where there is a momentum term. To understand this better, we look at the setting where there is no noise. Then, the update rule of the Momentum Sampler is: ∆x t = C t-∆t x0 -C t x0 . (18) As seen, the first term is what the Naive Sampler would use to update the image at level t and the second term is what the Naive Sampler would use to update the image at level t -∆t. If these two directions are aligned, then the gradient ∆x t is small. Hence, there is a notion of momentum, analogous to how the term is used in classical optimization. In the Appendix, Section B.2, we also present a DDIM-type (Song et al., 2021a) sampler for which the momentum term also appears. Probability Flow Momentum Sampler. The update rule of our sampler was derived using the discretization of the backward SDE of our corruption, given in Eq. ( 14). Similarly to Song et al. (2021b) , we can also consider the Ordinary Differential Equation (ODE) associated with this SDE: dx t = Ċt E[x 0 |x t ] - 1 2 d(σ 2 t ) dt ∇ xt log q t (x t ) dt. Again, we can approximate Ċt E[x 0 |x t ], ∇ xt log q t (x t ) with our trained network and get a deterministic version of the Momentum Sampler, which we call Probability Flow Momentum Sampler, as in Song et al. (2021b) . We detail our derivations in the Appendix, Section B.1.

3.3. SCHEDULING

The last piece of our framework is how to choose the corruption levels, i.e. the scheduling of the diffusion. For example, for Gaussian Blur, the scheduling decides how much time is spent in the diffusion in the very blurry, somewhat blurry and almost no blurry regimes. We provide a principled way of choosing the corruption levels for arbitrary corruption processes. Let D 0 the distribution of real images and D 1 be a known distribution that we know how to sample from, e.g. a Normal Distribution. In the design phase of score-based modeling, the goal is to choose a set of intermediate distributions, {D t }, that smoothly transform images from D 0 to samples from the distribution D 1 . Let Θ = {θ 1 , θ 2 , ..., θ k , ...} be the space of diffusion parameters, i.e. each θ i corresponds to a distribution D θi . In the case of blur for example, θ i controls how much we blur the image. Let also M : X × X → R be a metric that measures distances between distributions, e.g. M might be the Wasserstein Distance of distributions with support X . We construct a weighted graph G with the nodes being the distributions and the weights given by: w D θi , D θj = M D θi , D θj , if M D θi , D θj ≤ , ∞, otherwise. For a fixed , we choose the distributions that minimize the cost of the path between D 0 and D 1 . The parameter expresses the power of the best neural network we can train to reverse one step of the diffusion. If = ∞, then for any metric M , the shortest path is to go directly from D 0 to D 1 . However, it is impossible to denoise in one step a super noisy image. Hence, we need to go through many intermediate steps, which is forced by setting to a smaller value. As we increase the number of the candidate distributions we are getting closer to finding the geodesic between D 0 and D 1 . However, the computational cost of the method increases since we need to estimate all the pairwise distances M(D θi , D θj ). In practice, we use a relatively small number of candidate distributions, e.g. T = 256 and once the path is found, we do linear interpolation to extend to the continuous case.

4. EXPERIMENTS

We evaluate our method in CelebA-64 and CIFAR-10. We show that by just changing the corruption mechanism (and using our framework for scheduling, learning and sampling) we significantly improve the FID score and reduce the sampling time. We use the architecture and the training hyperparameters from Song et al. (2021b) (full details can be found in the Appendix). For most of our experiments, we use Gaussian Blur as our primary corruption mechanism. To illustrate that our method works more generally, we also show results with masking which is discussed separately later. Consistent with the description of our method, our deterministic corruptions are also paired with additive low magnitude noise. This is required by our theoretical results, otherwise the conditional log-likelihood, log q(x t |x 0 ) would be undefined. We also find the addition of noise beneficial in practice (see Appendix F.1.1 for ablation studies on the role of noise). Scheduling. We use the standard geometric scheduling for the noise (Song & Ermon, 2019; 2020; Song et al., 2021b ) and use the methodology described in Section 3.3 to select the blur levels. We use the Wasserstein distance as the metric M to measure how close are the different distributions. To clearly illustrate the switch to a different corruption, our diffusion has an initial stage that increases the noise (with no blur) and then we fix the noise (to a small value, e.g. σ = 0.1) and change (using our scheduling framework) the amount of blur. Our diffusion spends less than 20% of the total time in the initial stage that increases the noise. We ablate those choices in Section F.1 of the Appendix. One important property of our framework is that the scheduling is dataset specific. Intuitively, the way we corrupt depends on the nature of the data we are modelling. The interested reader can find the found schedulings for each dataset in Figure 6 of the Appendix. Results. We train our networks on CelebA-64 and CIFAR-10 using the found schedulings and the training objective of Eq. ( 12). We start by showing uncurated samples of our models in Figure 2 . The generated images have high diversity and fidelity in both datasets. We compare the FID (Heusel et al., 2017) obtained by our method on CelebA with many natural baselines that use any of the VE, VP or subVP SDEs. Specifically, we compare against DDPM (Ho et al., 2020) that uses the VP SDE, DDIM (Song et al., 2021a) that uses the same model but with a different sampler, DDPM++ (Kim et al., 2022b ) that is the state-of-the-art model for VP-SDE and the NCSN++ models (Song et al., 2021b) trained with the VE and subVP-SDEs. For a fair comparison, we only use reported numbers in published papers for the baselines and we do not rerun them ourselves. Our model achieves state-of-the-art FID score, 1.85, in CelebA, outperforming all the other methods. The results are summarized in Table 1 . For CIFAR-10, we obtain FID score 3.86 with our Probability Flow Momentum Sampler and 3.91 with our Momentum Sampler. We summarize the results in Table 2 . Our best FID is competitive with similar samplers to similar methods, e.g. with NCSN++ (VE SDE) with Reverse SDE sampling. However, it is behind other state-of-the-art models such as LSGM (Vahdat et al., 2021) and DDPM (Ho et al., 2020) . We believe that this performance gap can be decreased in the future by further research in the area of diffusion models with general corruptions. Our method is superior in sampling time, for both CIFAR-10 and CelebA. Figure 3 shows how FID changes based on the Number of Function Evaluations (NFEs). Our method requires significant less steps to achieve the same or better quality than NCSN++ (VE SDE) (Song et al., 2021b) , using the same architecture and training hyperparameters (FID values taken from (Ma et al., 2022) ). Model FID DDPM (VP SDE) (Ho et al., 2020) 3.26 DDIM (VP SDE) (Song et al., 2021a) 3.51 DDPM++ (VP SDE) (Kim et al., 2022b) 1.90 NCSN++ (subVP-SDE) (Song et al., 2021b) 3.95 NCSN++ (VE SDE) (Song et al., 2021b) 3.25 Ours (VE SDE + Blur) 1.85 Ablation Study for Sampling. For all the results we presented so far, we used the Momentum Sampler that we introduced in Section 3.2 and Algorithm 2. In this section, we ablate the choice of the sampler. Specifically, we compare with the intuitive Naive Sampler described in Algorithm 1. We show that the choice of the sampler has a dramatic effect in the quality and the diversity of the generated images. Results are shown in Figure 4 . The images from the Naive Sampler seem repetitive and lack details. This is reflected in the poor FID score. The Momentum Sampler leads to images with greater variety and detail, dramatically improving FID from 27.82 to 1.85. Ablation Studies for Scheduling. We perform extensive ablations on the scheduling to understand the role of noise in the framework and whether our scheme outperforms other natural baselines. The results are detailed in the Appendix, Section F.1. Our main findings are: i) the Momentum Sampler works even if noise and blur are changing simultaneously (i.e. for schedulings with non-fixed noise), ii) lowering the maximum value of noise leads to important performance degradation -for very small noise the method completely fails and iii) our found scheduling outperforms significantly (baseline FID: 8.35, ours: 1.85) a natural baseline that sets blur parameters such that MSE between the corrupted and the clean image decays in the same rate for Gaussian Denoising and Soft Diffusion. Masking Diffusion Models. To show the generality of our framework, we also train models with (discrete) masking diffusion paired with noise. Figure 1 (top 2 rows) shows the forward and the (learned) reverse process for blur on the left and masking on the right. We train the model with our Soft Score Matching objective on CelebA. Unconditional samples from the model trained with masking can be found in Figure 13 of the Appendix. In Figure 5 , we show the predictions of our two trained models (blur and masking) for the conditional mean, E[x 0 |x t ], at different times of the diffusion. Soft Score Matching trains the model to make a prediction that matches the real images in the filtered space. Hence, given masked images the masking model is incentived to predict right only the observed noisy region. As diffusion time t becomes smaller (cleaner images), the observed region grows and the model predicts bigger windows. Although it is interesting that we can train Masking Diffusion models, there are several limitations compared to Blur Diffusion (and even Gaussian Diffusion). We observe that these models are very slow to sample from: with 1024 sampling steps, FID is 30.92 while with 4096, FID improves to 12.37.

5. RELATED WORK

We showed that by just changing the corruption mechanism, we observe important computational benefits. 2022) train progressively a student network that mimics the teacher diffusion model with fewer sampling steps. We note that all these works are orthogonal to ours and therefore can be used in combination with our framework for even faster sampling. On scheduling, there is closely related work by Bao et al. (2022) . The authors find a closed form solution (w.r.t. the score function) for the variance of the reverse SDE for Gaussian Diffusion. Then, they select a noise scheduling that minimizes the KL along the path from the initial to the final distribution. In our work, we use Wasserstein distances and consider general corruption processes for which it is unclear whether such a closed form solution exists. Instead, we estimate the distances in a data-driven way. This allows us to schedule arbitrary diffusion processes in a principled way. There is significant recent (Anonymous, 2022) and concurrent work (Rissanen et al., 2022; Bansal et al., 2022; Hoogeboom & Salimans, 2022 ) that proposes diffusion with other degradations. These works have significant differences since they use different loss functions and sampling methods. Soft Score Matching experimentally outperforms (under the exact same setting) the loss functions used in the concurrent works (ours FID: 1.85, theirs: 5.91). Our (blur) models obtain state-of-the-art FID for linear diffusion models on CelebA (FID 1.85). For CIFAR10, we outperform (FID: 3.86) Gaussian diffusion with Variance Exploding SDE (Song et al., 2021b) . Hoogeboom & Salimans (2022) also use blurring (but with Variance Preserving SDE) to further push the CIFAR-10 performance to FID 3.17. These advancements show that there are multiple diffusions with promising potential.

6. CONCLUSIONS AND FUTURE WORK

We presented a framework to train and sample from diffusion models that reverse general linear corruptions. We showed that by changing the corruption process, we can get significant sample quality improvements and computational benefits. This work opens several future research directions. For example, it is possible to optimize or learn the corruption process for solving a specific type of inverse problem. It is also worth exploring if mixing different corruptions (blur, noise, masking, etc.) improves performance. Our work could also be extended to the non-linear case, leveraging the techniques introduced in (Rombach et al., 2022; Kim et al., 2022a; Wang et al., 2022) . Finally, it is important to understand the role of noise, from both a theoretical and practical standpoint. Mao Ye, Lemeng Wu, and Qiang Liu. First hitting diffusion models. arXiv preprint arXiv:2209.01170, 2022.

A APPENDIX

A.1 PROOFS Theorem 3.1. Let q 0 , q t be two distributions in R n . Assume that all the conditional distributions, q t (x t |x 0 ), are supported and differentiable in R n . Let: J 1 (θ) = 1 2 E xt∼qt ||s θ (x t ) -∇ xt log q t (x t )|| 2 , ( ) J 2 (θ) = 1 2 E (x0,xt)∼q0(x0)qt(xt|x0) ||s θ (x t ) -∇ xt log q t (x t |x 0 )|| 2 . ( ) Then, there is a universal constant C (that does not depend on θ) such that: J 1 (θ) = J 2 (θ) + C. The proof of this Theorem is following the calculations of Vincent (2011) . We observe that as long as the technical conditions listed are satisfied, the proof holds independent of the corruption type. We provide the proof below for the shake of completeness. Proof of Theorem 3.1. J 1 (θ) = 1 2 E xt∼qt ||s θ (x t )|| 2 -2s θ (x t ) T ∇ xt log q t (x t ) + ||∇x t log q t (x t )|| 2 (23) = 1 2 E xt∼qt ||s θ (x t )|| 2 -E xt∼qt s θ (x t ) T ∇ xt log q t (x t ) + C 1 . Similarly, J 2 (θ) = 1 2 E xt∼qt ||s θ (x t )|| 2 -E (x0,xt)∼q0( x0)qt(xt|x0) s θ (x t ) T ∇ xt log q t (x t |x 0 ) + C 2 . ( ) It suffices to show that: E xt∼qt s θ (x t ) T ∇ xt log q t (x t ) = E (x0,xt)∼q0(x0)qt(xt|x0) s θ (x t ) T ∇ xt log q t (x t |x 0 ) . ( ) We start with the second term. E (x0,xt)∼q0(x0)qt(xt|x0) s θ (x t ) T ∇ xt log q t (x t |x 0 ) = x0 xt q 0 (x 0 )q t (x t |x 0 )s θ (x t ) T ∇ xt log q t (x t |x 0 )dx t dx 0 (27) = x0 xt s T θ (x t ) (q 0 (x 0 )q t (x t |x 0 )∇ xt log q t (x t |x 0 )) dx t dx 0 (28) = x0 xt s T θ (x t ) q 0 (x 0 )q t (x t |x 0 ) 1 q t (x t |x 0 ) ∇ xt q t (x t |x 0 ) dx t dx 0 (29) = x0 xt s T θ (x t ) (q 0 (x 0 )∇ xt q t (x t |x 0 )) dx t dx 0 (30) = xt x0 s T θ (x t ) (q 0 (x 0 )∇ xt q t (x t |x 0 )) dx 0 dx t (31) = xt s T θ (x t ) x0 q 0 (x 0 )∇ xt q t (x t |x 0 )dx 0 dx t (32) = xt s T θ (x t ) x0 ∇ xt (q 0 (x 0 )q t (x t |x 0 )) dx 0 dx t (33) = xt s T θ (x t ) ∇ xt x0 q 0 (x 0 )q t (x t |x 0 )dx 0 dx t (34) = xt s T θ (x t )∇ xt q t (x t )dx t (35) = xt q t (x t )s T θ (x t )∇ xt log q t (x t )dx t (36) = E xt∼qt(xt) s T θ (x t )∇ xt log q t (x t ) . Lemma A.1 (Tweedie's formula). Consider the corruption process: x t = C t x 0 + σ t η, where η ∼ N (0, I). Denote with q t (x t ) the density of x t and assume that log q t (x t ) is differentiable everywhere. Then, for the Minimum Mean-Square Estimation (MMSE) of x 0 given x t , it holds that: C t E[x 0 |x t ] = x t + σ 2 t ∇ xt log q t (x t ). Proof. q t (x t ) = q t (x t |x 0 )q 0 (x 0 )dx 0 ⇒ (39) ∇ xt q t (x t ) = q 0 (x 0 )∇ xt q t (x t |x 0 )dx 0 (40) = q 0 (x 0 )q t (x t |x 0 )∇ xt log q t (x t |x 0 )dx 0 . We know that: q t (x t |x 0 ) = N (x t ; µ = C t x 0 , Σ = σ 2 t I). Hence, ∇ xt q t (x t ) = q 0 (x 0 )q t (x t |x 0 ) C t x 0 -x t σ 2 t dx 0 (42) = 1 σ 2 t C t q 0 (x 0 )q t (x t |x 0 )x 0 dx 0 -x t q 0 (x 0 )q t (x t |x 0 )dx 0 (43) = 1 σ 2 t C t q 0 (x 0 |x t )q t (x t )x 0 dx 0 -x t q t (x t ) ⇒ (44) ∇ xt q t (x t ) q t (x t ) = 1 σ 2 t (C t E[x 0 |x t ] -x t ) ⇔ (45) ∇ xt log q t (x t ) = 1 σ 2 t (C t E[x 0 |x t ] -x t ) . B DETERMINISTIC SAMPLERS

B.1 PROBABILITY FLOW ODE

In the main text, we derived our Momentum Sampler by analyzing the Backward SDE associated with our corruption process. Inspired by the works of Song et al. (2021b) ; Maoutsa et al. (2020) , we also consider deterministic sampling that is derived by looking at the ODE that describes our diffusion. Particularly, the ODE: dx t = f (x t , t) - 1 2 g 2 (t)∇ xt log q t (x t ) dt, has the same marginal distributions (Anderson, 1982; Maoutsa et al., 2020; Song et al., 2021b; Chen et al., 2018) with the Backward SDE: dx t = f (x t , t) -g 2 (t)∇ xt log q t (x t ) dt + g(t)d w. ( ) For our case, Eq. ( 47) becomes: dx t = Ċt E[x 0 |x t ] - 1 2 d(σ 2 t ) dt ∇ xt log q t (x t ) dt. ( ) The first-order discretization of this ODE is given below: x t-∆t -x t = (C t-∆t -C t )E[x 0 |x t ] - (σ 2 t-∆t -σ 2 t ) 2 ∇ xt log q t (x t ). ( ) We estimate C t E[x 0 |x t ] and ∇ xt log q t (x t ) with our neural network and we get the Neural ODE (Chen et al., 2018): ∆x t = x t-∆t -x t = (51) = (C t-∆t -C t ) φ θ (x t |t) + x t - 1 2 σ 2 t-∆t -σ 2 t σ 2 t C t φ θ (x t |t) + x t -x t , which is the update rule of our Probability Flow ODE Momentum Sampler. We note that our simple discretization is not the only way to solve the ODE of Eq. ( 47). We can use more sophisticated, e.g. see Dormand & Prince (1980) .

B.2 DDIM

A popular sampling scheme is the DDIM method, introduced in Song et al. (2021a) . The idea is to use the network to predict x0 from x t and then use the forward model to move from x0 to x t-∆t . This is same with the Naive Sampler we introduced in the main paper but the difference in DDIM is that part of the stochasticity is replaced by "simulated" noise. The main trick to simulate the noise is to observe that once we know x0 and x t , we can once again use the forward model to estimate the noise. We can use the same idea for Soft Diffusion sampling. Specifically, we propose the following DDIM-type sampler: x t-∆t = C t-∆t h θ (x t , t) x0 + σ 2 t-∆t -k 2 x t -C t h θ (x t , t) σ t θ (xt,t): simulated noise +k 2 η. ( ) This Equation is very similar to Equation 12, page 5 in the DDIM (Song et al., 2021a) paper. The parameter k controls the amount of stochasticity, i.e. how much of the noise is simulated. For k = 0, we get a deterministic sampler. Experiments with the deterministic DDIM-type sampler can be found in Section F.3. 

C SCHEDULINGS

We use our framework to select the blur levels in an unsupervised way. For solving the optimal transport problem, we use the Sinkhorn Distances by Cuturi ( 2013) that has been extensively used for image experiments. We use the implementation of this paper from the software package ott-jax (Cuturi et al., 2022) . We experimented with both using the whole dataset and using slices. As expected, dataset slices lead to higher estimation errors, but we observe that the found scheduling doesn't change much, i.e. the estimation error increases approximately uniformly for all the pairs for reasonably sized dataset slices. For the schedules found in the paper, we used slices of 10% of the dataset. We start with 256 different blur levels and we tune such that the shortest path contains 32 distributions. We then use linear interpolation to extend to the continuous case. Full experimental details can be found in the Appendix. These choices seem to work well in practice, but further optimization could be made in future work. Figure 6 shows the found schedulings for the CelebA and the CIFAR-10 datasets. Notice that the scheduling is slightly different between the two datasets -the diffusion depends on the nature of the data we are trying to model. We underline that the parameters for the blur are selected without any supervision, by solving the optimization problem we defined in Section 3.3. 

D TRAINING DETAILS

Hyperparameters. For our trainings, we use Adam optimizer with learning rate 2e -4, β 1 = 0.9, β 2 = 0.999, = 1e -8. We additionally use gradient clipping for gradient norms bigger than 1. For the learning rate scheduling, we use 5000 steps of linear warmup. We use batch size 128 and we train for 1 -2M iterations (based on observed FID performance). Blur parameters. For the blurring operator, we use Gaussian blur with fixed kernel size and we vary the variance of the kernel. For CelebA-64, we keep the kernel half size fixed to 80 and we vary the standard deviation from 0.01 to 23. For CIFAR-10, we keep the kernel half size fixed to 32 and we vary the standard deviation from 0.01 to 18. For both datasets, we implement blur with zero-padding. We chose the final blur level such that the final distribution is easy to sample from. In both cases, the final distribution becomes noise on top of (almost) a single color. Final distribution For the blurring models, the final distribution is the distribution of very blurry images with additive Gaussian noise (we first blur, then add noise). At the limit, each blurry image becomes a constant image having a single color (i.e., the average color of the image). Hence, to sample from this final distribution, we first have to sample a single color (from the distribution of average colors in the dataset), generate a constant image having that color, and then add little noise to the constant image. The distribution of average colors for all the considered datasets is very simple and we model it with a Gaussian distribution. Specifically, we fit a three-dimensional gaussian distribution (one dimension for each color channel, diagonal covariance) to the average color of the dataset. To begin the inference process, we sample one color from this distribution, we create an image where every pixel has this color and then we add Gaussian noise. Architecture. We use the architecture of Song et al. (2021b) without any changes. Training Objective For all our experiments, we scale the loss at level t with w(t) = 1/σ 2 t as in Song & Ermon (2019; 2020) ; Song et al. (2021b) .

Compute and Training Time

We train our models on 16 v2-TPUs. Our blur models on CelebA had an average speed of 6 iterations per second. For CIFAR-10, the average speed was 11 iterations per second We note that there is no overhead over the NCSN++ paper other than projecting to the measurements space, which can be done very efficiently for both blur and masking. Evaluation We keep one checkpoint every 10000 steps and we keep the best model among the kept checkpoints based on the obtained FID score. We use 50000 samples to evaluate the FID, as it is typically done in prior work. Regarding the number of steps, we selected the best FID among models evaluated with steps ranging from 200 to 1000 with a step size of 80 (10 experiments total). The Momentum Sampler seems to have a U-shaped performance plot, i.e. there is a sweet spot in the number of function evaluations that gives the lowest FID score. We observed this in both CelebA and CIFAR-10. However, this is not unique to our sampler -the general belief that sample quality improves with the number of steps has been called into question by recent works. Specifically, Karras et. al. do an extensive evaluation of the role of NFEs for different samplers and they find that many stochastic samplers behave similarly to our Momentum Sampler. For example, we refer the interested reader to Figure 4 , page 8 of the paper Elucidating the Design Space of Diffusion-Based Generative Models that clearly shows that for many stochastic samplers performance slightly deteriorates after a certain threshold of function evaluations and the optimum is at some intermediate point. We emphasize, however, that the deterioration we observed in CelebA is bigger compared to other stochastic samplers and therefore selecting the correct number of NFEs is important for our method. Consistent with the observations by Karras et al. (2022) (e.g. Figure 2 , page 4), for deterministic samplers the performance flattens (DDIM) or deteriorates slightly (Probability Flow Momentum Sampler) in the regime of very high Number of Function Evaluations (NFEs). We also want to note that selecting the optimal number of function evaluations for reporting FID is not uncommon in the literature. It is done in both the landmark papers Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) and Score-Based Generative Modeling through Stochastic Differential Equations (Song et al., 2021b ).

E LIMITATIONS AND THINGS THAT DID NOT WORK

Our method has several limitations. First, it requires the diffusion operator to be known. This is not always the case, e.g. see Peebles et al. (2022) where diffusion is applied to checkpoints of different models. Another limitation is that our framework does not offer any guidance on which diffusion operators are actually more or less useful for learning the data distribution. Particularly, we already showed that blurring is a much more powerful diffusion mechanism than masking, in the sense that it leads to better FID scores and faster generations. It is yet to be seen whether blurring is going to be outperformed by some other corruption method. Our method also only concerns linear diffusions (however, the extension to non-linear is relatively straightforward). On the theoretical side, our method has also some shortcomings. First, it only intuitively explains why the reparametrization to the Denoising Score Matching is needed. Second, since our method is based on the Denoising Score Matching, it only works when the conditional log-likelihood is defined everywhere. There are distributions for which such condition is not satisfied, but still, can be learned (to some extent) with heuristic methods (Bansal et al., 2022) . Another limitation of our work is the derivation of the Momentum Sampler and its sampling distribution. Consider the forward process: dx t = Ċt E x0∼p0 [x 0 |x t ] + g(t)dw This process is described by an Ito SDE, i.e. an SDE of the form: dx t = f (x t , t)dt + g(t)dw, where f (x t , t) = Ċt E x0∼p0 [x 0 |x t ]. The sampling process we write in the main body of the paper (that leads to the Momentum Sampler) is exactly the reverse of this forward process, where E x0∼p0 [x 0 |x t ] is approximated by the neural network. Hence, the Fokker-Planck equations hold and we are guaranteed that (as long as the approximation of the conditional expectation and the approximation of the score function are accurate), we are sampling from the correct distribution (Song et al., 2021b) . During training, we cannot use this forward process because we do not have the conditional expectation. Instead, we replace the conditional expectation of x 0 with x 0 itself. The mismatch between x 0 and the conditional expectation of x 0 leads to an additional approximation error in the learning of the score. For small t, the conditional expectation will be very close to x 0 and hence this approximation error is small. For large values of t, the distance between the conditional expectation and x 0 is bigger. However, we are multiplying with the corruption matrix C t , which is removing information as t grows. Therefore the distance of the corrupted conditional expectation and the corrupted x 0 is also expected to be small. Our training process learns the correct score for the corruption process applied directly to x 0 . Our sampler, however, assumes that we have learned the score using the conditional expectation of x 0 . We intuitively expect that these two are not far away (for all t) as we explained. However, precisely characterizing this mismatch remains open. On the practical side, we believe that our objective, Soft Score Matching, sometimes leads to slower sampling compared to the simpler objective of predicting the clean image. For example, for masking, since the model is only penalized in the observed region, there is no incentive in expanding this region. Hence, to achieve smooth transition between different masking levels we need to run many steps. Experimentally, we tried using our framework with even less stochasticity, but it did not work, e.g. see 7. It would be interesting to understand better what is causing the failure and also what is the proper amount of randomness required at each diffusion step. In the experiments of the main paper, our diffusion involves an initial stage where only noise is added. Then noise is fixed and the images are getting corrupted by the deterministic operator (e.g. blur or masking). In this section, we show two ablations regarding the noise.

Magnitude of noise.

In this ablation study, we still keep the noise fixed for a significant part of the diffusion, but we ablate the magnitude of the noise. Specifically, we attempt to study to what extent noise is needed in order to learn to reverse corruption processes with Soft Score Matching. We train two additional models on CelebA-64 where we attempt to decrease the maximum noise to a lower value. The corruptions for both models involves an initial stage where noise grows geometrically rate from the initial value (0.01) to the maximum value. Then, all the models keep the noise fixed at their maximum value for the rest of the diffusion. We use the following maximum values: i) σ max = 0.1 (model used in the paper), ii) σ max = 0.05, and iii) σ max = 0.025. Unconditional samples from the two ablation models are shown in Figure 7 . As shown, both models fail to produce realistic samples. The quality of samples deteriorates significantly as the noise decreases -the samples from the ablation models are significantly worse than the ones produced by our state-of-the-art model (see Figure 2 (right)). We want to underline that this is not a conclusive study. It might be the case that with different hyperparameters one can make Soft Score Matching work with lower values of noise. For example, we might need to tune the weights w(t) since for the ablations (and the state-of-the-art model), we use w(t) = 1/σ 2 t (as in Song & Ermon (2019; 2020) ; Song et al. (2021b) ) which might be causing instabilities for low values of noise (Nichol & Dhariwal, 2021) . Noise changing throughout the diffusion. We also train a model where noise and blur are changing simultaneously throughout the diffusion. This is a sanity check to verify that our framework (learning and sampling) still works when the model needs to deblur and denoise at the same time. For the noise scheduling, we simply use geometric scheduling from 0.1 to 0.01. We keep the blur parameters the same with the state-of-the-art model, i.e. we use the blur parameters shown in Figure 6 . This model achieves a competitive FID score, 4.31. This score could probably be further improved (by jointly selecting the blur and the noise scheduling with our framework), but this is beyond the scope of this ablation.

F.1.2 ABLATION STUDY FOR BLUR

For all our experiments so far, we chose the blur corruption levels based on the scheduling method we described in Section 3.3. We show the benefits of our approach by comparing to a natural baseline for selecting the diffusion levels. For this natural baseline, we use the scheduling of Variance Exploding (VE) as guidance. Specifically, we choose the blur parameters such that the MSE between the corrupted image and the clean image decays with the same rate for the Gaussian Denoising Diffusion and our Soft Diffusion (blur and low magnitude noise). Formally, let {q t } 1 t=0 be the (noisy) distributions used in Song et al. (2021b) for the Variance Exploding (VE) SDE and let {q t } 1 t=0 the blurry (and noisy) distributions we want to select. At level t, we choose the blur parameters such that: E (x0,xt)∼qt(xt|x0)q0(x0) ||x 0 -x t || 2 E (x0,x1)∼q1(x1|x0)q0(x0) [||x 0 -x 1 || 2 ] = E (x0,xt)∼q t (xt|x0)q0(x0) ||x 0 -x t || 2 E (x0,x1)∼q 1 (x1|x0)q0(x0) [||x 0 -x 1 || 2 ] . We retrain on CelebA using this natural baseline method for selecting the diffusion parameters. For a fair comparison, we keep the architecture and all the hyperparameters the same and we only ablate the scheduling of the blur. We measure FID for both the trained model with the baseline scheduling and we observe it increases from 1.85 to 8.35. Apart from this large deterioration in performance, the baseline model obtains its best FID score after 2000 steps, while with our scheduling we only need 280 steps to obtain the best FID. This experiment shows that the choice of scheduling is really important for the model performance but for also the computational requirements of the sampling. We also show visual samples from our blurring model, NCSN++ (VE SDE) and DDPM++ (VP SDE) in Figure 9 . For all the models we fix the NFEs to 200 and we show samples generated with the Euler-Maruyama discretization of the associated reverse ODE. As shown, the baseline models generate samples with artifacts for low NFEs while our model leads to images of superior visual quality. We note that different samplers can be used to accelerate sampling for all models, e.g. samplers from Karras et al. (2022) or DDIM-type samplers (Song et al., 2021a) , as we show in Section B.2. We also report best FID performance for different samplers for ours and competing methods in Table 2 . Our best FID is competitive with similar samplers to similar methods, e.g. with NCSN++ trained with VE SDE and using the Reverse SDE sampler. However, it is significantly behind the state-ofthe-art LSGM (Vahdat et al., 2021) model. We believe that this performance gap can be decreased in the future by further research in the area of diffusion models with general corruptions.

F.3 SAMPLING ABLATIONS

We perform several ablations regarding the sampling method, additional to the results mentioned in the paper. DDIM We experiment with the deterministic DDIM-type sampler and we report our results in the Table 3 for CelebA-64. Our DDIM-type sampler is very effective when the number of function evaluations is very low -it achieves FID 5.08 with only 50 steps while in comparison, the Momentum 3.17 LSGM (Vahdat et al., 2021) 2.10 (Song et al., 2021a; Karras et al., 2022 ). DDIM's performance is also not getting worse as we increase the number of steps. Instead, the performance of the Momentum Sampler, as observed in Figure 3 has a U-shape: there is an intermediate optimal number of steps to achieve the best FID score. We believe this could be related to the fact that Momentum Sampler is not deterministic. Specifically, in Karras et al. (2022) it is observed that for deterministic samplers performance usually flattens after a point (e.g. see Figure 2 , page 4) while for some stochastic samplers performance goes again up after a certain number of function evaluations (e.g. see Figure 4 (Song et al., 2021a) sampler for CelebA-64 dataset using our blurring model. The DDIM sampler is particularly effective for low number of function evaluations and maintains its performance as we increase the number of steps. Probability Flow Momentum Sampler For CelebA-64, we reported results only for the (stochastic) Momentum Sampler in the main body of the paper (see Figure 3 ). We report here results for the deterministic counterpart of this sampler, Probability Flow Momentum Sampler, summarized in Table 4 . The deterministic version of the Momentum Sampler seems to be performing slightly worse than the stochastic version -FID jumps from 1.85 to 2.14. However, this sampler, similar to the DDIM-type sampler, maintains more performance as we increase the number of steps. This is an important advantage for the deterministic sampling methods since practitioners need to put less effort into tuning the NFEs to get the best result. Predictor-Corrector Samplers Finally, we perform experiments with Predictor-Corrector samplers, as proposed in Song et al. (2021b) . The idea is that we are alternating at each diffusion step between two different samplers. We experiment with a DDIM Predictor and a Probability Flow Momentum Sampler corrector. The results are summarized in Table 5 . The Predictor-Corrector sampler maintains some of the benefits of both samplers. Namely, for low number of function evaluations it has better performance than the Probability Flow Momentum Sampler (benefit coming from the DDIM sampler) and for higher number of function evaluations performance is better than the DDIM sampler (benefit coming from the Probability Flow Momentum Sampler). There is a spot, at 100 NFEs, where the Predictor-Corrector sampler is better than both the Predictor and the Corrector. We encourage future research in identifying even better pairs of Predictor-Correctors that might outperform both the Predictor and the Corrector in some regime. (Song et al., 2021b) sampler for Celeba-64 for our blurring models. The Predictor is the DDIM-type sampler and the corrector our Probability Flow Momentum Sampler.

F.4 NEAREST SAMPLES IN TRAINING DATA

To verify that our model does not simply memorize the training dataset, we present generated images from our model and their nearest neighbor (L2 pixel distance) from the dataset. The results are shown in Figure 10 . 

G IMPLEMENTATION OF DEGRADATION OPERATORS

Blurring. The blurring operator is implemented as a convolution with a Gaussian kernel. The Gaussian kernel is truncated to have fixed support size (161 × 161 pixels), and normalized to have area one. The standard deviation of the Gaussian kernel sigma blur (parameter controlling the strength of the blur) is computed using the optimal transport optimization for the scheduling. The obtained schedule for the parameter sigma blur in CIFAR10 can be accessed from the following anonymous url: https://drive.google.com/file/d/ 192kdbj9oq1EGCm7KY52QZj--x0g5Elgs. Masking. The masking operator is implemented using a centered binary mask that sets to zero some percentage of pixels in the image. The scheduling with the masking operators can be accessed from the following anonymous url: https://drive.google.com/file/d/ 1YjzYKgivhvbHOzTABOYDiUoHbzq60Ozy. Noise. The noise degradation is implemented by adding to the blurred or masked image white Gaussian noise having standard deviation sigma noise. The scheduling of the noise can be accessed from the following anonymous url: https://drive.google.com/file/d/ 17UwFlJyp4euQeKVAVmRRHg8HL52anapf.



Figure 2: Uncurated samples from our trained models on CIFAR-10 (left) and CelebA (right).

Figure 3: FID versus NFEs (CelebA-64).

(a) Naive Sampler (uncurated). FID: 27.82. (b) Momentum Sampler (uncurated). FID: 1.85.

Figure 4: Effect of sampling method on the quality of the samples. Naive Sampler (4a) gives repetitive images that lack details. Momentum Sampler (4b) dramatically improves the sampling quality and the FID score.

Figure 5: Conditional means E[x0|xt] predictions of our blur/masking models, at different diffusion times.

Figure 6: Diffusion scheduling for CelebA-64 and CIFAR-10. The blur corruption levels are selected without supervision to minimize the sum of the Wasserstein distances between consecutive distributions. Notice that the scheduling is slightly different between the two datasets -the diffusion depends on the nature of the data we are trying to model. The support of the Gaussian blur kernel was set to 65 × 65 and 161 × 161 for CIFAR-10 and CelebA-64 datasets respectively.

Figure 7: Ablation study for the magnitude of noise.

Figure 8: Demonstration of how FID changes based on Number of Function Evaluations (NFEs) for CIFAR-10 for our (blur) model on CIFAR-10 with momentum sampling. Our model offers significant performance benefits for low number of function evaluations.

Figure 9: Visual comparison of samples from our model and baselines for 200 NFEs. For all the models, we are using the Euler-Maruyama discretization of the associated reverse ODE to generate the samples. As shown, our model leads to superior visual quality. Method FID Ours (VE SDE) Naive Sampler 40.07 Ours (VE SDE) Momentum Sampler 3.91 Ours (VE SDE) Probability Flow Momentum Sampler 3.86 NCSN++ (VE SDE) Reverse SDE 4.79 NCSN++ (VE SDE) Probability Flow 10.54 DDPM (VP SDE)3.17 LSGM(Vahdat et al., 2021) 2.10

(a) Generated images. (b) Nearest neighbors from dataset.

Figure 10: Generated images and nearest neighbors (L2 pixel distance) from the training dataset. As shown, the model produces new samples and does not simply memorize the training dataset.

Figure 11: More uncurated samples from our blur model trained on CelebA. These samples are obtained using our Momentum Sampler with 280 NFEs.

Figure 12: More uncurated samples from our blur model trained on CIFAR-10. These samples are obtained using the Probabilistic Flow Momentum Sampler with 200 NFEs.

Figure 13: Uncurated samples from our masking model trained on CelebA. These samples are obtained using the Probabilistic Flow Momentum Sampler with 1024 NFEs.

def generate_gaussian_kernel(sigma_blur, half_size=80): v = np.arange(-half_size, half_size + 1) x, y = np.meshgrid(v, v) k = np.exp(-(x ** 2 + y ** 2) / (2 * sigma_blur ** 2)) k = k / np.sum(k) return k

works that consider different degradations than Gaussian Diffusion, use the heuristic objective of predicting the clean image, i.e. they minimize: ||φ θ (x t |t) -r t ||. This is actually an upper-bound on our loss, i.e. ||C t (φ θ (x t |t) -r t ) || ≤ ||C t ||||φ θ (x t |t) -r t ||. Since the spectral norm, ||C t ||, is fixed, one can optimize for the upper-bound by minimizing ||φ θ (x t |t) -r t ||. Instead, Soft Score Matching optimizes directly for learning the score. Experimentally, Soft Score Matching outperforms (ours FID: 1.85, theirs: 5.91), under the exact same setting, this simple baseline (see Experiments).

dw, that appears inSong et al. (2021b), i.e. f (x, t) = Ċt E[x 0 |x t ] and g(t) = E[x 0 |x t ]does not depend on previous values of x t and our SDE is indeed an Ito SDE.

FID results on CelebA-64.

FID Results on CIFAR-10 for different samplers. For each of our samplers, we report the best result obtained among 10 runs with different NFEs, ranging from 200 to 1000 with a step size of 80. The results for competing methods are taken directly from the papers.

Results of DDIM

Results of the Probability Flow Momentum Sampler for Celeba-64 for our blurring models.

Results of a Predictor-Corrector

