SOFT DIFFUSION SCORE MATCHING FOR GENERAL CORRUPTIONS

Abstract

We define a broader family of corruption processes that generalizes previously known diffusion models. To reverse these general diffusions, we propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process and yields state of the art results for CelebA. Soft Score Matching incorporates the degradation process in the network. Our new loss trains the model to predict a clean image, that after corruption, matches the diffused observation. We show that our objective learns the gradient of the likelihood under suitable regularity conditions for a family of corruption processes. We further develop a principled way to select the corruption levels for general diffusion processes and a novel sampling method that we call Momentum Sampler. We show experimentally that our framework works for general linear corruption processes, such as Gaussian blur and masking. We achieve state-of-the-art FID score 1.85 on CelebA-64, outperforming all previous linear diffusion models. We also show significant computational benefits compared to vanilla denoising diffusion.

1. INTRODUCTION

Score-based models (Song & Ermon, 2019; 2020; Song et al., 2021b) and Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021a) are two powerful classes of generative models that produce samples by inverting a diffusion process. These two classes have been unified under a single framework (Song et al., 2021b) and are widely known as diffusion models. Diffusion modeling has found great success in a wide range of applications (Croitoru et al., 2022; Yang et al., 2022 ), including image (Saharia et al., 2022a; Ramesh et al., 2022; Rombach et al., 2022; Dhariwal & Nichol, 2021 ), audio (Kong et al., 2021; Richter et al., 2022; Serrà et al., 2022 ), video generation (Ho et al., 2022b) , as well as solving inverse problems (Daras et al., 2022; Kadkhodaie & Simoncelli, 2021; Kawar et al., 2022; 2021; Jalal et al., 2021; Saharia et al., 2022b; Laumont et al., 2022; Whang et al., 2022; Chung et al., 2022) . Karras et al. (2022) analyze the design space of diffusion models. The authors identify three stages: i) the noise scheduling, ii) the network parametrization (each one leads to a different loss function), iii) the sampling algorithm. We argue that there is one more important step: choosing how to corrupt. Typically, the diffusion is additive noise of different magnitudes (and sometimes input rescalings). There have been a few recent attempts to use different corruptions (Deasy et al., 2021; Hoogeboom et al., 2022a; b; Avrahami et al., 2022; Nachmani et al., 2021; Johnson et al., 2021; Lee et al., 2022; Ye et al., 2022) , but the results are usually inferior to diffusion with additive noise. Also, a common framework on how to properly design general corruption processes is missing. We present such a principled framework for learning to invert a general class of corruption processes. We propose a new objective called Soft Score Matching that provably learns the score for any regular linear corruption process. Soft Score Matching incorporates the filtering process in the network and trains the model to predict a clean image that after corruption matches the diffused observation. Our theoretical results show that Soft Score Matching learns the score (i.e. likelihood gradients) for corruption processes that satisfy a regularity condition that we identify: the diffusion must transform any image into any other image with nonzero likelihood. Using our method and Gaussian Blur paired with little noise as the diffusion mechanism, we achieve state-of-the-art FID on CelebA (FID 1.85) for linear diffusion models. We also show that our corruption process leads to generative models that are faster compared to vanilla Gaussian denoising diffusion. 

2. BACKGROUND

Diffusion models are generative models that produce samples by inverting a corruption process. The corruption level is typically indexed by a time t, with t = 0 corresponding to clean and t = 1 to fully corrupted images. The diffusion process can be discrete or continuous. The two general classes of diffusion models are Score-Based Models (Song & Ermon, 2019; 2020; Song et al., 2021b) and Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) . The typical diffusion in score-based modeling is additive noise of increasing magnitude. The perturbation kernel at time t is: q t (x t |x 0 ) = N (x t |µ = x 0 , Σ = σ 2 t I) , where x 0 ∼ q 0 is a clean image. Score models are trained with the Denoising Score Matching (DSM) objective: min θ E t∼U [0,1] w t E (x0,xt)∼q0(x0)qt(xt|x0) ||s θ (x t |t) -∇ xt log q t (x t |x 0 )|| 2 , where w t scales differently the weights of the inner objectives. If we train for each noise level t independently, given enough data and model capacity, the network is guaranteed to recover the gradient of the log likelihood (Vincent, 2011) , known as the score function. In other words, the model s θ (x t |t) is trained such that: s θ (x t |t) ≈ ∇ xt log q t (x t ). In practice, we use parameter sharing and conditioning on time t to learn all the scores. Once the model is trained, we start from a sample of the final distribution, q 1 , and then use the learned score to gradually denoise it (Song & Ermon, 2019; 2020) . The final variance σ 2 1 is selected to be very large such that the distribution q 1 is approximately Gaussian, i.e. the signal to noise ratio tends to 0. DDPMs corrupt by rescaling the input images and by adding noise. The corruption can be modelled with a Markov chain with perturbation kernel q t (x t |x t-∆t ) = N (x t |µ = √ 1 -β t x t-∆t , Σ =



Figure 1: Top two rows: Demonstration of our generalized diffusion method. Instead of corrupting by only adding noise, we propose a framework to provably learn the score function to reverse any linear diffusion (left: blur and noise, right: masking and noise). Our (blur and noise) models achieve state-of-the-art FID score 1.85 for linear diffusion models on CelebA-64. Uncurated samples shown in the last three rows.

