RESTORATION BASED GENERATIVE MODELS

Abstract

Denoising diffusion models (DDMs) have recently attracted increasing attention by showing impressive synthesis quality. DDMs are built on a diffusion process that pushes data to the noise distribution and the models learn to denoise. In this paper, we establish the interpretation of DDMs in terms of image restoration (IR). Integrating IR literature allows us to use an alternative objective and diverse forward processes, not confining to the diffusion process. By imposing prior knowledge on the loss function grounded on MAP estimation, we eliminate the need for the expensive sampling of DDMs. Also, we propose a multi-scale training, which improves the performance compared to the diffusion process, by taking advantage of the flexibility of the forward process. Our model improves the quality and efficiency of both training and inference, Furthermore, we show the applicability of our model to inverse problems. We believe that our framework paves the way for designing a new type of flexible general generative model.



Generative modeling is a prolific machine learning task that the models learn to describe how a dataset is distributed and generate new samples from the distribution. The most widely used generative models primarily differ in their choice of bridging the data distribution to a tractable latent distribution (Goodfellow et al., 2020; Kingma & Welling, 2014; Rezende et al., 2014; Rezende & Mohamed, 2015; Sohl-Dickstein et al., 2015; Chen et al., 2021a) . In recent years, denoising diffusion models (DDMs) (Ho et al., 2020; Song & Ermon, 2019; Song et al., 2020b; Dockhorn et al., 2021) have drawn considerable attention by demonstrating remarkable results in terms of both high sample quality and likelihood. DDMs rely on a forward diffusion process that progressively transforms the data into Gaussian noise, and they learn to reverse the noising process. Albeit their enormous successes, their forward process is fixed as a diffusion process, which gives rise to a few limitations. To pull latent variables back to the data distribution, the denoising process requires thousands of network evaluations to sample a single instance. Many follow-up studies consider enhancing inference speed (Song et al., 2020a; Jolicoeur-Martineau et al., 2021; Tachibana et al., 2021) or grafting with other generative models (Xiao et al., 2021a; Vahdat et al., 2021; Zhang & Chen, 2021; Pandey et al., 2022) . In this study, we focus on a different perspective. We interpret the DDMs through the lens of image restoration (IR), which is a family of inverse problems for recovering the original images from corrupted ones (Castleman, 1996; Gunturk & Li, 2018) . The corruption arises in various forms, including noising (Buades et al., 2005; Rudin et al., 1992 ), blurring (Biemond et al., 1990 ), and downsampling (Farsiu et al., 2004) . IR has been a long-standing problem because of its high practical value in various applications (Besag et al., 1991; Banham & Katsaggelos, 1997; Lehtinen et al., 2018; Ma et al., 2011) . From an IR point of view, DDMs can be considered as IR models based on minimum mean square error (MMSE) estimation (Zervakis & Venetsanopoulos, 1991; Laumont et al., 2022) , focusing only on the denoising task. Mathematically, IR is an ill-posed inverse problem in the sense that it does not admit a unique solution and hence, leads to instability in reconstruction (Hadamard, 1902) . Owing to the ill-posedness of IR, MMSE which only measures data fidelity produces impertinent results. DDMs alleviate this problem by leveraging costly Langevin sampling, and this inefficient inference scheme has been regarded as an indispensable tool in the literature of DDMs. By casting DRMs as IR models, however, the forward process need not be restricted to Gaussian noising, and ill-posedness can be detoured in ways other than Langevin dynamics. Inspired by this observation, we propose a new flexible family of generative models, that we refer to as restoration-based generative models (RGMs). First, we adopt an alternative objective; a maximum a posteriori (MAP) (Trussell, 1980; Hunt, 1977) , which is predominantly used in IR. The MAP-based estimator compensates the ill-posedness by regularizing the data fidelity loss by a prior term. Many advent hand-crafted regularization schemes (Tikhonov, 1963; Donoho, 1995; Mallat, 1999; Baraniuk, 2007) encourage solutions to satisfy certain properties, such as smoothness and sparsity. However, for the purpose of density estimation, we implicitly parameterize a prior term as a variational regularization via GAN (Goodfellow et al., 2020) with a newly introduced random auxiliary variable. Our MAP approach retains the density estimating capability of DDMs at a much smaller computation cost. Secondly, unlike DDMs, which are buried in a Gaussian noising process, RGMs can be combined with other general degradation processes. As one instantiation, we design a multi-scale training that resolves the latent inefficiency of DDMs. Because the behavior of generative models is significantly affected by how the data distribution is transformed into a simple distribution, our approach opens the way for designing more flexible generative models. Our comprehensive empirical studies on image generation and inverse problems demonstrate that RGMs generate samples rivaling the quality of DDMs. Also, the inference of our model is several orders of magnitude faster than DDMs. In particular, our model achieve FID 2.47 on CIFAR10, with only seven number of network function evaluations.

2. BACKGROUND

Image Restoration A common inverse problem arising in image processing, including denoising, deblurring, super-resolution, and inpainting, is the estimation of an image x given a corrupted image y = Ax + ξ, where A is a matrix that models the degradation process, including blurring and downsampling kernels, and ξ ∼ N (0, Σ) is an additive noise. A family of such problems are known as image restoration (IR). The inference of the image x from the noised one y is typically ill-posed, in the sense that the inverse problem (1) has multiple valid explanations (Hadamard, 1902) . In other words, the noisy y does not have exactly one restoration x. This is further exacerbated when the noise level is large. To produce consistent results, most methods use the maximum a posteriori (MAP) estimator: x * MAP = argmax x log p (x | y) = argmin x f (x, y) + λg (x), where f (x, y) = -log p (y | x) = 1 2 Σ † 1 2 (Ax -y) 2 2 is the data fidelity term with the pseudoinverse (Moore, 1920) Σ † , g is the prior term that encourages the reconstruction to satisfy some prior assumptions on x, and a scalar λ ≥ 0 controls the strength of the regularization. The regularization term g is essential because it relieves the ill-posedness nature of the inverse problem by imposing the assumption about the desirable solution. Therefore, many researchers have been devoted to designing a proper g (Rudin et al., 1992; Mallat, 1999; Lunz et al., 2018) . Denoising Generative Models Denoising diffusion models (DDMs), such as DDPM (Ho et al., 2020) , and score matching with Langevin dynamics (Song et al., 2020b) , have recently emerged as the forefront of image synthesis research. Starting from the data distribution, they gradually corrupt the image x 0 ∼ p data into Gaussian noise over time through a forward Markovian diffusion process; q (x 1:T | x 0 ) = T -1 t=0 q (t) (x t+1 | x t ) , x 0 ∼ p data . They pose the data generation as an iterative denoising procedure p (4)



Figure 1: Comparison of DDMs and RGMs.

(x t | x t+1 ), the reverse of the forward diffusion process:p θ (x 0:T ) = p (T ) (x T ) θ (x t | x t+1 ) , x T ∼ N (0, I) .

