ADVERSARIAL SCORE MATCHING AND IMPROVED SAMPLING FOR IMAGE GENERATION

Abstract

Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a standard metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of both Denoising Score Matching and adversarial objectives. By combining these two techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.

1. INTRODUCTION

Song and Ermon (2019) recently proposed a novel method of generating samples from a target distribution through a combination of Denoising Score Matching (DSM) (Hyvärinen, 2005; Vincent, 2011; Raphan and Simoncelli, 2011) and Annealed Langevin Sampling (ALS) (Welling and Teh, 2011; Roberts et al., 1996) . Since convergence to the distribution is guaranteed by the ALS, their approach (DSM-ALS) produces high-quality samples and guarantees high diversity. Though, this comes at the cost of requiring an iterative process during sampling, contrary to other generative methods. These generative methods can notably be used to diverse tasks like colorization, image restoration and image inpainting (Song and Ermon, 2019; Kadkhodaie and Simoncelli, 2020) . Song and Ermon (2020) further improved their approach by increasing the stability of score matching training and proposing theoretically sound choices of hyperparameters. They also scaled their approach to higher-resolution images and showed that DSM-ALS is competitive with other generative models. Song and Ermon (2020) observed that the images produced by their improved model were more visually appealing than the ones from their original work; however, the reported Fréchet Inception Distance (FID) (Heusel et al., 2017) did not correlate with this improvement. Although DSM-ALS is gaining traction, Generative adversarial networks (GANs) (Goodfellow et al., 2014) remain the leading approach to generative modeling. GANs are a very popular class of generative models; they have been successfully applied to image generation (Brock et al., 2018; Karras et al., 2017; 2019; 2020) and have subsequently spawned a wealth of variants (Radford et al., 2015a; Miyato et al., 2018; Jolicoeur-Martineau, 2018; Zhang et al., 2019) . The idea behind this method is to train a Discriminator (D) to correctly distinguish real samples from fake samples generated by a second agent, known as the Generator (G). GANs excel at generating high-quality samples as the discriminator captures features that make an image plausible, while the generator learns to emulate them. Still, GANs often have trouble producing data from all possible modes, which limits the diversity of the generated samples. A wide variety of tricks have been developed to address this issue in GANs (Kodali et al., 2017; Gulrajani et al., 2017; Arjovsky et al., 2017; Miyato et al., 2018; Jolicoeur-Martineau and Mitliagkas, 2019) , though it remains an issue to this day. DSM-ALS, on the other hand, does not suffer from that problem since ALS allows for sampling from the full distribution captured by the score network. Nevertheless, the perceptual quality of DSM-ALS higher-resolution images has so far been inferior to that of GAN-generated images. Generative modeling has since seen some incredible work from Ho et al. ( 2020), who achieved exceptionally low (better) FID on image generation tasks. Their approach showcased a diffusion-based method (Sohl-Dickstein et al., 2015; Goyal et al., 2017) that shares close ties with DSM-ALS, and additionally proposed a convincing network architecture derived from Salimans et al. (2017) . In this paper, after introducing the necessary technical background in the next section, we build upon the work of Song and Ermon (2020) and propose improvements based on theoretical analyses both at training and sampling time. Our contributions are as follows: • We propose Consistent Annealed Sampling (CAS) as a more stable alternative to ALS, correcting inconsistencies relating to the scaling of the added noise; • We show how to recover the expected denoised sample (EDS) and demonstrate its unequivocal benefits w.r.t the FID. Notably, we show how to resolve the mismatch observed in DSM-ALS between the visual quality of generated images and its high (worse) FID; • We propose to further exploit the EDS through a hybrid objective function, combining GAN and Denoising Score Matching objectives, thereby encouraging the EDS of the score network to be as realistic as possible. In addition, we show that the network architecture used used by Ho et al. ( 2020) significantly improves sample quality over the RefineNet (Lin et al., 2017a) architecture used by Song and Ermon (2020) . In an ablation study performed on CIFAR-10 and LSUN-church, we demonstrate how these contributions bring DSM-ALS in range of the state-of-the-art for image generation tasks w.r.t. the FID. The code to replicate our experiments is publicly available at [Available in supplementary material].

2.1. DENOISING SCORE MATCHING

Denoising Score Matching (DSM) (Hyvärinen, 2005) consists of training a score network to approximate the gradient of the log density of a certain distribution (∇ x log p(x)), referred to as the score function. This is achieved by training the network to approximate a noisy surrogate of p at multiple levels of Gaussian noise corruption (Vincent, 2011) . The score network s, parametrized by θ and conditioned on the noise level σ, is tasked to minimize the following loss: 1 2 E p(x,x,σ) σs θ (x, σ) + x -x σ 2 2 , where p(x, x, σ) = q σ (x|x)p(x)p(σ). We define further q σ (x|x) = N (x|x, σ 2 I) the corrupted data distribution, p(x) the training data distribution, and p(σ) the uniform distribution over a set {σ i } corresponding to different levels of noise. In practice, this set is defined as a geometric progression between σ 1 and σ L (with L chosen according to some computational budget): {σ i } L i=1 = γ i σ 1 i ∈ {0, . . . , L -1}, γ σ 2 σ 1 = ... = σ L σ 1 1 L-1 < 1 . (2) Rather than having to learn a different score function for every σ i , one can train an unconditional score network by defining s θ (x, σ i ) = s θ (x)/σ i , and then minimizing Eq. 1. While unconditional networks are less heavy computationally, it remains an open question whether conditioning helps performance. Li et al. (2019) and Song and Ermon (2020) found that the unconditional network produced better samples, while Ho et al. ( 2020) obtained better results than both of them using a conditional network. Additionally, the denoising autoencoder described in Lim et al. (2020) gives evidence supporting the benefits of conditioning when the noise becomes small (also see App. D and E for a theoretical discussion of the difference). While our experiments are conducted with unconditional networks, we believe our techniques can be straightforwardly applied to conditional networks; we leave that extension for future work.

