ADVERSARIAL SCORE MATCHING AND IMPROVED SAMPLING FOR IMAGE GENERATION

Abstract

Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a standard metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of both Denoising Score Matching and adversarial objectives. By combining these two techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.

1. INTRODUCTION

Song and Ermon (2019) recently proposed a novel method of generating samples from a target distribution through a combination of Denoising Score Matching (DSM) (Hyvärinen, 2005; Vincent, 2011; Raphan and Simoncelli, 2011) and Annealed Langevin Sampling (ALS) (Welling and Teh, 2011; Roberts et al., 1996) . Since convergence to the distribution is guaranteed by the ALS, their approach (DSM-ALS) produces high-quality samples and guarantees high diversity. Though, this comes at the cost of requiring an iterative process during sampling, contrary to other generative methods. These generative methods can notably be used to diverse tasks like colorization, image restoration and image inpainting (Song and Ermon, 2019; Kadkhodaie and Simoncelli, 2020) . Song and Ermon (2020) further improved their approach by increasing the stability of score matching training and proposing theoretically sound choices of hyperparameters. They also scaled their approach to higher-resolution images and showed that DSM-ALS is competitive with other generative models. Song and Ermon (2020) observed that the images produced by their improved model were more visually appealing than the ones from their original work; however, the reported Fréchet Inception Distance (FID) (Heusel et al., 2017) did not correlate with this improvement. Although DSM-ALS is gaining traction, Generative adversarial networks (GANs) (Goodfellow et al., 2014) remain the leading approach to generative modeling. GANs are a very popular class of generative models; they have been successfully applied to image generation (Brock et al., 2018; Karras et al., 2017; 2019; 2020) and have subsequently spawned a wealth of variants (Radford et al., 2015a; Miyato et al., 2018; Jolicoeur-Martineau, 2018; Zhang et al., 2019) . The idea behind this method is to train a Discriminator (D) to correctly distinguish real samples from fake samples generated by a second agent, known as the Generator (G). GANs excel at generating high-quality samples as the discriminator captures features that make an image plausible, while the generator learns to emulate them. Still, GANs often have trouble producing data from all possible modes, which limits the diversity of the generated samples. A wide variety of tricks have been developed to address this issue in GANs (Kodali et al., 2017; Gulrajani et al., 2017; Arjovsky et al., 2017; Miyato et al., 2018; Jolicoeur-Martineau and Mitliagkas, 2019) , though it remains an issue to this day. DSM-ALS, on the other hand, does not suffer from that problem since ALS allows for sampling from the full distribution 1

