DIFFUSION-GAN: TRAINING GANS WITH DIFFUSION

Abstract

Generative adversarial networks (GANs) are challenging to train stably, and a promising remedy of injecting instance noise into the discriminator input has not been very effective in practice. In this paper, we propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate Gaussianmixture distributed instance noise. Diffusion-GAN consists of three components, including an adaptive diffusion process, a diffusion timestep-dependent discriminator, and a generator. Both the observed and generated data are diffused by the same adaptive diffusion process. At each diffusion timestep, there is a different noise-to-data ratio and the timestep-dependent discriminator learns to distinguish the diffused real data from the diffused generated data. The generator learns from the discriminator's feedback by backpropagating through the forward diffusion chain, whose length is adaptively adjusted to balance the noise and data levels. We theoretically show that the discriminator's timestep-dependent strategy gives consistent and helpful guidance to the generator, enabling it to match the true data distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) and their variants (Brock et al., 2018; Karras et al., 2019; 2020a; Zhao et al., 2020) have achieved great success in synthesizing photo-realistic high-resolution images. GANs in practice, however, are known to suffer from a variety of issues ranging from non-convergence and training instability to mode collapse (Arjovsky and Bottou, 2017; Mescheder et al., 2018) . As a result, a wide array of analyses and modifications has been proposed for GANs, including improving the network architectures (Karras et al., 2019; Radford et al., 2016; Sauer et al., 2021; Zhang et al., 2019) , gaining theoretical understanding of GAN training (Arjovsky and Bottou, 2017; Heusel et al., 2017; Mescheder et al., 2017; 2018) , changing the objective functions (Arjovsky et al., 2017; Bellemare et al., 2017; Deshpande et al., 2018; Li et al., 2017a; Nowozin et al., 2016; Zheng and Zhou, 2021; Yang et al., 2021) , regularizing the weights and/or gradients (Arjovsky et al., 2017; Fedus et al., 2018; Mescheder et al., 2018; Miyato et al., 2018a; Roth et al., 2017; Salimans et al., 2016) , utilizing side information (Wang et al., 2018; Zhang et al., 2017; 2020b) , adding a mapping from the data to latent representation (Donahue et al., 2016; Dumoulin et al., 2016; Li et al., 2017b) , and applying differentiable data augmentation (Karras et al., 2020a; Zhang et al., 2020a; Zhao et al., 2020) . A simple technique to stabilize GAN training is to inject instance noise, i.e., to add noise to the discriminator input, which can widen the support of both the generator and discriminator distributions and prevent the discriminator from overfitting (Arjovsky and Bottou, 2017; Sønderby et al., 2017) . However, this technique is hard to implement in practice, as finding a suitable noise distribution is challenging (Arjovsky and Bottou, 2017) . Roth et al. (2017) show that adding instance noise to the high-dimensional discriminator input does not work well, and propose to approximate it by adding a zero-centered gradient penalty on the discriminator. This approach is theoretically and empirically shown to converge in Mescheder et al. (2018) , who also demonstrate that adding zero-centered gradient penalties to non-saturating GANs can result in stable training and better or comparable generation quality compared to WGAN-GP (Arjovsky et al., 2017) . However, Brock 2018) caution that zero-centered gradient penalties and other similar regularization methods may stabilize training at the cost of generation performance. To the best of our knowledge, there has been no existing work that is able to empirically demonstrate the success of using instance noise in GAN training on high-dimensional image data. To inject proper instance noise that can facilitate GAN training, we introduce Diffusion-GAN, which uses a diffusion process to generate Gaussian-mixture distributed instance noise. We show a graphical representation of Diffusion-GAN in Figure 1 . In Diffusion-GAN, the input to the diffusion process is either a real or a generated image, and the diffusion process consists of a series of steps that gradually add noise to the image. The number of diffusion steps is not fixed, but depends on the data and the generator. We also design the diffusion process to be differentiable, which means that we can compute the derivative of the output with respect to the input. This allows us to propagate the gradient from the discriminator to the generator through the diffusion process, and update the generator accordingly. Unlike vanilla GANs, which compare the real and generated images directly, Diffusion-GAN compares the noisy versions of them, which are obtained by sampling from the Gaussian mixture distribution over the diffusion steps, with the help of our timestep-dependent discriminator. This distribution has the property that its components have different noise-to-data ratios, which means that some components add more noise than others. By sampling from this distribution, we can achieve two benefits: first, we can stabilize the training by easing the problem of vanishing gradient, which occurs when the data and generator distributions are too different; second, we can augment the data by creating different noisy versions of the same image, which can improve the data efficiency and the diversity of the generator. We provide a theoretical analysis to support our method, and show that the min-max objective function of Diffusion-GAN, which measures the difference between the data and generator distributions, is continuous and differentiable everywhere. This means that the generator in theory can always receive a useful gradient from the discriminator, and improve its performance. Our main contributions include: 1) We show both theoretically and empirically how the diffusion process can be utilized to provide a model-and domain-agnostic differentiable augmentation, enabling data-efficient and leaking-free stable GAN training. 2) Extensive experiments show that Diffusion-GAN boosts the stability and generation performance of strong baselines, including Style-GAN2 (Karras et al., 2020b ), Projected GAN (Sauer et al., 2021 ), and InsGen (Yang et al., 2021) , achieving state-of-the-art results in synthesizing photo-realistic images, as measured by both the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Recall score (Kynkäänniemi et al., 2019) .

2. PRELIMINARIES: GANS AND DIFFUSION-BASED GENERATIVE MODELS

GANs (Goodfellow et al., 2014) are a class of generative models that aim to learn the data distribution p(x) of a target dataset by setting up a min-max game between two neural networks: a generator and a discriminator. The generator G takes as input a random noise vector z sampled from a simple prior distribution p(z), such as a standard normal or uniform distribution, and tries to produce realistic-looking samples G(z) that resemble the data. The discriminator D receives either



Figure 1: Flowchart for Diffusion-GAN. The top-row images represent the forward diffusion process of a real image, while the bottom-row images represent the forward diffusion process of a generated fake image. The discriminator learns to distinguish a diffused real image from a diffused fake image at all diffusion steps. et al. (2018) caution that zero-centered gradient penalties and other similar regularization methods may stabilize training at the cost of generation performance. To the best of our knowledge, there has been no existing work that is able to empirically demonstrate the success of using instance noise in GAN training on high-dimensional image data.

