DIFFUSION-GAN: TRAINING GANS WITH DIFFUSION

Abstract

Generative adversarial networks (GANs) are challenging to train stably, and a promising remedy of injecting instance noise into the discriminator input has not been very effective in practice. In this paper, we propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate Gaussianmixture distributed instance noise. Diffusion-GAN consists of three components, including an adaptive diffusion process, a diffusion timestep-dependent discriminator, and a generator. Both the observed and generated data are diffused by the same adaptive diffusion process. At each diffusion timestep, there is a different noise-to-data ratio and the timestep-dependent discriminator learns to distinguish the diffused real data from the diffused generated data. The generator learns from the discriminator's feedback by backpropagating through the forward diffusion chain, whose length is adaptively adjusted to balance the noise and data levels. We theoretically show that the discriminator's timestep-dependent strategy gives consistent and helpful guidance to the generator, enabling it to match the true data distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) and their variants (Brock et al., 2018; Karras et al., 2019; 2020a; Zhao et al., 2020) have achieved great success in synthesizing photo-realistic high-resolution images. GANs in practice, however, are known to suffer from a variety of issues ranging from non-convergence and training instability to mode collapse (Arjovsky and Bottou, 2017; Mescheder et al., 2018) . As a result, a wide array of analyses and modifications has been proposed for GANs, including improving the network architectures (Karras et al., 2019; Radford et al., 2016; Sauer et al., 2021; Zhang et al., 2019) , gaining theoretical understanding of GAN training (Arjovsky and Bottou, 2017; Heusel et al., 2017; Mescheder et al., 2017; 2018) , changing the objective functions (Arjovsky et al., 2017; Bellemare et al., 2017; Deshpande et al., 2018; Li et al., 2017a; Nowozin et al., 2016; Zheng and Zhou, 2021; Yang et al., 2021) , regularizing the weights and/or gradients (Arjovsky et al., 2017; Fedus et al., 2018; Mescheder et al., 2018; Miyato et al., 2018a; Roth et al., 2017; Salimans et al., 2016) , utilizing side information (Wang et al., 2018; Zhang et al., 2017; 2020b) , adding a mapping from the data to latent representation (Donahue et al., 2016; Dumoulin et al., 2016; Li et al., 2017b) , and applying differentiable data augmentation (Karras et al., 2020a; Zhang et al., 2020a; Zhao et al., 2020) . A simple technique to stabilize GAN training is to inject instance noise, i.e., to add noise to the discriminator input, which can widen the support of both the generator and discriminator distributions and prevent the discriminator from overfitting (Arjovsky and Bottou, 2017; Sønderby et al., 2017) . However, this technique is hard to implement in practice, as finding a suitable noise distribution is challenging (Arjovsky and Bottou, 2017) . Roth et al. (2017) show that adding instance noise to the high-dimensional discriminator input does not work well, and propose to approximate it by adding a zero-centered gradient penalty on the discriminator. This approach is theoretically and empirically shown to converge in Mescheder et al. (2018) , who also demonstrate that adding zero-centered gradient penalties to non-saturating GANs can result in stable training and better or comparable generation quality compared to WGAN-GP (Arjovsky et al., 2017) . However, Brock a real data sample x drawn from p(x) or a fake sample G(z) generated by G, and tries to correctly classify them as real or fake. The goal of G is to fool D into making mistakes, while the goal of D is to accurately distinguish G(z) from x. The min-max objective function of GANs is given by min G max D V (G, D) = E x∼p(x) [log(D(x))] + E z∼p(z) [log(1 -D(G(z)))]. In practice, this vanilla objective function is often modified to improve the stability and performance of GANs (Goodfellow et al., 2014; Miyato et al., 2018a; Fedus et al., 2018) , but the general idea of adversarial learning between G and D remains the same. Diffusion-based generative models (Ho et al., 2020b; Sohl-Dickstein et al., 2015; Song and Ermon, 2019) assume p θ (x 0 ) := p θ (x 0:T )dx 1:T , where x 1 , . . . , x T are latent variables of the same dimensionality as the data x 0 ∼ p(x 0 ). There is a forward diffusion chain that gradually adds noise to the data x 0 ∼ q(x 0 ) in T steps with pre-defined variance schedule β t and variance σ 2 : q(x 1:T | x 0 ) := T t=1 q(x t | x t-1 ), q(x t | x t-1 ) := N (x t ; √ 1 -β t x t-1 , β t σ 2 I). A notable property is that x t at an arbitrary time-step t can be sampled in closed form as q(x t | x 0 ) = N (x t ; √ ᾱt x 0 , (1 -ᾱt )σ 2 I), where α t := 1 -β t , ᾱt := t s=1 α s . (1) A variational lower bound (Blei et al., 2017) is then used to optimize the reverse diffusion chain as p θ (x 0:T ) := N (x T ; 0, σ 2 I) T t=1 p θ (x t-1 | x t ).

3. DIFFUSION-GAN: METHOD AND THEORETICAL ANALYSIS

To construct Diffusion-GAN, we describe how to inject instance noise via diffusion, how to train the generator by backpropagating through the forward diffusion process, and how to adaptively adjust the diffusion intensity. We further provide theoretical analysis illustrated with a toy example.

3.1. INSTANCE NOISE INJECTION VIA DIFFUSION

We aim to generate realistic samples x g from a generator network G that maps a latent variable z sampled from a simple prior distribution p(z) to a high-dimensional data space, such as images. The distribution of generator samples x g = G(z), z ∼ p(z) is denoted by p g (x) = p(x g | z)p(z)dz. To make the generator more robust and diverse, we inject instance noise into the generated samples x g by applying a diffusion process that adds Gaussian noise at each step. The diffusion process can be seen as a Markov chain that starts from the original sample x and gradually erases its information until reaching a noise level σ 2 after T steps. We define a mixture distribution q(y | x) that models the noisy samples y obtained at any step of the diffusion process, with a mixture weight π t for each step t. The mixture components q(y | x, t) are Gaussian distributions with mean proportional to x and variance depending on the noise level at step t. We use the same diffusion process and mixture distribution for both the real samples x ∼ p(x) and the generated samples x g ∼ p g (x). More specifically, the diffusion-induced mixture distributions are expressed as x ∼ p(x), y ∼ q(y | x), q(y | x) := T t=1 π t q(y | x, t), x g ∼ p g (x), y g ∼ q(y g | x g ), q(y g | x g ) := T t=1 π t q(y g | x g , t) , where q(y | x) is a T -component mixture distribution, the mixture weights π t are non-negative and sum to one, and the mixture components q(y | x, t) are obtained via diffusion as in Equation ( 1), expressed as q(y | x, t) = N (y; √ ᾱt x, (1 -ᾱt )σ 2 I). (2) Samples from this mixture can be drawn as t ∼ p π := Discrete(π 1 , . . . , π T ), y ∼ q(y | x, t). By sampling y from this mixture distribution, we can obtain noisy versions of both real and generated samples with varying degrees of noise. The more steps we take in the diffusion process, the more noise we add to y and the less information we preserve from x. We can then use this diffusion-induced mixture distribution to train a timestep-dependent discriminator D that distinguishes between real and generated noisy samples, and a generator G that matches the distribution of generated noisy samples to the distribution of real noisy samples. Next we introduce Diffusion-GAN that trains its discriminator and generator with the help of the diffusion-induced mixture distribution.

3.2. ADVERSARIAL TRAINING

The Diffusion-GAN trains its generator and discriminator by solving a min-max game objective as V (G, D) = E x∼p(x),t∼pπ,y∼q(y | x,t) [log(D ϕ (y, t))] + E z∼p(z),t∼pπ,yg∼q(y | Gθ(z),t) [log(1 -D ϕ (y g , t))]. (3) Here, p(x) is the true data distribution, p π is a discrete distribution that assigns different weights π t to each diffusion step t ∈ {1, . . . , T }, and q(y | x, t) is the conditional distribution of the perturbed sample y given the original data x and the diffusion step t. By Equation ( 2), with Gaussian reparameterization, the perturbation function could be written as y = √ ᾱt x + √ 1 -ᾱt σϵ, where 1 -ᾱt = 1 -t s=1 α s is the cumulative noise level at step t, σ is a scale factor, and ϵ ∼ N (0, I) is a Gaussian noise. The objective function in Equation ( 3) encourages the discriminator to assign high probabilities to the perturbed real data and low probabilities to the perturbed generated data, for any diffusion step t. The generator, on the other hand, tries to produce samples that can deceive the discriminator at any diffusion step t. Note that the perturbed generated sample y g ∼ q(y | G θ (z), t) can be rewritten as y g = √ ᾱt G θ (z) + (1 -ᾱt )σϵ, ϵ ∼ N (0, I). This means that the objective function in Equation ( 3) is differentiable with respect to the generator parameters, and we can use gradient descent to optimize it with back-propagation. The objective function Equation ( 3) is similar to the one used by the original GAN (Goodfellow et al., 2014) , except that it involves the diffusion steps and the perturbation functions. We can show that this objective function also minimizes an approximation of the Jensen-Shannon (JS) divergence between the true and the generated distributions, but with respect to the perturbed samples and the diffusion steps, as follows: D JS (p(y, t)||p g (y, t)) = E t∼pπ [D JS (p(y | t)||p g (y | t))]. The JS divergence measures the dissimilarity between two probability distributions, and it reaches its minimum value of zero when the two distributions are identical. The proof of the equality in Equation ( 4) is given in Appendix C. A natural question that arises from this result is whether minimizing the JS divergence between the perturbed distributions implies minimizing the JS divergence between the original distributions, i.e., whether the optimal generator for Equation ( 3) is also the optimal generator for D JS (p(x)||p g (x)). We will answer this question affirmatively and provide a theoretical justification in Section 3.4.

3.3. ADAPTIVE DIFFUSION

With the help of the perturbation function and timestep dependency, we have a new strategy to optimize the discriminator. We want the discriminator D to have a challenging task, neither too easy to allow overfitting the data (Karras et al., 2020a; Zhao et al., 2020) nor too hard to impede learning. Therefore, we adjust the intensity of the diffusion process, which adds noise to both y and y g , depending on how much D can distinguish them. When the diffusion step t is larger, the noise-to-data ratios are higher and the task is harder. We use 1 -ᾱt to measure the intensity of the diffusion, which increases as t grows. To control the diffusion intensity, we adaptively modify the maximum number of steps T . Our strategy is to make the discriminator learn from the easiest samples first, which are the original data samples, and then gradually increase the difficulty by feeding it samples from larger t. To do this, we use a self-paced schedule for T , which depends on a metric r d that estimates how much the discriminator overfits to the data: r d = E y,t∼p(y,t) [sign(D ϕ (y, t) -0.5)], T = T + sign(r d -d target ) * C, where r d is the same as in Karras et al. (2020a) and C is a constant. We calculate r d and update T every four minibatches. We have two options for the distribution p π that we use to sample t for the diffusion process: t ∼ p π := uniform: Discrete 1 T , 1 T , . . . , 1 T , priority: Discrete 1 T t=1 t , 2 T t=1 t , . . . , T T t=1 t , The 'priority' option gives more weight to larger t, which means the discriminator will see more new samples from the new steps when T increases. This is because we want the discriminator to focus on the new and harder samples that it has not seen before, as this indicates that it is confident about the easier ones. Note that even with the 'priority' option, the discriminator can still see samples from smaller t, because q(y | x) is a mixture of Gaussians that covers all steps of the diffusion chain. To avoid sudden changes in T during training, we use an exploration list t epl that contains t values sampled from p π . We keep t epl fixed until we update T , and we sample t from t epl to generate noisy samples for the discriminator. This way, the model can explore each t sufficiently before moving to a higher T . We give the details of training Diffusion-GAN in Algorithm 1 in Appendix F.

3.4. THEORETICAL ANALYSIS WITH EXAMPLES

To better understand the theoretical properties of our proposed method, we present two theorems that address two important questions about the use of diffusion-based instance noise injection for training GANs. The proofs of these theorems are deferred to Appendix B. The first question, denoted as (a), is whether adding noise to the real and generated samples in a diffusion process can facilitate the learning. The second question, denoted as (b), is whether minimizing the JS divergence between the joint distributions of the noisy samples and the noise levels, p(y, t) and p g (y, t), can lead to the same optimal generator as minimizing the JS divergence between the original distributions of the real and generated samples, p(x) and p g (x). To answer (a), we prove that for any choice of noise level t and any choice of convex function f , the f -divergence (Nowozin et al., 2016) between the marginal distributions of the noisy real and generated samples, q(y | t) and q(y g | t), is a smooth function that can be computed and optimized by the discriminator. This implies that the diffusion-based noise injection does not introduce any singularity or discontinuity in the objective function of the GAN. The JS divergence is a special case of f -divergence, where f (u) = -log(2u) -log (2 -2u) . Theorem 1 (Valid gradients anywhere for GANs training). Let p(x) be a fixed distribution over X and z be a random noise over another space Z. Denote G θ : Z → X as a function with parameter θ and input z and p g (x) as the distribution of G θ (z). Let q(y | x, t) = N (y; √ ᾱt x, (1 -ᾱt )σ 2 I) , where ᾱt ∈ (0, 1) and σ > 0. Let q(y | t) = p(x)q(y | x, t)dx and q g (y | t) = p g (x)q(y | x, t)dx. Then, ∀t, if function G θ is continuous and differentiable, the f-divergence D f (q(y | t)||q g (y | t)) is continuous and differentiable with respect to θ. Theorem 1 shows that with the help of diffusion noise injection by q(y | x, t), ∀t, y and y g are defined on the same support space, the whole X , and D f (q(y | t)||q g (y | t)) is continuous and differentiable everywhere. Then, one natural question is what if D f (q(y | t)||q g (y | t)) keeps a near constant value and hence provides little useful gradient. Hence, we empirically show that by injecting noise through a mixture defined over all steps of the diffusion chain, there is always a good chance that a sufficiently large t is sampled to provide a useful gradient, via the toy example below. Toy example. We use the same simple example from Arjovsky et al. (2017) to illustrate our method. Let x = (0, z) be the real data and x g = (θ, z) be the data generated by a one-parameter generator, where z is a uniform random variable in [0, 1] . The JS divergence between the real and the generated distributions, D JS (p(x)||p(x g )), is discontinuous: it is log 2 when θ = 0 and zero otherwise, so it does not provide a useful gradient to guide θ towards zero. We introduce diffusion-based noise to both the real and the generated data, as shown in the first row of Figure 2 . The noisy data, y and y g , have supports that cover the whole space R 2 and their densities overlap more or less depending on the diffusion step t. In the second row, left, of Figure 2 , we plot how the JS divergence between the noisy distributions, D JS (q(y | t)||q g (y | t)), varies with θ for different t values. The black line with t = 0 is the original JS divergence, which has a discontinuity at θ = 0. As t increases, the JS divergence curves become smoother and have nonzero gradients for a larger range of θ. However, some values of t, such as t = 200 in this example, still have flat regions where the JS divergence is nearly constant. To avoid this, we use a mixture of all steps to ensure that there is always a high chance of getting informative gradients. For the discriminator optimization, as shown in the second row, right, of Figure 2 , the optimal discriminator under the original JS divergence is discontinuous and unattainable. With diffusionbased noise, the optimal discriminator changes with t: a smaller t makes it more confident and a larger t makes it more cautious. Thus the diffusion acts like a scale to balance the power of the discriminator. This suggests the use of a differentiable forward diffusion chain that can provide various levels of gradient smoothness to help the generator training. Theorem 2 (Non-leaking noise injection). Let x ∼ p(x), y ∼ q(y | x) and x g ∼ p g (x), y g ∼ q(y g | x g ), where q(y | x) is the transition density. Given certain q(y | x), if y could be reparame- terized into y = f (x) + h(ϵ), ϵ ∼ p(ϵ) , where p(ϵ) is a known distribution, and both f and h are one-to-one mapping functions, then we could have p(y) = p g (y) ⇔ p(x) = p g (x). To answer question (b), we present Theorem 2, which shows a sufficient condition for the equality of the original and the augmented data distributions. By Theorem 2, the function f maps each x to a unique y, the function h maps each ϵ to a unique noise term, and the distribution of ϵ is known and independent of x. Under these assumptions, the theorem proves that the distribution of y is the same as the distribution of y g , if and only if the distribution of x is the same as the distribution of x g . If we take y | t as the y introduced in the theorem, then for ∀t, Equation ( 2) fits the assumption made. This means that, by minimizing the divergence between q(y | t) and q g (y | t), which is the same as minimizing the divergence between p(x) | t and p g (x) | t, we are also minimizing the divergence between p(x) and p g (x). This implies that the noise injection does not affect the quality of the generated samples, and we can safely use our noise injection to improve the training of the generative model.

3.5. RELATED WORK

The proposed Diffusion-GAN can be related to previous works on stabilizing the GAN training, building diffusion-based generative models, and constructing differential augmentation for dataefficient GAN training. A detailed discussion on these related works is deferred to Appendix A.

4. EXPERIMENTS

We conduct extensive experiments to answer the following questions: (a) Will Diffusion-GAN outperform state-of-the-art GAN baselines on benchmark datasets? (b) Will the diffusion-based noise injection help the learning of GANs in domain-agnostic tasks? (c) Will our method improve the performance of data-efficient GANs trained with a very limited amount of data? Datasets. We conduct experiments on image datasets ranging from low-resolution (e.g., 32 × 32) to high-resolution (e.g., 1024 × 1024) and from low-diversity to high-diversity: CIFAR-10 ( Krizhevsky, 2009) , STL-10 (Coates et al., 2011) , LSUN-Bedroom (Yu et al., 2015) , LSUN-Church (Yu et al., 2015) , AFHQ(Cat/Dog/Wild) (Choi et al., 2020) , and FFHQ (Karras et al., 2019) . More details on these benchmark datasets are provided in Appendix E. Evaluation protocol. We measure image quality using FID (Heusel et al., 2017) . Following Karras et al. (2019; 2020b) , we measure FID using 50k generated samples, with the full training set used as reference. We use the number of real images shown to the discriminator to evaluate convergence (Karras et al., 2020a; Sauer et al., 2021) . Unless specified otherwise, all models are trained with 25 million images to ensure convergence (these trained with more or fewer images are specified in table captions). We further report the improved Recall score introduced by Kynkäänniemi et al. (2019) to measure the sample diversity of generative models. Implementations and resources. We build Diffusion-GANs based on the code of StyleGAN2 (Karras et al., 2020b) , ProjectedGAN (Sauer et al., 2021) , and InsGen (Yang et al., 2021) to answer questions (a), (b), and (c), respectively. Diffusion GANs inherit from their corresponding base GANs all their network architectures and training hyperparamters, whose details are provided in Appendix G. Specifically for StyleGAN2 and InsGen, we construct the discriminator as D ϕ (y, t), where t is injected via their mapping network. For ProjectedGAN, we empirically find t in the discriminator could be ignored to simplify the implementation and minimize the modifications to Pro-jectedGAN. More implementation details are provided in Appendix H. By applying our diffusionbased noise injection, we denote our models as Diffusion StyleGAN2/ProjectedGAN/InsGen. In the following experiments, we train related models with their official code if the results are unavailable, while others are all reported from references and marked with * . We run all our experiments with either 4 or 8 NVIDIA V100 GPUs depending on the demands of the inherited training configurations.

4.1. COMPARISON TO STATE-OF-THE-ART GANS

We compare Diffusion-GAN with its state-of-the-art GAN backbone, StyleGAN2 (Karras et al., 2020a) , and to evaluate its effectiveness from the data augmentation perspective, we compare it with both StyleGAN2 + DiffAug (Zhao et al., 2020) and StyleGAN2 + ADA (Karras et al., 2020a) , in terms of both sample fidelity (FID) and sample diversity (Recall) over extensive benchmark datasets. We present the quantitative and qualitative results in Table 1 and Figure 3 . Qualitatively, these generated images from Diffusion StyleGAN2 are all photo-realistic and have good diversity, ranging from low-resolution (32 × 32) to high-resolution (1024 × 1024). Additional randomly generated images can be found in Appendix L. Quantitatively, Diffusion StyleGAN2 outperforms all the GAN baselines in generation diversity, as measured by Recall, on all 6 benchmark datasets and outperforms them in FID by a clear margin on 5 out of the 6 benchmark datasets. From the data augmentation perspective, we observe that Diffusion StyleGAN2 always clearly outperforms the backbone model StyleGAN2 across various datasets, which empirically validates our Theorem 2. By contrast, both the ADA (Karras et al., 2020b) and Diffaug (Zhao et al., 2020) techniques could sometimes impair the generation performance on sufficiently large datasets, e.g., LSUN-Bedroom and LSUN-Church, which is also observed by Yang et al. (2021) on FFHQ. This is possibly because their risk of leaking augmentation overshadows the benefits of data augmentation. To investigate how the adaptive diffusion process works during training, we illustrate in Figure 4 the convergence of the maximum timestep T in our adaptive diffusion and discriminator outputs. We see that T is adaptively adjusted: The T for Diffusion StyleGAN2 increases as the training goes while the T for Diffusion ProjectedGAN first goes up and then goes down. Note that the T is adjusted according to the overfitting status of the discriminator. The second panel shows that trained with the diffusion-based mixture distribution, the discriminator is always well behaved and provides useful learning signals for the generator, which validates our analysis in Section 3.4 and Theorem 1. Memory and time costs. Generally speaking, the memory and time costs of a Diffusion-GAN are comparable to those of the corresponding GAN baseline. More specifically, switching from ADA (Karras et al., 2020a) to our diffusion-based augmentation, the added memory cost is negative, the added training time cost is negative, and the added inference time cost is zero. For example, for CIFAR-10, with four NVIDIA V100 GPUs, the training time for each 4k images is around 8.0s for StyleGAN2, 9.8s for StyleGAN2-ADA, and 9.5s for Diffusion-StyleGAN2.

4.2. EFFECTIVENESS OF DIFFUSION-GAN FOR DOMAIN-AGNOSTIC AUGMENTATION

To verify whether our method is domain-agnostic, we apply Diffusion-GAN onto the input feature vectors of GANs. We conduct experiments on both low-dimensional and high-dimensional feature vectors, for which commonly used image augmentation methods are no longer applicable.

25-Gaussians Example.

We conduct experiments on the popular 25-Gaussians generation task. The 25-Gaussians dataset is a 2-D toy data, generated by a mixture of 25 two-dimensional Gaussian distributions. Each data point is a 2-dimensional feature vector. We train a small GAN model, whose generator and discriminator are both parameterized by multilayer perceptrons (MLPs), with two 128-unit hidden layers and LeakyReLu nonlinearities. The training results are shown in Figure 5 . We observe that the vanilla GAN exhibits severe mode collapsing, capturing only a few modes. Its discriminator outputs of real and fake samples depart from each other very quickly. This implies a strong overfitting of the discriminator happened so that the discriminator stops providing useful learning signals for the generator. However, Diffusion-GAN successfully captures all the 25 Gaussian modes and the discriminator is under control to continuously provide useful learning signals. We interpret the improvement from two perspectives: First, non-leaking augmentation helps provide more information about the data space; Second, the discriminator is well behaved given the adaptively adjusted diffusion-based noise injection. ProjectedGAN. To verify that our adaptive diffusion-based noise injection could benefit the learning of GANs on high-dimensional feature vectors, we directly apply it to the discriminator feature 

Discriminator outputs of DiffusionGAN

Real samples Fake samples Figure 5 : The 25-Gaussians example. We show the true data samples, the generated samples from vanilla GANs, the discriminator outputs of the vanilla GANs, the generated samples from our Diffusion-GAN, and the discriminator outputs of Diffusion-GAN. space of ProjectedGAN (Sauer et al., 2021) . ProjectedGANs generally leverage pre-trained neural networks to extract meaningful features for the adversarial learning of the discriminator and generator. Following Sauer et al. (2021) , we adaptively diffuse the feature vectors extracted by EfficientNet-v0 and keep all the other training parts unchanged. We report the performance of Diffusion ProjectedGAN on several benchmark datasets in Table 2 , which verifies that our augmentation method is domain-agnostic. Under the ProjectedGAN framework, we see that with noise properly injected into the high-dimensional feature space, Diffusion ProjectedGAN shows clear improvement in terms of both FID and Recall. We reach state-of-the-art FID results with Diffusion ProjectedGAN on STL-10 and LSUN-Bedroom/Church datasets. 

4.3. EFFECTIVENESS OF DIFFUSION-GAN FOR LIMITED DATA

We evaluate whether Diffusion-GAN can provide data-efficient GAN training. We first generate five FFHQ (1024 × 1024) dataset splits, consisting of 200, 500, 1k, 2k, and 5k images, respectively, where 200 and 500 images are considered to be extremely limited data cases. We also consider AFHQ-Cat, -Dog, and -Wild (512 × 512), each with as few as around 5k images. Motivated by the success of InsGen (Yang et al., 2021) on small datasets, we build our Diffusion-GAN upon it. We note on limited data, InsGen convincingly outperforms both StyleGAN2+ADA and +DiffAug, and currently holds the state-of-the-art performance for data-efficient GAN training. The results in Table 3 show that our Diffusion-GAN method can help further boost the performance of InsGen in limited data settings. 

5. CONCLUSION

We present Diffusion-GAN, a novel GAN framework that uses a variable-length forward diffusion chain with a Gaussian mixture distribution to generate instance noise for GAN training. This approach enables model-and domain-agnostic differentiable augmentation that leverages the advantages of diffusion without requiring a costly reverse diffusion chain. We prove theoretically and demonstrate empirically that Diffusion-GAN can prevent discriminator overfitting and provide non-leaking augmentation. We also demonstrate that Diffusion-GAN can produce high-resolution photo-realistic images with high fidelity and diversity, outperforming its corresponding state-of-theart GAN baselines on standard benchmark datasets according to both FID and Recall.

A RELATED WORK

Stabilizing GAN training. A root cause of training difficulties in GANs is often attributed to the JS divergence that GANs intend to minimize. This is because when the data and generator distributions have non-overlapping supports, which are often the case for high-dimensional data supported by low-dimensional manifolds, the gradient of the JS divergence may provide no useful guidance to optimize the generator (Arjovsky and Bottou, 2017; Arjovsky et al., 2017; Mescheder et al., 2018; Roth et al., 2017) . For this reason, Arjovsky et al. (2017) propose to instead use the Wasserstein-1 distance, which in theory can provide useful gradient for the generator even if the two distributions have disjoint supports. However, Wasserstein GANs often require the use of a critic function under the 1-Lipschitz constraint, which is difficult to satisfy in practice and hence realized with heuristics such as weight clipping (Arjovsky et al., 2017) , gradient penalty (Gulrajani et al., 2017) , and spectral normalization (Miyato et al., 2018a) . While the divergence minimization perspective has played an important role in motivating the construction of Wasserstein GANs and gradient penalty-based regularizations, cautions should be made on purely relying on it to understand GAN training, due to not only the discrepancy between the divergence in theory and the actual min-max objective function used in practice, but also the potential confounding between different divergences and different training and regularization strategies (Fedus et al., 2018; Mescheder et al., 2018) . E.g., Mescheder et al. (2018) have provided a simple example where in theory the Wasserstein GAN is predicted to succeed while the vanilla GAN is predicted to fail, but in practice the Wasserstein GAN with a finite number of discriminator updates per generator update fails to converge while the vanilla GAN with the non-saturating loss can slowly converge. Fedus et al. (2018) provide a rich set of empirical evidence to discourage viewing GANs purely from the perspective of minimizing a specific divergence at each training step and emphasize the important role played by gradient penalties on stabilizing GAN training. Diffusion models. Due to the use of a forward diffusion chain, the proposed Diffusion-GAN can be related to diffusion-based (or score-based) deep generative models (Ho et al., 2020b; Sohl-Dickstein et al., 2015; Song and Ermon, 2019 ) that employ both a forward (inference) and a reverse (generative) diffusion chain. These diffusion-based generative models are stable to train and can generate high-fidelity photo-realistic images (Dhariwal and Nichol, 2021; Ho et al., 2020b; Nichol et al., 2021; Ramesh et al., 2022; Song and Ermon, 2019; Song et al., 2021b) . However, they are notoriously slow in generation due to the need to traverse the reverse diffusion chain, which involves going through the same U-Net-based generator network hundreds or even thousands of times (Song et al., 2021a) . For this reason, a variety of methods have been proposed to reduce the generation cost of diffusion-based generative models (Kong and Ping, 2021; Luhman and Luhman, 2021; Pandey et al., 2022; San-Roman et al., 2021; Song et al., 2021a; Xiao et al., 2021; Zheng et al., 2022) . A key distinction is that Diffusion-GAN needs a reverse diffusion chain during neither training nor generation. More specifically, its generator maps the noise to a generated sample in a single step. Diffusion-GAN can train and generate as quickly as a vanilla GAN does with the same generator size. For example, it takes around 20 hours to sample 50k images of size 32 × 32 from a DDPM (Ho et al., 2020b) on an Nvidia 2080 Ti GPU, but would take less than a minute to do so from Diffusion-GAN. Differentiable augmentation. As Diffusion-GAN transforms both the data and generated samples before sending them to the discriminator, we can also relate it to differentiable augmentation (Karras et al., 2020a; Zhao et al., 2020) proposed for data-efficient GAN training. Karras et al. (2020a) introduce a stochastic augmentation pipeline with 18 transformations and develop an adaptive mechanism for controlling the augmentation probability. Zhao et al. (2020) propose to use Color + Translation + Cutout as differentiable augmentations for both generated and real images. While providing good empirical results on some datasets, these augmentation methods are developed with domain-specific knowledge and have the risk of leaking augmentation into generation (Karras et al., 2020a) . As observed in our experiments, they sometime worsen the results when applied to a new dataset, likely because the risk of augmentation leakage overpowers the benefits of enlarging the training set, which could happen especially if the training set size is already sufficiently large. By contrast, Diffusion-GAN uses a differentiable forward diffusion process to stochastically transform the data and can be considered as both a domain-agnostic and a model-agnostic augmentation method. In other words, Diffusion-GAN can be applied to non-image data or even latent features, for which appropriate data augmentation is difficult to be defined, and easily plugged into an existing GAN to improve its generation performance. Moreover, we prove in theory and show in experiments that augmentation leakage is not a concern for Diffusion-GAN. Tran et al. (2021) provide a theoretical analysis for deterministic non-leaking transformation with differentiable and invertible mapping functions. Bora et al. (2018) show similar theorems to us for specific stochastic transformations, such as Gaussian Projection, Convolve+Noise, and stochastic Block-Pixels, while our Theorem 2 includes more satisfying possibilities as discussed in Appendix B.

B PROOF

Proof of Theorem 1. For simplicity, let x ∼ P r , x g ∼ P g , y ∼ P r ′ ,t , y g ∼ P g ′ ,t , a t = √ ᾱt and b t = (1 -ᾱt )σ 2 . Then, p r ′ ,t (y) = X p r (x)N (y; a t x, b t I)dx p g ′ ,t (y) = X p g (x)N (y; a t x, b t I)dx z ∼ p(z), x g = g θ (z), y g = a t x g + b t ϵ, ϵ ∼ p(ϵ) D f (p r ′ ,t (y)||p g ′ ,t (y)) = X p g ′ ,t (y)f p r ′ ,t (y) p g ′ ,t (y) dy = E y∼p g ′ ,t (y) f p r ′ ,t (y) p g ′ ,t (y) = E z∼p(z),ϵ∼p(ϵ) f p r ′ ,t (a t g θ (z) + b t ϵ) p g ′ ,t (a t g θ (z) + b t ϵ) Since N (y; a t x, b t I) is assumed to be an isotropic Gaussian distribution, for simplicity, in what follows we show the proof in uni-variate Gaussian, which could be easily extended to multi-variate Gaussian by the production rule. We first show that under mild conditions, the p r ′ ,t (y) and p g ′ ,t (y) are continuous functions over y. lim ∆y→0 p r ′ ,t (y -∆y) = lim ∆y→0 X p r (x)N (y -∆y; a t x, b t )dx = X p r (x) lim ∆y→0 N (y -∆y; a t x, b t )dx = X p r (x) lim ∆y→0 1 C 1 exp ((y -∆y) -a t x) 2 C 2 dx = X p r (x)N (y; a t x, b t )dx = p r ′ ,t (y), where C 1 and C 2 are constants. Hence, p r ′ ,t (y) is a continuous function defined on y. The proof of continuity for p g ′ ,t (y) is exactly the same proof. Then, given g θ is also a continuous function, it is clear to see that D f (p r ′ ,t (y)||p g ′ ,t (y)) is a continuous function over θ. Next, we show that D f (p r ′ ,t (y)||p g ′ ,t (y)) is differentiable. By the chain rule, showing D f (p r ′ ,t (y)||p g ′ ,t (y)) to be differentiable is equivalent to show p r ′ ,t (y), p r ′ ,t (y) and f are differentiable. Usually, f is defined with differentiability (Nowozin et al., 2016) . ∇ θ p r ′ ,t (a t g θ (z) + b t ϵ) = ∇ θ X p r (x)N (a t g θ (z) + b t ϵ; a t x, b t )dx = X p r (x) 1 C 1 ∇ θ exp ||a t g θ (z) + b t ϵ -a t x|| 2 2 C 2 dx, ∇ θ p g ′ ,t (a t g θ (z) + b t ϵ) = ∇ θ X p g (x)N (a t g θ (z) + b t ϵ; a t x, b t )dx = ∇ θ E z ′ ∼p(z ′ ) [N (a t g θ (z) + b t ϵ; a t g θ (z ′ ), b t )] = E z ′ ∼p(z ′ ) 1 C 1 ∇ θ exp ||a t g θ (z) + b t ϵ -a t g θ (z ′ )|| 2 2 C 2 , where C 1 and C 2 are constants. Hence, p r ′ ,t (y) and p r ′ ,t (y) are differentiable, which concludes the proof. Proof of Theorem 2. We have p(y) = p(x)q(y | x)dx and p g (y) = p g (x)q(y | x)dx. ⇐ If p(x) = p g (x), then p(y) = p g (y) ⇒ Let y ∼ p(y) and y g ∼ p g (y). Given the assumption on q(y | x), we have y = f (x) + g(ϵ), x ∼ p(x), ϵ ∼ p(ϵ) y g = f (x g ) + g(ϵ g ), x g ∼ p g (x), ϵ g ∼ p(ϵ). Since f and g are one-to-one mapping functions, f (x) and g(ϵ) are identifiable, which indicates f (x) D = f (x g ) ⇒ x D = x g . By the property of moment-generating functions (MGF), given f (x) is independent with g(ϵ), we have for ∀s Discussion. Next, we discuss which q(y | x) fits the assumption we made on it. We follow the discussion of reparameterization of distributions as used in Kingma and Welling (2014)  M y (s) = M f (x) (s) • M g(ϵ) (s) M yg (s) = M f (xg) (s) • M g(ϵg) (s).

D DETAILS OF TOY EXAMPLE

Here, we provide the detailed analysis of the JS divergence toy example. Notation. Let X be a compact metric set (such as the space of images [0, 1] d ) and Prob(X ) denote the space of probability measures defined on X . Let P r be the target data distribution and P gfoot_0 be the generator distribution. The JSD between the two distributions P r , P g ∈ Prob(X ) is defined as: D JS (P r ||P g ) = 1 2 D KL (P r ||P m ) + 1 2 D KL (P g ||P m ), where P m is the mixture (P r + P g )/2 and D KL denotes the Kullback-Leibler divergence, i.e., D KL (P r ||P g ) = X p r (x) log( pr(x) p θ (x) )dx. More generally, the f -divergence (Nowozin et al., 2016 ) between P r and P g is defined as: D f (P r ||P g ) = X p g (x)f p r (x) p g (x) dx, where the generator function f : R + → R is a convex and lower-semicontinuous function satisfying f (1) = 0. We refer to Nowozin et al. (2016) for more details. We recall the typical example introduced in Arjovsky and Bottou (2017) and follow the notations. Example. Let Z ∼ U [0, 1] be the uniform distribution on the unit interval. Let X ∼ P r be the distribution of (0, Z) ∈ R 2 , which contains a 0 on the x-axis and a random variable Z on the y-axis. Let X g ∼ P g be the distribution of (θ, Z) ∈ R 2 , where θ is a single real parameter. In this case, the D JS (P r ||P g ) is not continuous, which can not provide a usable gradient for training. The derivation is as follows: D JS (P r ||P g ) = 0 if θ = 0, log 2 if θ ̸ = 0. D JS (P r ||P g ) = 1 2 E x∼pr(x) log 2 • p r (x) p r (x) + p g (x) + 1 2 E y∼pg(y) log 2 • p g (y) p r (y) + p g (y) = 1 2 E x1=0,x2∼U [0,1] log 2 • 1[x 1 = 0] • U (x 2 ) 1[x 1 = 0] • U (x 2 ) + 1[x 1 = θ] • U (x 2 ) + 1 2 E y1=θ,y2∼U [0,1] log 2 • 1[y 1 = θ] • U (y 2 ) 1[y 1 = 0] • U (y 2 ) + 1[y 1 = θ] • U (y 2 ) = 1 2 log 2 • 1[x 1 = 0] 1[x 1 = 0] + 1[x 1 = θ] x 1 = 0 + 1 2 log 2 • 1[y 1 = θ] 1[y 1 = 0] + 1[y 1 = θ] y 1 = θ = 0 if θ = 0, log 2 if θ ̸ = 0. Although this simple example features distributions with disjoint supports, the same conclusion holds when the supports have a non empty intersection contained in a set of measure zero (Arjovsky and Bottou, 2017) . This happens to be the case when two low dimensional manifolds intersect in general position (Arjovsky and Bottou, 2017) . To avoid the potential issue caused by having non-overlapping distribution supports, a common remedy is to use Wasserstein-1 distance which in theory can still provide usable gradient (Arjovsky and Bottou, 2017; Arjovsky et al., 2017) . In this case, the Wasserstein-1 distance is |θ|.

Diffusion-based noise injection

In general, with our diffusion noise injected, we could have, p r ′ ,t = X p r (x)N (y; √ ᾱt x, (1 -ᾱt )σ 2 I)dx p g ′ ,t = X p g (x)N (y; √ ᾱt x, (1 -ᾱt )σ 2 I)dx D JS (p r ′ ,t ||p g ′ ,t ) = 1 2 E p r ′ ,t log 2p r ′ ,t p r ′ ,t + p g ′ ,t + 1 2 E p g ′ ,t log 2p g ′ ,t p r ′ ,t + p g ′ ,t For the previous example, we have Y ′ t and Y ′ g,t such that,  Y ′ t = (y 1 , y 2 ) ∼ p r ′ ,t = N (y 1 | 0, b t )f (y 2 ), Y ′ g,t = (y g,1 , y g,2 ) ∼ p g ′ ,t = N (y g,1 | a t θ, b t )f (y g,2 ), where f (•) = 1 0 N (• | a t Z, b t )U (Z) D JS (p r ′ ,t ||p g ′ ,t ) = 1 2 E y1∼N (y1 | 0,bt),y2∼f (y2) log 2 • N (y 1 | 0, b t )f (y 2 ) N (y 1 | 0, b t )f (y 2 ) + N (y 1 | a t θ, b t )f (y 2 ) + 1 2 E yg,1∼N (yg,1 | 0,bt),yg,2∼f (yg,2) log 2 • N (y g,1 | a t θ, b t )f (y g,2 ) N (y g,1 | 0, b t )f (y g,2 ) + N (y g,1 | a t θ, b t )f (y g,2 ) = 1 2 E y1∼N (0,bt) log 2 • N (y 1 | 0, b t ) N (y 1 | 0, b t ) + N (y 1 | a t θ, b t ) + 1 2 E yg,1∼N (atθ,bt) log 2 • N (y g,1 | a t θ, b t ) N (y g,1 | 0, b t ) + N (y g,1 | a t θ, b t ) which is clearly continuous and differentiable. We show this D JS (p r ′ ,t ||p g ′ ,t ) with respect to increasing t values and a θ grid in the second row of Figure 2 . As shown in the left panel, the black line with t = 0 shows the origianl JSD, which is not even continuous, while as the diffusion level t increments, the lines become smoother and flatter. It is clear to see that these smooth curves provide good learning signals for θ. Recall that the Wasserstein-1 distance is |θ| in this case. Meanwhile, we could observe with an intense diffusion, e.g., t = 800, the curve becomes flatter, which indicates smaller gradients and a much slower learning process. This motivates us that an adaptive diffusion could provide different level of gradient smoothness and is possibly better for training. The right panel shows the optimal discriminator outputs over the space X . With diffusion, the optimal discriminator is well defined over the space and the gradient is smooth, while without diffusion the optimal discriminator is only valid on two star points. Interestingly, we find that smaller t drives the optimal discriminator to become more assertive while larger t makes discriminator become more neutral. The diffusion here works like a scale to balance the power of the discriminator.

E DATASET DESCRIPTIONS

The CIFAR-10 dataset consists of 50k 32 × 32 training images in 10 categories. The STL-10 dataset originated from ImageNet (Deng et al., 2009) consists of 100k unlabeled images in 10 categories, and we resize them to 64 × 64 resolution. For LSUN datasets, we sample 200k images from LSUN-Bedroom, use the whole 125k images from LSUN-Church, and resize them to 256 × 256 resolution for training. The AFHQ datasets includes around 5k 512 × 512 images per category for dogs, cats, and wild life; we train a separate network for each of them. The FFHQ contains 70k images crawled from Flickr at 1024 × 1024 resolution and we use all of them for training.

F ALGORITHM

We provide the Diffusion-GAN algorithm in Algorithm 1.

G HYPERPARAMETERS

Diffusion-GAN is built on GAN backbones, so we keep the learning hyperparameters of the original GAN backbones untouched. Diffusion-GAN introduces four new hyperparameters: noise standard deviation σ, T max , T increasing threshold d target , and t sampling distribution p π . The σ is fixed as 0.05 for images (pixel values rescaled to [-1 ,1] ) in all our experiments and it shows good performance. T max could be fixed as 500 or 1000, which depends on the diversity of the dataset. We recommend a large T max for diverse datasets. d target is usually fixed as 0.6, which does not influence much about the performance. p π has two choices, 'uniform' and 'priority'. Generally, (σ = 0.05, T max = 500, d target = 0.6, p π = 'uniform') is a good starting point for a new dataset. In our experiment, we find StyleGAN2-based models are not sensitive to the values of d target , so we set d target = 0.6 for them across all dataset, only except that we set d target = 0.8 for FFHQ to inject t similarly to Diffusion StyleGAN2. Then, we train the generator and discriminator with diffused samples y and t.

I ABLATION ON THE MIXING PROCEDURE AND T ADAPTIVENESS

Note the mixing procedure described in Equation ( 6), referred to as "priority mixing" in what follows, is designed based on our intuition. Here we conduct an ablation study on the mixing procedure by comparing the priority mixing with uniform mixing on three representative datasets. We report in Table 7 the FID results, which suggest that uniform mixing could work better than priority mixing in some dataset, and hence Diffusion-GAN may be further improved by optimizing its mixing procedure according to the training data. While optimizing the mixing procedure is beyond the focus of this paper, it is worth further investigation in future studies. We further conduct ablation study on whether the T needs to be adaptively adjusted. As shown in Figure 7 , we observe with adaptive diffusion strategy, the training curves of FIDs converge faster and reach lower final FIDs.

J MORE GAN VARIANTS

To further validate our noise injection via diffusion-based mixtures, we add our diffusion-based training into two more representative GAN variants: DCGAN (Radford et al., 2015) and SNGAN (Miyato et al., 2018b) , which have quite different GAN architectures compared to StyleGAN2. We provide the FIDs for CIFAR-10 in Table 8 . We observe that both Diffusion-DCGAN and Diffusion-SNGAN clearly outperform their corresponding baseline GANs. (Ho et al., 2020a) 9.46 3.21 0.57 1000 DDIM (Song et al., 2020) 8.78 4.67 0.53 50 Denoising Diffusion GAN (Xiao et al., 2021) 

K INCEPTION SCORE FOR CIFAR-10

We report the Inception Score (IS) (Salimans et al., 2016) of Diffusion StyleGAN2 for CIFAR-10 dataset in Table 9 and also include other state-of-the-art GANs and diffusion models as baselines. Note CIFAR-10 is a well-known dataset and tested by almost all baselines, so we pick CIFAR-10 here and we reference the reported IS values from their original papers for a fair comparison.

L MORE GENERATED IMAGES

We provide more randomly generated images for LSUN-Bedroom, LSUN-Church, AFHQ, and FFHQ datasets in Figure 8 , Figure 9 , and Figure 10 . 



For notation simplicity, g and G both denote the generator network in GANs in this paper.



Figure2: The toy example inherited fromArjovsky et al. (2017). The first row plots the distributions of data with diffusion noise injected for t. The second row shows the JS divergence and the optimal discriminator value with and without our noise injection.

Figure 3: Randomly generated images from Diffusion StyleGAN2 trained on CIFAR-10, CelebA, STL-10, LSUN-Bedroom, LSUN-Church, and FFHQ datasets.

Figure 4: Plot of adaptively adjusted maximum diffusion steps T and discriminator outputs of Diffusion-GANs.

where M y (s) = E y∼p(y) [e s T y ] denotes the MGF of random variable y and the others follow the same form. By the moment-generating function uniqueness theorem, given y D = y g and g(ϵ) D = g(ϵ g ), we have M y (s) = M yg (s) and M g(ϵ) (s) = M g(ϵg) (s) for ∀s. Then, we could obtainM f (x) = M f (xg) for ∀s. Thus, M f (x) = M f (xg) ⇒ f (x) D = f (x g ) ⇒ p(x) = p(x g ),which concludes the proof.

Figure 6: We show the data distribution and DJS(Pr||Pg).

Figure 7: Ablation study on the T adaptiveness.

Figure 8: More generated images for LSUN-Bedroom (FID 1.43, Recall 0.58) and LSUN-Church (FID 1.85, Recall 0.65) from Diffusion ProjectedGAN.

Image generation results on benchmark datasets: CIFAR-10, CelebA, STL-10, LSUN-Bedroom, LSUN-Church, and FFHQ. We highlight the best and second best results in each column with bold and underline, respectively. Lower FIDs indicate better fidelity, while higher Recalls indicate better diversity.

Domain-agnostic experiments on ProjectedGAN.

FFHQ (1024 × 1024) FID results with 200, 500, 1k, 2k, and 5k training samples; AFHQ (512 × 512) FID results. To ensure convergence, all models are trained across 10M images for FFHQ and 25M images for AFHQ. We bold the best number in each column.

| t) + p g (y | t) = E t∼pπ(t) [JSD(p(y | t), p g (y | t))].

′ g,t are both the whole metric space R 2 and they overlap with each other depending on t, as shown in Figure2. As t increases, the high density region of Y ′ t and Y ′ g,t get closer since the weight a t is decreasing towards 0. Then, we derive the JS divergence,

Ablation study on the mixing procedure. "Priority Mixing" refers to the mixing procedure in Equation (6) and "Uniform Mixing" refers to sample t uniformly at random.

FIDs on CIFAR-10 for DCGAN, Diffusion-DCGAN, SNGAN, and Diffusion-SNGAN.

Inception Score for CIFAR-10. For sampling time, we use the number of function evaluations (NFE).

ACKNOWLEDGEMENTS

Z. Wang, H. Zheng, and M. Zhou acknowledge the support of NSF-IIS 2212418 and IFML.

Algorithm 1 Diffusion-GAN while i ≤ number of training iterations do

Step I: Update discriminator • Sample minibatch of m noise samples {z 1 , z 2 , . . . , z m } ∼ p z (z).• Obtain generated samples {x g,1 , x g,2 , . . . , x g,m } by x g = G(z).• Sample minibatch of m data examples {x 1 , x 2 , . . . , x m } ∼ p(x).• Sample {t 1 , t 2 , . . . , t m } from t epl list uniformly with replacement.• For j ∈ {1, 2, . . . , m}, sample y j ∼ q(y j |x j , t j ) and y g,j ∼ q(y g,j |x g,j , t j )• Update discriminator by maximizing Equation (3) .Step II: Update generator• Obtain generated samples {x g,1 , x g,2 , . . . , x g,m } by x g = G(z).• Sample {t 1 , t 2 , . . . , t m } from t epl list with replacement.• For j ∈ {1, 2, . . . , m}, sample y g,j ∼ q(y g,j |x g,j , t j )• Update generator by minimizing Equation (3).Step (d target = 0.8 for FFHQ is slightly better than 0.6 in FID). We report d target of Diffusion Project-edGAN for our experiments in Table 4 . We also evaluated two t sampling distribution p π , ['priority', 'uniform'], defined in Equation (6). In most cases, 'priority' works slightly better, while in some cases, such as FFHQ, 'uniform' is better. Overall, we didn't modify anything in the model architectures and training hyperparameters, such as learning rate and batch size. The forward diffusion configuration and model training configurations are as follows.Diffusion config. For our diffusion-based noise injection, we set up a linearly increasing schedule for β t , where t ∈ {1, 2, . . . , T }. For pixel level injection in StyleGAN2, we follow Ho et al. (2020b) and set β 0 = 0.0001 and β T = 0.02. We adaptively modify T ranging from T min = 5 to T max = 1000. The image pixels are usually rescaled to [-1, 1] so we set the Guassian noise standard deviation σ = 0.05. For feature level injection in Diffusion ProjectedGAN, we set β 0 = 0.0001, β T = 0.01, T min = 5, T max = 500, and σ = 0.5. We list all these values in Table 5 Model config. For StyleGAN2-based models, we borrow the config settings provided by Karras et al. (2020a) , which include ['auto', 'stylegan2', 'cifar', 'paper256', 'paper512', 'stylegan2'] . We create the 'stl' config based on 'cifar' with a small modification that we change the gamma term to be 0.01. For ProjectedGAN models, we use the recommended default config (Sauer et al., 2021) , which is based on FastGAN (Liu et al., 2020) . We report the config settings used for our experiments in Table 6 .Diffusion config for pixel, priority β 0 = 0.0001, β T = 0.02, T min = 5, T max = 1000, σ = 0.05 Diffusion config for pixel, uniform β 0 = 0.0001, β T = 0.02, T min = 5, T max = 500, σ = 0.05 Diffusion config for feature β 0 = 0.0001, β T = 0.01, T min = 5, T max = 500, σ = 0.5 , 'style-gan2', 'cifar', 'paper256', 'paper512', 'paper1024'] . We create the 'stl' config based on 'cifar' with small modifications that we change the gamma term to be 0.01. For ProjectedGAN models, we use the recommended default config (Sauer et al., 2021) , which is based on FastGAN.

H IMPLEMENTATION DETAILS

We implement an additional diffusion sampling pipeline, where the diffusion configurations are set in Appendix G. The T in the forward diffusion process is adaptively adjusted and clipped to [T min , T max ]. As illustrated in Algorithm 1, at each update step, we sample t from t epl for each point x, and then use the analytic Gaussian distribution at diffusion step t to sample y. Next, we use y and t instead of x for optimization.Diffusion StyleGAN2. We inherit all the network architectures from StyleGAN2 implemented by Karras et al. (2020a) . We modify the original mapping network, which is there for label conditioning and unused for unconditional image generation tasks, inside the discriminator to inject t. Specifically, we change the original input of mapping network, the class label c, to our discrete value timestep t. Then, we train the generator and discriminator with diffused samples y and t.Diffuson ProjectedGAN. To simplify the implementation and minimize the modifications to ProjectedGAN, we construct the discriminator as D ϕ (y), where t is ignored. Our method is plugged in as a data augmentation method. The only change in the optimization stage is that the discriminator is fed with diffused images y instead of original images x.Diffuson InsGen.To simplify the implementation and minimize the modifications to InsGen, we keep their contrastive learning part untouched. We modify the original discriminator network 

