INTERVENTION GENERATIVE ADVERSARIAL NETS

Abstract

In this paper we propose a novel approach for stabilizing the training process of Generative Adversarial Networks as well as alleviating the mode collapse problem. The main idea is to incorporate a regularization term that we call intervention into the objective. We refer to the resulting generative model as Intervention Generative Adversarial Networks (IVGAN). By perturbing the latent representations of real images obtained from an auxiliary encoder network with Gaussian invariant interventions and penalizing the dissimilarity of the distributions of the resulting generated images, the intervention term provides more informative gradient for the generator, significantly improving training stability and encouraging modecovering behaviour. We demonstrate the performance of our approach via solid theoretical analysis and thorough evaluation on standard real-world datasets as well as the stacked MNIST dataset.

1. INTRODUCTION

As one of the most important advances in generative models in recent years, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have been attracting great attention in the machine learning community. GANs aim to train a generator network that transforms simple vectors of noise to produce "realistic" samples from the data distribution. In the basic training process of GANs, a discriminator and a target generator are trained in an adversarial manner. The discriminator tries to distinguish the generated fake samples from the real ones, and the generator tries to fool the discriminator into believing the generated samples to be real. Though successful, there are two major challenges in training GANs: the instability of the training process and the mode collapse problem. To deal with these problems, one class of approaches focus on designing more informative objective functions (Salimans et al., 2016; Mao et al., 2016; Kodali et al., 2018; Arjovsky & Bottou; Arjovsky et al., 2017; Gulrajani et al., 2017; Zhou et al., 2019) . For example, Mao et al. (2016) proposed Least Squares GAN (LSGAN) that uses the least squares loss to penalize the outlier point more harshly. Arjovsky & Bottou discussed the role of the Jensen-Shannon divergence in training GANs and proposed WGAN (Arjovsky et al., 2017) and WGAN-GP (Gulrajani et al., 2017) that use the more informative Wasserstein distance instead. Other approaches enforce proper constraints on latent space representations to better capture the data distribution (Makhzani et al., 2015; Larsen et al., 2015; Che et al., 2016; Tran et al., 2018) . A representative work is the Adversarial Autoencoders (AAE) (Makhzani et al., 2015) which uses the discriminator to distinguish the latent representations generated by encoder from Gaussian noise. Larsen et al. (2015) employed image representation in the discriminator as the reconstruction basis of a VAE. Their method turns pixel-wise loss to feature-wise, which can capture the real distribution more simply when some form of invariance is induced. Different from VAE-GAN, Che et al. (2016) regarded the encoder as an auxiliary network, which can promote GANs to pay much attention on missing mode and derive an objective function similar to VAE-GAN. A more detailed discussion of related works can be found in Appendix C. In this paper we propose a novel technique for GANs that improve both the training stability and the quality of generated images. The core of our approach is a regularization term based on the latent representations of real images provided by an encoder network. More specifically, we apply auxiliary intervention operations that preserve the standard Gaussian (e.g., the noise distribution) to these latent representations. The perturbed latent representations are then fed into the generator to produce intervened samples. We then introduce a classifier network to identify the right intervention operations that would have led to these intervened samples. The resulting negative cross-entropy loss is added as a regularizer to the objective when training the generator. We call this regularization term the intervention loss and our approach InterVention Generative Adversarial Nets (IVGAN). We show that the intervention loss is equivalent with the JS-divergence among multiple intervened distributions. Furthermore, these intervened distributions interpolate between the original generative distribution of GAN and the data distribution, providing useful information for the generator that is previously unavailable in common GAN models (see a thorough analysis on a toy example in Example 1). We show empirically that our model can be trained efficiently by utilizing the parameter sharing strategy between the discriminator and the classifier. The models trained on the MNIST, CIFAR-10, LSUN and STL-10 datasets successfully generate diverse, visually appealing objects, outperforming state-of-the-art baseline methods such as WGAN-GP and MRGAN in terms of the Frèchet Inception Distance (FID) (proposed in (Heusel et al., 2017) ). We also perform a series of experiments on the stacked MNIST dataset and the results show that our proposed method can also effectively alleviate the mode collapse problem. Moreover, an ablation study is conducted, which validates the effectiveness of the proposed intervention loss. In summary, our work offers three major contributions as follows. (i) We propose a novel method that can improve GAN's training as well as generating performance. (ii) We theoretically analyze our proposed model and give insights on how it makes the gradient of generator more informative and thus stabilizes GAN's training. (iii) We evaluate the performance of our method on both standard real-world datasets and the stacked MNIST dataset by carefully designed expriments, showing that our approach is able to stabilize GAN's training and improve the quality and diversity of generated samples as well.

2. PRELIMINARIES

Generative adversarial nets The basic idea of GANs is to utilize a discriminator to continuously push a generator to map Gaussian noise to samples drawn according to an implicit data distribution. The objective function of the vanilla GAN takes the following form: min G max D V (D, G) E x∼p data log(D(x)) + E z∼pz log(1 -D(G(z))) , where p z is a prior distribution (e.g., the standard Gaussian). It can be easily seen that when the discriminator reaches its optimum, that is, D * (x) = p data (x) x) , the objective is equivalent to the Jensen-Shannon (JS) divergence between the generated distribution p G and data distribution p data : p data (x)+p G ( JS(p G p data ) 1 2 KL(p G p G + p data 2 ) + KL(p data p G + p data 2 ) . Minimizing this JS divergence guarantees that the generated distribution converges to the data distribution given adequate model capacity.

Multi-distribution JS divergence

The JS divergence between two distributions p 1 and p 2 can be rewritten as JS(p 1 p 2 ) = H( p 1 + p 2 2 ) - 1 2 H(p 1 ) - 1 2 H(p 2 ), where H(p) denotes the entropy of distribution p. We observe that the JS-divergence can be interpreted as the entropy of the mean of the two distribution minus the mean of two distributions' entropy. So it is immediate to generalize the JS-divergence to the setting of multiple distributions. In particular, we define the JS-divergence of p 1 , p 2 , . . . , p n with respect to weights π 1 , π 2 , . . . , π n ( π i = 1 and π i ≥ 0) as JS π1,...,πn (p 1 , p 2 , . . . , p n ) H( n i=1 π i p i ) - n i=1 π i H(p i ). The two-distribution case described above is actually a special case of the 'multi-JS divergence', where π 1 = π 2 = 1 2 . When π i > 0 ∀i, it can be found immediately by Jensen's inequality that JS π1,...,πn (p 1 , p 2 , . . . , p n ) = 0 if and only if  p 1 = p 2 = • • • = p n .

3. METHODOLOGY

Training GAN has been challenging, especially when the generated distribution and the data distribution are far away from each other. In such cases, the discriminator often struggles to provide useful information for the generator, leading to instability and mode collapse problems. The key idea behind our approach is that we construct auxiliary intermediate distributions that interpolate between the generated distribution and the data distribution. To do that, we first introduce an encoder network and combine it with the generator to learn the latent representation of real images within the framework of a standard autoencoder. We then perturb these latent representations with carefully designed intervention operations before feeding them into the generator to create these auxiliary interpolating distributions. A classifier is used to distinguish the intervened samples, which leads to an intervention loss that penalizes the dissimilarity of these intervened distributions. The reconstruction loss and the intervention loss are added as regularization terms to the standard GAN loss for training. We start with an introduction of some notation and definitions. Definition 1 (Intervention). Let O be a transformation on the space of d-dimension random vectors and P be a probability distribution whose support is in R d . We call O a P-intervention if for any d-dimensional random vector X, X ∼ P ⇒ O(X) ∼ P. Since the noise distribution in GAN models is usually taken to be standard Gaussian, we use the standard Gaussian distribution as the default choice of P and abbreviate the P-intervention as intervention, unless otherwise claimed. One of the simplest groups of interventions is block substitution. Let Z ∈ R d be a random vector, k ∈ N and k|d. We slice Z into k blocks so that every block is in R d k . A block substitution intervention O i is to replace the ith block of Z with Gaussian noise, i = 1, . . . , d k . We will use block substitution interventions in the rest of the paper unless otherwise specified. Note that our theoretical analysis as well as the algorithmic framework do not depend on the specific choice of the intervention group.

Notation

We use E, G, D, f to represent encoder, generator, discriminator and classifier, respectively. Here and later, p real means the distribution of the real data X, and p z is the prior distribution of noise z defined on the latent space (usually is taken to be Gaussian). Let O i , i = 1, . . . , k denote k different interventions and p i be the distribution of intervened sample X i created from O i (namely X i = G(O i (E(X)))). Intervention loss The intervention loss is the core of our approach. More specifically, given a latent representation z that is generated by an encoder network E, we sample an intervention O i from a complete group S = {O 1 , . . . , O k } and obtain the corresponding intervened latent variable O i (z) with label e i . These perturbed latent representations are then fed into the generator to produce intervened samples. We then introduce an auxiliary classifier network to identify which intervention operations may lead to these intervened samples. The intervention loss L IV (G, E) is simply the resulting negative cross-entropy loss and we add that as a regularizer to the objective function when training the generator. As we can see, the intervention loss is used to penalize the dissimilarity of the distributions of the images generated by different intervention operations. Moreover, it can be noticed that the classifier and the combination of the generator and the encoder are playing a two-player adversarial game and we will train them in an adversarial manner. In particular, we define L IV (G, E) = -min f V class , where V class = E i∼U ([k]) E x ∼pi -e T i log f (x ). (3) Theorem 1 (Optimal Classifier). The optimal solution of the classifier is the conditional probability of label y given X , where X is the intervened sample generated by the intervention operation sampled from S. And the minimum of the cross entropy loss is equivalent with the negative of the Jensen Shannon divergence among {p 1 , p 2 , ..., p k }. That is, f * i (x) = p i (x) k j=1 p j (x) and L IV (G, E) = JS(p 1 , p 2 , ..., p k ) + Const. ( ) The proof can be found in Appendix A.1. Clearly, the intervention loss is an approximation of the multi-JS divergence among the intervened distributions {p i : i ∈ [k]}. If the intervention reaches its global minimum, we have p 1 = p 2 = • • • = p k . And it reaches the maximum log k if and only if the supports of these k distributions do not intersect with each other. This way, the probability that the 'multi' JS-divergence has constant value is much smaller, which means the phenomenon of gradient vanishing should be rare in IVGAN. Moreover, as shown in the following example, due to these auxiliary intervened distributions, the intervention is able to provide more informative gradient for the generator that is not previously available in other GAN variants. Example 1 (Square fitting). Let X 0 be a random vector with distribution U(α), where α = [-1 2 , 1 2 ] × [-1 2 , 1 2 ]. And X 1 ∼ U(β), where β = [a -1 2 , a + 1 2 ] × [ 1 2 , 3 2 ] and 0 ≤ a ≤ 1. Assuming we have a perfect discriminator (or classifier), we compute the vanilla GAN loss (i.e. the JS-divergence) and the intervention loss between these two distributions, respectively, • JS(X 0 X 1 ) = log 2. • In order to compute the intervention loss we need figure out two intervened samples' distributions evolved from U(α) and U(β). Y 1 ∼ U(γ 1 ); γ 1 = [-1 2 , 1 2 ] × [ 1 2 , 3 2 ] and Y 2 ∼ U(γ 2 ); γ 2 = [a -1 2 , a + 1 2 ] × [-1 2 , 1 2 ]. Then the intervention loss is the multi JS-divergence among these four distributions: L IV = JS(X 0 ; X 1 ; Y 1 ; Y 2 ) = - A c 1 4 log 1 4 dµ- A 1 2 log 1 2 dµ-H(X 0 ) = log 2 2 [µ(A c )+µ(A)] = log 2 2 × 2(2 -a) -H(X 0 ) = -(log 2)a -Const. Here A is the shaded part in Figure 2 and A c = {α ∪ β ∪ γ 1 ∪ γ 2 } \A. The most important observation is that the intervention loss is a function of parameter a and the traditional GAN loss is always constant. When we replace the JS with other f -divergence, the metric between U(α) and U(β) would still remain constant. Hence in this situation, we can not get any information from the standard JS for training of the generator but the intervention loss works well. Sample minibatch z j , j = 1, ..., n, z j ∼ p z 3: Sample minibatch x j , j = 1, ..., n, x j ∼ p real 4: for number of inner iteration do 5: w j ← E(x j ), j = 1, ..., n 6: Sample Gaussian noise 7: Sample i j ∈ [k], j = 1, ..., n 8: x j ← G(O ij (w j )) 9: Update the parameters of D by: 10: θ D ← θ D -α 2n ∇ θ D L adv (θ D ) 11: Update the parameters of f by: 12: θ f ← θ f + α n ∇ θ f n j=1 log f ij (x j ) 13: Calculate L Adv and L IV 14: Update the parameter of G by: 15: θ G ← θ G + α n ∇ θ G LAdv + λ Lrecon + µ LIV 16: Update the parameter of E by: 17: θ E ← θ E + α n ∇ θ E λ Lrecon + µ LIV

Reconstruction loss

In some sense we expect our encoder to be a reverse function of the generator. So it is necessary for the objective function to have a term to push the map composed of the Encoder and the Generator to have the ability to reconstruct the real samples. Not only that, we also hope that the representation can be reconstructed from samples in the pixel space. Formally, the reconstruction loss can be defined by the p -norm (p ≥ 1) between the two samples, or in the from of the Wasserstein distance between samples if images are regarded as a histogram. Here we choose to use the 1 -norm as the reconstruction loss: L recon = E X∼p real G(E(X))-X 1 + E i∼U ([k]) E x,z∼p real ,pz E(G(O i (z)))-O i (z) 1 . (5) Theorem 2 (Inverse Distribution). Suppose the cumulative distribution function of O i (z) is q i . For any given positive real number , there exist a δ > 0 such that if L recon + L IV ≤ δ , then ∀i, j ∈ [k], sup r q i (r) -q j (r) ≤ . The proof is in A.2.

Adversarial loss

The intervention loss and reconstruction loss can be added as regularization terms to the adversarial loss in many GAN models, e.g., the binary cross entropy loss in vanilla GAN and the least square loss in LSGAN. In the experiments, we use LSGAN (Mao et al., 2016) and DCGAN (Radford et al., 2015) as our base models, and name the resulting IVGAN models IVLSGAN and IVDCGAN respectively. Now that we have introduced the essential components in the objective of IVGAN, we can write the loss function of the entire model: L model = L Adv + λL recon + µL IV , where λ and µ are the regularization coefficients for the reconstruction loss and the intervention loss respectively. We summarize the training procedure in Algorithm 1. A diagram of the full workflow of our framework can be found in Figire 3. 

4. EXPERIMENTS

In this section we conduct a series of experiments to study IVGAN from multiple aspects. First we evaluate IVGAN's performance on standard real-world datasets. Then we show IVGAN's ability to tackle the mode collapse problem on the stacked MNIST dataset. Finally, through an ablation study we investigate the performance of our proposed method under different settings of hyperparameters and demonstrate the effectiveness of the intervention loss. We implement our models using PyTorch (Paszke et al., 2019) with the Adam optimizer (Kingma & Ba, 2015) . Network architectures are fairly chosen among the baseline methods and IVGAN (see Table B .1 in the appendix for more details). The classifier we use to compute the intervention loss shares the parameters with the discriminator except for the output layer. All input images are resized to have 64 × 64 pixels. We use 100-dimensional standard Gaussian distribution as the prior p z . We deploy the instance noise technique as in (Jenni & Favaro, 2019) . One may check Appendix B.2 for detailed hyperparameter settings. All experiments are run on one single NVIDIA RTX 2080Ti GPU. Although IVGAN introduces extra computational complexities to the original framework of GANs, the training cost of IVGAN is within an acceptable rangefoot_0 due to the application of strategies like parameter sharing. 

Real-world datasets experiments

We first test IVGAN on four standard real-world datasets, including CIFAR-10 (Krizhevsky, 2009), MNIST (Lecun et al., 1998) , STL-10 (Coates et al., 2011) , and a subclass named "church_outdoor" of the LSUN dataset (Yu et al., 2015) , to investigate its training stability and quality of the generated images. We use the Frèchet Inception Distance (FID) (Heusel et al., 2017) to measure the performance of all methods. The FID results are listed in Table 1 , and the training curves of the baseline methods and IVGAN on four different datasets are shown in Figure 5 . We see that on each datasets, the IVGAN counterparts obtain better FID scores than their corresponding baselines. Moreover, the figure of training curves also suggests the learning processes of IVDCGAN and IVLSGAN are smoother and steadier compared to DCGAN, LSGAN or MRGAN (Che et al., 2016) , and converge faster than WGAN-GP. Samples of generated images on all datasets are presented in Figure 4 . The mode of a generated imaged is found from a pre-trained MNIST digit classifier.

Stacked MNIST experiments

Our results are shown in Table 2 . It can be seen that our model works very well to prevent the mode collapse problem. Both IVLSGAN and IVDCGAN are able to reach all 1,000 modes and greatly outperforms early approaches to mitigate mode collapse, such as VEEGAN (Srivastava et al., 2017) , and Unrolled GAN (Metz et al., 2017) . Moreover, the performance of our model is also comparable to method that is proposed more recently, such as the PacDCGAN (Lin et al., 2018) . Figure 6 shows images generated randomly by our model as well as the baseline methods. Ablation study Our ablation study is conducted on the CIFAR-10 dataset. First, we show the effectiveness of the intervention loss. We consider two cases, IVLSGAN without the intervention loss (µ = 0), and standard IVLSGAN (µ = 0.5). From Figure 7 we can find that the intervention loss makes the training process much smoother and leads to a lower FID score in the end. We also investigate the performance of our model using different number of blocks for the block substitution interventions and different regularization coefficients for the intervention loss. The results are presented in Table 3 . It can be noticed that to some extent our models' performance is not sensitive to the choice of hyperparameters and performs well under several different hyperparameter settings. However, when the number of blocks or the scale of IV loss becomes too large the performance of our model gets worse. 

5. CONCLUSION

We have presented a novel model, intervention GAN (IVGAN), to stabilize the training process of GAN and alleviate the mode collapse problem. By introducing auxiliary Gaussian invariant interventions to the latent space of real images and feeding these perturbed latent representations into the generator, we have created intermediate distributions that interpolate between the generated distribution of GAN and the data distribution. The intervention loss based on these auxiliary intervened distributions, together with the reconstruction loss, are added as regularizers to the objective to provide more informative gradients for the generator, significantly improving GAN's training stability and alleviating the mode collapse problem as well. We have conducted a detailed theoretical analysis of our proposed approach, and illustrated the advantage of the proposed intervention loss on a toy example. Experiments on both real-world datasets and the stacked MNIST dataset demonstrate that, compared to the baseline methods, IVGAN variants are stabler and smoother during training, and are able to generate images of higher quality (achieving state-of-the-art FID scores) and diversity. We believe that our proposed approach can also be applied to other generative models such as Adversarial Autoencoders (Makhzani et al., 2015) , which we leave to future work. 

A PROOFS

A.1 PROOF OF THEOREM 1 Proof. The conditional probability of X given label can be written as P(X |e i ) = p i (X ), so further P(X , e i ) = 1 k p i . And we denote the marginal distribution of x as p(x) = 1 k k i=1 p i (x). Cause the activation function at the output layer of the classifier is softmax, we can rewrite the loss function into a more explicit form: V class (f ) = E i∼U [k] E x ∼pi -e T i log f (x ) = E i∼U [k] E x ∼pi -log f i (x) = 1 k k i=1 -p i (x) log f i (x)dx = p(x) - k i=1 p(e i |x) log f i (x) dx. Let g i (x) = fi(x) p(ei|x) , then k i=1 p(e i |x)g i (x) = 1. And notice that k i=1 p(e i |x) = 1. By Jensen's inequality, we have: k i=1 -p(e i |x) log f i (x) = k i=1 -p(e i |x) log[g i (x)p(e i |x)] = k i=1 -p(e i |x) log g i (x) + H(p(•|x)) ≥ log k i=1 p(e i |x)g i (x) + H(p i (•|x)) = log 1 + H(p(•|x)) = H(p(•|x)). And V class (f * ) = p(x)H(p i (•|x))dx if and only if g * i (x) = g * j (x) for any i = j, which means that f * i (x) p(ei|x) = r ∀i ∈ [k], where r ∈ R. Notice that k i=1 f * i (x) = 1, it is not difficult to get that f * i (x) = p(e i |x) . The loss function becomes 1 k k i=1 -p i (x) log p(e i |x)dx = -H(x) + k i=1 1 k H(p i ) + log k = -JS(p 1 , p 2 , ..., p k ) + log k A.2 PROOF OF THEOREM 2 Proof. According to Theorem 1, for a given real number 1 , we can find another δ 1 , when intervention loss is less than δ 1 , the distance between p i and p j under the measurement of JS-divergence is less than 1 . And because JS-divergence and Total Variance distance (TV) are equivalent in the sense of convergence. So we can bound the TV-distance between p i and p j by their JS-divergence. Which means that |p i -p j |dx ≤ 0 when the intervention loss is less than 1 (we can according to the 0 to finding the appropriate 1 ). Using this conclusion we can deduce |P (E(G(O i (z))) ≤ Nets which introduce a reconstruction loss term to the training target of GAN to penalize the missing modes. It is shown that it actually ameliorates GAN's 'mode missing'-prone weakness to some extent. However, both of them fail to fully excavate the impact of the interaction between VAEs and GANs. Kim & Mnih (2018) proposed Factor VAE where a regularization term called total correlation penalty is added to the traditional VAE loss. The total correlation is essentially the Kullback-Leibler divergence between the joint distribution p(z 1 , z 2 , . . . , z d ) and the product of marginal distribution p(z i ). Because the closed forms of these two distribution are unavailable, Factor VAE uses adversarial training to approximate the likelihood ratio.



Empirically IVGANs are approximately times slower than their corresponding baseline methods.



Figure 1: Comparison between the vanilla GAN loss and the Intervention loss. Here the intervened samples are generated through different intervention operations, namely O 1 , ..., O k .

Figure 2: The supports of the two original distribution are the squares with black border, and the supports of the synthetic distributions are the area enclosed by red and blue dotted line, respectively.

Figure 3: Full workflow of our approach.

Figure 4: Random samples of generated images on MNIST, CIFAR-10, LSUN and STL-10.

Figure 5: Training curves of different methods in terms of FID on different datasets, averaged over five runs. Left: CIFAR10. Right: Church Outdoors. Note that the raise of curves in the later stage may indicate mode collapse.

Figure 6: Sampled images on the stacked MNIST dataset. Left: Ground-truth. Middle: LSGAN. Right: IVLSGAN. Images generated by our method are more diverse.

Figure 7: Training curve of IVLSGAN, with and without the intervention loss.

r) -P (E(G(O j (z))) ≤ r)| ≤ 0 , where r is an arbitrary vector in R d . Further, we have:|P (O i (z) ≤ r) -P (O j (z) ≤ r)| ≤ |P (O i (z) ≤ r; O i (z) -E(G(O i (z))) > δ)| + |P (O j (z) ≤ r; O j (z) -E(G(O j (z))) > δ)| + |P (O i (z) ≤ r; O i (z) -E(G(O i (z))) ≤ δ) -P (O j (z) ≤ r; O j (z) -E(G(O j (z))) ≤ δ)|(8) We control the three terms on the right side of the inequality sign respectively.P (O i (z) ≤ r; O i (z) -E(G(O i (z))) > δ) ≤ P ( O i (z) -E(G(O i (z))) > δ) ≤ E O i (z) -E(G(O i (z))) δ (9)And the last term can be bounded by the reconstruction loss. The same trick can be used onP (O j (z) ≤ r; O j (z) -E(G(O j (z))) > δ). Moreover, we haveP (E(G(O i (z))) ≤ r -δ) -P ( O i (z) -E(G(O i (z))) > δ) ≤P (O i (z) ≤ r; O i (z) -E(G(O i (z))) ≤ δ) ≤ P (E(G(O i (z))) ≤ r + δ) O i (z))) ≤ r ± δ) = P (E(G(O i (z))) ≤ r). Let s i (r, δ) = |P (E(G(O i (z))) ≤ r ± δ)) -P (E(G(O i (z))) ≤ r)| then the last term of inequalityA.2 can be bounded as: |P (O i (z) ≤ r; O i (z) -E(G(O i (z))) ≤ δ) -P (O j (z) ≤ r; O j (z) -E(G(O j (z))) ≤ δ)| ≤|P (E(G(O i (z))) ≤ r) -P (E(G(O j (z))) ≤ r)| + P ( O i (z) -E(G(O i (z))) > δ) + s i (r, δ) + s j (r, δ)(11) Every term on the right hand of the inequality can be controlled close to 0 by the inequalities mentioned above.

Algorithm 1 Intervention GAN Input learning rate α, regularization parameters λ and µ, dimension d of latent space, number k of blocks in which the hidden space is divided, minibatch size n, Hadamard multiplier * 1: for number of training iterations do

Minimum of FIDs on different Datasets. The FID results are calculated every 10 epochs, and are averaged over five independent runs. Lower is better.

Results of our stacked MNIST experiments. The first four rows are directly copied from(Lin et al., 2018)  and(Srivastava et al., 2017). And the last three rows are obtained after training each model for 100K iterations, respectively.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview. net/forum?id=BydrOIcle. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234-2242, 2016. Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308-3318, 2017.

C RELATED WORK

In order to address GAN's unstable training and mode missing problems, many researchers have turned their attention to the latent representations of samples. Makhzani et al. (2015) proposed the Adversarial Autoencoder (AAE). As its name suggests, AAE is essentially a probabilistic autoencoder based on the framework of GANs. Unlike classical GAN models, in the setting of AAE the discriminator's task is to distinguish the latent representations of real images that are generated by an encoder network from Gaussian noise. And the generator and the encoder are trained to fool the discriminator as well as reconstruct the input image from the encoded representations. However, the generator can only be trained by fitting the reverse of the encoder and cannot get any information from the latent representation.The VAE-GAN (Larsen et al., 2015) combines the objective function from a VAE model with a GAN and utilizes the learned features in the discriminator for better image similarity metrics, which is of great help for the sample visual fidelity. Considering the opposite perspective, Che et al. ( 2016) claim that the whole learning process of a generative model can be divided into the manifold learning phase and the diffusion learning phase. And the former one is considered to be the source of the mode missing problem. (Che et al., 2016) then proposed Mode Regularized Generative Adversarial

