THE BURES METRIC FOR TAMING MODE COLLAPSE IN GENERATIVE ADVERSARIAL NETWORKS

Abstract

Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. However, under certain circumstances, the training of GANs can lead to mode collapse or mode dropping, i.e. the generative models not being able to sample from the entire probability distribution. To address this problem, we use the last layer of the discriminator as a feature map to study the distribution of the real and the fake data. During training, we propose to match the real batch diversity to the fake batch diversity by using the Bures distance between covariance matrices in feature space. The computation of the Bures distance can be conveniently done in either feature space or kernel space in terms of the covariance and kernel matrix respectively. We observe that diversity matching reduces mode collapse substantially and has a positive effect on the sample quality. On the practical side, a very simple training procedure, that does not require additional hyperparameter tuning, is proposed and assessed on several datasets.

1. INTRODUCTION

In several machine learning applications, data is assumed to be sampled from an implicit probability distribution. The estimation of this empirical implicit distribution is often intractable, especially in high dimensions. To tackle this issue, generative models are trained to provide an algorithmic procedure for sampling from this unknown distribution. Popular approaches are Variational Auto-Encoders proposed by Kingma & Welling (2014) , Generating Flow models by Rezende & Mohamed (2015) and Generative Adversarial Networks (GANs) initially developed by Goodfellow et al. (2014) . The latter are particularly successful approaches to produce high quality samples, especially in the case of natural images, though their training is notoriously difficult. The vanilla GAN consists of two networks: a generator and a discriminator. The generator maps random noise, usually drawn from a multivariate normal, to fake data in input space. The discriminator estimates the likelihood ratio of the generator network to the data distribution. It often happens that a GAN generates samples only from a few of the many modes of the distribution. This phenomenon is called 'mode collapse'. Contribution. We propose BuresGAN: a generative adversarial network which has the objective function of a vanilla GAN complemented by an additional term, which is given by the squared Bures distance between the covariance matrix of real and fake batches in a latent space. This loss function promotes a matching of fake and real data in a feature space R f , so that mode collapse is reduced. Conveniently, the Bures distance also admits both a feature space and kernel based expression. Contrary to other related approaches such as in Che et al. (2017) or Srivastava et al. (2017) , the architecture of the GAN is unchanged, only the objective is modified. A variant called alt-BuresGAN, which is trained with alternating minimization, achieves competitive performance with a simple training procedure that does not require hyperparameter tuning or an additional regularization such as a gradient penalty. We empirically show that the proposed methods are robust when it comes to the choice of architecture and do not require an additional fine architecture search. Finally, an extra asset of BuresGAN is that it yields competitive or improved IS and FID scores compared with the state of the art on CIFAR-10 and STL-10 using a ResNet architecture. Related works. The Bures distance is closely related to the Fréchet distance (Dowson & Landau, 1982) which is a 2-Wasserstein distance between multivariate normal distributions. Namely, the Fréchet distance between multivariate normals of equal means is the Bures distance between their covariance matrices. The Bures distance is also equivalent to the exact expression for the 2-Wasserstein distance between two elliptically contoured distributions with the same mean as shown in Gelbrich (1990) and Peyré et al. (2019) . Noticeably, the Fréchet Inception Distance score (FID) is a popular manner to assess the quality of generative models. This score uses the Fréchet distance between real and generated samples in the feature space of a pre-trained inception network as it is explained in Salimans et al. (2016) and Heusel et al. (2017) . There exist numerous works aiming to improve training efficiency of generative networks. For mode collapse evaluation, we compare BuresGAN to the most closely related works. GDPP-GAN (Elfeki et al., 2019) and VEEGAN (Srivastava et al., 2017 ) also try to enforce diversity in 'latent' space. GDPP-GAN matches the eigenvectors and eigenvalues of the real and fake diversity kernel. In VEE-GAN, an additional reconstructor network is introduced to map the true data distribution to Gaussian random noise. In a similar way, architectures with two discriminators are analysed by Nguyen et al. (2017) , while MADGAN (Ghosh et al., 2018) uses multiple discriminators and generators. A different approach is taken by Unrolled-GAN (Metz et al., 2017) which updates the generator with respect to the unrolled optimization of the discriminator. This allows the training to be adjusted between using the optimal discriminator in the generator's objective, which is ideal but infeasible in practice. Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017) leverage the 1-Wasserstein distance to match the real and generated data distributions. In MDGAN (Che et al., 2017) , a regularization is added to the objective function, so that the generator can take advantage of another similarity metric with more predictable behavior. This idea is combined with a penalization of the missing modes. More recent approaches to reducing mode collapse are variations of WGAN (Wu et al., 2018) . Entropic regularization has been also proposed in PresGAN (Dieng et al., 2020) , while metric embeddings were used in the paper introducing BourGAN (Xiao et al., 2018) . A simple packing procedure which significantly reduces mode collapse was proposed in PacGAN (Lin et al., 2018) that we also consider hereafter in our comparisons.

2. METHOD

A GAN consists of a discriminator D : R d → R and a generator G : R → R d which are typically defined by neural networks and parametrized by real vectors. The value D(x) gives the probability that x comes from the empirical distribution, while the generator G maps a point z in the latent space R to a point in the input space R d . The training of a GAN consists in solving min G max D E x∼p d [log D(x)] + E x∼pg [log(1 -D(x))], by alternating two phases of training. In equation 1, the expectation in the first term is over the empirical data distribution p d , while the expectation in the second term is over the generated data distribution p g , implicitly given by the mapping by G of the latent prior distribution N (0, I ). It is common to define and minimize the discriminator loss by V D = -E x∼p d [log D(x)] -E x∼pg [log(1 -D(x))]. In practice, it is proposed in Goodfellow et al. (2014) to minimize generator loss V G = -E z∼N (0,I ) [log D(G(z))], rather than the second term of equation 1, for an improved training efficiency. Matching real and fake data covariance. To prevent mode collapse, we encourage the generator to sample fake data of similar diversity to the real data. This is achieved by matching the sample covariance matrices of the real and fake data respectively. Covariance matching and similar ideas were explored for GANs in Mroueh et al. (2017) and Elfeki et al. (2019) . In order to compare covariance matrices, we propose to use the squared Bures distance between positive semi-definite × matrices (Bhatia et al., 2019) , i.e., B (A, B) 2 = min U ∈O( ) A 1/2 -B 1/2 U 2 F = Tr(A + B -2(A 1 2 BA 1 2 ) 1 2 ). Being a Riemannian metric on the manifold of positive semi-definite matrices (Massart & Absil, 2020) , the Bures metric is adequate to compare covariance matrices. The covariances are defined in a feature space associated to the discriminator. More precisely, the last layer of the discriminator, denoted by φ(x) ∈ R f , defines a feature map, namely D(x) = σ(w φ(x)), where w is the weight vector of the last dense layer and σ is the sigmoid function. We use the normalization φ(x) = φ(x)/ φ(x) 2 , after the centering of φ(x). Then, we define a covariance matrix as follows: The training described in Algorithm 1 is analogous to the training of GDPP GAN, although the additional generator loss is rather different. The computational advantage of the Bures distance is that it admits two expressions which can be evaluated numerically in a stable way. Namely, there is no need to calculate a gradient update through an eigendecomposition. C(p) = E x∼p [ φ(x) φ(x) ]. Feature space expression. In the training procedure, real x (d) i and fake data x (g) i with i = 1, . . . , b are sampled respectively from the empirical distribution and the mapping of the normal distribution N (0, I ) by the generator. Consider the case where the batch size b is larger than the feature space dimension. Let the embedding of the batches in feature space be Φ α = [φ(x (α) 1 ), . . . , φ(x (α) b )] ∈ R b×f with α = d, g. The covariance matrix of one batch in feature spacefoot_0 is Ĉ = Φ Φ, where Φ is the 2 -normalized centered feature map of the batch. Numerical instabilities can be avoided by adding a small number, e.g. 1e-14, to the diagonal elements of the covariance matrices, so that, in practice, we only deal with strictly positive definite matrices. From the computational perspective, an interesting alternative expression for the Bures distance is given by B (C d , C g ) 2 = Tr C d + C g -2(C g C d ) 1 2 , whose computation requires only one matrix square root. This identity can be obtained from Lemma 1. Note that an analogous result is proved in Oh et al. (2020) . Lemma 1. Let A and B be f × f symmetric positive semidefinite matrices and let B = Y Y . Then, we have: (i) AB is diagonalizable with nonnegative eigenvalues, and (ii) Tr((AB) 1 2 ) = Tr((Y AY ) 1 2 ). Kernel based expression. Alternatively, if the feature space dimension f is larger than the batch size b, it is more efficient to compute B( Ĉd , Ĉg ) thanks to b × b kernel matrices: K d = Φd Φ d , K g = Φg Φ g and K dg = Φd Φ g . Then, we have the kernel based expression B( Ĉd , Ĉg ) 2 = Tr K d + K g -2 K dg K dg 1 2 , ( ) which allows to calculate the Bures distance between covariance matrices by computing a matrix square root of a b × b matrix. This is a consequence of Lemma 2. Lemma 2. The matrices X XY Y and Y X XY are diagonalizable with nonnegative eigenvalues and share the same non-zero eigenvalues. Connection with Wasserstein GAN and integral probability metrics. The Bures distance is proportional to the 2-Wasserstein distance W 2 between two ellipically contoured distributions, with the same mean (Gelbrich, 1990) . For instance, in the case of multivariate normal distributions, we have B (A, B) 2 = min π E (X,Y )∼π X -Y 2 2 s.t. X ∼ N (0, A) and Y ∼ N (0, B), where the minimization is over the joint distributions π. More precisely, in this paper, we make the approximation that the implicit distribution of the real and generated data in the feature space R f (associated to φ(x)) are elliptically contoured with the same mean. Under different assumptions, the Generative Moment Matching Networks (Ren et al., 2016; Li et al., 2017) work in the same spirit, but use a different approach to match covariance matrices. On the contrary, WGAN uses the Kantorovich dual formula for the 1-Wasserstein distance: W 1 (α, β) = sup f ∈Lip f d(α - β) , where α, β are signed measures. Generalizations of such integral formulae are called integral probability metrics (see for instance Binkowski et al. (2018) ). Here, f is the discriminator, so that the maximization over Lipschitz functions f plays the role of the maximization over discriminator parameters in the min-max game of equation 1. Then, in the training procedure, this maximization alternates with a minimization over the generator parameters. We can now discuss the connection with Wasserstein GAN. Coming back to the definition of Bures-GAN, we can now explain that the 2-Wasserstein distance provides an upper bound on an integral probability metric. Then, if we assume that the densities are elliptically contoured distributions in feature space, the use of the Bures distance to calculate W 2 allows to spare the maximization over the discriminator parameters -and this motivates why the optimization of B only influences updates of the generator in Algorithm 1 and Algorithm 2. Going more into detail, the 2-Wasserstein distance between two probability densities (w.r.t. the same measure) is equivalent to a Sobolev dual norm, which can be interpreted as an integral probability metric. Indeed, let the Sobolev seminorm f H 1 = ( ∇f (x) 2 dx) 1/2 . Then, its dual norm over signed measures is defined as 2018) and Peyré et al. (2019) that there exist two positive constants c 1 and c 2 such that ν H -1 = sup f H 1 ≤1 f dν. It is then shown in Peyre ( c 1 α -β H -1 ≤ W 2 (α, β) ≤ c 2 α -β H -1 . Hence, the 2-Wasserstein distance gives an upper bound on an integral probability metric. Algorithmic details. The matrix square root in equation 4 and equation 5 is obtained thanks to the Newton-Schultz algorithm which is inversion free and can be efficiently calculated on GPUs since it involves only matrix products. In practice, we found 15 iterations of this algorithm to be sufficient for the small scale datasets, while 20 iterations were used for the ResNet examples. A small regularization term 1e-14 is added for stability. The latent prior distribution is N (0, I ) with = 100 and the parameter in Algorithm 1 is always set to λ = 1. In the tables hereafter, we indicate the largest scores in bold, although we invite the reader to also consider the standard deviation.

3. EMPIRICAL EVALUATION OF MODE COLLAPSE

The BuresGAN and alt-BuresGAN performances on synthetic data, artificial and real images are compared with the standard DCGAN (Salimans et al., 2016) , WGAN-GP, MDGAN, Unrolled GAN, VEEGAN, GDPP and PacGAN. We want to emphasize that the purpose of this experiment is not to challenge these baselines, but to report the improvement obtained by adding the Bures metric to the objective function. It would be straightforward to add the Bures loss to other GAN variants, as well as most GAN architectures, and we would expect an improvement in mode coverage and generation quality. In the experiments, we notice that adding the Bures loss to the vanilla GAN already significantly improves the results. A low dimensional feature space (f = 128) is used for the synthetic data so that the feature space formula in equation 4 is used, while the dual formula in equation 5 is used for the image datasets (Stacked MNIST, CIFAR-10, CIFAR-100 and STL-10) for which the feature space is larger than the batch size. The architectures used for the image datasets are based on the DCGAN (Radford et al., 2016) , while results using ResNets are given in Section 4. All images are scaled in between -1 and 1 before running the algorithms. Additional information on the architectures and datasets is given in Appendix. The hyperparameters of other methods are typically chosen as suggested in the authors' reference implementation. The number of unrolling steps in Unrolled GAN is chosen to be 5. For MDGAN, both versions are implemented. The first version, which corresponds to the mode regularizer, has hyperparameters λ 1 = 0.2 and λ 2 = 0.4, for the second version, which corresponds to manifold diffusion training for regularized GANs, has λ = 10 -2 . WGAN-GP uses λ = 10.0 and n critic = 5. All models are trained using Adam (Kingma & Ba, 2015) with β 1 = 0.5, β 2 = 0.999 and learning rate 10 -3 for both the generator and discriminator. Unless stated otherwise, the batch size is taken to 256. Examples of random generations of all the GANs are given in Appendix. Notice that in this section we report the results achieved only at the end of the training.

3.1. ARTIFICIAL DATA

Synthetic. The ring is a mixture of eight two-dimensional spherical Gaussians in the plane with means 2.5 × (cos((2π/8)i), sin((2π/8)i)) and std 0.01 for i ∈ {1, . . . , 8}. The 2D-grid is a mixture of 25 two-dimensional isotropic normals in the plane with means separated by 2 and with standard deviation 0.05. All models have the same architecture, with = 256 following Elfeki et al. (2019) , and are trained for 25k iterations. The evaluation is done by sampling 3k points from the generator network. A sample is counted as high quality if it is within 3 standard deviations of the nearest mode. The experiments are repeated 10 times for all models and their performance is compared in Table 1 . BuresGANs consistently capture all the modes and produces the highest quality samples. The training progress of the Alt-BuresGAN is shown on Figure 1 , where we observe that all the modes early on in the training procedure, afterwards improving the quality. The training progress of the other GAN models listed in Table 1 is given in Appendix. Although BuresGAN training times are larger than most other methods for this low dimensional example, we show in Appendix D.1 that Bures-GAN scales better with the input data dimension and architecture complexity. Step 0 Step 2k Step 4k Step 6k Step 8k Step 25k Stacked MNIST. The Stacked MNIST dataset is specifically constructed to contain 1000 known modes. This is done by stacking three digits, sampled uniformly at random from the original MNIST dataset, each in a different channel. The BuresGAN models are compared to the other methods and are trained for 25k iterations. The results with different batch sizes can be found in Table 2 . For the evaluation of the performance, we follow Metz et al. (2017) and use the following metrics: the number of captured modes measures mode collapse and the KL divergence, which also measures sample quality. The mode of each generated image is identified by using a standard MNIST classifier which is trained up to 98.43% accuracy on the validation set (see Appendix), and classifies each channel of the fake sample. The same classifier is used to count the number of captured modes. The metrics are calculated based on 10k generated images for all the models. Interestingly, for most models, an improvement is observed in the quality of the images -KL divergence -and in terms of mode collapse -number of modes attained -as the size of the batch increases. For the same batch size, architecture and iterations, the image quality is improved by BuresGAN, which is robust with respect to batch size and architecture choice. The other methods show a higher variability over the different experiments. WGAN-GP has the best single run performance with a discriminator with 3 convolutional layers and has on average a superior performance when using a discriminator with 2 convolutional layers (see Table 21 in Appendix) but sometimes fails to converge when increasing the number of discriminator layers by 1 along with increasing the batch size. MDGANv2, VEEGAN, GDPP and WGAN-GP often have an excellent single run performance. However, when increasing the number of discriminator layers, the training of these models has a tendency to collapse more often as indicated by the large std. Vanilla GAN is one of the best performing models in the variant with 3 layers. This indicates that, for certain datasets, careful architecture tuning can be more important than complicated training schemes. A lesson from Table 2 is that BuresGAN's mode coverage does not vary much if the batch size increases, although the KL divergence seems to be slightly improved. Generated samples from the Alt-BuresGAN are given in Figure 2 . Since VEEGAN and PacGAN papers use a different setup, we also report in Table 12 the specifications of the different settings. Stacked MNIST CIFAR-10 CIFAR-100 2 do not determine a clear winner, we repeated a similar experiment with a more challenging architecture (reported in Table 10 ) which includes 4 convolutional layers for both the generator and discriminator in an analogous way to the experiment proposed in Srivastava et al. (2017) . The leading methods of Table 2 are considered for this experiment. Then, we observe in Table 3 that vanilla GAN and GDPP collapse for this architecture. Compared with the results of Table 2 , the difference between the methods is more significant: WGAN-GP yields the best result and is followed by the BuresGAN models. However note that our empirical results indicate that WGAN-GP is sensitive to the choice of architecture and hyperparameters and its training time is also longer as it can be seen from 

3.2. REAL IMAGES

Metrics. The image quality is assessed thanks to the Inception Score (IS), Fréchet Inception Distance (FID), Inference via Optimization (IvO) and Sliced Wasserstein Distance (SWD). The latter was also used in Elfeki et al. (2019) and Karras et al. (2017) to evaluate the quality of images as well as the severity of mode-collapse. In a word, SWD evaluates the multiscale statistical similarity between distributions of local image patches drawn from Laplacian pyramids. A small Wasserstein distance indicates that the distribution of the patches is similar, thus real and fake images appear similar in both appearance and variation at this spatial resolution. IvO (Metz et al., 2017) measures mode collapse by comparing real images with the nearest generated image. It relies on an optimization procedure within the latent space to find the closest generated image to a given test image, and returns the distance which can be large in the case of mode collapse. The metrics are calculated based on 10k generated images for all the models. CIFAR datasets. In Table 4 , we evaluate the performance on the 32 × 32 × 3 CIFAR datasets. While all models are trained for 100k iterations, the best performance is observed for BuresGAN and Alt-BuresGAN in terms of image quality, measured by FID and Inception Score, and in terms of mode collapse, measured by SWD and IvO. We also notice that UnrolledGAN, VEEGAN and WGAN-GP have difficulty converging to a satisfactory result for this architecture. This in contrast to the 'simpler' synthetic data and the Stacked MNIST dataset, where the models get comparable performance to BuresGAN and Alt-BuresGAN. In Arjovsky et al. (2017) , WGAN-GP achieves a very good performance on CIFAR-10 with a ResNet architecture which is considerably more complicated than the DCGAN used here. Therefore, results with a Resnet architure are reported in Section 4. Also, for this architecture and number of training iterations, MDGAN-v1 and MDGAN-v2 did not converge to a meaningful result in our simulations. CIFAR-100 dataset consists of 100 different classes and is therefore more diverse. Compared to the original CIFAR-10 dataset, the performance of the proposed GANs remains almost the same, with a small increase in IvO. An exception is vanilla GAN, which shows a higher presence of mode collapse as measured by IvO and SWD. STL-10. The STL-10 dataset includes higher resolution images of size 96 × 96 × 3. The best performing models from previous experiments are trained for 150k iterations. Samples of generated images from a trained Alt-BuresGAN are given on Figure 3 . The metrics are calculated based on 5k generated images for all the models. Compared to the previous datasets, GDPP and vanilla GAN are rarely able to generate high quality images on the higher resolution STL-10 dataset. Only BuresGANs are capable of consistently generating high quality images as well as preventing mode collapse, for the same architecture. Timings. The computing times for these datasets are in Appendix. For the same number of iterations, (alt-)BuresGAN training time is comparable to WGAN-GP training for the simple data in Table 1 . For more complicated architectures, (alt-)BuresGAN scales better and the training time was observed to be significantly shorter with respect to WGAN-GP and several other methods.

4. HIGH QUALITY GENERATION USING A RESNET ARCHITECTURE

As noted by Lucic et al. (2018) 7) 2.3(0.3) Alt-BuresGAN 0.45(0.04) 7.5(0.3) 110(4) 2.8(0.4) Table 5 : Generation quality on STL-10 with DCGAN architecture. Average(std) over 5 runs. 150k iterations for each. SWD score was multiplied by 100 for improving readability. question the performance of BuresGAN with a ResNet architecture. Hence, we trained BuresGAN on the CIFAR-10 and STL-10 datasets by using the ResNet architecture taken from Gulrajani et al. (2017) . In this section, the STL-10 images are rescaled to 48 × 48 × 3 according the procedure described in Miyato et al. (2018) ; Lee et al. ( 2019); Wang et al. (2019) , so that the comparison of IS and FID scores with other works is meaningful. Note that BuresGAN has no parameters to tune, except for the hyperparameters of the optimizers. The results are displayed in Table 6 , where the scores of state-of-the-art unconditional GAN models with a ResNet architecture are also reported. In contrast with Section 3, we report here the best performance achieved at any time during the training, averaged over several runs. To the best of our knowledge, our method achieves a new state of the art inception score on STL-10 and is within a standard deviation of state of the art on CIFAR-10 using a ResNet architecture. The FID score achieved by BuresGAN is nonetheless smaller than the reported FID scores for GANs using a ResNet architecture. A visual inspection of the generated images in Figure 3 shows that the high inception score is warranted, the samples are clear, diverse and often recognizable. BuresGAN also performs well on the full-sized STL-10 data set where an inception score of 11.11 ± 0.19 and an FID of 50.9 ± 0.13 is achieved (average and std over 3 runs). CIFAR-10 STL-10 Table 6 : Best achieved IS and FID, using a ResNet architecture. Results with an asterisk are quoted from their respective papers (std in parenthesis). BuresGAN results were obtained after 300k iterations and averaged over 3 runs. The result indicated with † are taken from Wu et al. (2018) . For all the methods, the STL-10 images are rescaled to 48 × 48 × 3 in contrast with Table 5 . IS (↑) FID (↓) IS (↑) FID (↓) WGAN-GP ResNet (

5. CONCLUSION

In this work, we discussed an additional term based on the Bures distance whichpromotes a matching of the distribution of the generated and real data in a feature space R f . The Bures distance admits both a feature space and kernel based expression, which makes the proposed model time and data efficient when compared to state of the art models. Two training procedures are proposed: Algorithm 1 deals with the squared Bures distance as an additive term to the generator loss, while an alternating training is used in Algorithm 2 so that no extra parameter is introduced. Our experiments show that the proposed methods are capable of reducing mode collapse and, on the real datasets, achieve a clear improvement of sample quality without parameter tuning and without the need for regularization such as a gradient penalty. Moreover, the proposed GANs show a stable performance over different architectures, datasets and hyperparmeters. 

A PROOFS

Proof of Lemma 1. (i) is a consequence of Corollary 2.3 in (Hong & Horn, 1991) . (ii) We now follow (Oh et al., 2020) . Thanks to (i), we have AB = P DP -1 where D is a nonnegative diagonal and the columns of P contain the right eigenvectors of AB. Therefore, Tr((AB) 1/2 ) = Tr(D 1/2 ). Then, Y AY is clearly diagonalizable. Let us show that it shares its nonzero eigenvalues with AB. Proof of Lemma 2. The result follows from Lemma 1 and its proof, where A = X X and B = Y Y .

B DETAILS OF THE THEORETICAL RESULTS

Let A and B be symmetric and positive semi-definite matrices. Let A 1/2 = U diag( √ λ)U where U and λ are obtained thanks to the eigenvalue decomposition A = U diag(λ)U . We show here that the Bures distance between A and B is B (A, B) 2 = min U ∈O( ) A 1/2 -B 1/2 U 2 F = Tr(A + B -2(A 1 2 BA 1 2 ) 1 2 ), where O( ) is the set of × orthonormal matrices. We can simplify the above expression as follows min U ∈O( ) A 1/2 -B 1/2 U 2 F = Tr(A) + Tr(B) -2 max U ∈O( ) Tr(A 1/2 B 1/2 U ) since Tr(U B 1/2 A 1/2 ) = Tr(A 1/2 B 1/2 U ). Let the characteristic function of the set of orthonormal matrices be f (U ) = χ O( ) (U ) that is, f (U ) = 0 if U ∈ O( ) and +∞ otherwise. Lemma 3. The Fenchel conjugate of f (U ) = χ O( ) (U ) is f (M ) = M , where the nuclear norm is M = Tr( √ M M ) and U, M ∈ R × . Proof. The definition of the Fenchel conjugate with respect to the Frobenius inner product gives f (M ) = sup U ∈R × Tr(U M ) -f (U ) = max U ∈O( ) Tr(U M ). Next we decompose M as follows: M = W ΣV , where W, V ∈ O( ) are orthogonal matrices and Σ is a × diagonal matrix with non negative diagonal entries, such that M M = W Σ 2 W and M M = V Σ 2 V . Notice that the non zero diagonal entries of Σ are the singular values of M . Then, max U ∈O( ) Tr(U M ) = max U ∈O( ) Tr(W ΣV U ) = max U ∈O( ) Tr(ΣU ), where we renamed U = V U W . Next, we remark that Tr(ΣU ) = Tr(Σ diag(U )). Since by construction, Σ is diagonal with non negative entries the maximum is attained at U = I. Then, the optimal objective is Tr(Σ) = Tr( √ M M ). By taking M = A 1/2 B 1/2 we obtain equation 6. Notice that the role of A and B can be exchanged in equation 6 since U is orthogonal.

C TRAINING DETAILS

C.1 SYNTHETIC ARCHITECTURES Following the recommendation in the original work (Srivastava et al., 2017) 

D ADDITIONAL EXPERIMENTS D.1 TIMINGS

The timings per iteration for the experiments presented in the paper are listed in Table 19 . Times are given for all the methods considered, although some method do not always generate meaningful images for all datasets. They are measured for 50 iterations after the first 5 iterations, and the average number of iterations per second is computed. The fastest method is the vanilla GAN. BuresGAN has a similar computation cost as GDPP. We observe that (alt-)BuresGAN is significantly faster compared to WGAN-GP. In order to obtain reliable timings, these results were obtained on the same GPU Nvidia Quadro P4000, although, for convenience, the experiments on these image datasets were executed on a machine equipped with different GPUs. Table 21 : KL-divergence between the generated distribution and true distribution (Quality, lower is better). The number of counted modes indicates the amount of mode collapse (higher is better). 25k iterations and average and std over 10 runs. Same architecture as in Table 2 with a discriminator with 2 convolutional layers.

E ADDITIONAL FIGURES

Step 1k Step 2k Step 4k Step 6k Step 8k Step 25k Step 1k Step 2k Step 4k Step 6k Step 8k Step 25k 



For simplicity, we omit the normalization by 1 b-1 in front of the covariance matrix.



Figure 1: Figure accompanying Table 1, the progress of Alt-BuresGAN on the synthetic examples. Each column shows 3k samples from the training of the generator in blue and 3k samples from the true distribution in green.

Figure 2: Generated samples from a trained Alt-BuresGAN, with a DCGAN architecture.

Figure 3: Images generated by BuresGAN with a ResNet architecture for CIFAR-10 (left) and STL-10 (right) datasets. The STL-10 samples are full-sized 96 × 96 images.

a) We have ABP = P D, so that, by multiplying of the left by Y , it holds that (Y AY )Y P = Y P D. b) Similarly, suppose that we have the eigenvalue decomposition Y AY Q = QΛ. Then, we have BAY Q = Y QΛ with B = Y Y . This means that the non-zero eigenvalues of Y AY are also eigenvalues of BA. Since A and B are symmetric, this completes the proof.

Figure 4: The progress of different GANs on the synthetic ring example. Each column show 3000 samples from the training generator in blue with 3000 samples from the true distribution in green.

Figure 5: The progress of different GANs on the synthetic grid example. Each column show 3000 samples from the training generator in blue with 3000 samples from the true distribution in green.

For simplicity, we denote the real data and generated data covariance matrices by C d = C(p d ) and C g = C(p g ), respectively. Our proposal is to replace the generator loss by V G + λB(C d , C g ) 2 .



Experiments on the synthetic datasets with a GPU Nvidia Quadro P4000. Average(std) over 10 runs. All the models are trained for 25k iterations and the total training time is indicated in seconds.

Table 19 in Appendix.

KL-divergence between the generated distribution and true distribution for an architecture with 3 conv. layers for the Stacked MNIST dataset. The number of counted modes assesses mode collapse. The results are obtained after 25k iterations and we report the average(std) over 10 runs.



Generation quality on CIFAR-10 and CIFAR-100 with DCGAN architecture. Average(std) over 10 runs. 100k iterations for each. For improving readability, SWD score was multiplied by 100.

Chang Xiao, Peilin Zhong, and Changxi Zheng. BourGAN: Generative Networks with Metric Embeddings. In Advances in Neural Information Processing Systems 32, 2018. Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.

, the same fullyconnected architecture is used for the VEEGAN reconstructor in all experiments. The generator and discriminator architectures for the synthetic examples.

Respectively the MDGAN encoder model and VEEGAN stochastic inverse generator architectures for the synthetic examples. The output of the VEEGAN models are samples drawn from a normal distribution with scale 1 and where the location is learned.

The generator and discriminator architectures for the Stacked MNIST experiments. The BN column indicates whether batch normalization is used after the layer or not. For the experiments with 2 convolution layers in Table21, the final convolution layer is removed in the discriminator.

The generator and discriminator architectures for the CIFAR-10 and CIFAR-100 experiments. The BN column indicates whether batch normalization is used after the layer or not.

The MDGAN encoder model architecture for the CIFAR-10 and CIFAR-100 experiments. The BN column indicates whether batch normalization is used after the layer or not.

The generator and discriminator architectures for the STL-10 experiments. The BN column indicates whether batch normalization is used after the layer or not.

Average time per iteration in seconds for the convolutional architecture. Averaged over 5 runs, with std in parenthesis. The batch size is 256. For Stacked MNIST, we use a discriminator architecture with 3 convolutional layers.D.2 BEST INCEPTION SCORES ACHIEVED WITH DCGAN ARCHITECTUREThe inception scores for the best trained models are listed in Table20. For the CIFAR datasets, the largest inception score is significantly better than the mean for UnrolledGAN and VEEGAN. This is the same for GAN and GDPP on the STL-10 dataset, where the methods often converge to bad results. Only the proposed methods are capable of consistently generating high quality images over all datasets.

Inception Score for the best trained models on CIFAR-10, CIFAR-100 and STL-10, with a DCGAN architecture (higher is better).D.3 INFLUENCE OF THE NUMBER OF CONVOLUTIONAL LAYERS FOR DCGANARCHITECTURE Also, we provide in Figure21results with a DCGAN architecture including only 2 conv. layers for the discriminator in contrast to Table2which uses 3 conv. layers.

C.5 RESNET ARCHITECTURES

For CIFAR-10, we used the ResNet architecture from the appendix of Gulrajani et al. (2017) with minor changes as given in Table 17 . We used an initial learning rate of 5e-4 for CIFAR-10 and STL-10. For both datasets, the models are run for 200k iterations. For STL-10, we used a similar architecture that is given in Table 18 . 

