THE BURES METRIC FOR TAMING MODE COLLAPSE IN GENERATIVE ADVERSARIAL NETWORKS

Abstract

Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. However, under certain circumstances, the training of GANs can lead to mode collapse or mode dropping, i.e. the generative models not being able to sample from the entire probability distribution. To address this problem, we use the last layer of the discriminator as a feature map to study the distribution of the real and the fake data. During training, we propose to match the real batch diversity to the fake batch diversity by using the Bures distance between covariance matrices in feature space. The computation of the Bures distance can be conveniently done in either feature space or kernel space in terms of the covariance and kernel matrix respectively. We observe that diversity matching reduces mode collapse substantially and has a positive effect on the sample quality. On the practical side, a very simple training procedure, that does not require additional hyperparameter tuning, is proposed and assessed on several datasets.

1. INTRODUCTION

In several machine learning applications, data is assumed to be sampled from an implicit probability distribution. The estimation of this empirical implicit distribution is often intractable, especially in high dimensions. To tackle this issue, generative models are trained to provide an algorithmic procedure for sampling from this unknown distribution. Popular approaches are Variational Auto-Encoders proposed by Kingma & Welling (2014) , Generating Flow models by Rezende & Mohamed (2015) and Generative Adversarial Networks (GANs) initially developed by Goodfellow et al. (2014) . The latter are particularly successful approaches to produce high quality samples, especially in the case of natural images, though their training is notoriously difficult. The vanilla GAN consists of two networks: a generator and a discriminator. The generator maps random noise, usually drawn from a multivariate normal, to fake data in input space. The discriminator estimates the likelihood ratio of the generator network to the data distribution. It often happens that a GAN generates samples only from a few of the many modes of the distribution. This phenomenon is called 'mode collapse'. Contribution. We propose BuresGAN: a generative adversarial network which has the objective function of a vanilla GAN complemented by an additional term, which is given by the squared Bures distance between the covariance matrix of real and fake batches in a latent space. This loss function promotes a matching of fake and real data in a feature space R f , so that mode collapse is reduced. Conveniently, the Bures distance also admits both a feature space and kernel based expression. Contrary to other related approaches such as in Che et al. (2017) or Srivastava et al. (2017) , the architecture of the GAN is unchanged, only the objective is modified. A variant called alt-BuresGAN, which is trained with alternating minimization, achieves competitive performance with a simple training procedure that does not require hyperparameter tuning or an additional regularization such as a gradient penalty. We empirically show that the proposed methods are robust when it comes to the choice of architecture and do not require an additional fine architecture search. Finally, an extra asset of BuresGAN is that it yields competitive or improved IS and FID scores compared with the state of the art on CIFAR-10 and STL-10 using a ResNet architecture. Related works. The Bures distance is closely related to the Fréchet distance (Dowson & Landau, 1982) which is a 2-Wasserstein distance between multivariate normal distributions. Namely, the Fréchet distance between multivariate normals of equal means is the Bures distance between their covariance matrices. The Bures distance is also equivalent to the exact expression for the 2-Wasserstein distance between two elliptically contoured distributions with the same mean as shown in Gelbrich (1990) and Peyré et al. (2019) . Noticeably, the Fréchet Inception Distance score (FID) is a popular manner to assess the quality of generative models. This score uses the Fréchet distance between real and generated samples in the feature space of a pre-trained inception network as it is explained in Salimans et al. (2016) and Heusel et al. (2017) . There exist numerous works aiming to improve training efficiency of generative networks. For mode collapse evaluation, we compare BuresGAN to the most closely related works. GDPP-GAN (Elfeki et al., 2019) and VEEGAN (Srivastava et al., 2017 ) also try to enforce diversity in 'latent' space. GDPP-GAN matches the eigenvectors and eigenvalues of the real and fake diversity kernel. In VEE-GAN, an additional reconstructor network is introduced to map the true data distribution to Gaussian random noise. In a similar way, architectures with two discriminators are analysed by Nguyen et al. ( 2017), while MADGAN (Ghosh et al., 2018) uses multiple discriminators and generators. A different approach is taken by Unrolled-GAN (Metz et al., 2017) which updates the generator with respect to the unrolled optimization of the discriminator. This allows the training to be adjusted between using the optimal discriminator in the generator's objective, which is ideal but infeasible in practice. Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017) leverage the 1-Wasserstein distance to match the real and generated data distributions. In MDGAN (Che et al., 2017) , a regularization is added to the objective function, so that the generator can take advantage of another similarity metric with more predictable behavior. This idea is combined with a penalization of the missing modes. More recent approaches to reducing mode collapse are variations of WGAN (Wu et al., 2018) . Entropic regularization has been also proposed in PresGAN (Dieng et al., 2020) , while metric embeddings were used in the paper introducing BourGAN (Xiao et al., 2018) . A simple packing procedure which significantly reduces mode collapse was proposed in PacGAN (Lin et al., 2018) that we also consider hereafter in our comparisons.

2. METHOD

A GAN consists of a discriminator D : R d → R and a generator G : R → R d which are typically defined by neural networks and parametrized by real vectors. The value D(x) gives the probability that x comes from the empirical distribution, while the generator G maps a point z in the latent space R to a point in the input space R d . The training of a GAN consists in solving min G max D E x∼p d [log D(x)] + E x∼pg [log(1 -D(x))], by alternating two phases of training. In equation 1, the expectation in the first term is over the empirical data distribution p d , while the expectation in the second term is over the generated data distribution p g , implicitly given by the mapping by G of the latent prior distribution N (0, I ). It is common to define and minimize the discriminator loss by V D = -E x∼p d [log D(x)] -E x∼pg [log(1 -D(x))]. In practice, it is proposed in Goodfellow et al. ( 2014) to minimize generator loss V G = -E z∼N (0,I ) [log D(G(z))], rather than the second term of equation 1, for an improved training efficiency. Matching real and fake data covariance. To prevent mode collapse, we encourage the generator to sample fake data of similar diversity to the real data. This is achieved by matching the sample covariance matrices of the real and fake data respectively. Covariance matching and similar ideas were explored for GANs in Mroueh et al. (2017) and Elfeki et al. (2019) . In order to compare covariance matrices, we propose to use the squared Bures distance between positive semi-definite × matrices (Bhatia et al., 2019) , i.e., B (A, B) 2 = min U ∈O( ) A 1/2 -B 1/2 U 2 F = Tr(A + B -2(A 1 2 BA 1 2 ) 1 2 ). Being a Riemannian metric on the manifold of positive semi-definite matrices (Massart & Absil, 2020) , the Bures metric is adequate to compare covariance matrices. The covariances are

