SAMPLE WEIGHTING AS AN EXPLANATION FOR MODE COLLAPSE IN GENERATIVE ADVERSARIAL NETWORKS

Abstract

Generative adversarial networks were introduced with a logistic MiniMax cost formulation, which normally fails to train due to saturation, and a Non-Saturating reformulation. While addressing the saturation problem, NS-GAN also inverts the generator's sample weighting, implicitly shifting emphasis from higher-scoring to lower-scoring samples when updating parameters. We present both theory and empirical results suggesting that this makes NS-GAN prone to mode dropping. We design MM-nsat, which preserves MM-GAN sample weighting while avoiding saturation by rescaling the MM-GAN minibatch gradient such that its magnitude approximates NS-GAN's gradient magnitude. MM-nsat has qualitatively different training dynamics, and on MNIST and CIFAR-10 it is stronger in terms of mode coverage, stability and FID. While the empirical results for MM-nsat are promising and favorable also in comparison with the LS-GAN and Hinge-GAN formulations, our main contribution is to show how and why NS-GAN's sample weighting causes mode dropping and training collapse.

1. INTRODUCTION

Generative adversarial networks have come a long way since their introduction (Goodfellow et al., 2014) and are currently state of the art for some tasks, such as generating images. A combination of deep learning developments, GAN specific advances and vast improvements in data sets and computational resources have enabled GANs to generate high resolution images that require some effort to distinguish from real photos (Zhang et al., 2018; Brock et al., 2018; Karras et al., 2018) . GANs use two competing networks: a generator G that maps input noise to samples mimicking real data, and a discriminator D that outputs estimated probabilities of samples being real rather than generated by G. We summarize their cost functions, J D and J G , for the minimax and nonsaturating formulations introduced in Goodfellow et al. (2014) . We denote samples from real data and noise distributions by x and z and omit the proper expectation value formalism: J DMM (x, z) = J DNS (x, z) = -log(D p (x)) -log(1 -D p (G(z))) J GMM (z) = log(1 -D p (G(z))) J GNS (z) = -log(D p (G(z)) For clarity, we use subscripts to distinguish between the discriminator's pre-activation logit output D l and the probability representation D p : D p ≡ (1 + exp(-D l )) -1 (2) Both formulations have the same cost function for D, representing the cross entropy between probability estimates and ground truth. In the minimax formulation (MM-GAN), G is simply trained to maximize D's cost. Ideally, G matches its outputs to the real data distribution while also achieving meaningful generalization, but many failure modes are observed in practice. NS-GAN uses a modified cost for G that is non-saturating when D distinguishes real and generated data with very high confidence, such that G's gradients do not vanish. (Supplementary: C) Various publications establish what the different cost functions optimize in terms of the Jensen-Shannon and reverse Kullback-Leibler divergences between real and generated data: J GMM ⇔ 2 • D JS (Goodfellow et al., 2014) J GMM + J GNS ⇔ D R KL (Huszár, 2016) J GNS ⇔ D R KL -2 • D JS (Arjovsky & Bottou, 2017) Huszár ( 2015) and Arjovsky & Bottou ( 2017) have suggested NS-GAN's divergence as an explanation for the ubiquitous mode dropping and mode collapsing problems with GANs (Metz et al., 2016; Salimans et al., 2016; Srivastava et al., 2017) . While MM-GAN seems promising in terms of its Jensen-Shannon divergence, the formulation has largely been ignored because the saturating cost causes training to break down. (4)



Figure 1: Median Fréchet Inception Distance during training for ten runs on MNIST, CIFAR-10, CAT 128 2 and FFHQ 512 2 , using very simple convolutional GANs. The shaded areas show minimum and maximum value during training for the cost formulations. MM-nsat is best overall, suffers less from gradual mode dropping and trains reliably on the more challenging datasets.

Figure 2: Scaling factors as a function of the discriminator output, at a scale emphasizing asymptotic behaviors. MM-GAN's scaling factor causes G's gradient to vanish when D p (G(z)) → 0, which corresponds to the maximum value of its cost.

