TRAINING GANS WITH STRONGER AUGMENTATIONS VIA CONTRASTIVE DISCRIMINATOR

Abstract

Recent works in Generative Adversarial Networks (GANs) are actively revisiting various data augmentation techniques as an effective way to prevent discriminator overfitting. It is still unclear, however, that which augmentations could actually improve GANs, and in particular, how to apply a wider range of augmentations in training. In this paper, we propose a novel way to address these questions by incorporating a recent contrastive representation learning scheme into the GAN discriminator, coined ContraD. This "fusion" enables the discriminators to work with much stronger augmentations without increasing their training instability, thereby preventing the discriminator overfitting issue in GANs more effectively. Even better, we observe that the contrastive learning itself also benefits from our GAN training, i.e., by maintaining discriminative features between real and fake samples, suggesting a strong coherence between the two worlds: good contrastive representations are also good for GAN discriminators, and vice versa. Our experimental results show that GANs with ContraD consistently improve FID and IS compared to other recent techniques incorporating data augmentations, still maintaining highly discriminative features in the discriminator in terms of the linear evaluation. Finally, as a byproduct, we also show that our GANs trained in an unsupervised manner (without labels) can induce many conditional generative models via a simple latent sampling, leveraging the learned features of ContraD.

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) have become one of the most prominent approaches for generative modeling with a wide range of applications (Ho & Ermon, 2016; Zhu et al., 2017; Karras et al., 2019; Rott Shaham et al., 2019) . In general, a GAN is defined by a minimax game between two neural networks: a generator network that maps a random vector into the data domain, and a discriminator network that classifies whether a given sample is real (from the training dataset) or fake (from the generator). Provided that both generator and discriminator attain their optima at each minimax objective alternatively, it is theoretically guaranteed that the generator implicitly converges to model the data generating distribution (Goodfellow et al., 2014) . Due to the non-convex/stationary nature of the minimax game, however, training GANs in practice is often very unstable with an extreme sensitivity to many hyperparameters (Salimans et al., 2016; Lucic et al., 2018; Kurach et al., 2019) . Stabilizing the GAN dynamics has been extensively studied in the literature (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018; Wei et al., 2018; Jolicoeur-Martineau, 2019; Chen et al., 2019; Schonfeld et al., 2020) , and the idea of incorporating data augmentation techniques has recently gained a particular attention on this line of research: more specifically, Zhang et al. (2020) have shown that consistency regularization between discriminator outputs of clean and augmented samples could greatly stabilize GAN training, and Zhao et al. (2020c) further improved this idea. The question of which augmentations are good for GANs has been investigated very recently in several works (Zhao et al., 2020d; Tran et al., 2021; Karras et al., 2020a; Zhao et al., 2020a) , while they unanimously conclude only a limited range of augmentations (e.g., flipping and spatial translation) were actually helpful for the current form of training GANs. Meanwhile, not only for GANs, data augmentation has also been played a key role in the literature of self-supervised representation learning (Doersch et al., 2015; Gidaris et al., 2018; Wu et al., 2018) , especially with the recent advances in contrastive learning (Bachman et al., 2019; Oord et al., 2018; Chen et al., 2020a; b; Grill et al., 2020 ): e.g., Chen et al. (2020a) have shown that the performance gap between supervised-and unsupervised learning can be significantly closed with large-scale contrastive learning over strong data augmentations. In this case, contrastive learning aims to extract the mutual information shared across augmentations, so good augmentations for contrastive learning should keep information relevant to downstream tasks (e.g., classification), while discarding nuisances for generalization. Finding such augmentations is still challenging, yet in some sense, it is more tangible than the case of GANs, as there are some known ways to formulate the goal rigourously, e.g., InfoMax (Linsker, 1988) or InfoMin principles (Tian et al., 2020) . Contribution. In this paper, we propose Contrastive Discriminator (ContraD), a new way of training discriminators of GAN that incorporates the principle of contrastive learning. Specifically, instead of directly optimizing the discriminator network for the GAN loss, ContraD uses the network mainly to extract a contrastive representation from a given set of data augmentations and (real or generated) samples. The actual discriminator that minimizes the GAN loss is defined independently upon the contrastive representation, which turns out that a simple 2-layer network is sufficient to work as a complete GAN. By design, ContraD can be naturally trained with augmentations used in the literature of contrastive learning, e.g., those proposed by SimCLR (Chen et al., 2020a), which are in fact much stronger than typical practices in the context of GAN training (Zhang et al., 2020; Zhao et al., 2020c; a; Karras et al., 2020a) . Our key observation here is that, the task of contrastive learning (to discriminate each of independent real samples) and that of GAN discriminator (to discriminate fake samples from the reals) benefit each other when jointly trained with a shared representation. Self-supervised learning, including contrastive learning, have been recently applied in GAN as an auxiliary task upon the GAN loss (Chen et al., 2019; Tran et al., 2019; Lee et al., 2021; Zhao et al., 2020d) , mainly in attempt to alleviate catastopic forgetting in discriminators (Chen et al., 2019) . For conditional GANs, Kang & Park (2020) have proposed a contrastive form of loss to efficiently incorporate a given conditional information into discriminators. Our work can be differentiated to these prior works in a sense that, to the best of our knowledge, it is the first method that successfully leverage contrastive learning alone to incorporate a wide range of data augmentations in GAN training. Indeed, for example, Zhao et al. (2020d) recently reported that simply regularizing auxiliary SimCLR loss (Chen et al., 2020a) improves GAN training, but could not outperform existing methods based on simple data augmentations, e.g., bCR (Zhao et al., 2020c) .

2. BACKGROUND

Generative adversarial networks. We consider a problem of learning a generative model p g from a given dataset {x i } N i=1 , where x i ∼ p data and x i ∈ X . To this end, generative adversarial network (GAN) (Goodfellow et al., 2014) considers two neural networks: (a) a generator network G : Z → X that maps a latent variable z ∼ p(z) into X , where p(z) is a specific prior distribution, and (b) a discriminator network D : X → [0, 1] that discriminates samples from p data and those from the implicit distribution p g derived from G(z). The primitive form of training G and D is the following: min G max D V (G, D) := E x∼pdata [log(D(x))] + E z∼p(z) [log(1 -D(G(z)))]. For a fixed G, the inner maximization objective (1) with respect to D leads to the following optimal discriminator D * G , and consequently the outer minimization objective with respect to G becomes to minimize the Jensen-Shannon divergence between p data and p g : D * G := max D V (G, D) = pdata pdata+pg . Although this formulation (1) theoretically guarantees p * g = p data as the global optimum, the nonsaturating loss (Goodfellow et al., 2014) is more favored in practice for better optimization stability: (2) Here, compared to (1), G is now optimized to let D to classify G(z) as 1, i.e., the "real". Contrastive representation learning. Consider two random variables v (1) and v (2) , which are often referred as views. Generally speaking, contrastive learning aims to extract a useful representation of v (1) and v (2) from learning a function that identifies whether a given sample is from



:= V (G, D), and min G L(G) := -E z [log(D(G(z)))].

