DISSECTING ADAPTIVE METHODS IN GANS

Abstract

Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to pinpoint the "marginal value of adaptive methods" in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in Agarwal et al. ( 2020), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam's performance can be recovered with nSGDA methods.

1. INTRODUCTION

Adaptive algorithms have become a key component in training modern neural network architectures in various deep learning tasks. Minimization problems that arise in natural language processing (Vaswani et al., 2017) , fMRI (Zbontar et al., 2018) , or min-max problems such as generative adversarial networks (GANs) (Goodfellow et al., 2014) almost exclusively use adaptive methods, and it has been empirically observed that Adam (Kingma & Ba, 2014) yields a solution with better generalization than stochastic gradient descent (SGD) in such problems (Choi et al., 2019) . Several works have attempted to explain this phenomenon in the minimization setting. Common explanations are that adaptive methods train faster (Zhou et al., 2018) , escape flat "saddle-point"-like plateaus faster (Orvieto et al., 2021) , or handle heavy-tailed stochastic gradients better (Zhang et al., 2019; Gorbunov et al., 2022) . However, much less is known about why adaptive methods are so critical for solving min-max problems such as GANs. Several previous works attribute this performance to the superior convergence speed of adaptive methods. For instance, Liu et al. (2019) show that an adaptive variant of Optimistic Gradient Descent (Daskalakis et al., 2017) converges faster than stochastic gradient descent ascent (SGDA) for a class of non-convex, non-concave min-max problems. However, contrary to the minimization setting, convergence to a stationary point is not required to obtain satisfactory GAN performance. Mescheder et al. (2018) empirically shows that popular architectures such as Wasserstein GANs (WGANs) (Arjovsky et al., 2017) do not always converge, and yet they produce realistic images. We support this observation with our own experiments in Section 2. Our findings motivate the central question in this paper: what factors of Adam contribute to better quality solutions over SGDA when training GANs? In this paper, we investigate why GANs trained with adaptive methods outperform those trained using SGDA. Directly analyzing Adam is challenging due to the highly non-linear nature of its gradient oracle and its path-dependent update rule. Inspired by the grafting approach in (Agarwal et al., 2020) , we disentangle the adaptive magnitude and direction of Adam and show evidence that an algorithm using the adaptive magnitude of Adam and the direction of SGDA (which we call Ada-nSGDA) recovers the performance of Adam in GANs. Our contributions are as follows: • In Section 2, we present the Ada-nSGDA algorithm.We emprically show that the adaptive magnitude in Ada-nSGDA stays within a constant range and does not heavily fluctuate which motivates the focus on normalized SGDA (nSGDA) which only contains the direction of SGDA. • In Section 3, we prove that for a synthetic dataset consisting of two modes, a model trained with SGDA suffers from mode collapse (producing only a single type of output), while a model trained with nSGDA does not. This provides an explanation for why GANs trained with nSGDA outperform those trained with SGDA. • In Section 4, we empirically confirm that nSGDA mostly recovers the performance of Ada-nSGDA when using different GAN architectures on a wide range of datasets. Our key theoretical insight is that when using SGDA and any step-size configuration, either the generator G or discriminator D updates much faster than the other. By normalizing the gradients as done in nSGDA, D and G are forced to update at the same speed throughout training. The consequence is that whenever D learns a mode of the distribution, G learns it right after, which makes both of them learn all the modes of the distribution separately at the same pace. (2017) show that Adam locally converges to a Nash equilibrium in the regime where the step-size of the discriminator is much larger than the one of the generator. Our work differs as we do not focus on the convergence properties of Adam, but rather on the fit of the trained model to the true (and not empirical) data distribution. Statistical results in GANs. Early works studied whether GANs memorize the training data or actually learn the distribution (Arora et al., 2017; Liang, 2017; Feizi et al., 2017; Zhang et al., 2017; Arora et al., 2018; Bai et al., 2018; Dumoulin et al., 2016) . Some works explained GAN performance through the lens of optimization. Lei et al. (2020); Balaji et al. (2021) show that GANs trained with SGDA converge to a global saddle point when the generator is one-layer neural network and the discriminator is a specific quadratic/linear function. Our contribution differs as i) we construct a setting where SGDA converges to a locally optimal min-max equilibrium but still suffers from mode collapse, and ii) we have a more challenging setting since we need at least a degree-3 discriminator to learn the distribution, which is discussed in Section 3. Normalized gradient descent. Introduced by Nesterov (1984) , normalized gradient descent has been widely used in minimization problems. Normalizing the gradient remedies the issue of iterates being stuck in flat regions such as spurious local minima or saddle points (Hazan et al., 2015; Levy, 2016) . Normalized gradient descent methods outperforms their non-normalized counterparts in multiagent coordination (Cortés, 2006) and deep learning tasks (Cutkosky & Mehta, 2020) . Our work considers the min-max setting and shows that nSGDA outperforms SGDA as it forces discriminator and generator to update at same rate.

1.2. BACKGROUND

Generative adversarial networks. Given a training set sampled from some target distribution D, a GAN learns to generate new data from this distribution. The architecture is comprised of two networks: a generator that maps points in the latent space D z to samples of the desired distribution, and a discriminator which evaluates these samples by comparing them to samples from D. More formally, the generator is a mapping G V : R k → R d and the discriminator is a mapping D W : R d → R, where V and W are their corresponding parameter sets. To find the optimal parameters of these two networks, one must solve a min-max optimization problem of the form min V max W E X∼p data [log(D W (X))] + E z∼pz [log(1 -D W (G V (z)))] := f (V, W), (GAN) where p data is the distribution of the training set, p z the latent distribution, G V the generator and D W the discriminator. Contrary to minimization problems where convergence to a local minimum is required for high generalization, we empirically verify that most of the well-performing GANs do not converge to a stationary point.



RELATED WORK Adaptive methods in games optimization. Several works designed adaptive algorithms and analyzed their convergence to show their benefits relative to SGDA e.g. in variational inequality problems, Gasnikov et al. (2019); Antonakopoulos et al. (2019); Bach & Levy (2019); Antonakopoulos et al. (2020); Liu et al. (2019); Barazandeh et al. (2021). Heusel et al.

