DISSECTING ADAPTIVE METHODS IN GANS

Abstract

Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to pinpoint the "marginal value of adaptive methods" in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in Agarwal et al. ( 2020), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam's performance can be recovered with nSGDA methods.

1. INTRODUCTION

Adaptive algorithms have become a key component in training modern neural network architectures in various deep learning tasks. Minimization problems that arise in natural language processing (Vaswani et al., 2017) , fMRI (Zbontar et al., 2018) , or min-max problems such as generative adversarial networks (GANs) (Goodfellow et al., 2014) almost exclusively use adaptive methods, and it has been empirically observed that Adam (Kingma & Ba, 2014) yields a solution with better generalization than stochastic gradient descent (SGD) in such problems (Choi et al., 2019) . Several works have attempted to explain this phenomenon in the minimization setting. Common explanations are that adaptive methods train faster (Zhou et al., 2018) , escape flat "saddle-point"-like plateaus faster (Orvieto et al., 2021) , or handle heavy-tailed stochastic gradients better (Zhang et al., 2019; Gorbunov et al., 2022) . However, much less is known about why adaptive methods are so critical for solving min-max problems such as GANs. Several previous works attribute this performance to the superior convergence speed of adaptive methods. For instance, Liu et al. (2019) show that an adaptive variant of Optimistic Gradient Descent (Daskalakis et al., 2017) converges faster than stochastic gradient descent ascent (SGDA) for a class of non-convex, non-concave min-max problems. However, contrary to the minimization setting, convergence to a stationary point is not required to obtain satisfactory GAN performance. Mescheder et al. (2018) empirically shows that popular architectures such as Wasserstein GANs (WGANs) (Arjovsky et al., 2017) do not always converge, and yet they produce realistic images. We support this observation with our own experiments in Section 2. Our findings motivate the central question in this paper: what factors of Adam contribute to better quality solutions over SGDA when training GANs? In this paper, we investigate why GANs trained with adaptive methods outperform those trained using SGDA. Directly analyzing Adam is challenging due to the highly non-linear nature of its gradient oracle and its path-dependent update rule. Inspired by the grafting approach in (Agarwal et al., 2020) , we disentangle the adaptive magnitude and direction of Adam and show evidence that an algorithm using the adaptive magnitude of Adam and the direction of SGDA (which we call Ada-nSGDA) recovers the performance of Adam in GANs. Our contributions are as follows: • In Section 2, we present the Ada-nSGDA algorithm.We emprically show that the adaptive magnitude in Ada-nSGDA stays within a constant range and does not heavily fluctuate which motivates the focus on normalized SGDA (nSGDA) which only contains the direction of SGDA. • In Section 3, we prove that for a synthetic dataset consisting of two modes, a model trained with SGDA suffers from mode collapse (producing only a single type of output), while a model trained with nSGDA does not. This provides an explanation for why GANs trained with nSGDA outperform those trained with SGDA. • In Section 4, we empirically confirm that nSGDA mostly recovers the performance of Ada-nSGDA when using different GAN architectures on a wide range of datasets. Our key theoretical insight is that when using SGDA and any step-size configuration, either the generator G or discriminator D updates much faster than the other. By normalizing the gradients as done in nSGDA, D and G are forced to update at the same speed throughout training. The consequence is that whenever D learns a mode of the distribution, G learns it right after, which makes both of them learn all the modes of the distribution separately at the same pace.

1.1. RELATED WORK

Adaptive methods in games optimization. Several works designed adaptive algorithms and analyzed their convergence to show their benefits relative to SGDA e.g. in variational inequality problems, Barazandeh et al. (2021) . Heusel et al. (2017) show that Adam locally converges to a Nash equilibrium in the regime where the step-size of the discriminator is much larger than the one of the generator. Our work differs as we do not focus on the convergence properties of Adam, but rather on the fit of the trained model to the true (and not empirical) data distribution. Statistical results in GANs. Early works studied whether GANs memorize the training data or actually learn the distribution (Arora et al., 2017; Liang, 2017; Feizi et al., 2017; Zhang et al., 2017; Arora et al., 2018; Bai et al., 2018; Dumoulin et al., 2016) . Some works explained GAN performance through the lens of optimization. Lei et al. (2020) ; Balaji et al. (2021) show that GANs trained with SGDA converge to a global saddle point when the generator is one-layer neural network and the discriminator is a specific quadratic/linear function. Our contribution differs as i) we construct a setting where SGDA converges to a locally optimal min-max equilibrium but still suffers from mode collapse, and ii) we have a more challenging setting since we need at least a degree-3 discriminator to learn the distribution, which is discussed in Section 3. Normalized gradient descent. Introduced by Nesterov (1984) , normalized gradient descent has been widely used in minimization problems. Normalizing the gradient remedies the issue of iterates being stuck in flat regions such as spurious local minima or saddle points (Hazan et al., 2015; Levy, 2016) . Normalized gradient descent methods outperforms their non-normalized counterparts in multiagent coordination (Cortés, 2006) and deep learning tasks (Cutkosky & Mehta, 2020) . Our work considers the min-max setting and shows that nSGDA outperforms SGDA as it forces discriminator and generator to update at same rate.

1.2. BACKGROUND

Generative adversarial networks. Given a training set sampled from some target distribution D, a GAN learns to generate new data from this distribution. The architecture is comprised of two networks: a generator that maps points in the latent space D z to samples of the desired distribution, and a discriminator which evaluates these samples by comparing them to samples from D. More formally, the generator is a mapping G V : R k → R d and the discriminator is a mapping D W : R d → R, where V and W are their corresponding parameter sets. To find the optimal parameters of these two networks, one must solve a min-max optimization problem of the form min V max W E X∼p data [log(D W (X))] + E z∼pz [log(1 -D W (G V (z)))] := f (V, W), (GAN) where p data is the distribution of the training set, p z the latent distribution, G V the generator and D W the discriminator. Contrary to minimization problems where convergence to a local minimum is required for high generalization, we empirically verify that most of the well-performing GANs do not converge to a stationary point. This ratio is equal to ∥grad (t) G ∥2/∥grad (0) G ∥2 + ∥grad (t) D ∥2/∥grad (0) D ∥2 , where grad D ) are the current and initial gradients of G (resp. D). Note that ∥ • ∥2 refers to the sum of all the parameters norm in a network. For all the plots, the models are trained for 100 epochs using a batch-size 64. For (b), the results are averaged over 5 seeds. Convergence and performance are decorrelated in GANs. We support this observation through the following experiment. We train a DCGAN (Radford et al., 2015) using Adam and set up the step-sizes for G and D as η D , η G , respectively. Note that D is usually trained faster than G i.e. η D ≥ η G . Figure 1a displays the GAN convergence measured by the ratio of gradient norms and the GAN's performance measured in FID score (Heusel et al., 2017) . We observe that when η D /η G is close to 1, the algorithm does not converge, but the model produces high-quality samples. On the other hand, when η D /η G ≫ 1, the model converges to an equilibrium; a similar statement has been proved by Jin et al. (2020) and Fiez & Ratliff (2020) in the case of SGDA. However, the trained GAN produces low-quality solutions at this equilibrium, so simply comparing the convergence speed of adaptive methods and SGDA cannot explain the performance obtained with adaptive methods. SGDA and adaptive methods. The most simple algorithm to solve the min-max (GAN) is SGDA, which is defined as follows: W (t+1) = W (t) + η D M (t) W,1 , V (t+1) = V (t) -η G M (t) V,1 , where M (t) W,1 , M V,1 are the first-order momentum gradients as defined in Algorithm 1. While this method has been used in the first GANs (Radford et al., 2015) , most modern GANs are trained with adaptive methods such as Adam (Kingma & Ba, 2014) . The definition of this algorithm for game optimizations is given in Algorithm 1. The hyperparameters β 1 , β 2 ∈ [0, 1) control the weighting of the exponential moving average of the first and second-order moments. In many deep-learning tasks, practitioners have found that setting β 2 = 0.9 works for most problem settings. Additionally, it has been empirically observed that having no momentum (i.e., β 1 ≈ 0) is optimal for many popular GAN architectures (Karras et al., 2020; Brock et al., 2018) . Therefore, we only consider the case where β 1 = 0. Optimizers such as Adam (Algorithm 1) are adaptive since they use a step-size for each parameter that is different than the magnitude of the gradient g (t) Y for that parameter up to some constant (such as the global learning rate), and this step-size updates while training the model. There are three components that makes the adaptive update differ from the standard SGDA update: 1) the adaptive normalization by ∥g (t) Y ∥ 2 , 2) the change of direction from g (t) Y /∥g (t) Y ∥ 2 to A (t) Y /∥A (t) Y ∥ 2 and 3) adaptive scaling by ∥A (t) Y ∥ 2 . In summary, the steps from the standard to the adaptive update, are: g (t) Y normalization -------→ ×1/∥g (t) Y ∥2 g (t) Y /∥g (t) Y ∥ 2 change of direction ----------→ A (t) Y /∥A (t) Y ∥ 2 adaptive scaling --------→ ×∥A (t) Y ∥2 A (t) Y The three components are entangled and it remains unclear how they contribute to the superior performance of adaptive methods relative to SGDA in GANs. Algorithm 1 Adam (Kingma & Ba, 2014) for games. All operations on vectors are element-wise. Input: initial points W (0) , V (0) , step-size schedules {(η (t) G , η (t) D )} , hyperparameters {β1, β2, ε}. Initialize M (0) W,1 , M (0) W,2 , M (0) V,1 and M (0) V,2 to zero. for t = 0 . . . T -1 do Receive stochastic gradients g (t) W , g V evaluated at W (t) and V (t) . Update for Y ∈ {W, V}: M (t+1) Y,1 = β1M (t) Y,1 + g (t) Y and M (t+1) Y,2 = β2M (t) Y,2 + g (t) Y 2 . Compute gradient oracles for Y ∈ {V, W }: A (t+1) Y = M (t+1) Y,1 / M (t+1) Y,2 +ε. Update: W (t+1) = W (t) + η (t) D A (t+1) W , V (t+1) = V (t) -η (t) G A (t+1) V .

2. NSGDA AS A MODEL TO ANALYZE ADAM IN GANS

In this section, we show that normalized stochastic gradient descent-ascent (nSGDA) is a suitable proxy to study the learning dynamics of Adam. To decouple the normalization, change of direction, and adaptive scaling in Adam, we adopt the grafting approach proposed by Agarwal et al. (2020) . At each iteration, we compute stochastic gradients, pass them to two optimizers A 1 , A 2 and make a grafted step that combines the magnitude of A 1 's step and direction of A 2 's step. We focus on the optimizer defined by grafting the Adam magnitude onto the SGDA direction, which corresponds to omitting the change of direction in (2): W (t+1) = W (t) + η D ∥A (t) W ∥ 2 g (t) W ∥g (t) W ∥ 2 + ε , V (t+1) = V (t) -η G ∥A (t) V ∥ 2 g (t) V ∥g (t) V ∥ 2 + ε , where A (t) V , A W are the Adam gradient oracles as in Algorithm 1 and g (t) V , g W the stochastic gradients. We refer to this algorithm as Ada-nSGDA (combining the Adam magnitude and SGDA direction). There are two natural implementations for nSDGA. In the layer-wise version, Y (t) is a single parameter group (typically a layer in a neural network), and the updates are applied to each group. In the global version, Y (t) contains all of the model's weights. In Fig. 2a , we see that Ada-nSGDA and Adam appear to have similar learning dynamics in terms of the FID score. Both Adam and Ada-nSGDA significantly outperform SGDA as well as AdaDir, which is the alternate case of (3) where we instead graft the magnitude of the SGDA update to the direction of the Adam update. AdaDir diverged after a single step so we omit it in Fig. 2 . These results show that the adaptive scaling and normalization components are sufficient to recover the performance of Adam, suggesting that Ada-nSGDA is a valid proxy for Adam A natural question that arises is how the total adaptive magnitude varies during training. We empirically investigate this by tracking the layer-wise adaptive magnitudes of the Adam gradient oracles when training a GAN with Ada-nSGDA, and summarize our key findings here (see Section 4 for complete experimental details). We first train a WGAN-GP (Arjovsky et al., 2017) model, and find that the adaptive magnitude is bound within a constant range, and that all the layers have approximately the same adaptive magnitude (Fig. 2 (b, c )). This suggests that the adaptive scaling component is constant (in expectation) and motivates the use of nSGDA, corresponding to Ada-nSGDA with a constant adaptive scaling factor. We then train a WGAN-GP model with nSGDA and we find that nSGDA mostly recovers the FID score obtained by Ada-nSGDA (Fig. 2a ). We also validate this observation for more complicated GAN architectures by repeating this study on StyleGAN2 (Karras et al., 2019) . We find that the adaptive magnitudes also vary within a constant range, but each layer has its own constant scaling factor. Thus, training StyleGAN2 with nSGDA and a global normalization fails, but training with nSGDA with a different constant step-size for each layer yields a performance that mostly recovers that of . These results suggest that the schedule of the adaptive scaling is not central in the success of Ada-nSGDA in GANs. Instead, adaptive methods are successful because they normalize the gradients for each layer, which allows for more balanced updates between G and D as we will show in Section 3. We conduct more experiments in Section 4 and in Appendix A. In Section 2, we empirically showed that the most important component of the adaptive magnitude is the normalization, and that nSGDA (an algorithm consisting of this component alone) is sufficient to recover most of the performance of Ada-nSGDA (and by extension, Adam). Our goal is to construct a dataset and model where we can prove that a model trained with nSGDA generates samples from the true training distribution while SGDA fails. To this end, we consider a dataset where the underlying distribution consists of two modes, defined as vectors u 1 , u 2 , that are slightly correlated (see Assumption 1) and consider the standard GANs' training objective. We show that a GAN trained with SGDA using any reasonablefoot_0 step-size configuration suffers from mode collapse (Theorem 3.1); it only outputs samples from a single mode which is a weighted average of u 1 and u 2 . Conversely, nSGDA-trained GANs learn the two modes separately (Theorem 3.2). Notation We set the GAN 1-sample loss L (t) V,W (X, z) = log(D (t) W (X)) + log(1 -D (t) W (G (t) V (z))). We denote g (t) Y = ∇ Y L (t) V,W (X, z) as the 1-sample stochastic gradient. We use the asymptotic complexity notations when defining the different constants e.g. poly(d) refers to any polynomial in the dimension d, polylog(d) to any polynomial in log(d), and o(1) to a constant ≪ d. We denote a ∝ b for vectors a and b in R d if there is a positive scaling factor c > 0 such that ∥a-cb∥ 2 = o(∥b∥ 2 ).

3.1. SETTING

In this section, we present the setting to sketch our main results in Theorem 3.1 and Theorem 3.2. We first define the distributions for the training set and latent samples, and specify our GAN model and the algorithms we analyze to solve (GAN). Note that for many assumptions and theorems below, we present informal statements which are sufficient to capture the main insights. The precise statements can be found in Appendix B. Our synthetic theoretical framework considers a bimodal data distribution with two correlated modes: Assumption 1 (p data structure). Let γ = 1 polylog(d) . We assume that the modes are correlated. This means that ⟨u 1 , u 2 ⟩ = γ > 0 and the generated data point X is either X = u 1 or X = u 2 . Next, we define the latent distribution p z that G V samples from and maps to p data . Each sample from p z consists of a data-point z that is a binary-valued vector z ∈ {0, 1} m G , where m G is the number of neurons in G V , and has non-zero support, i.e. ∥z∥ 0 ≥ 1. Although the typical choice of a latent distributions in GANs is either Gaussian or uniform, we choose p z to be a binary distribution because it models the weights' distribution of a hidden layer of a deep generator; Allen-Zhu & Li (2021) argue that the distributions of these hidden layers are sparse, non-negative, and non-positively correlated. We now make the following assumptions on the coefficients of z: Assumption 2 (p z structure). Let z ∼ p z . We assume that with probability 1 -o(1), there is only one non-zero entry in z. The probability that the entry i ∈ [m G ] is non-zero is Pr[z i = 1] = Θ(1/m G ). In Assumption 2, the output of G V is only made of one mode with probability 1 -o(1). This avoids summing two of the generator's neurons, which may cause mode collapse. To learn the target distribution p data , we use a linear generator G V with m G neurons and a non-linear neural network with m D neurons: G V (z) = V z = m G i=1 v i z i , D W (X) = sigmoid a m D i=1 ⟨w i , X⟩ 3 + b √ d . where V = [v ⊤ 1 , v ⊤ 2 , • • • , v ⊤ m G ] ∈ R m G ×d , z ∈ {0, 1} m G , W = [w ⊤ 1 , . . . , w ⊤ m D ] ∈ R m D ×d , and a, b ∈ R. Intuitively, G V outputs linear combinations of the modes v i . We choose a cubic activation as it is the smallest monomial degree for the discriminator's non-linearity that is sufficient for the generator to recover the modes u 1 , u 2 . 2We now state the SGDA and nSGDA algorithms used to solve the GAN training problem (GAN). For simplicity, we set the batch-size to 1. The resultant update rules for SGDA and nSGDA are:foot_2 SGDA: at each step t > 0, sample X ∼ p data and z ∼ p z and update as W (t+1) = W (t) + η D g (t) W , V (t+1) = V (t) -η G g (t) V , nSGDA: at each step t > 0, sample X ∼ p data and z ∼ p z and update as W (t+1) = W (t) + η D g (t) W ∥g (t) W ∥2 , V (t+1) = V (t) -η G g (t) V ∥g (t) V ∥2 . ( ) Compared to the versions of SGDA and Ada-nSGDA that we introduced in Section 2, we have the same algorithms except that we set β 1 = 0 and omit ε in ( 5) and ( 6). Note that since there is only one layer in the neural networks we study in this paper, the global-wise and layer-wise versions of nSGDA are actually the same. Lastly, we detail how to set the optimization parameters for SGDA and nSGDA in ( 5) and ( 6). Parametrization 3.1 (Informal). When running SGDA and nSGDA on (GAN), we set: -Initialization: b (0) = 0, and a (0) , w (0) i (i ∈ [m D ]), v (0) j (j ∈ [m G ] ) are initialized with a Gaussian with small variance. -Number of iterations: we run SGDA for t ≤ T 0 iterations where T 0 is the first iteration such that the algorithm converges to an approximate first order local minimum. For nSGDA, we run for T 1 = Θ(1/η D ) iterations. -Step-sizes: For SGDA, η D , η G ∈ (0, 1 poly(d) ) can be arbitrary. For nSGDA, η D ∈ (0, 1 poly(d) ], and η G is slightly smaller than η D . -Over-parametrization: For SGDA, m D , m G = polylog(d) are arbitrarily chosen i.e. m D may be larger than m G or the opposite. For nSGDA, we set m D = log(d) and m G = 2 log(d). Our theorem holds when running SGDA for any (polynomially) possible number of iterations; after T 0 steps, the gradient becomes inverse polynomially small and SGDA essentially stops updating the parameters. Additionally, our setting allows any step-size configuration for SGDA i.e. larger, smaller, or equal step-size for D compared to G. Note that our choice of step-sizes for nSGDA is the one used in practice, i.e. η D slightly larger than η G .

3.2. MAIN RESULTS

We state our main results on the performance of models trained using SGDA (5) and nSGDA (6). We show that nSGDA learns the modes of the distribution p data while SGDA does not. Theorem 3.1 (Informal). Consider a training dataset and a latent distribution as described above and let Assumption 1 and Assumption 2 hold. Let T 0 , η G , η D and the initialization be as defined in Parametrization 3.1. Let t be such that t ≤ T 0 . Run SGDA on (GAN) for t iterations with step-sizes η G , η D . Then, with probability at least 1 -o(1), the generator outputs for all z ∈ {0, 1} m G : G (t) V (z) ∝ u 1 + u 2 if η D ≥ η G ξ (t) (z) otherwise , where ξ (t) (z) ∈ R d is some vector that is not correlated to any of the modes. Formally, ∀ℓ ∈ [2], cos(ξ (t) (z), u ℓ ) = o(1) for all z ∈ {0, 1} m G . A formal proof can be found in Appendix G. Theorem 3.1 indicates that when training with SGDA and any step-size configuration, the generator either does not learn the modes at all (G (t) V (z) = ξ (t) (z)) or learns an average of the modes (G (t) V (z) ∝ u 1 + u 2 ) . The theorem holds for any time t ≤ T 0 which is the iteration where SGDA converges to an approximate first-order locally optimal min-max equilibrium. Conversely, nSGDA succeeds in learning the two modes separately: Theorem 3.2 (Informal). Consider a training dataset and a latent distribution as described above and let Assumption 1 and Assumption 2 hold. Let T 1 , η G , η D and the initialization as defined in Parametrization 3.1. Run nSGDA on (GAN) for T 1 iterations with step-sizes η G , η D . Then, the generator learns both modes u 1 , u 2 i.e., for ℓ ∈ {1, 2}, Pr z∼pz [G (T1) V (z) ∝ u ℓ ] is non-negligible. (8) A formal proof can be found in Appendix I. Theorem 3.2 indicates that when we train a GAN with nSGDA in the regime where the discriminator updates slightly faster than the generator (as done in practice), the generator successfully learns the distribution containing the direction of both modes. We implement the setting introduced in Subsection 3.1 and validate Theorem 3.1 and Theorem 3.2 in Fig. 3 . Fig. 3a displays the relative update speed η∥g (t) Y ∥ 2 /∥Y (t) ∥ 2 , where Y corresponds to the parameters of either D or G. Fig. 3b shows the correlation ⟨w (t) i , u ℓ ⟩/∥w (t) i ∥ 2 between one of D's neurons and a mode u ℓ and Fig. 3c the correlation ⟨v (t) j , u ℓ ⟩/∥v (t) j ∥ 2 between G's neurons and u ℓ . We discuss the interpretation of these plots to the next section.

WHY DOES SGDA SUFFER FROM MODE COLLAPSE AND NSGDA LEARN THE MODES?

We now explain why SGDA suffers from mode collapse, which corresponds to the case where η D ≥ η G . Our explanation relies on the interpretation of Figs. 3a, 3b, and 3c, and on the updates around initialization that are defined as followed. There exists i ∈ [m D ] such that D's update is E[w (t+1) i |w (t) i ] ≈ w (t) i + η D 2 l=1 E[⟨w (t) i , u l ⟩ 2 ]u l . Thus, the weights of D receive gradients directed by u 1 and u 2 . On the other hand, the weights of G at early stages receive gradients directed by w (t) j : v (t+1) i ≈ v (t) i + η G j ⟨v (t) i , w (t) j ⟩ 2 w (t) j . We observe that the learning process in Figs. 3a & 3b has three distinct phases. In the first phase (iterations 1-20), D learns one of the modes (u 1 or u 2 ) of p data (Fig. 3b ) and G barely updates its weights (Fig. 3a ). In the second phase (iterations 20-40), D learns the weighted average u 1 + u 2 (Fig. 3b ) while G starts moving its weights (Fig. 3a ). In the final phase (iterations 40+), G learns u 1 + u 2 (Fig. 3c ) from D. In more detail, the learning process is described as follows: Phase 1 : At initialization, w j and v (0) j are small. Assume w.l.o.g. that ⟨w (0) i , u 2 ⟩ > ⟨w (0) i , u 1 ⟩. Because of the ⟨w (t) i , u l ⟩ 2 in front of u 2 in (9), the parameter w (t) i gradually grows its correlation with u 2 (Fig. 3b ) and D's gradient norm thus increases (Fig. 3a ). While ∥w 3a ). (t) j ∥ ≪ 1 ∀j, we have that v (t) i ≈ v (0) i (Fig. Phase 2: D has learned u 2 . Because of the sigmoid in the gradient of w (t) i (that was negligible during Phase 1) and ⟨u 1 , u 2 ⟩ = γ > 0, w (t) i now mainly receives updates with direction u 2 . Since G did not update its weights yet, the min-max problem (GAN) is approximately just a minimization problem with respect to D's parameters. Since the optimum of such a problem is the weighted average u 1 + u 2 , w (t) j slowly converges to this optimum. Meanwhile, v (t) i start to receive some significant signal (Fig. 3a ) but mainly learn the direction u 1 + u 2 (Fig. 3c ), because w (t) j is aligning with this direction. Phase 3: The parameters of G only receive gradient directed by u 1 + u 2 . The norm of its relative updates stay large and D only changes its last layer terms (slope a and bias b). In contrast to SGDA, nSGDA ensures that G and D always learn at the same speed with the updates: w (t+1) i ≈ w (t) i + η D ⟨w (t) i , X⟩ 2 X ∥⟨w (t) i , X⟩ 2 X∥ 2 , and v (t+1) i ≈ v (t) i + η G j ⟨w (t) j , v (t) i ⟩ 2 w (t) j ∥ j ⟨w (t) j , v (t) i ⟩ 2 w (t) j ∥ 2 (11) No matter how large ⟨w (t) i , X⟩ is, G still learns at the same speed with D. There is a tight window (iteration 25, Fig. 3b ) where only one neuron of D is aligned with u 1 . This is when G can also learn to generate u 1 by "catching up" to D at that point, which avoids mode collapse.

4. NUMERICAL PERFORMANCE OF NSGDA

In Section 2, we presented the Ada-nSGDA algorithm (3) which corresponds to "grafting" the Adam magnitude onto the SGDA direction. In Section 3, we construct a dataset and GAN model where we prove that a GAN trained with nSGDA can generate examples from the true training distribution, while a GAN trained with SGDA fails due to mode collapse. We now provide more experiments comparing nSGDA and Ada-nSGDA with Adam on real GANs and datasets. We train a ResNet WGAN with gradient penalty on CIFAR-10 ( Krizhevsky et al., 2009) and STL-10 (Coates et al., 2011) with Adam, Ada-nSDGA, SGDA, and nSGDA with a fixed learning rate as done in Section 3. We use the default architectures and training parameters specified in Gulrajani et al. (2017) (λ GP = 10, n dis = 5, learning rate decayed linearly to 0 over 100k steps). We also train a StyleGAN2 model (Karras et al., 2020) on FFHQ (Karras et al., 2019) and LSUN Churches (Yu et al., 2016) (both resized to 128 × 128 pixels) with Adam, Ada-nSGDA, SGDA, and nSGDA. We use the recommended StyleGAN2 hyperparameter configuration for this resolution (batch size = 32, γ = 0.1024, map depth = 2, channel multiplier = 16384). We use the Fréchet Inception distance (FID) (Heusel et al., 2017) to quantitatively assess the performance of the model. For each optimizer, we conduct a coarse log-space sweep over step sizes and optimize for FID. We train the WGAN-GP models for 2880 thousand images (kimgs) on CIFAR-10 and STL-10 (45k steps with a batch size of 64), and the StyleGAN2 models for 2600 kimgs on FFHQ and LSUN Churches. We average our results over 5 seeds for the WGAN-GP ResNets, and over 3 seeds for the StyleGAN2 models due to the computational cost associated with training GANs. 4a and 4b validates the conclusions on WGAN-GP from Section 2. We find that both Ada-nSGDA and nSGDA mostly recover the performance of Adam, with nSGDA obtaining a final FID of ∼2-3 points lower than Ada-nSGDA. As discussed in Section 2, such performance is possible because the adaptive magnitude stays within a constant range. In contrast, models trained with SGDA consistently perform significantly worse, with final FID scores 4× larger than Adam. ). We find that although the magnitude for each layer fluctuates, the fluctuations are bounded to some fixed range for each layer. We show similar behaviour for the Generator in the Appendix.

WGAN-GP Figures

StyleGAN2 Figures 4c and 4d show the final FID scores when training a StyleGAN2. We find that Ada-nSGDA recovers most of the performance of Adam, but one difference with WGAN-GP is that nSGDA does not work if we use the same global learning rate for each layer. As discussed in Section 2, nSGDA with a different (but constant) step-size for each layer does work, and is able to mostly recover Ada-nSGDA's performance (Fig. 5a ). To choose the scaling for each layer, we train StyleGAN2 with Ada-nSGDA on FFHQ-128, track the layer-wise adaptive magnitudes, and take the mean of these magnitudes over the training run (for each layer). Figure 5b shows that the fluctuations for each layer are bound to a constant range, validating our assumption of constant step-sizes. Additionally, the same scaling obtained from training FFHQ seems to work for different datasets; we used it to train StyleGAN2 with nSGDA on LSUN Churches-128 and recovered similar performance to training on this dataset with Ada-nSGDA (Fig. 4d ).

5. CONCLUSION

Our work addresses the question of which mechanisms in adaptive methods are critical for training GANs, and why they outperform non-adaptive methods. We empirically show that Ada-nSGDA, an algorithm composed of the adaptive magnitude of Adam and the direction of SGD, recovers most of the performance of Adam. We further decompose the adaptive magnitude into two components: normalization, and adaptive step-size. We then show that the adaptive step size is roughly constant (bounded fluctuations) for multiple architectures and datasets. This empirically indicates that the normalization component of the adaptive magnitude is the key mechanism of Ada-nSGDA, and motivates the study of nSGDA; we verify that it too recovers the performance of Ada-nSGDA. Having shown that nSGDA is a good proxy for a key mechanism for adaptive methods, we then construct a setting where we proved that nSGDA -thanks to its balanced updates-recovers the modes of the true distribution while SGDA fails to do it. The key insight from our theoretical analysis is that the ratio of the update of D and G must be close to 1 during training in order to recover the modes of the distribution. This matches the experimental setting with nSGDA, as we find that global norm of the parameter updates for both D and G are almost equal for optimal choices of learning rates.



Reasonable simply means that the learning rates are bounded to prevent the training from diverging. Li & Dou (2020) show that when using linear or quadratic activations, the generator can fool the discriminator by only matching the first and second moments of p data . In the nSGDA algorithm defined in (3), the step-sizes were time-dependent. Here, we assume for simplicity that the step-sizes ηD, ηG > 0 are constant.



(a) Each circle corresponds to a specific step-size configuration ηD/ηG. The best-performing models have step-size ratios between 10 -1 and 1, and do not converge. As ηD/ηG increases, the models perform worse but get closer to an equilibrium.(b) shows that during training, the gradient ratio of a well-performing GAN approximately stays constant to 1. We also display the images produced by the model during training.

Figure 1: Gradient ratio against FID score (a) and number of epochs (b) obtained with DCGAN on CIFAR-10.

Figure 2: (a) shows the FID training curve for a WGAN-GP ResNet, averaged over 5 seeds. We see that Ada-nSGDA and nSGDA have very similar performance to Adam for a WGAN-GP. (b, c) displays the fluctuations of Ada-nSGDA adaptive magnitude. We plot the ratio ∥A (t) Y ∥2/∥A (0) Y ∥2 for each of the generator's (b) and discriminator's (c) layers. At early stages, this ratio barely increases and remains constant after 10 steps. 3 WHY DOES NSGDA PERFORM BETTER THAN SGDA IN GANS?

Figure 3: (a) shows the relative gradient updates for SGDA. D first updates its weights while G does not move until iteration 20, then G moves its weights. (b) shows the correlation for one neuron of D (with maximal correlation to u2 at initialization) with the modes u1, u2 during the learning process of SGDA. (c, d) shows the correlations of the neurons of G with the modes when trained with SGDA and nSGDA respectively. This shows that for SGDA (c), the model ultimately learns the weighted average u1 + u2. For nSGDA, we see from (d) that one of the neurons (V4) is highly correlated with u1 and another one (V3) is correlated with u2.

Figure 4: (a, b) are the final FID scores (5 seeds) for a ResNet WGAN-GP model trained for 45k steps on CIFAR-10 and STL-10 respectively. (c, d) are the final FID scores (3 seeds) for a StyleGAN2 model trained for 2600kimgs on FFHQ and LSUN Churches respectively. We use the same constant layer scaling in (d) for nSGDA as that in (c), which was found by tracking the layer-wise adaptive step-sizes.

