TAMING GANS WITH LOOKAHEAD-MINMAX

Abstract

Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead-minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources.

1. INTRODUCTION

Gradient-based methods are the workhorse of machine learning. These methods optimize the parameters of a model with respect to a single objective f : X → R . However, an increasing interest for multi-objective optimization arises in various domains-such as mathematics, economics, multiagent reinforcement learning (Omidshafiei et al., 2017) -where several agents aim at optimizing their own cost function f i : X 1 × • • • × X N → R simultaneously. A particularly successful class of algorithms of this kind are the Generative Adversarial Networks (Goodfellow et al., 2014, (GANs) ), which consist of two players referred to as a generator and a discriminator. GANs were originally formulated as minmax optimization f : X × Y → R (Von Neumann & Morgenstern, 1944) , where the generator and the discriminator aim at minimizing and maximizing the same value function, see § 2. A natural generalization of gradient descent for minmax problems is the gradient descent ascent algorithm (GDA), which alternates between a gradient descent step for the min-player and a gradient ascent step for the max-player. This minmax training aims at finding a Nash equilibrium where no player has the incentive of changing its parameters. Despite the impressive quality of the samples generated by the GANs-relative to classical maximum likelihood-based generative models-these models remain notoriously difficult to train. In particular, poor performance (sometimes manifesting as "mode collapse"), brittle dependency on hyperparameters, or divergence are often reported. Consequently, obtaining state-of-the-art performance was shown to require large computational resources (Brock et al., 2019) , making well-performing models unavailable for common computational budgets. It was empirically shown that: (i) GANs often converge to a locally stable stationary point that is not a differential Nash equilibrium (Berard et al., 2020); (ii) increased batch size improves GAN performances (Brock et al., 2019) in contrast to minimization (Defazio & Bottou, 2019; Shallue et al., 2018) . A principal reason is attributed to the rotations arising due to the adversarial component of the associated vector field of the gradient of the two player's parameters (Mescheder et al., 2018; Balduzzi et al., 2018) , which are atypical for minimization. More precisely, the Jacobian of the associated vector field (see def. in § 2) can be decomposed into a symmetric and antisymmetric component (Balduzzi et al., 2018) , which behave as a potential (Monderer & Shapley, 1996) and a Hamiltonian game, resp. Games are often combination of the two, making this general case harder to solve. In the context of single objective minimization, Zhang et al. ( 2019) recently proposed the Lookahead algorithm, which intuitively uses an update direction by "looking ahead" at the sequence of parameters that change with higher variance due to the "stochasticity" of the gradient estimates-called fast weights-generated by an inner optimizer. Lookahead was shown to improve the stability during training and to reduce the variance of the so called slow weights. Contributions. Our contributions can be summarized as follows: • We propose Lookahead-minmax for optimizing minmax problems, that applies extrapolation in the joint parameter space (see Alg 1), so as to account for the rotational component of the associated game vector field (defined in § 2). • In the context of: (i) single objective minimization: by building on insights of Wang et al. (2020) , who argue that Lookahead can be interpreted as an instance of local SGD, we derive improved convergence guarantees for the Lookahead algorithm; (ii) two-player games: we elaborate why Lookahead-minmax suppresses the rotational part in a simple bilinear game, and prove its convergence for a given converging base-optimizer; in § 3 and 4, resp. • We motivate the use of Lookahead-minmax for games by considering the extensively studied toy bilinear example (Goodfellow, 2016) and show that: (i) the use of lookahead allows for convergence of the otherwise diverging GDA on the classical bilinear game in full-batch setting (see § 4.2.1), as well as (ii) it yields good performance on challenging stochastic variants of this game, despite the high variance (see § 4.2.2). • We empirically benchmark Lookahead-minmax on GANs on four standard datasets-MNIST, CIFAR-10, SVHN and ImageNet-on two different models (DCGAN & ResNet), with standard optimization methods for GANs, GDA and Extragradient, called LA-AltGAN and LA-ExtraGradient, resp. We consistently observe both stability and performance improvements at a negligible additional cost that does not require additional forward and backward passes, see § 5.



Figure 1: Illustration of Lookahead-minmax (Alg.1) with GDA on: minx maxy x • y, with α=0.5. The solution, trajectory {xt, yt} T t=1 , and the lines between (x P , y P ) and (xt, yt) are shown with red star, blue line, and dashed green line, resp. The backtracking step of Alg. 1 (lines 10 & 11) allows the otherwise non-converging GDA to converge, see § 4.2.1.

availability

https://github.com/Chavdarova/

