TWO STEPS AT A TIME -TAKING GAN TRAINING IN STRIDE WITH TSENG'S METHOD

Abstract

Motivated by the training of Generative Adversarial Networks (GANs), we study methods for solving minimax problems with additional nonsmooth regularizers. We do so by employing monotone operator theory, in particular the Forward-Backward-Forward (FBF) method, which avoids the known issue of limit cycling by correcting each update by a second gradient evaluation. Furthermore, we propose a seemingly new scheme which recycles old gradients to mitigate the additional computational cost. In doing so we rediscover a known method, related to Optimistic Gradient Descent Ascent (OGDA). For both schemes we prove novel convergence rates for convex-concave minimax problems via a unifying approach. The derived error bounds are in terms of the gap function for the ergodic iterates. For the deterministic and the stochastic problem we show a convergence rate of O( 1 /k) and O( 1 / √ k), respectively. We complement our theoretical results with empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have proven to be a powerful class of generative models, producing for example unseen realistic images. Two neural networks, called generator and discriminator, compete against each other in a game. In the special case of a zero sum game this task can be formulated as a minimax (aka saddle point) problem. Conventionally, GANs are trained using variants of (stochastic) Gradient Descent Ascent (GDA) which are known to exhibit oscillatory behavior and thus fail to converge even for simple bilinear saddle point problems, see Goodfellow (2016) . We therefore propose the use of methods with provable convergence guarantees for (stochastic) convex-concave minimax problems, even though GANs are well known to not warrant these properties. Along similar considerations an adaptation of the Extragradient method (EG) (Korpelevich, 1976) Sridharan, 2013a; b) . We however investigate the Forward-Backward-Forward (FBF) method (Tseng, 1991) from monotone operator theory, which uses two gradient evaluations per update, similar to EG, in order to circumvent the aforementioned issues. Instead of trying to improve GAN performance via new architectures, loss functions, etc., we contribute to the theoretical foundation of their training from the point of view of optimization. Contribution. Establishing the connection between GAN training and monotone inclusions motivates to use the FBF method, originally designed to solve this type of problems. This approach allows to naturally extend the constrained setting to a regularized one making use of the proximal operator. We also propose a variant of FBF reusing previous gradients to reduce the computational cost per iteration, which turns out to be a known method, related to OGDA. By developing a unifying scheme that captures FBF and a generalization of OGDA, we reveal a hitherto unknown connection. Using this approach we prove novel non asymptotic convergence statements in terms of the minimax gap for both methods in the context of saddle point problems. In the deterministic and stochastic setting we obtain rates of O( 1 /k) and O( 1 / √ k), respectively. Concluding, we highlight the relevance of our proposed method as well as the role of regularizers by showing empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset. Organization. This paper is structured as follows. In Section 2 we highlight the connection of GAN training and monotone inclusions and give an extensive review of methods with convergence guarantees for the latter. The main results as well as a precise definition of the measure of optimality are discussed in Section 3. Concluding, Section 4 illustrates the empirical performance in the training of GANs as well as solving bilinear problems.

2. GAN TRAINING AS MONOTONE INCLUSION

The GAN objective was originally cast as a two-player zero-sum game between the discriminator D y and the generator G x (Goodfellow et al., 2014) given by min x max y E ρ∼q [log(D y (ρ))] + E ζ∼p [log(1 -D y (G x (ζ)))], exhibiting the aforementioned minimax structure. Due to problems with vanishing gradients in the training of such models, a successful alternative formulation called Wasserstein GAN (WGAN) (Arjovsky et al., 2017) has been proposed. In this case the minimization tries to reduce the Wasserstein distance between the true distribution q and the one learned by the generator. Reformulating this distance via the Kantorovich Rubinstein duality leads to an inner maximization over 1-Lipschitz functions which are approximated via neural networks, yielding the saddle point problem min x max y: Dy Lip≤1 E ρ∼q [D y (ρ)] -E ζ∼p [D y (G x (ζ))].

2.1. CONVEX-CONCAVE MINIMAX PROBLEMS

Due to the observations made in the previous paragraph we study the following abstract minimax problem min x∈R d max y∈R n Ψ(x, y) := f (x) + E ξ∼Q [Φ(x, y; ξ)] -h(y), where the convex-concave coupling function Φ(x, y) := E ξ∼Q [Φ(x, y; ξ)], which hides the stochasticity for ease of notation, is differentiable with L-Lipschitz continuous gradient. The proper, convex and lower semicontinuous functions f : R d → R ∪ {+∞} and h : R n → R ∪ {+∞} act as regularizers. A solution of ( 1) is given by a so-called saddle point (x * , y * ) fulfilling for all x and y Ψ(x * , y) ≤ Ψ(x * , y * ) ≤ Ψ(x, y * ). In the context of two-player games this corresponds to a pair of strategies, where no player can be better off by changing just their own strategy. For the purpose of this motivating section, we will restrict ourselves for now to the special case of the deterministic constrained version of (1), given by min x∈X max y∈Y Φ(x, y), where f and h are given by indicator functions of closed convex sets X and Y , respectively. The indicator function δ C of a set C is defined as δ C (z) = 0 for z ∈ C and δ C (z) = +∞ otherwise.

2.2. MINIMAX PROBLEMS AS MONOTONE INCLUSIONS

If the coupling function Φ is convex-concave and differentiable then solving (1) is equivalent to solving the first order optimality conditions which can be written as a so-called monotone inclusion with w = (x, y) ∈ R m and m = d + n, given by 0 ∈ F (w) + N Ω (w). (2) The entities involved are F (x, y) := (∇ x Φ(x, y), -∇ y Φ(x, y)),



for the training of GANs was suggested in Gidel et al. (2019), whereas Daskalakis et al. (2018); Daskalakis & Panageas (2018); Liang & Stokes (2019) studied Optimistic Gradient Descent Ascent (OGDA) based on optimistic mirror descent (Rakhlin &

