A PROVABLY CONVERGENT AND PRACTICAL ALGO-RITHM FOR MIN-MAX OPTIMIZATION WITH APPLICA-TIONS TO

Abstract

We present a first-order algorithm for nonconvex-nonconcave min-max optimization problems such as those that arise in training GANs. Our algorithm provably converges in poly(d, L, b) steps for any loss function f : R d × R d → R which is b-bounded with L-Lipschitz gradient. To achieve convergence, we 1) give a novel approximation to the global strategy of the max-player based on first-order algorithms such as gradient ascent, and 2) empower the min-player to look ahead and simulate the max-player's response for arbitrarily many steps, but restrict the min-player to move according to updates sampled from a stochastic gradient oracle. Our algorithm, when used to train GANs on synthetic and real-world datasets, does not cycle, results in GANs that seem to avoid mode collapse, and achieves a training time per iteration and memory requirement similar to gradient descent-ascent.

1. INTRODUCTION

We consider the problem of min-max optimization min x∈R d max y∈R d f (x, y), where the loss function f may be nonconvex in x and nonconcave in y. Min-max optimization of such loss functions has many applications to machine learning, including to GANs (Goodfellow et al., 2014) and adversarial training (Madry et al., 2018) . In particular, following Goodfellow et al. (2014) , GAN training can be formulated as a min-max optimization problem where x encodes the parameters of a "generator" network, and y encodes the parameters of a "discriminator" network. Unlike standard minimization problems, the min-max nature of GANs makes them particularly difficult to train (Goodfellow, 2017) , and has received wide attention. A common algorithm to solve these min-max optimization problems, gradient descent ascent (GDA), alternates between stochastic gradient descent steps for x and ascent steps for y. 1 The advantage of GDA is that it just requires first-order access to f and each iteration is efficient in terms of memory and time, making it quite practical. However, as many works have observed, GDA can suffer from issues such as cycling (Arjovsky & Bottou, 2017) and "mode collapse" (Dumoulin et al., 2017; Che et al., 2017; Santurkar et al., 2018) . Several recent works have focused on finding convergent first-order algorithms for min-max optimization (Rafique et al., 2018; Daskalakis et al., 2018; Liang & Stokes, 2019; Gidel et al., 2019b; Mertikopoulos et al., 2019; Nouiehed et al., 2019; Lu et al., 2020; Lin et al., 2020; Mokhtari et al., 2019; Thekumparampil et al., 2019; Mokhtari et al., 2020) . However, these algorithms are also not guaranteed to converge for general nonconvex-nonconcave min-max problems. The challenge is that min-max optimization generalizes nonconvex minimization, which, in general, is intractable. Algorithms for nonconvex minimization resort to finding "local" optima or assume a starting point "close" to a global optimum. However, unlike minimization problems where local notions of optima exist (Nesterov & Polyak, 2006) , it has been challenging to define a notion of convergent points for min-max optimization, and most notions of local optima considered in previous works (Daskalakis & Panageas, 2018; Jin et al., 2020; Fiez et al., 2019) require significant restrictions for existence. Our contributions. Our main result is a new first-order algorithm for min-max optimization (Algorithm 1) that for any ε > 0, any nonconvex-nonconcave loss function, and any starting point, converges in poly(d, L, b, 1 /ε) steps, if f is b-bounded with L-Lipschitz gradient (Theorem 2.3).



In practice, gradients steps are often replaced by ADAM steps; we ignore this distinction for this discussion.1

