A PROVABLY CONVERGENT AND PRACTICAL ALGO-RITHM FOR MIN-MAX OPTIMIZATION WITH APPLICA-TIONS TO

Abstract

We present a first-order algorithm for nonconvex-nonconcave min-max optimization problems such as those that arise in training GANs. Our algorithm provably converges in poly(d, L, b) steps for any loss function f : R d × R d → R which is b-bounded with L-Lipschitz gradient. To achieve convergence, we 1) give a novel approximation to the global strategy of the max-player based on first-order algorithms such as gradient ascent, and 2) empower the min-player to look ahead and simulate the max-player's response for arbitrarily many steps, but restrict the min-player to move according to updates sampled from a stochastic gradient oracle. Our algorithm, when used to train GANs on synthetic and real-world datasets, does not cycle, results in GANs that seem to avoid mode collapse, and achieves a training time per iteration and memory requirement similar to gradient descent-ascent.

1. INTRODUCTION

We consider the problem of min-max optimization min x∈R d max y∈R d f (x, y), where the loss function f may be nonconvex in x and nonconcave in y. Min-max optimization of such loss functions has many applications to machine learning, including to GANs (Goodfellow et al., 2014) and adversarial training (Madry et al., 2018) . In particular, following Goodfellow et al. (2014) , GAN training can be formulated as a min-max optimization problem where x encodes the parameters of a "generator" network, and y encodes the parameters of a "discriminator" network. Unlike standard minimization problems, the min-max nature of GANs makes them particularly difficult to train (Goodfellow, 2017) , and has received wide attention. A common algorithm to solve these min-max optimization problems, gradient descent ascent (GDA), alternates between stochastic gradient descent steps for x and ascent steps for y. 1 The advantage of GDA is that it just requires first-order access to f and each iteration is efficient in terms of memory and time, making it quite practical. However, as many works have observed, GDA can suffer from issues such as cycling (Arjovsky & Bottou, 2017) and "mode collapse" (Dumoulin et al., 2017; Che et al., 2017; Santurkar et al., 2018) . Several recent works have focused on finding convergent first-order algorithms for min-max optimization (Rafique et al., 2018; Daskalakis et al., 2018; Liang & Stokes, 2019; Gidel et al., 2019b; Mertikopoulos et al., 2019; Nouiehed et al., 2019; Lu et al., 2020; Lin et al., 2020; Mokhtari et al., 2019; Thekumparampil et al., 2019; Mokhtari et al., 2020) . However, these algorithms are also not guaranteed to converge for general nonconvex-nonconcave min-max problems. The challenge is that min-max optimization generalizes nonconvex minimization, which, in general, is intractable. Algorithms for nonconvex minimization resort to finding "local" optima or assume a starting point "close" to a global optimum. However, unlike minimization problems where local notions of optima exist (Nesterov & Polyak, 2006) , it has been challenging to define a notion of convergent points for min-max optimization, and most notions of local optima considered in previous works (Daskalakis & Panageas, 2018; Jin et al., 2020; Fiez et al., 2019) require significant restrictions for existence. Our contributions. Our main result is a new first-order algorithm for min-max optimization (Algorithm 1) that for any ε > 0, any nonconvex-nonconcave loss function, and any starting point, converges in poly(d, L, b, 1 /ε) steps, if f is b-bounded with L-Lipschitz gradient (Theorem 2.3). A key ingredient in our result is an approximation to the global max function max z∈R d f (x, z). Unlike GDA and related algorithms that alternate between updating the discriminator and generator in an incremental fashion, our algorithm lets the discriminator run a convergent algorithm (such as gradient ascent) until it reaches a first-order stationary point. We then empower the generator to simulate the discriminator's response for arbitrarily many gradient ascent updates. Roughly, at each iteration of our algorithm, the min-player proposes a stochastic (batch) gradient update for x and simulates the response of the max-player with gradient ascent steps for y until it reaches a first-order stationary point. If the resulting loss has decreased, the updates for x and y are accepted; otherwise they are only accepted with a small probability (a la simulated annealing). The point (x , y ) returned by our algorithm satisfies the following guarantee: if the min-player proposes a stochastic gradient descent update to x , and the max-player is allowed to respond by updating y using any "path" that increases the loss at a rate of at least ε -with high probability, the final loss cannot decrease by more than ε. See Section 2 for our convergence guarantees, Section 4 for the key ideas in our proof, and Appendix C for a comparison to previous notions of convergence. Empirically, we apply our algorithm for training GANs (with the cross-entropy loss) on both synthetic (mixture of Gaussians) and real-world (MNIST and CIFAR-10) datasets (Section 3). We compare our algorithm's performance against two related algorithms: gradient/ADAM descent ascent (with one or multiple discriminator steps), and Unrolled GANs (Metz et al., 2017) . Our simulations with MNIST (Figure 1 ) and mixture of Gaussians (Figure 2 ) indicate that training GANs using our algorithm can avoid mode collapse and cycling. For instance, on the Gaussian mixture dataset, we found that by around the 1500'th iteration GDA learned only one mode in 100% of the runs, and cycled between multiple modes. In contrast, our algorithm learned all four modes in 68% of the runs, and three modes in 26% of the runs. On 0-1 MNIST, we found that GDA tends to briefly generate shapes that look like a combination of 0's and 1's, then switches between generating only 1's and only 0's. In contrast, our algorithm seems to learn to generate both 0's and 1's early on and does not stop generating either digit. GANs trained using our algorithm generated both digits by the 1000'th iteration in 86% of the runs, while those trained using GDA only did so in 23% of the runs. Our CIFAR-10 simulations (Figure 3 ) indicate that our algorithm trains more stably, resulting in a lower mean and standard deviation for FID scores compared to GDA. Furthermore, the per-step computational and memory cost of our algorithm is similar to GDA indicating that our algorithm can scale to larger datasets.

Related work

Guaranteed convergence for min-max optimization. Several works have studied GDA dynamics in GANs (Nagarajan & Kolter, 2017; Mescheder et al., 2017; Li et al., 2018; Balduzzi et al., 2018; Daskalakis & Panageas, 2018; Jin et al., 2020) and established that GDA suffers from severe limitations: GDA can exhibit rotation around some points, or otherwise fail to converge. Thus, we cannot expect global convergence guarantees for GDA. To address these convergence issues for GDA, multiple works have proposed algorithms based on Optimistic Mirror Descent (OMD), Extra-gradient method, or similar approaches (Gidel et al., 2019b; Daskalakis et al., 2018; Liang & Stokes, 2019; Daskalakis & Panageas, 2019; Mokhtari et al., 2019; 2020) . These algorithms avoid some of the pathological behaviors of GDA and achieve guaranteed convergence in poly(κ, log( 1 /ε)) iterations where κ is the condition number of f. However, all these results either require convexity/concavity assumptions on f, which usually do not hold for GANs, or require that the starting point lies in a small region around an equilibrium point, and hence provide no guarantees for an arbitrary initialization. Some works also provide convergence guarantees for min-max optimization (Nemirovski & Yudin, 1978; Kinderlehrer & Stampacchia, 1980; Nemirovski, 2004; Rafique et al., 2018; Lu et al., 2020; Lin et al., 2020; Nouiehed et al., 2019; Thekumparampil et al., 2019) . However, they require f to be concave in y, again limiting their applicability. As for nonconvex-nonconcave min-max optimization, Heusel et al. (2017) prove convergence of finite-step GDA, under the assumption that the underlying continuous dynamics converge to a local min-max optimum (this assumption may not even hold for f that is bi-linear). Jin et al. (2020) present a version of GDA for min-max optimization (generalized by Fiez et al. (2019) ) such that if the algorithm converges, the convergence point is a local min-max optimum. Both these results require that the min-player use a vanishingly small step size relative to the max-player, resulting in slow convergence. Wang et al. (2020) present an algorithm that can converge for nonconvex-nonconcave functions, but requires the initial point to lie in a region close a local min-max optimum (such optima are not guaranteed to exist). In contrast to the above works, our algorithm is guaranteed to converge for any nonconvex-nonconcave loss, from any starting point, in poly(d, L, b, 1 /ε) steps, if f is b-bounded with L-Lipschitz gradient. Greedy paths. The paths along which the max-player is allowed to make updates in our equilibrium definition are inspired from the work of Mangoubi & Vishnoi (2020) , which gives a second-order algorithm for min-max optimization. The "greedy paths" considered in their work are defined such that at every point along these paths, f is non-decreasing, and the first derivative of f is at least ε or the 2nd derivative is at least √ ε. In contrast, we just require a condition on the first derivative of f along the path. This distinction gives rise to a different notion of equilibrium than the one presented in their work. The first-order condition on the paths crucially also results in our algorithm being applicable to machine learning settings where only first-order oracles are available, because unlike Mangoubi & Vishnoi (2020) , traversing such a path only requires first-order access to f . Training GANs. Starting with Goodfellow et al. (2014) , there has been considerable work to develop algorithms to train GANs. One line of work focuses on modifying the loss to improve convergence (Arjovsky et al., 2017; Bellemare et al., 2017; Lim & Ye, 2017; Mao et al., 2017; Salimans et al., 2018; Metz et al., 2017) . Another line of work regularizes the discriminator using gradient penalties or spectral normalization (Gulrajani et al., 2017; Kodali et al., 2017; Miyato et al., 2018) . Metz et al. (2017) introduced Unrolled GANs, where the generator optimizes an "unrolled" loss function that allows the generator to simulate a fixed number of discriminator updates. While this has some similarity to our algorithm there are two important distinctions: 1) the discriminator in Unrolled GANs may not reach a first-order stationary point, and hence their algorithm does not come with any convergence guarantees, and 2) unlike our algorithm, the implementation of the generator in Unrolled GANs requires memory that grows with the number of discriminator steps, limiting its scalability. We observe that our algorithm, applied to training GANs, trains stably and avoids mode collapse, while achieving a training time per iteration and memory requirements that are similar to GDA, and much lower than Unrolled GANs (Metz et al., 2017) (see also the discussion in Section 5). Remark 1.1 (Loss functions in GANs). Loss functions f : R d × R d → R which take bounded values on R d × R d arise in many GAN applications. For instance, GANs with mean-squared error loss Mao et al. (2017) have uniformly bounded f . GANs with cross entropy loss Goodfellow et al. (2014) have f uniformly bounded above, and Wasserstein GANs Arjovsky et al. (2017) have a loss function f (x, y) which is bounded above as a function of y and is uniformly bounded below.

2. THEORETICAL RESULTS

We consider the problem min x max y f (x, y), where x, y ∈ R d , and f is a function R d ×R d → R. We consider f that is an empirical risk loss over m training examples. Thus, we have f = 1 m i∈[m] f i , and are given access to f via a randomized oracle F such that E[F ] = f. We call such an oracle a stochastic zeroth-order oracle for f. We are also given randomized oracles G x , G y for ∇ x f, ∇ y f, such that E[G x ] = ∇ x f , and E[G y ] = ∇ y f. We call such oracles stochastic gradient oracles for f. In practice, these oracles are computed by randomly sampling a "batch" B ⊆ [m] and returning F = 1 /|B| i∈B f i , G x = 1 /|B| i∈B ∇ x f i , and G y = 1 /|B| i∈B ∇ y f i . For our convergence guarantees, we require bounds on standard smoothness parameters for functions f i : b such that for all i and all x, y, we have |f i (x, y)| ≤ b, and L such that ∇f i (x, y) -∇f i (x , y ) 2 ≤ L x -x 2 + L y -y 2 . Such smoothness/Lipschitz bounds are standard in convergence guarantees for optimization algorithms (Bubeck, 2017; Nesterov & Polyak, 2006; Ge et al., 2015) , and imply that f is also continuous, b-bounded, and L-gradient-Lipschitz. Our algorithm is described informally in Algorithm 1 and formally as Algorithm 2 in the Appendix. Intuition for the algorithm. To solve the min-max problem, the max-player would ideally find the global maximum max z f (x, z). However, since f may be nonconcave in y, finding the global maximum may be computationally intractable. To get around this problem, roughly speaking, in our algorithm the max-player computes its update by running gradient ascent until it reaches a first-order ε-stationary point y , that is, a point where ∇ y f (x, y ) ≤ ε. This allows our algorithm to compute an approximation L ε (x, y) = f (x, y ) for the global maximum. (Note that even though max z f (x, z) is only a function of x, L ε may depend on both x and the initial point y.) Algorithm 1 Algorithm for min-max optimization input: A stochastic zeroth-order oracle F for loss function f : R d × R d → R, and stochastic gradient oracles G x for ∇ x f , and G y for ∇ y f . An initial point (x, y), and an error parameter ε. output: A point (x , y ) hyperparameters: r max (maximum number of rejections); τ 1 (hyperparameters for annealing); ADAM hyperparameters Set r ← 0, i ← 0 while r ≤ r max do We would like the min-player to minimize L ε (x, y). Ideally, the min-player would make updates in the direction -∇ x L ε . However, L ε (x, y) may not be differentiable and may even be discontinuous in x (see Section 2.2 for an example), making it challenging to optimize. Moreover, even at points where L ε is differentiable, computing ∇ x L ε may require memory proportional to the number of max-player steps used to compute L ε (for instance, this is the case for Unrolled GANs (Metz et al., 2017) ). For this reason, we only provide our min-player with access to the value of L ε . f old ← F (x, y), i ← i + 1 Set G x ← G x (x, One approach to minimize L ε would be to use a zeroth-order optimization procedure where the min-player proposes a random update to x, and then only accept this update if it results in a decrease in L ε . At each iteration of our algorithm, the min-player proposes an update roughly in the direction -∇ x f (x, y). To motivate this choice, note that once the min-player proposes an update ∆ to x, the max-player's updates will only increase f , i.e., L ε (x + ∆, y) ≥ f (x + ∆, y). Moreover, since y is a first-order stationary point of f (x, •) (because y was computed using gradient ascent in the previous iteration), we also have L ε (x, y) = f (x, y). Therefore, we want an update ∆ such that f (x + ∆, y) ≤ L ε (x + ∆, y) ≤ L ε (x, y) = f (x, y), which implies that any proposed step which decreases L ε must also decrease f (although the converse is not true). This motivates proposing steps in the direction of the gradient -∇ x f (x, y). Unfortunately, updates in the direction -∇ x f (x, y) do not necessarily decrease L ε . Our algorithm instead has the min-player perform a random search by proposing a stochastic update in the direction of a batch gradient with mean -∇ x f (x, y) (or, more precisely, the ADAM batch gradients), and accepts this update only if L ε decreases by some fixed amount. We show empirically that these directions allow the algorithm to rapidly decrease the simulated loss. The fact that L ε decreases whenever the min-player takes a step allows us to guarantee that our algorithm eventually converges. A final issue is that converging to a local minimum point does not guarantee that the point is desirable from an applications standpoint. To allow our algorithm to escape undesirable local minima of L ε (•, y), we use a randomized accept-reject rule inspired by simulated annealing -if the resulting loss has decreased the updates for x and y are accepted; otherwise they are only accepted with a small probability e -i /τ 1 , where τ 1 is a "temperature" parameter.

2.1. CONVERGENCE GUARANTEES

We first formally define "simulated loss" and what it means for f to increase rapidly. Definition 2.1. For any x, y, and ε > 0, define E(ε, x, y) ⊆ R d to be points w s.t. there is a continuous and (except at finitely many points) differentiable path γ(t) starting at y, ending at w, and moving with "speed" at most 1 in the ∞ -normfoot_1 d dt γ(t) ∞ ≤ 1 such that at any point on γ,foot_2 d dt f (x, γ(t)) > ε. (2) We define L ε (x, y) := sup w∈E(ε,x,y) f (x, w), and refer to it as the simulated loss. A few remarks are in order. Observe that L -∞ (x, y) = max y f (x, y). Further, if ∇ y f (x, y) 1 ≤ ε, then E(ε, x, y) = {y} and L ε (x, y) = f (x, y) (this follows from Hölder's inequality, since γ(t) has speed at most 1 in the ∞ -norm, the dual of the 1 -norm). Note that the path γ need not be in the direction of the gradient, and there can potentially be infinitely many such paths starting at y. Unfortunately, the simulated loss may not even be continuousfoot_3 in x, and thus, gradient-based notions of approximate local minima do not apply. To bypass this discontinuity (and hence nondifferentiability), we use the idea to sample updates to x, and test whether L ε has decreased (equation 35). This leads to the following definition of local min-max equilibrium (see also Section 2.3): Definition 2.2. Given a distribution D x,y for each x, y ∈ R d and ε > 0, we say that (x , y ) is an ε -local min-max equilibrium with respect to the distribution D if ∇ y f (x , y ) 1 ≤ ε , and , (3) Pr ∆∼D x ,y [L ε (x + ∆, y ) < L ε (x , y ) -ε ] < ε , In our main result, we set D to roughly be the distribution of ADAM stochastic gradients for -∇ x f . Also note that from the above discussion, equation 34 implies that L ε (x , y ) = f (x , y ). To allow convergence of ADAM (and avoid non-convergence issues such as those encountered in Reddi et al. ( 2019)), in our main result we constrain the values of the ADAM hyperparametersfoot_4 . Now, we can state the formal guarantees of our algorithm. Theorem 2.3. Algorithm 2, with appropriate hyperparameters for ADAM and some constant τ 1 > 0, given access to stochastic zeroth-order and gradient oracles for a function f = i∈[m] f i where each f i is b-bounded with L-Lipschitz gradient for some b, L > 0, and ε > 0, with probability at least 9 /10 returns a point (x , y ) ∈ R d × R d such that, for some ε ∈ [ 1 2 ε, ε], the point (x , y ) is an ε -local min-max equilibrium point with respect to the distribution D x,y , where D x,y is the distribution of G x (x, y)/ G 2 x (x, y), where the division "/" is applied element-wise. The number of stochastic gradient and function evaluations required by the algorithm is poly( 1 /ε, d, b, L). We present key ideas in the proof in Section 4, and a proof overview in Section B. The full proof appears in Appendix D. Note that D x,y is the distribution of the stochastic gradient updates with element-wise normalizations. Roughly, this corresponds to the distribution of ADAM steps taken by the algorithm for updating x if one uses a small step-size parameter for ADAM. We conclude this section with a few technical remarks about the theorem. Our algorithm could provide guarantees with respect to other distributions D x,y in equation 35 by sampling update steps for x from D x,y instead of ADAM. The norm in the guarantees of the stationary point y for our algorithm is 1 since we use ADAM for updating y. A simpler version of this algorithm using SGD would result in an 2 -norm guarantee. Comparison to notions of local optimality. Definition 2.2 provides a new notion of equilibrium for min-max optimization. Consider the problem min x max y f (x, y), but with constraints on the player's updates to x, y -the max player is restricted to updating y via a path which increases f (x, y) at a rate of at least ε at every point on the path, and the min player proposes an update for x, ∆ sampled from D. Then (x , y ) satisfies Definition 2.2 if and only if (i) there is no update to y that the max player can make which will increase the loss at a rate of at least ε (equation 34), and, (ii) with probability at least 1 -ε for a random step ∆ ∼ D proposed by the min player, the above-constrained max player can update y s.t. the overall decrease in the loss is at most ε from its original value f (x , y ). As one comparison to a previous notion of local optimality, any point which is a local optimum under the definition used previously e.g. in Daskalakis & Panageas (2018) , also satisfies our Definition 2.2 for small enough ε and distribution D corresponding to small-enough step size. On the other hand, previous notions of local optima including the one in Daskalakis & Panageas (2018) are not guaranteed to exist in a general setting, unlike our definition. (See Appendix C for a detailed comparison of how Definition 2.2 relates to previously proposed notions of local optimality).

3. EMPIRICAL RESULTS

We seek to apply our min-max optimization algorithm for training GANs on both real-world and synthetic datasets. Following Goodfellow et al. (2014) , we formulate GAN training as a min-max optimization problem using the cross entropy loss, f (x, y) = log(D y (ζ)) + log(1 -D y (G x (ξ)), where x, y are the weights of the generator and discriminator networks G and D respectively, ζ is sampled from the data, and ξ ∼ N (0, I d ). For this loss, the smoothness parameters b, L may not be finite. To adapt Alg. 1 to training GANs, we make the following simplifications in our simulations: (1) Temperature schedule: We use a fixed temperature τ, constant with iteration i, making it simpler to choose a good temperature value rather than a temperature schedule. (2) Accept/reject rule: We replace the randomized acceptance rule with a deterministic rule: If f new ≤ f old we accept the proposed step, and if f new > f old we only accept if i is a multiple of e 1 /τ , corresponding to an average acceptance rate of e -1 /τ . (3) Discriminator steps: We take a fixed number of discriminator steps at each iteration, instead of taking as many steps needed to achieve a small gradient. These simplifications do not seem to significantly affect our algorithm's performance (see Appendix F.5 for simulations showing it effectively trains GANs without most of these simplifications). Moreover, our simulations show a smaller number of discriminator steps k is usually sufficient in practice. Datasets and Metrics. We perform simulations on MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky et al.) datasets to evaluate whether GANs trained using our algorithm converge, and whether they are able to learn the target distribution. Convergence is evaluated by visual inspection (for MNIST and CIFAR), and by tracking the loss (for MNIST) and FID scores (for CIFAR). As noted by previous works (Borji, 2019; Metz et al., 2017; Srivastava et al., 2017) , it is challenging to detect mode collapse on CIFAR and MNIST, visually or using standard quantitative metrics such as FID scores, because CIFAR (and to some extent MNIST) do not have well-separated modes. Thus, we consider two datasets, one real and one synthetic, with well-separated modes, whence mode collapse can be clearly detected by visual inspection. For the real dataset we consider the 0-1 MNIST dataset (MNIST restricted to digits labeled 0 or 1). The synthetic dataset consists of 512 points sampled from a mixture of four Gaussians in two dimensions with standard deviation 0.01 and means at (0, 1), (1, 0), (-1, 0) and (0, -1). Hyperparameters and hardware. The details of the networks and hyperparameter choices are given in Appendix E. Simulations on MNIST and Gaussian datasets used four 3.0 GHz Intel Scalable CPUs, provided by AWS. On CIFAR-10, we used one High freq. Intel Xeon E5-2686 v4 GPU.

3.1. EVALUATING THE PERFORMANCE OF OUR ALGORITHM

We compare our algorithm's performance to both GDA and unrolled GANs. All algorithms are implemented using ADAM (Kingma & Ba, 2015) . MNIST. We trained a GAN on the full MNIST dataset using our algorithm for 39,000 iterations (with k = 1 discriminator steps and acceptance rate e -1 /τ = 1 /5). We ran this simulation five times; each time the GAN learned to generate all ten digits (see Appendix F.1 for generated images). 0-1 MNIST. We trained a GAN using our algorithm on the 0-1 MNIST dataset for 30,000 iterations and ploted a moving average of the loss values. We repeated this simulation five times; in each of the five runs our algorithm learned to generate digits which look like both the "0" and "1" digits, and the loss seems to decrease and stabilize once our algorithm learns how to generate the two digits. (See Appendix F.3 for the generated images and loss value plots.) Figure 1 : We trained a GAN using our algorithm on 0-1 MNIST for 30,000 iterations (with k = 1 discriminator steps and acceptance rate e -1/τ = 1 /5). We repeated this experiment 22 times for our algorithm and 13 times for GDA. Shown here are the images generated from one of these runs at various iterations for our algorithm (right) and GDA (left) (see also Appendix F.3 for images from other runs). CIFAR-10. We ran our algorithm (with k = 1 discriminator steps and acceptance rate e -1 /τ = 1 /2) on CIFAR-10 for 50,000 iterations. We compare with GDA with k = 1 discriminator steps. We plotted the FID scores for both algorithms. We found that both algorithms have similar FID scores which decrease over time, and produce images of similar quality after 50,000 iterations (Figure 3 ). Clock time per iteration. When training on the 0-1 MNIST dataset (with k = 1 discriminator steps each iteration), our algorithm took 1.4 seconds per iteration on the AWS CPU server. On the same machine, GDA took 0.85 seconds per iteration. When training on CIFAR-10, our algorithm and GDA both took the same amount of time per iteration, 0.08 seconds, on the AWS GPU server. Mitigating mode-collapse on 0-1 MNIST. We trained GANs using both GDA and our algorithm on the 0-1 MNIST dataset, and ran each algorithm for 3000 iterations (Figure 1 ). GDA seems to briefly generate shapes that look like a combination of 0's and 1's, then switches to generating only 1's, and then re-learns how to generate 0's. In contrast, our algorithm seems to learn how to generate both 0's and 1's early on and does not mode collapse to either digit. We repeated this simulation 22 times for our algorithm and 13 times for GDA, and visually inspected the images at iteration 1000. GANs trained using our algorithm generated both digits by the 1000'th iteration in 86% of the runs, while those trained using GDA only did so in 23% of the runs at the 1000'th iteration (see Appendix F.4 for images from all runs). Mitigating mode-collapse on synthetic data. We trained GANs on the 4-Gaussian mixture dataset for 1500 iterations (Figure 2 ) using our algorithm, unrolled GANs with k = 6 unrolling steps, and GDA with k = 1 and k = 6 discriminator steps. We repeated each simulation 10-20 times. By the 1500'th iteration GDA with k = 1 discriminator steps seems to have learned only one mode in 100% of the runs. GDA with k = 6 discriminator steps learned two modes 65% of the runs, one mode 20% of runs, and four modes 15% of runs. Unrolled GANs learned one mode 75% of the runs, two modes 15% of the runs, and three modes 10% of the runs. In contrast, our algorithm learned all four modes 68% of the runs, three modes 26% of the runs, and two modes 5% of the runs. and GDA on CIFAR-10 for 50,000 iterations. The images generated from the resulting generator for both our algorithm (middle) and GDA (left). Over 9 runs, our algorithm achieves a very similar minimum FID score (33.8) compared to GDA (33.0), and a better average FID score over 9 runs (mean µ = 35.6, std. dev. σ = 1.1) compared to GDA (µ = 53.8, σ = 53.9). Images are shown from one run each; see Appendix F.2 for full results.

4. KEY IDEAS IN THE PROOF

For simplicity, assume b = L = τ 1 = 1. There are two key pieces to proving Theorem 2.3. The first is to show that our algorithm converges to some point (x , y ) in poly(d, 1 /ε) gradient and function evaluations (Lemma D.7). Secondly, we show that, y is a first-order ε-stationary point for f (x , •) and x is, roughly, a ε-local minimum for the simulated loss function L ε (•, y ) (Lemma D.9). Step1: Bounding the number of gradient evaluations: After Θ(log( 1 ε )) steps, the decaying acceptance rate of the simulated annealing step ensures that our algorithm stops whenever r max = O( 1 /ε) proposed steps are rejected in a row. Thus, for every O( rmax /ε 2 ) iterations where the algorithm does not terminate, with probability at least 1 -ε the value of the loss decreases by more than ε. Since f is 1-bounded, this implies our algorithm terminates after roughly O( rmax /ε 3 ) iterations of the minimization routine (Proposition D.6). Next, since f is 1-bounded with 1-Lipschitz gradient, in each iteration, we require at most poly( d /ε) gradient ascent steps to reach an ε-stationary point. Since each step of the maximization subroutine requires one gradient evaluation, and each iteration of the minimization routine calls the maximization routine exactly once, the total number of gradient evaluations is poly(d, 1 /ε). Step 2: Show x is an ε-local minimum for L ε (•, y ) and y is an ε-stationary point. First, we note that since our algorithm runs the gradient ascent maximization subroutine until it reaches an ε-stationary point, we have, ∇ y f (x , y ) 1 ≤ ε. Our stopping condition for the algorithm implies the last r max updates ∆ proposed by the max-player were all rejected, and hence were sampled from the distribution D x ,y of the ADAM gradient at (x , y ). Roughly, this implies Pr ∆∼D x ,y [f (x + ∆, y ) ≥ f (x , y ) -ε] ≥ 1 -ε, where the maximization subroutine computes y by gradient ascent on f (x + ∆, •) initialized at y . In other words, equation 5 says that at the point (x , y ) where our algorithm stops, if the min-player samples an update ∆ from the distribution D x ,y , followed by the max-player updating y using gradient ascent, with high probability the final loss value cannot decrease by more than ε. To show equation 35 holds, we need to replace f in the above equation with the simulated loss L ε . We first show that the gradient ascent steps form an "ε-increasing" path, starting at y with endpoint y , along which f increases at rate at least ε (Prop. D.8). This crucially exploits that our algorithm restricts the max player to only use such "ε-increasing" paths. Since L ε is the supremum of f at the endpoints of all such ε-increasing paths starting at y , we get f (x + ∆, y ) ≤ L ε (x + ∆, y ). Finally, recall from Section 2 that ∇ y f (x , y ) 1 ≤ ε implies L ε (x , y ) = f (x , y ). Combining the above observations implies the ε-local minimum condition equation 35. Note that we could pick any distribution D for the updates and the proof still holds -the distribution of ADAM gradients works well in practice. Also, we could replace simulated annealing with a deterministic rule, but such an algorithm often gets stuck at poor local equilibria in GAN training.

5. CONCLUSION

In this paper, we develop a convergent first-order algorithm for min-max optimization and show how it can lead to a stable and scalable algorithm for training GANs. We prove that our algorithm converges in time polynomial in the dimension and the smoothness parameters of the loss function. Our simulations show that a version of our algorithm can lead to more stable training of GANs on synthetic and real-world datasets. And yet the amount of memory and time required by each iteration of our algorithm is competitive with GDA. Our algorithm synthesizes a first-order approximation to the global strategy of the max-player, a look ahead strategy based on batch gradients for the minplayer, and simulated annealing. We believe that these ideas of imposing computational restrictions on the min-and max-players should be useful in obtaining convergent and practical algorithms for other applications of min-max optimization, such as adversarial learning. A THE FULL ALGORITHM Algorithm 2 Algorithm for min-max optimization (formal version) input: Stochastic zeroth-order oracle F for loss function f : R d × R d → R, and stochastic gradient oracle G x for ∇ x f , and G y for ∇ y f input: Initial point (x 0 , y 0 ), Error parameter ε output: A point (x , y ) hyperparameters: r max (max number of rejections); τ 1 > 0 (hyperparameter for simulated annealing); hyperparameters α, η, δ > 0 and β 1 , β 2 ∈ [0, 1] for ADAM 1: Set i ← 0, r ← 0, a ← 0, m 0 ← 0, v 0 ← 0, m ← 0, v ← 0, s ← 0, ε 1 = ε 2 2: Set f old ← ∞ {Set f old to be ∞ (or the largest possible value allowed by the computer) to ensure that the first step is accepted} 3: while r ≤ r max do 4: Set i ← i + 1 5: Set g x,i ← G x (x i , y i ) {Compute proposed stochastic gradient} 6: Set M i+1 ← β 1 m i + (1 -β 1 )g x,i , and V i+1 ← β 2 v i + (1 -β 2 )g 2 x,i {Compute proposed ADAM update for first-and second-moment estimates} 7: Set X i+1 ← x i -α 1 1-β a+1 1 M i+1 /( 1 1-β a+1 2 V i+1 + δ) {Compute proposed ADAM update of x-variable} 8: Run Algorithm 3 with inputs x ← X i+1 , y 0 ← y i , (m, v, s), and ε ← ε i (1 + η).

9:

Set Y i+1 ← y stationary and (M, V, S) ← (m out , v out , s out ) to be the outputs of Alg. 3. Use ADAM optimizer to simulate the max player's response. 10: Set f new ← F (X i+1 , Y i+1 ) 11: if f new > f old -ε 4 , then 12: Set Accept i ← False with probability max(0, 1 -e -i τ 1 ) {Decide to accept or reject} 13: if Accept i = True then Set ε i+1 ← ε i 25: return (x , y ) ← (x i , y i ) Algorithm 3 ADAM (for the max-player updates) input: Stochastic gradient oracle G y for ∇ y f input: x, y 0 , m, v, ε , number of steps s taken at previous iterations output: A point y stationary which is a first-order ε -stationary point for f (x, •), and s out , m out , v out hyperparameters: η > 0; ADAM hyperparameters α, δ > 0 and β 1 , β 2 ∈ [0, 1] 1: Set j ← 0 2: Set Stop = False 3: while Stop = False do 4: Set j ← j + 1 5: Set g y,j ← G y (x, y j ) 6: if g y,j 1 > ε then 7: Set j ← j + 1 8: Set m j ← β 1 m j + (1 -β 1 )g y,j , and v j ← β 2 v j + (1 -β 2 )g 2 y,j {Compute proposed ADAM update for first-and second-moment estimates} 9: Set y j+1 ← y j + α  1 1-β s+j+1 1 m j /( 1 1-β s+j+1 2 v j + δ) ← s + j, m out ← m j , v out ← v j B PROOF OVERVIEW To prove Theorem 2.3 we would like to show two things: First, that our algorithm converges to some point (x , y ) in a number of gradient and function evaluations which is polynomial in 1 /ε, d, b, L (Lemma D.7, in the appendix). Secondly, roughly, we would like to show y is a first-order εstationary point for f (x , •) and x is a ε-local minimum for the simulated loss function L ε (•, y ) (Lemma D.9). Bounding the number of gradient and function evaluations (Lemma D.7) First, we bound the number of gradient evaluations for the maximization subroutine for y (Line 6 of Algorithm 1), and then bound the number of iterations in the minimization routine for x (the "While" loop in Alg. 1). Step 1: Bounding the number of gradient ascent steps for the maximization subroutine: Consider the sequence of ADAM gradient ascent steps y = y 1 , y 2 . . . , y = y the max-player uses to compute her update y in Line 6 of Alg. 1. For our choice of hyperparameters, the ADAM update is y j+1 = y j + αG y (x, y j )/(G 2 y (x, y j )) 1 /2 , where α is the ADAM learning rate. Since the magnitude of our ADAM gradient satisfies G y (x, y j )/(G 2 y (x, y j )) 1 /2 2 = G y (x i , y j ) 1 , our stopping condition for gradient ascent (Line 6 of Alg. 1), which says that gradient ascent stops whenever G y (x, y j ) 1 ≤ ε, implies that at each step of gradient ascent our ADAM update has magnitude at least αε in 2 -norm. Using this, we then show that at each step y j of gradient ascent, we have with high probability that f (y j+1 ) -f (y j ) ≥ Ω(αε), if the ADAM learning rate satisfies α ≈ Θ( ε /(Ld)) (Proposition D.5). Since f is b-bounded, this implies that the ADAM gradient ascent subroutine must terminate after O( b /(αε)) = O( bLd /ε) steps. Step 2: Bounding the number of iterations for the minimization routine: We first show a concentration bound for our stochastic zeroth-order oracle (Proposition D.3) and use it to show that, for our temperature schedule of τ i = τ1 /i, after Î := τ 1 log( 2rmaxb /ε 2 ) iterations, our algorithm rejects any proposed step x = x + ∆ for which f (x , y ) > f (x, y) -ε with probability at most roughly 1 -e -1 /τ i ≤ 1 -ε 2 /(2rmaxb). Therefore, roughly, with probability at least 1 -ε, for the first 2rmaxb /ε iterations after Î, we have that only proposed steps x = x + ∆ for which f (x , y ) ≤ f (x, y) -ε are accepted. Moreover, our stopping condition (Line 2 in Alg. 1) stops the algorithm whenever r max proposed steps are rejected in a row. Therefore, with probability at least 1 -ε we have that the value of the loss decreases by more than 2b /ε between iterations Î and Î + r 2 max b /ε 2 , unless our algorithm terminates during these iterations. Since f is b-bounded, this implies our algorithm must terminate after roughly O( r 2 max b /ε 2 ) iterations of the minimization routine (Proposition D.6). Now, each of the O( bLd /ε) steps of the maximization subroutine requires one gradient evaluation, and each of the O( r 2 max b /ε 2 ) iterations of the minimization routine calls the maximization routine exactly once (and makes one call to stochastic oracles for the function f and the x-gradient). Therefore, the total number of gradient and function evaluations is roughly bLd /ε × r 2 max b /ε 2 , which, for our choice of hyperparameter r max of roughly r max = 1 /ε, is polynomial in 1 /ε, d, b, L. Showing (x , y ) satisfies Inequalities equation 34 and equation 35 (Lemma D.9) Step 1: Show y is a first-order ε-stationary point for f (x , •) (equation 34) Our stopping condition for the maximization subroutine says G y (x , y ) 1 ≤ ε at the point y where the subroutine terminates. To prove equation 34, we show a concentration bound for our stochastic gradient G y (the second part of Proposition D.3) and use this to show that the bound G y (x , y ) 1 ≤ ε on the stochastic gradient obtained from the stopping condition implies the desired bound on the exact gradient, ∇ y f (x , y ) 1 ≤ ε. Step 2: Show x is an ε-local minimum for L ε (•, y ) (equation 35) First, we show our stopping condition for the minimization routine implies the last r max steps ∆ proposed by the algorithm were all rejected. This implies the last r max proposed steps were sampled from the distribution D x ,y of the ADAM gradient G x (x , y )/(G 2 x (x , y )) 1 /2 and, since they were rejected, our stopping condition implies, roughly, f (x + ∆, y ) > f (x , y ) -ε for all these proposed steps ∆. Roughly, we use this, together with our poly( 1 /ε, d, b, L) bound on the number of iterations (Proposition D.6), to show Pr ∆∼D x ,y [f (x + ∆, y ) ≥ f (x , y ) -ε] ≥ 1 -ε, where the maximization subroutine computes y by gradient ascent on f (x + ∆, •) initialized at y . To show that equation 35 holds, roughly, we would like to replace "f " in the bound in Ineq. equation 8 with the simulated loss function L ε . Towards this end, we first show that the steps traced by the gradient ascent maximization subroutine form an "ε-increasing" path, with endpoint y , along which f increases at rate at least ε (Prop. D.8). Although we would ideally like to use this fact to show that f (x + ∆, y ) = L ε (x + ∆, y ), this equality does not hold in general since L ε is defined using a large set of such ε-increasing paths, only one of which is simulated by the maximization subroutine. To get around this problem, we instead use the fact that L ε is the supremum of the values of f at the endpoints of all ε-increasing paths starting at y which seek to maximize f (x + ∆, •), to show that f (x + ∆, y ) ≤ L ε (x + ∆, y ). (9) Finally, we already showed that ∇ y f (x , y ) 1 ≤ ε ; recall from Section 2 that this implies L ε (x , y ) = f (x , y ). Combing this with equation 8, equation 9 implies the ε-local minimum condition equation 35.

C COMPARISON OF NOTIONS OF LOCAL OPTIMALITY

In previous works Daskalakis & Panageas (2018) ; Heusel et al. (2017) , a local saddle point (equivalently a local Nash point) has been considered as a possible notion of local min-max optimum. A point (x , y ) is said to be a local saddle point if there exists a small neighborhood around x and y such that y is a first-order stationary point for the function f (x , •), and x is a local minimum for the function f (•, y ). A key difference between a local saddle point and the local min-max equilibrium we introduce in our paper (Definition 2.2) is that a local saddle point does not take into account the order in which the min-player and max-player choose x and y. While a local saddle point is not guaranteed to exist in general nonconvex-nonconcave min-max problems, if there exists a local saddle point, this saddle point is also a local min-max equilibrium as given by Definition 2.2. More specifically, any local saddle point for a smooth function f is an ε-local min-max equilibrium for every ε > 0, provided that we pick the step size for the min-player to be small enough. This is because, if there if a point (x , y ) is a local saddle, then y is a first-order stationary point for the function f (x , •). Thus, the gradient at (x , y ) must be zero and the first condition (Inequality equation 34) is satisfied. If we pick the step size small enough, the step ∆ will be such that x +∆ lies within this neighborhood of x and hence L ε (x , y ) = f (x , y ) ≤ f (x + ∆, y ) ≤ L ε (x + ∆, y ) for all ε > 0foot_5 . Thus the second condition (Inequality equation 35) is satisfied for any ε > 0. Another notion of a local minimax point was proposed in Jin et al. (2020) . In their definition, the max player is able to choose her move after the min-player reveals her move. The max-player is restricted to move in a small ball around y , but is always able to find the global maximum inside this ball. In contrast, in our definition, the max-player is empowered to move much farther, as long as she is following a path along which the loss function increases rapidly. Hence, in general these two notions are incomparable. However, even under mild smoothness conditions, a local minimax point is not guaranteed to exist Jin et al. (2020) , whereas a local min-max equilibrium (Definition 2.2) is. Finally, in a parallel line of work, we prove the existence of a stronger, second-order notion of minmax equilibrium for nonconvex-nonconcave functions motivated by the notion of approximate local minimum introduced in Nesterov & Polyak (2006) . The advantage of the notion in our current paper (Definition 2.2) is that it yields a concise proof of convergence and the accompanying algorithm, as our empirical results show, is effective for training GANs.

D MAIN RESULT

Recall that we have defined D x,y to be the distribution of the point-wise normalized stochastic gradient G x (x, y)/ G 2 x (x, y). We also recall that we have made the following assumptions about our stochastic gradient, which we restate here for convenience: Assumption D.1 (smoothness). Suppose that f t (x, y) ∼ D x,y is sampled from the data distribution for any x, y ∈ R d . Then, with probability 1, f t is b-bounded, L 1 -Lipschitz, and has L-Lipschitz gradients ∇f t . Assumption D.2 (batch gradients). F (x, y) = 1 b 0 b0 t=1 f t (x, y) G x (x, y) = 1 b x bx t=1 ∇ x f t (x, y), G y (x, y) = 1 b y by t=1 ∇ y f t (x, y), where, f 1 , . . . f t are sampled iid (with replacement) from the distribution D (x,y) and b 0 , b x , b y > 0 are batch sizes. Every time G x , G y , or F is evaluated, a new, independent, batch is used. Setting parameters: For the theoretical analysis, we set the following parameters, and we define z z := 1 if z = 0. We also assume 0 < ε ≤ 1. 1. α = 1 2. β 1 = 0, β 2 = 0, δ = 0 3. ω = ε 4 2 36 bLd+16 [(τ 1 + b ε 2 ) 256 ε 2 log(τ 1 256 ε 2 ) log 4 ( 100 ε (τ 1 + 1)( 8b ε + 1))] -2 4. r max = 128 ε log 2 100 ε (τ 1 + 1)(8 b ε + 1) + log( 1 ω ) 5. Define I := τ 1 log( rmax ω ) + 8r max b ε + 1 6. η = 2 1 2I -1 (in particular, note that the fact that η = 2 1 2I -1 and I ≥ 1 implies that 1 5I ≤ η ≤ 1 I ) 7. α = ε(1-1 1+η ) 10Ld 8. Define J := 20b 3 αε Under review as a conference paper at ICLR 2021 9. ε1 = ε √ d (1 -1 1+η ) 10. b x = 1 11. b 0 = ε-2 1 140 2 b 2 log( 1 /ω) 12. b y = ε-2 1 140 2 L 2 1 log( 1 /ω) In particular, we note that ω ≤ ε (32J +16)I and r max ≥ 4 ε log( 100I ε ). At every iteration i, where we set ε = ε i , we also have ε ≤ ε 0 (1 + η) 2I ≤ ε and hence that ( ε1 10 + L √ dα) √ d ≤ (1 -1 1+η )ε 0 ≤ (1 -1 1+η )ε . Proposition D.3. For any ε1 , ω > 0, if we use batch sizes b y = ε-2 1 140 2 L 2 1 log( 1 /ω) and b 0 = ε-2 1 140 2 b 2 log( 1 /ω), we have that P G y (x, y) -∇ y f (x, y) 2 ≥ ε1 10 < ω, and P |F (x, y) -f (x, y)| ≥ ε1 10 < ω. ( ) Proof. From Assumption D.2 we have that G y (x, y) -∇ y f (x, y) = 1 b y by t=1 [∇ y f t (x, y) -∇ y f (x, y)], where f 1 , . . . f t are sampled iid (with replacement) from the data distribution D. But by Assumption D.1, we have (with probability 1) that ∇ y f t (x, y) -∇ y f (x, y) 2 ≤ ∇ y f t (x, y) 2 + ∇ y f (x, y) 2 ≤ 2L 1 . Now, E[∇ y f t (x, y) -∇ y f (x, y)] = E[∇ y f t (x, y) -E[∇ y f t (x, y)]] = 0. Therefore, by the Azuma-Hoefding inequality for mean-zero bounded vectors, we have P   1 b y by t=1 [∇ y f t (x, y) -∇ y f (x, y)] 2 ≥ s b y + 1 b y 2L 1   < 2e 1-1 2 s 2 ∀s > 0. Hence, if we set s = 6 log 1 /2 ( 2 ω ), we have that 7 log 1 /2 ( 2 ω ) b y + 1 ≥ s b y + 1 and hence that P   1 b y by t=1 [∇ y f t (x, y) -∇ y f (x, y)] 2 ≥ 7 log 1 /2 ( 2 ω ) b y b y 2L 1   < ω. Therefore, P   1 b y by t=1 [∇ y f t (x, y) -∇ y f (x, y)] 2 ≥ ε1 10   < ω which completes the proof of Inequality equation 10. Inequality equation 11 follows from the exact same steps as the proof of Inequality equation 10, if we replace the bound L 1 for ∇ y f t (x, y) 2 with the bound b on |f t (x, y)|. Proposition D.4. For every j we have y j+1 -y j 2 ≤ α√ d. (12) Moreover, for every i we have, x i+1 -x i 2 ≤ α √ d. ( ) Proof. y j+1 -y j 2 ≤ α m j / √ v j 2 = α d k=1 g y,j [k]/ g 2 y,j [k] 2 = α√ d This proves Inequality equation 12. The proof of Inequality equation 13 follows from the same steps as above. Proposition D.5. Algorithm 3 terminates in at most J := 20b 3 αε iterations of its "While" loop, with probability at least 1 -ω × J . Proof. Recalling that / and √ • denote element-wise operations, we have (m j / √ v j ), g y,j = d k=1 g y,j [k]/ g 2 y,j [k] × g y,j [k] (14) = d k=1 |g y,j [k]| = g y,j 1 ≥ ε ≥ ε 2 , since, whenever Algorithm 3 is called by Algorithm 2, Algorithm 2 sets Algorithm 3's input ε to some value ε ≥ ε 2 . Therefore, we have that y j+1 -y j , ∇ y f (x, y j ) Prop.D.3 ≥ y j+1 -y j , g y,j -y j+1 -y j 2 × ε1 10 = α(m j / √ v j ), g y,j -y j+1 -y j 2 × ε1 10 Eq.14 ≥ α ε 2 -y j+1 -y j 2 × ε1 10 Prop.D.4 ≥ α ε 2 - α√ d × ε1 10 ≥ 4 10 αε, with probability at least 1 -ω, since ε1 ≤ ε √ d . Since f has L-Lipschitz gradient, there exits a vector u, with u 2 ≤ L y j+1 -y j 2 , such that f (y j+1 ) -f (y j ) = y j+1 -y j , ∇ y f (x, y j ) + u (16) = y j+1 -y j , ∇ y f (x, y j ) + y j+1 -y j , u Proof. For any i > 0, let E i be the "bad" event that both f (x i+1 , y i+1 ) -f (x i , y i ) > -ε 4 and Accept i = True. Then by Proposition D.3, since ε1 10 ≤ ε 8 , we have that P(E i ) ≤ e -i τ 1 + ω. ( ) Define Î := τ 1 log( rmax ω ). Then for i ≥ Î, from Line 12 of Algorithm 2 we have by Inequality equation 17 that P(E i ) ≤ 2ω. Define h := r max 2b 1 4 ε + 1. Then P   Î+h i= Î E i   ≤ 2ω × h. ( ) Since f takes values in [-b, b] , if Î+h i= Î E i does not occur, the number of accepted steps over the iterations Î ≤ i ≤ Î + h (that is, the size of the set {i : Î ≤ i ≤ Î + h, Accept i = True}) is at most 2b 1 4 ε . Therefore, since h = r max 2b 1 4 ε +1, we must have that there exists a number i, with Î ≤ i ≤ i+r max ≤ Î + h, such that Accept i = False for all i ∈ [i, i + r max ]. Therefore the condition in the While loop (Line 9) of Algorithm 2 implies that Algorithm 2 terminates after at most i + r max ≤ Î + h iterations of its While loop, as long as Î+h i= Î E i does not occur. Therefore, Inequality 18 implies that, with probability at least 1 -2ω × (r max 2b 1 4 ε + 1), Algorithm 2 terminates after at most Î + h = τ 1 log( r max ω ) + 8r max b ε + 1 iterations of its "While" loop. Recall the paths γ(t) from Definition 2.1. From now on we will refer to such paths as "ε-increasing paths". That is, for any ε > 0, we say that a path γ(t) is an "ε-increasing path" if at every point along this path we have that d dt γ(t) ∞ and that d dt f (x, γ(t)) > ε (Inequality equation 33). Proposition D.8. Every time Algorithm 3 is called we have that, with probability at least 1 -2ωJ , the path consisting of the line segments [y j , y j+1 ] formed by the points y j computed by Algorithm 3 has a parametrization γ(t) which is an 1 1+η ε -increasing path. Proof. We consider the following continuous parametrized path γ(t): γ(t) = y j + tv j t ∈ [α(j -1), αj], j ∈ [j max ], where v j := g y,j / (g y,j ) 2 and j max is the number of iterations of the While loop of Algorithm 3. First, we show that this path has at most at most unit-velocity in the infinity-norm at all but a finite number of time-points. For any point in one of the open intervals t ∈ ( α(j -1), αj), we have thatfoot_6  d dt γ(t) ∞ = v j ∞ = g y,j / (g y,j ) 2 ∞ ≤ (1, . . . , 1) ∞ = 1. This implies that the path γ(t) has unit velocity in the infinity norm at all but a finite number of time-points. Next, we show that d dt f (x, γ(t)) ≥ ε . Now, v j 2 = √ d and hence, y j+1 -y j 2 = αv j 2 = α√ d at every step j. By Line 12 of Algorithm 3 we have that g y,j / |g y,j | 2 2 > ε for every j ∈ [j max ], where we define j max ∈ N ∪ {∞} to be the number of iterations of the While loop in Algorithm 3. Therefore, for every j ∈ [j max ], by Proposition D.3 we have with probability at least 1 -ω that d dt f (x, γ(t)) ≥ [∇ y f (x, y j ) -L y j+1 -y j 2 u] v j Eq.20 = [∇ y f (x, y j ) -L √ dαu] v j = [g y,j - ε1 10 w -L √ dαu] v j = g y,j v j -[ ε1 10 w + L √ dαu] v j = g y,j g y,j / (g y,j ) 2 -[ ε1 10 w + L √ dαu] v j = g y,j / |g y,j | 2 2 -[ ε1 10 w + L √ dαu] v j ≥ g y,j / |g y,j | 2 2 -( ε1 10 + L √ dα) v j 2 Eq.19 = g y,j / |g y,j | 2 2 -( ε1 10 + L √ dα) √ d ≥ ε -( ε1 10 + L √ dα) √ d ≥ 1 1 + η ε ∀t ∈ [α(j -1), αj], for some unit vectors u, w ∈ R d , since ( ε1 10 + L √ dα) √ d ≤ (1 -1 1+η )ε 0 ≤ (1 -1 1+η )ε . But by Proposition D.5 we have that j max ≤ J with probability at least 1 -ω × J . Therefore inequality equation 21 implies that d dt f (x, γ(t)) ≥ 1 1 + η ε ∀t ∈ [α(j -1), αj] ∀j ∈ [j max ], with probability at least 1 -2ωJ . Lemma D.9. Let i be such that i -1 is the last iteration i of the "While" loop in Algorithm 2 for which Accept i = True. Then with probability at least 1 -2ωJ I -2ω × (r max 2b 1 4 ε + 1) we have that ∇ y f (x , y ) 1 ≤ 1 √ 1 + η ε i . Moreover, with probability at least 1 -ε 100 -2ω × (r max 2b 1 4 ε + 1) we have that P ∆∼D x ,y L ε i (x + ∆, y ) ≤ L ε i (x , y ) - 1 2 ε x , y ≤ 1 2 ε. ( ) and that ε 2 ≤ ε i ≤ ε. Proof. First, we note that (x , y ) = (x i , y i ) for all i ∈ {i , . . . , i + r max }, and that Algorithm 2 stops after exactly i + r max iterations of the "While" loop in Algorithm 2. Define ∆ i := g x,i / (g x,i ) 2 for every i. Then ∆ i ∼ D xi,yi . Let H i be the "bad" event that, when Algorithm 3 is called during the ith iteration of the "While" loop in Algorithm 2, the path traced by Algorithm 3 is not an ε i -increasing path. Then, by Proposition D.8 we have that P(H i ) ≤ 2ωJ . Let K i be the "bad" event that G y (x i , y i ) -∇ y f (x i , y i ) 2 ≥ ε1 10 . Then by Propositions D.3 and D.5 we have that P(K i ) ≤ 2ωJ . Whenever K c i occurs we have that ∇ y f (x i , y i ) 1 ≤ G y (x i , y i ) 1 + G y (x i , y i ) -∇ y f (x i , y i ) 1 ≤ ε i + √ d G y (x i , y i ) -∇ y f (x i , y i ) 2 ≤ 1 1 + η ε i + ε1 10 ≤ 1 √ 1 + η ε i , where the second Inequality holds by Line 12 of Algorithm 3, and the last inequality holds since  ∇ y f (x , y ) 1 ≤ 1 √ 1 + η ε i with probability at least 1 -2ωJ I -2ω × (r max 2b 1 4 ε + 1 ). This proves Inequality equation 22. Inequality equation 27 also implies that, whenever K c i occurs, the set S εi,xi,yi of ε i -increasing paths with initial point y i (and x-value x i ) consists only of the single point y i . Therefore, we have that L εi (x i , y i ) = f (x i , y i ) whenever K c i occurs. Moreover, whenver H c i occurs we have that Y i+1 is the endpoint of an ε i -increasing path with starting point (x i + ∆ i , y i ). Now, L εi (x i + ∆ i , y i ) is the supremum of the value of f at the endpoints of all ε i -increasing paths with starting point (x i + ∆ i , y i ). Therefore, we must have that L εi (x i + ∆ i , y i ) ≥ f (x i + ∆ i , Y i+1 ) whenever H c i occurs. Therefore, P ∆∼Dx i ,y i L εi (x i + ∆, y i ) > L εi (x i , y i ) - 1 2 ε x i , y i Eq.28,29 ≥ P ∆∼Dx i ,y i f (x i + ∆ i , Y i+1 ) > f (x i , y i ) - 1 2 ε x i , y i -P(H i ) -P(K i ) Prop.D.3 ≥ P ∆∼Dx i ,y i F (x i + ∆, Y i+1 ) > F (x i , y i ) - 1 4 ε x i , y i -2ω -P(H i ) -P(K i ) ≥ P Accept i = False x i , y i -2ω -P(H i ) -P(K i ) Eq.25,26 ≥ P Accept i = False x i , y i -2ω -2ωJ -2ωJ , ∀i ≤ I, where the second inequality holds by Proposition D.3, since ε1 10 ≤ ε 8 . Define p i := P ∆∼Dx i ,y i L εi (x i + ∆, y i ) > L εi (x i , y i ) -1 2 ε x i , y i for every i ∈ N. Then Inequality equation 30 implies that P Accept i = False x i , y i ≤ p i + ω(4J + 2) ≤ p i + 1 8 ε ∀i ≤ I, since ω ≤ ε 32J +16 . We now consider what happens for indices i for which p i ≤ 1 -1 2 ε. Since (x i+s , y i+s ) = (x i , y i ) whenever Accept i+k = False for all 0 ≤ k ≤ s, we have by Inequality equation 31 that P ∩ rmax s=0 {Accept i+s = False} p i ≤ 1 - 1 2 ε ≤ (1 - 1 4 ε) rmax ≤ ε 100I ∀i ≤ I -r max since r max ≥ 4 ε log( 100I ε ) . Therefore, with probability at least 1-ε 100I ×I = 1-ε 100 , we have that the event ∩ rmax s=0 {Accept i+s = False} does not occur for any i ≤ I -r max for which p i ≤ 1 -1 2 ε. Recall from Proposition D.6 that Algorithm 2 terminates in at most I iterations of its "While" loop, with probability at least 1 -2ω × (r max 2b 1 4 ε + 1). Therefore, P p i > 1 - 1 2 ε ≥ 1 - ε 100 -2ω × (r max 2b 1 4 ε + 1). In other words, by the definition of p i , Inequality equation 32 implies that with probability at least 1 -ε 100 -2ω × (r max 2b 1 4 ε + 1), the point (x , y ) is such that P ∆∼D x ,y L ε i (x + ∆, y ) ≤ L ε i (x , y ) - 1 2 ε x , y ≤ 1 2 ε. This completes the proof of inequality equation 23. Finally we note that when Algorithm 2 terminates in at most I iterations of its "While" loop , we have ε i = ε 0 (1 + η) 2i ≤ ε 0 (1 + η) 2I ≤ ε. This completes the proof on Inequality equation 24. We can now complete the proof of the main theorem: Proof of Theorem 2.3. First, by Lemma D.7 our algorithm converges to some point (x , y ) after at most (τ 1 log( rmax ω ) + 4r max b ε + 1) × (J × b y + b 0 + b x ) gradient and function evaluations, which is polynomial in 1 /ε, d, b, L 1 , L. In particular, for L, b ≥ 1, the number of gradient and function evaluations is Õ(d 2 L 2 b 6 ε -11 ). By Lemma D.9, if we set ε = ε i , we have that Inequalities equation 34 and equation 35 hold for ε ∈ [ 1 2 ε, ε] with probability at least 1 -2ωJ I -2ω × (r max 2b 1 4 ε + 1) ≥ 9 10 .

E SIMULATION SETUP

In this section we discuss the neural network architectures, choice of hyperparameters, and hardware used for both the real and synthetic data simulations. Datasets. We evaluate the performance of our algorithm on both real-world and synthetic data. The real-world datasets used are MNIST and CIFAR-10. In one of our MNIST simulations we train our algorithm on the entire MNIST dataset, and on the remaining simulations we train on the subset of the MNIST dataset which includes only the digits labeled 0 or 1 (note however that the algorithms do not see the labels). For simplicity, we refer to this dataset as the 0-1 MNIST dataset. The synthetic dataset we use consists of 512 points points sampled at the start of each simulation from a mixture of four equally weighted Gaussians in two dimensions with standard deviation 0.01; and means positioned at (0, 1), (1, 0), (-1, 0) and (0, -1). For all of our simulations on both real and synthetic datasets, following Goodfellow et al. (2014) and Metz et al. (2017) , we use the cross entropy loss function for training, where f (x, y) = log(D y (ζ)) + log(1 -D y (G x (ξ)), where x are the weights of the generator's neural network and y are the weights of the discriminator's neural network, where ζ is sampled from the data distribution, and ξ is a Gaussian with identity covariance matrix. The neural network architectures and hyperparameters for both the real and synthetic data simulations are specified in the following paragraphs. Hyperparameters for MNIST simulations. For the MNIST simulations, we use a batch size of 128, with Adam learning rate of 0.0002 and hyperparameter β 1 = 0.5 for both the generator and discriminator gradients. Our code for the MNIST simulations is based on the code of Renu Khandelwal Khandelwal (2019) and Rowel Atienza Atienza (2017) , which originally used gradient descent ascent and ADAM gradients for training. For the generator we use a neural network with input of size 256 and 3 hidden layers, with leaky RELUS each with "alpha" parameter 0.2 and dropout regularization of 0.2 at each layer. The first layer has size 256, the second layer has size 512, and the third layer has size 1024, followed by an output layer with hyperbolic tangent ("tanh") acvtivation. For the discriminator we use a neural network with 3 hidden layers, and leaky RELUS each with "alpha" parameter 0.2, and dropout regularization of 0.3 (for the first two layers) and 0.2 (for the last layer). The first layer has size 1024, the second layer has size 512, the third layer has size 256, and the hidden layers are followed by a projection to 1 dimension with sigmoid activation (which is fed as input to the cross entropy loss function). Hyperparameters for Gaussian mixture simulations. For the simulations on Gaussian mixture data, we have used the code provided by the authors of Metz et al. (2017) , which uses a batch size 512, Adam learning rates of 10 -3 for the generator and 10 -4 for the discriminator, and Adam parameter β 1 = 0.5 for both the generator and discriminator.foot_7  We use the same neural networks that were used in the code from Metz et al. (2017) : The generator uses a fully connected neural network with 2 hidden layers of size 128 and RELU activation, followed by a linear projection to two dimensions. The discriminator uses a fully connected neural network with 2 hidden layers of size 128 and RELU activation, followed by a linear projection to 1 dimension (which is fed as input to the cross entropy loss function). As in the paper Metz et al. (2017) , we initialize all the neural network weights to be orthogonal with scaling 0.8. Hyperparameters for CIFAR-10 simulations. For the CIFAR-10 simulations, we use a batch size of 128, with Adam learning rate of 0.02 and hyperparameter β 1 = 0.5 for both the generator and discriminator gradients. Our code for the CIFAR-10 simulations is based on the code of Jason Brownlee Brownlee (2019) , which originally used gradient descent ascent and ADAM gradients for training. For the generator we use a neural network with input of size 100 and 4 hidden layers. The first hidden layer consists of a dense layer with 4, 096 parameters, followed by a leaky RELU layer, whose activations are reshaped into 246 4 × 4 feature maps. The feature maps are then upscaled to an output shape of 32 x 32 via three hidden layers of size 128 each consisting of a convolutional Conv2DTranspose layer followed by a leaky RELU layer, until the output layer where three filter maps (channels) are created. Each leaky RELU layer has "alpha" parameter 0.2. For the discriminator, we use a neural network with input of size 32 × 32 × 3 followed by 5 hidden layers. The first four hidden layers each consist of a convolutional Conv2DTranspose layer followed by a leaky RELU layer with "alpha" parameter 0.2. The first layer has size 64, the next two layers each have size 128, and the fourth layer has size 256. The output layer consists of a projection to 1 dimension with dropout regularization of 0.4 and sigmoid activation function. FID scores were computed every 2500 iterations. To compute each FID score we used 10,000 randomly selected images from the CIFAR-10 dataset, and 10,000 generated images. Setting hyperparameters. In our simulations, our goal was to be able to use the smallest number of discriminator or unrolled steps while still learning the distribution in a short amount of time, and we therefore decided to compare all algorithms using the same hyperparameter k. To choose this single value of k, we started by running each algorithm with k = 1 and increased the number of discriminator steps until one of the algorithms was able to learn the distribution consistently in the first 1500 iterations. This resulted in a choice of k = 1 for the MNIST datasets and a choice of k = 6 for the Gaussian mixture model data. For the CIFAR-10 dataset we simply used k = 1 for both algorithms (since for CIFAR-10 it is difficult to visually determine if all modes were learned). Our temperature hyper-parameter was set by running our algorithm with temperatures in the set {1, 2, 3, 4, 5, 10}, and choosing the temperature which gave the best performance. Hardware. Our simulations on the MNIST, 0-1 MNIST, and Gaussian datasets were performed on four 3.0 GHz Intel Scalable CPU Processors, provided by AWS. Our simulations on the CIFAR-10 dataset were performed on one GPU with High frequency Intel Xeon E5-2686 v4 (Broadwell) processors, provided by AWS.

F ADDITIONAL SIMULATION RESULTS

In this section we show additional results from simulations which we did not have space to include in the main body of the paper.

F.1 OUR ALGORITHM ON THE FULL MNIST DATASET

Here we show the results of the simulation of our algorithm on the full MNIST dataset (Fig. 4 ). F.2 OUR ALGORITHM AND GDA TRAINED ON THE CIFAR-10 DATASET. In this section we show the results of all the runs of the simulations of our algorithm and GDA on the CIFAR-10 dataset, which were mentioned in Section 3.1 (Figures 5 and 6 ). ) on the full MNIST dataset for 39,000 iterations, and then plotted images generated from the resulting generator. We repeated this simulation five times; the generated images from each of the five runs are shown here. F.3 OUR ALGORITHM TRAINED ON THE 0-1 MNIST DATASET. Here we show the results of the simulations of our algorithm on 0-1 MNIST which were mentioned in Section 3.1 (Figures 7 and 8 ).

F.4 COMPARISON WITH GDA ON MNIST

In this section we show the results from the different runs of the simulations of our algorithm and GDA on the 0-1 MNIST dataset, which were mentioned in Section 3.2 (Figures 9 and 10).

F.5 RANDOMIZED ACCEPTANCE RULE WITH DECREASING TEMPERATURE

In this section we give the simulations mentioned in the paragraph towards the beginning of Section 3, which discusses simplifications to our algorithm. We included these simulations to check whether our algorithm also works well when it is implemented using a randomized acceptance rule with a decreasing temperature schedule.

F.6 COMPARISON OF ALGORITHMS ON MIXTURE OF 4 GAUSSIANS

In this section we show the results of all the runs of the simulation mentioned in Figure 2 , where all the algorithms were trained on a 4-Gaussian mixture dataset for 1500 iterations. For each run, we plot points from the generated distribution at iteration 1,500. Figure 13 gives the results for GDA with k = 1 discriminator step. Figure 14 gives the results for GDA with k = 6 discriminator steps. Figure 15 gives the results for the Unrolled GANs algorithm. Figure 16 gives the results for our algorithm.

GDA Our Algorithm

Figure 5 : GAN trained using our algorithm (with k = 1 discriminator steps and acceptance rate e -1/τ = 1 /2) and GDA. We repeated this simulation nine times; we display here images generated from the resulting generator for each of the nine runs of GDA (top) and our algorithm (bottom). The final FID scores at 50,000 iterations for each of the nine runs (corresponding to the images above from left to right and then top to bottom) were {35.6, 36.3, 33.8, 35.2, 34.5, 36.7, 34.9, 36.9, 36.6} for our algorithm and {33.0, 197.1, 34.3, 34.3, 33.8, 37.0, 45.3, 34.7, 34 .7} for GDA. acceptance rate e -1/τ = 1 /2) and GDA on CIFAR-10 for 50,000 iterations. We repeated this simulation nine times, and plotted the FID scores for our algorithm (dashed blue) and GDA (solid red).

Our algorithm

2_LossValues_MNIST 2400 (This is our algorithm) 30,000 20,000 15,000 10,000 5,000 2,000 1,000 500 100 1,500 Figure 7 : We trained our algorithm on the 0-1 MNIST dataset for 30,000 iterations (with k = 1 discriminator steps and acceptance rate e -1 τ = 1 5 ). We repeated this experiment five times. For one of the runs, we plotted 25 generated images produced by the generator at various iterations. We also plotted a moving average of the computed loss function values, averaged over a window size of 50. Figure 8 : We show the images generated at the 30,000'th iteration for all 5 runs of the simulation in Figure 7 , together with a plot of their computed loss function values. Recall from the caption of Figure 7 that we trained our algorithm on the 0-1 MNIST dataset for 30,000 iterations (with k = 1 discriminator steps and acceptance rate e -1 τ = 1 5 ). We repeated this experiment five times. For one of the runs, we plotted 25 generated images produced by the generator at various iterations. We also plotted a moving average of the computed loss function values, averaged over a window size of 50. Figure 9 : Images generated at the 1000'th iteration of the 13 runs of the GDA simulation mentioned in Figure 1 . In 77% of the runs the generator seems to be generating only 1's at the 1000'th iteration. The algorithm was run for 39,000 iterations, with a temperature schedule of e -1 τ i = 1 4+e (i/20000) 2 . Proposed steps which decreased the computed value of the loss function were accepted with probability 1, and proposed steps which increased the computed value of the loss function were rejected with probability max(0, 1 -e -i τ 1 ) at each iteration i. We ran the simulation 5 times, and obtained similar results each time, with the generator learning both modes. In this figure, we plotted the generated images from one of the runs at various iterations, with the iteration number specified at the bottom of each figure (see also Figure 12 for results from the other four runs) Figure 12 : Images generated at the 39,000'th iteration of each of the 5 runs of our algorithm for the simulation mentioned in Figure 11 with a randomized acceptance rule with a temperature schedule of e Theorem G.5. Let f : X × Y → R be convex-concave, where X , Y ⊆ R d are compact convex sets. And let D x,y be a continuous distribution with support on Y such that, for every (x, y) ∈ X × Y, D x,y there is some open ball B ⊂ R d containing x such that D x,y has non-zero probability density at every point in B ∩ Y. Then (x , y ) is a ε-local min-max equilibrium for ε = 0 if and only if it is a global min-max point. Proof. Define the "global max" function ψ(x) := max y∈Y f (x, y) for all x ∈ X . We start by showing that the function ψ(x) is convex on the convex set X . Indeed, for any x 1 , x 2 ∈ X and any λ ∈ [0, 1] we have where the second inequality holds by convexity of f (•, y). Moreover, we note that, since, for all x ∈ X , f (x, •) is continuously differentiable on a compact convex set, every allowable path (with parameter ε = 0) can be extended to an allowable path whose endpoint ŷ has projected gradient ∇ Y y f (x , ŷ) = 0. Therefore, for every (x, y) ∈ X ×Y, there exists an allowable path with initial point y whose endpoint ŷ satisfies 1. First we prove the "only if" direction: Suppose that (x , y ) is a ε-local min-max equilibrium of f for ε = 0. Let y † be a global maximizer of the function f (x , •) (the function achieves its global maximum since it is continuous and Y is compact). Then the projected gradient at this point is  ∇ Y y f (x , y † ) = 0. (

SUBROUTINES

Many algorithms for min-max games can be viewed as using an inner maximization subroutine (e.g., unrolled GANs Metz et al. (2017) , and even versions of the GDA algorithm where the maxplayer's update is computed using multiple gradient ascent steps Goodfellow et al. (2014) , as well as Maheswaranathan et al. (2019) ; Bolte et al. (2020) ). However, in contrast to these algorithms, our algorithm has polynomial-time guarantees on the number of gradient and function evaluations when f is bounded with Lipschitz Hessian. In particular, while Bolte et al. (2020) provides theoretical guarantees, there are two key differences between their work and ours: (i) Their min-max algorithm (Algorithm 3 in their paper) requires access to an oracle which returns the global maximum argmax y f (x, y) for any input x. However, since f can be nonconcave in y, computing the global maximum may be intractable and one therefore may not have access to an oracle for the global maximum value in practice. In contrast, our algorithm only requires access to a (stochastic) oracle for the gradient and function value of f . (ii) Bolte et al. (2020) only prove that their algorithm converges asymptotically, and do not provide any bounds on the time to convergence. In contrast, our algorithm has polynomial-time guarantees on the number of gradient and function evaluations. 2019a) provide convergence guarantees under the assumption that f satisfies a variational inequality, such as the "coherence" condition of Mertikopoulos et al. (2019) . Specifically, one of the assumptions of this coherence condition is that there exists a global min-max solution point (x , y ) for min x max y f (x, y) which satisfies the variational inequality ∇ x f (x, y), x-x -∇ y f (x, y), y -y ≥ 0 for all (x, y) in R d × R d . This is a relatively strong assumption since it says that at every point (x, y) ∈ R d × R d , the vector field (-∇ x f (x, y), ∇ y f (x, y)) must not point in a direction "away" from the global min-max point (x , y ). Another setting where convergence guarantees have been show for min-max algorithms in the nonconvex-nonconcave setting is when f satisfies a "sufficient bilinearity" condition Abernethy et al. (2019) . Roughly, this condition says that there is a number γ > 2 such that, at every x, y ∈ R d , all the singular values of the cross derivative ∇ 2 xy f (x, y) are greater than γ. If, in addition, f is 1-Lipschitz then Abernethy et al. (2019) show that their algorithm reaches a first-order ε-stationary point-that is a point where the gradient for the min-and max-players has magnitude at most ε-in roughly O( 1 γ 2 log( 1 ε )) evaluations of a Hessian-vector product of f . In contrast to these works, our main result only assumes that the loss function is bounded with Lipschitz Hessian.



In practice, gradients steps are often replaced by ADAM steps; we ignore this distinction for this discussion. We use the ∞-norm in place of the Euclidean 2-norm, as it is a more natural norm for ADAM gradients. In this equation the derivative d dt is taken from the right. Consider the example f (x, y) = min(x 2 y 2 , 1). The simulated loss function for ε > 0 is Lε(x, y) = f (x, y) if 2x 2 y < ε, and 1 otherwise. Thus L1 /2 is discontinuous at ( 1 /2, 1). In particular, we set the ADAM hyperparameters β1, β2 to be β1 = β2 = 0. Recall from Sec. 2 that if Eq. equation 34 is satisfied then Lε(x , y ) = f (x , y ). Recall that we use the convention 0/0 = 0 when computing the Adam update in our algorithm. This implies that |gy,j/ (gy,j) 2 | ≤ (1, . . . , 1). Note that the authors also mention using slightly different ADAM parameters and neural network architecture in their paper than in their code; we have used the Adam parameters and neural network architecture provided in their code. In other words there is some τ ≥ 0 such that γ : [0, τ ] → Y. In this equation the derivative d dt is taken from the right.



Figure 2: Our algorithm (bottom right), unrolled GANs with k = 6 unrolling steps (top right), and GDA with k = 1 (top left) and k = 6 discriminator steps (bottom left). Each algorithm was trained on a 4-Gaussian mixture for 1500 iterations. Our algorithm used k = 6 discriminator steps and acceptance rate e -1/τ = 1 /4. Plots show the points generated by each of these algorithms after the specified number of iterations.

Figure 3: GAN trained using our algorithm (with k = 1 discriminator steps and acceptance rate e -1/τ = 1 /2)

← X i+1 , y i+1 ← Y i+1 {accept the proposed x and y updates} 15: Set m i+1 ← M i+1 , v i+1 ← V i+1 , and (m, v, s) ← (M, V, S) {accept the proposal's x i+1 ← x i , y i+1 ← y i {go back to the previous x and y values} 22: Set m i+1 ← m i , v i+1 ← v i {go back to the previous ADAM moment estimates} 23:r ← r + 1 {Keep track of how many steps were rejected since the last acceptance. If too many steps were rejected in a row, stop the algorithm and output the current weights.} 24:

least 1 -ω, since α ≤ ε 10Ld . Since f takes values in[-b, b], Inequality equation 16 implies that Algorithm 3 terminates in at most J = 20b 3 αε iterations, with probability at least 1 -ω × J .Proposition D.6. Algorithm 2 terminates in at most I := τ 1 log( rmax ω ) + 8r max b ε + 1 iterations of its "While" loop, with probability at least 1 -2ω × (

Algorithm 2 terminates after at most (τ 1 log( rmax ω )+4r max b ε +1)×(J ×b y +b 0 +b x ) gradient and function evaluations. Proof. Each iteration of the While loop in Algorithm 2 computes one batch gradient with batch size b x , and one stochastic function evaluation of batch size b 0 , and calls Algorithm 3 exactly once. Each iteration of the While loop in Algorithm 3 computes one batch gradient with batch size b y .The result then follows directly from Propositions D.6 and D.5.

equation 26 and equation 27 together with Proposition D.6 imply that

Figure 4: We ran our algorithm (with k = 1 discriminator steps and acceptance rate e -1 τ = 1 5) on the full MNIST dataset for 39,000 iterations, and then plotted images generated from the resulting generator. We repeated this simulation five times; the generated images from each of the five runs are shown here.

Figure 6: Plots of FID scores for GANs trained using our algorithm (with k = 1 discriminator steps and

Figure 10: Images generated at the 1000'th iteration of each of the 22 runs of our algorithm for the simulation mentioned in in Figure 1.

Figure 11: In this simulation we used a randomized accept/reject rule, with a decreasing temperature schedule.

Figure13:The generated points at the 1500'th iteration for all 9 runs of the GDA algorithm with k = 1 discriminator step, for the simulation mentioned in Figure2. At the 1500'th iteration, GDA had learned exactly one mode for each of the 9 runs.GDA with 6 discriminator steps

Figure 14: The generated points at the 1500'th iteration for all 20 runs of the GDA algorithm, with k = 6 discriminator steps, for the simulation mentioned in Figure 2. At the 1500'th iteration, GDA had learned two modes 65% of the runs, one mode 20% of the runs, and four modes 15 % of the runs. Unrolled GANs with 6 unrolling steps

Figure15:The generated points at the 1500'th iteration for all 20 runs of the Unrolled GAN algorithm for the example in Figure2, with k = 6 unrolling steps. By the 1500'th iteration, Unrolled GANs learned one mode 75% of the runs, two modes 15% of the runs, and three modes 10% of the runs.Our algorithm

λψ(λx 1 + (1 -λ)x 2 ) = max y∈Y f (λx 1 + (1 -λ)x 2 , y) ≤ max y∈Y [λf (x 1 , y) + (1 -λ)f (x 2 , y)] ≤ λ[max y∈Y f (x 1 , y)] + (1 -λ)[max y∈Y f (x 2 , y)] = λψ(x 1 ) + (1 -λ)ψ(x 2 ),

37 and equation 38 imply thatL 0 (x, y) = ψ(x) ∀(x, y) ∈ X × Y(39)since ψ(x) = max y∈Y f (x, y).

Since f (x, •) is concave for all x, and ∇ Y y f (x , y ) = 0, at every point y along the line [y † , y ] connecting the points y † and y , equation 40 implies that∇ Y y f (x , y) = 0, ∀y ∈ [y † , y ].(41)Therefore, equation 41 implies thatf (x , y † ) = f (x , y ),and hence thatf (x , y ) = max y∈Y f (x , y), (42) since max y∈Y f (x , y) = f (x , y † ). Now, since (x , y ) is a ε-local min-max equilibrium for ε = 0, Pr ∆∼D x ,y [L 0 (x + ∆, y ) < L 0 (x , y )] = 0. convex,and since there is an open ball B for which D x ,y has non-zero probability density at every point in B ∩ Y, equation 44 implies that x is a global minimizer for ψ42 and equation 45 imply that (x , y ) is a global min-max point forf : X × Y → R whenever (x , y ) is a ε-local min-max equilibrium of f for ε = 0.2. Next, we prove the "if" direction:Conversely, suppose that (x , y ) is a global min-max point for f : X × Y → R. Then f (x , y ) = max y∈Y f (x , y). Since f is differentiable on X × Y, this implies that ∇ Y y f (x , y ) = 0. (46)Moreover, since f (x , y ) is a global min-max point, we also have that f (x , y ) = min already shown that ψ is convex, equation 47 implies thatPr ∆∼D x ,y [ψ(x + ∆) < ψ(x )] = 0. (48)Since we have also shown in equation 39 that ψ(x) = L 0 (x, y) for all (x, y) ∈ X × Y, equation 48 implies thatPr ∆∼D x ,y [L 0 (x + ∆, y ) < L 0 (x , y )] = 0.(49)Therefore, equation 46 and equation 49 imply that, for ε = 0, (x , y ) is a ε-local min-max equilibrium of f : X × Y → R whenever (x , y ) is a global min-max point for f .H COMPARISON TO OTHER MIN-MAX ALGORITHMS WITH MAXIMIZATION

COMPARISON TO MIN-MAX ALGORITHMS WITH CONVEREGENCE GUARANTEES IN NONCONVEX-NONCONCAVE SETTINGS Multiple works provide min-max optimization algorithms with convergence guarantees in various settings where f may be nonconvex-nonconcave. However, the convergence results in these works still require strong assumptions on the loss function. For instance, Mertikopoulos et al. (2019); Lin et al. (2018); Gidel et al. (

y) {Compute a stochastic gradient} Use stochastic gradient G x to compute a one-step ADAM update ∆ for x Set x ← x + ∆ {Compute the proposed update for the min-player} Starting at point y, use stochastic gradients G y (x , •) to run multiple ADAM steps in the y-variable, until a point y is reached such that G y (x , y) 1 ≤ ε {Simulate max-player's update} Set f new ← F (x , y ) {Compute the new loss value} Set Accept ← True. if f new > f old -ε /2,set Accept ← False with probability max(0, 1 -e -i /τ 1 ) {accept or reject} if Accept = True then Set x ← x , y ← y , r ← 0 {Accept the updates} else Set r ← r + 1 {Reject the updates, and track how many successive steps were rejected.} return (x, y)

{Compute proposed ADAM update of True 12: Set y stationary ← y j and s out

G EXTENSION TO FUNCTIONS WITH COMPACT CONVEX SUPPORT

In this section we introduce a version of our algorithm (Section G.1) for loss functions with compact convex support, and run it on a simple bilinear loss function (Section G.2). We also introduce a version of our ε-local min-max equilibrium (Section G.3) which applies to loss functions with compact convex support. We then show that, if f is also convex-concave, then, for ε = 0, this ε-local min-max equilibrium is equivalent to a global min-max point (Section G.4).

G.1 PROJECTED GRADIENT MIN-MAX ALGORITHM FOR COMPACTLY SUPPORTED LOSS

Algorithm 4 Algorithm for min-max optimization on compact support input: A stochastic zeroth-order oracle F for loss function f : X × Y → R where X , Y ⊆ R d are compact convex sets. Stochastic gradient oracles G x for ∇ x f , and G y for ∇ y f . Projection oracles P X : R d → X for X and P Y : R d → Y for Y. An initial point (x, y) ∈ X × Y, and an error parameter ε. output: A point (x , y ) hyperparameters: else Set r ← r + 1 {Reject the updates, and track how many successive steps were rejected.} return (x, y)

G.2 NUMERICAL SIMULATION ON COMPACTLY SUPPORTED BILINEAR FUNCTION

In this appendix, we discuss numerical simulations on the bilinear loss function f (x, y) = xy with compact support (x, y) ∈ [-1, 1] × [-1, 1]. For this function, the gradient descent-ascent algorithm is known to diverge away from the global min-max point (see for instance Jin et al. (2020) ).This function has global min-max point at every point in the set {(x, y) : x = 0, y ∈ [-1, 1]}. We ran Algorithm 4 on this function with hyperparamters η = 0.2, ε = 0.06, and r max = 5, value oracle F (x, y) = f (x, y), gradient oracle G y (x, y) = ∇ y f (x, y), stochastic gradient oracle ∇ x f (x, y) + ξ where ξ ∼ N (0, 1), and initial point (x, y) = (0.4, 0.4). After 341 iterations of the outer loop, our algorithm reached the point (0.0279, -0.9944), which is very close to one of its true global min-max points, (0, -1).

G.3 LOCAL MIN-MAX EQUILIBRIUM FOR COMPACTLY SUPPORTED LOSS FUNCTIONS

Global min-max point. First, we recall the definition of global min-max point: Definition G.1. We say that (x , y ) ∈ X ×Y is a global min-max point for a function f :Local min-max equilibrium for projected subgradients. In this section we introduce a version of the local min-max equilibrium which applies to compactly supported loss functions (Definition G.3). The main difference with our previous definition (Definition 2.2) is the need for a projected gradient to deal with the compact support of the objective function.In the following we assume that f : X × Y → R, where X , Y ⊂ R d are two compact convex sets, and that f is continuously differentiable on X × Y. We denote by ∇ Y y the projected gradient in the y variable for the set Y.We first formally define "simulated loss" and what it means for f to increase rapidly. Definition G.2. For any x, y, and ε > 0, define E(ε, x, y) ⊆ Y to be points w s.t. there is a continuous and (except at finitely many points) differentiable path γ(t) contained in Y 9 , starting at y, ending at w, and moving with "speed" at most 1 in the ∞ -norm, d dt γ(t) ∞ ≤ 1 such that at any point on γ, 10(33) We define L ε (x, y) := sup w∈E(ε,x,y) f (x, w), and refer to it as the simulated loss. Definition G.3. Given a distribution D x,y , with Pr ∆∼Dx,y (x + ∆ ∈ X ) = 1 for each x, y ∈ R d , and ε ≥ 0, we say that (x , y ) is an ε -local min-max equilibrium with respect to the distributionRemark G.4. As a simple application of Algorithm 4, consider the bilinear loss function f (x, y) = xy where x and y are constrained to the set [-1 2 , 1 2 ]. It is easy to see that the set of global min-max points consists of the points (x, y) where x = 0 and y is any point in [-1 2 , 1 2 ]. The local min-max equilibria according to our definition are the set of points (x, y) where x is any point in [-ε, ε] and y is any point in [-1 2 , 1 2 ]. This is because, when running Algorithm 4 on this example, if x is outside the set [-ε, ε], the maxplayer will follow an increasing trajectory to always return a point y = -1 2 or y = 1 2 , which means that, roughly speaking, the min-player is attempting to minimize the function 1 2 x. This means that the algorithm will accept all updates x + ∆ for which 1 2 |x + ∆| < 1 2 |x| -ε 2 , implying that the algorithm converges towards a point with |x| ≤ ε.Thus, as ε goes to zero, the set of local min-max equilibrium points coincides with the set of global min-max optima for the function f (x, y) = xy. This latter fact holds more generally for any convex-concave function with compact convex domain (see Theorem G.5).

G.4 COMPARISON OF LOCAL MIN-MAX AND GLOBAL MIN-MAX IN THE COMPACTLY SUPPORTED CONVEX-CONCAVE SETTING.

The following theorem shows that, in the compactly supported convex-concave setting, a point (x , y ) is a local min-max equilibrium for ε = 0 (in the sense of Definition G.3) if and only if it is a global min-max point:

