Adaptive Extra-Gradient Methods for Min-Max Optimization and Games

Abstract

We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. As a result, the algorithm simultaneously achieves order-optimal convergence rates, i.e., it converges to an ε-optimal solution within O(1/ε) iterations in smooth problems, and within O(1/ε 2 ) iterations in non-smooth ones.Importantly, these guarantees do not require any of the standard boundedness or Lipschitz continuity conditions that are typically assumed in the literature; in particular, they apply even to problems with singularities (such as resource allocation problems and the like). This adaptation is achieved through the use of a geometric apparatus based on Finsler metrics and a suitably chosen mirror-prox template that allows us to derive sharp convergence rates for the methods at hand.

1. Introduction

The surge of recent breakthroughs in generative adversarial networks (GANs) [20] , robust reinforcement learning [41] , and other adversarial learning models [27] has sparked renewed interest in the theory of min-max optimization problems and games. In this broad setting, it has become empirically clear that, ceteris paribus, the simultaneous training of two (or more) antagonistic models faces drastically new challenges relative to the training of a single one. Perhaps the most prominent of these challenges is the appearance of cycles and recurrent (or even chaotic) behavior in min-max games. This has been studied extensively in the context of learning in bilinear games, in both continuous [16, 31, 40] and discrete time [12, 18, 19, 32] , and the methods proposed to overcome recurrence typically focus on mitigating the rotational component of min-max games. The method with the richest history in this context is the extra-gradient (EG) algorithm of Korpelevich [25] and its variants. The EG algorithm exploits the Lipschitz smoothness of the problem and, if coupled with a Polyak-Ruppert averaging scheme, it achieves an O(1/T ) rate of convergence in smooth, convex-concave min-max problems [35] . This rate is known to be tight [34, 39] but, in order to achieve it, the original method requires the problem's Lipschitz constant to be known in advance. If the problem is not Lipschitz smooth (or the algorithm is run with a vanishing step-size schedule), the method's rate of convergence drops to O(1/ √ T ). Our contributions. Our aim in this paper is to provide an algorithm that automatically adapts to smooth / non-smooth min-max problems and games, and achieves order-optimal rates in both classes without requiring any prior tuning by the optimizer. In this regard, we propose a flexible algorithmic scheme, which we call AdaProx, and which exploits gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this mechanism, and to the best of our knowledge, AdaProx is the first algorithm that simultaneously achieves the following: EG [24, 25, 35 ] GRAAL [29] GMP [47] AMP [1, 17] BL [2] AdaProx [ours] Parameter-Agnostic Partial Partial

Rate Interpolation

Unbounded Domain

Singularities

Table 1 : Overview of related work. For the purposes of this table, "parameter-agnostic" means that the method does not require prior knowledge of the parameters of the problem it was designed to solve (Lipschitz modulus, domain diameter, etc.); "rate interpolation" means that the algorithm's convergence rate is O(1/T ) or O 1/ √ T in smooth / non-smooth problems respectively; "unbounded domain" is self-explanatory; and, finally, "singularities" means that the problem's defining vector field may blow up at a boundary point of the problem's domain. 1. An O 1/

√

T convergence rate in non-smooth problems and O(1/T ) in smooth ones. 2. Applicability to min-max problems and games where the standard boundedness / Lipschitz continuity conditions required in the literature do not hold. 3. Convergence without prior knowledge of the problem's parameters (e.g., whether the problem's defining vector field is smooth or not, its smoothness modulus if it is, etc.). Our proposed method achieves the above by fusing the following ingredients: a) a family of local norms -a Finsler metric -capturing any singularities in the problem at hand; b) a suitable mirror-prox template; and c) an adaptive step-size policy in the spirit of Rakhlin & Sridharan [43] . We also show that, under a suitable coherence assumption, the sequence of iterates generated by the algorithm converges, thus providing an appealing alternative to iterate averaging in cases where the method's "last iterate" is more appropriate (for instance, if using AdaProx to solve non-monotone problems). Related works. There have been several works improving on the guarantees of the original extragradient/mirror-prox template. We review the most relevant of these works below; for convenience, we also tabulate these contributions in Table 1 above. Because many of these works appear in the literature on variational inequalities [15] , we also use this language in the sequel. In unconstrained problems with an operator that is locally Lipschitz continuous (but not necessarily globally so), the golden ratio algorithm (GRAAL) [29] achieves convergence without requiring prior knowledge of the problem's Lipschitz parameter. However, GRAAL provides no rate guarantees for non-smooth problems -and hence, a fortiori, no interpolation guarantees either. By contrast, such guarantees are provided in problems with a bounded domain by the generalized mirror-prox (GMP) algorithm of [47] under the umbrella of Hölder continuity. Still, nothing is known about the convergence of GRAAL / GMP in problems with singularities (i.e., when the problem's defining vector field blows up at a boundary point of the problem's domain). Singularities of this type were treated in a recent series of papers [1, 17, 48] by means of a "Bregman continuity" or "Lipschitz-like" condition. These methods are order-optimal in the smooth case, without requiring any knowledge of the problem's smoothness modulus. On the other hand, like GRAAL (but unlike GMP), they do not provide any rate interpolation guarantees between smooth and non-smooth problems. Another method that simultaneously achieves an O(1/ √ T ) rate in non-smooth problems and an O(1/T ) rate in smooth ones is the recent algorithm of Bach & Levy [2] . The BL algorithm employs an adaptive, AdaGradlike step-size policy which allows the method to interpolate between the two regimes -and this, even with noisy gradient feedback. On the negative side, the BL algorithm requires a bounded domain with a (Bregman) diameter that is known in advance; as a result, its theoretical guarantees do not apply to unbounded problems. In addition, the BL algorithm makes crucial use of boundedness and Lipschitz continuity; extending the BL method beyond this standard framework is a highly non-trivial endeavor which formed a big part of this paper's motivation.

2. Problem Setup and Blanket Assumptions

We begin in this section by reviewing some basics for min-max problems and games. where Θ, Φ are convex subsets of some ambient real space and L : Θ × Φ → is the problem's loss function. In the game-theoretic interpretation of (SP), the player controlling θ seeks to minimize L(θ, φ) for any value of the maximization variable φ, while the player controlling φ seeks to maximize L(θ, φ) for any value of the minimization variable θ. Accordingly, solving (SP) consists of finding a Nash equilibrium (NE), i.e., an action profile (θ * , φ * ) ∈ Θ × Φ such that L(θ * , φ) ≤ L(θ * , φ * ) ≤ L(θ, φ * ) for all θ ∈ Θ, φ ∈ Φ. By the minimax theorem of von Neumann [49] , Nash equilibria are guaranteed to exist when Θ, Φ are compact and L is convex-concave (i.e., convex in θ and concave in φ). Much of our paper is motivated by the question of calculating a Nash equilibrium (θ * , φ * ) of (SP) in the context of von Neumann's theorem; we expand on this below.

2.2.. Games.

Going beyond the min-max setting, a continuous game in normal form is defined as follows: First, consider a finite set of players N = {1, . . . , N}, each with their own action space K i ∈ d i (assumed convex but possibly not closed). During play, each player selects an action x i from K i with the aim of minimizing a loss determined by the ensemble x (x i ; x -i ) (x 1 , . . . , x N ) of all players' actions. In more detail, writing K i K i for the game's total action space, we assume that the loss incurred by the i-th player is i (x i ; x -i ), where i : K → is the player's loss function. In this context, a Nash equilibrium is any action profile x * ∈ K that is unilaterally stable, i.e., i (x * i ; x * -i ) ≤ i (x i ; x * -i ) for all x i ∈ K i and all i ∈ N . (NE) If each K i is compact and i is convex in x i , existence of Nash equilibria is guaranteed by the theorem of Debreu [13] . Given that a min-max problem can be seen as a two-player zero-sum game with 1 = L, 2 = -L, von Neumann's theorem may in turn be seen as a special case of Debreu's; in the sequel, we describe a first-order characterization of Nash equilibria that encapsulates both. In most cases of interest, the players' loss functions are individually subdifferentiable on a subset X of K with ri K ⊆ X ⊆ K [21, 44] . This means that there exists a (possibly discontinuous) vector field V i : X → d i such that i (x i ; x -i ) ≥ i (x i ; x -i ) + V i (x), x i -x i for all x ∈ X , x ∈ K and all i ∈ N [21] . In the simplest case, if i is differentiable at x, then V i (x) can be interpreted as the gradient of i with respect to x i . The raison d'être of the more general definition (2) is that it allows us to treat non-smooth loss functions that are common in machine learning (such as L 1 -regularized losses). We make this distinction precise below: 1. If there is no continuous vector field V i (x) satisfying (2), the game is called non-smooth. 2. If there is a continuous vector field V i (x) satisfying (2), the game is called smooth. Remark. We stress here that the adjective "smooth" refers to the game itself: for instance, if (x) = |x| for x ∈ , the game is not smooth and any V satisfying ( 2) is discontinuous at 0. In this regard, the above boils down to whether the (individual) subdifferential of each i admits a continuous selection.

2.3.. Resource allocation and equilibrium problems.

The notion of a Nash equilibrium captures the unilateral minimization of the players' individual loss functions. In many pratical cases of interest, a notion of equilibrium is still relevant, even though it is not necessarily attached to the minimization of individual loss functions. Such problems are known as "equilibrium problems" [15, 26] ; to avoid unnecessary generalities, we focus here on a relevant problem that arises in distributed computing architectures (such as GPU clusters and the like). To state the problem, consider a distributed computing grid consisting of N parallel processors that serve demands arriving at a rate of ρ per unit of time (measured e.g., in flop/s). If the maximum processing rate of the i-th node is µ i (without overclocking), and jobs are buffered and served on a first-come, first-served (FCFS) basis, the mean time required to process a unit demand at the i-th node is given by the Kleinrock M/M/1 response function τ i (x i ) = 1/(µ ix i ), where x i denotes the node's load [5] . Accordingly, the set of feasible loads that can be processed by the grid is X {(x 1 , . . . , x N ) : 0 ≤ x i < µ i , x 1 + • • • + x N = ρ}. In this context, a load profile x * ∈ X is said to be balanced if no infinitesimal process can be better served by buffering it at a different node [38] ; formally, this amounts to the so-called Wardrop equilibrium condition τ i (x * i ) ≤ τ j (x * j ) for all i, j ∈ N with x * i > 0. (WE) We note here a crucial difference between (WE) and (NE): if we view the grid's computing nodes as "players", the constraint i x i = ρ means that there is no allowable unilateral deviation (x * i ; x * -i ) → (x i ; x * -i ) with x i x * i . As a result, (NE) is meaningless as a requirement for this equilibrium problem. As we discuss below, this resource allocation problem will require the full capacity of our framework.

2.4.. Variational inequalities.

Importantly, all of the above problems can be restated as a variational inequality of the form Find x * ∈ X such that V(x * ), x -x * ≥ 0 for all x ∈ X . (VI) In the above, X is a convex subset of d (not necessarily closed) that represents the problem's domain. The problem's defining vector field V : X → d is then given as follows: In min-max problems and games, V is any field satisfying (2); otherwise, in equilibrium problems of the form (WE), the components of V are V i = τ i (we leave the details of this verification to the reader). This equivalent formulation is quite common in the literature on min-max / equilibrium problems [14, 15, 26, 30] , and it is often referred to as the "vector field formulation" [3, 8, 23] . Its usefulness lies in that it allows us to abstract away from the underlying game-theoretic complications (multiple indices, individual subdifferentials, etc.) and provides a unifying framework for a wide range of problems in machine learning, signal processing, operations research, and many other fields [15, 45] . For this reason, our analysis will focus almost exclusively on solving (VI), and we will treat V and X ⊆ d , d = i d i , as the problem's primitive data.

2.5.. Merit functions and monotonicity.

A widely used assumption in the literature on equilibrium problems and variational inequalities is the monotonicity condition V(x) -V(x ), x -x ≥ 0 for all x, x ∈ X . In single-player games, monotonicity is equivalent to convexity of the optimizer's loss function; in min-max games, it is equivalent to L being convex-concave [26] ; etc. In the absence of monotonicity, approximating an equilibrium is PPAD-hard [11] , so we will state most of our results under (Mon). Now, to assess the quality of a candidate solution x ∈ X , we will employ the restricted merit function Gap C ( x) = sup x∈C V(x), x -x , where the "test domain" C is a nonempty convex subset of X [15, 24, 37] . The motivation for this is provided by the following proposition: Proposition 1. Let C be a nonempty convex subset of X . Then: a) Gap C ( x) ≥ 0 whenever x ∈ C; and b) if Gap C ( x) = 0 and C contains a neighborhood of x, then x is a solution of (VI). Proposition 1 generalizes an earlier characterization by Nesterov [37] and justifies the use of Gap C (x) as a merit function for (VI); to streamline our presentation, we defer the proof to the paper's supplement. Moreover, to avoid trivialities, we will also assume that the solution set X * of (VI) is nonempty and we will reserve the notation x * for solutions of (VI). Together with monotonicity, this will be our only blanket assumption. 3 The Extra-Gradient Algorithm and its Limits Perhaps the most widely used solution method for games and variational inequalities (VIs) is the extra-gradient (EG) algorithm of Korpelevich [25] and its variants [28, 42, 43] . This algorithm has a rich history in optimization, and it has recently attracted considerable interest in the fields of machine learning and AI, see e.g., [8, 12, 18, 22, 23, 32, 33] and references therein. In its simplest form, for problems with closed domains, the algorithm proceeds recursively as X t+1/2 = Π(X t -γ t V t ), X t+1 = Π(X t -γ t V t+1/2 ), (EG) where Π(x) = arg min x ∈X xx is the Euclidean projection on X , V t V(X t ) for t = 1, 3/2, . . . , and γ t > 0, is the method's step-size. Then, running (EG) for T iterations, the algorithm returns the "ergodic average" XT = T t=1 γ t X t+1/2 T t=1 γ t . In this setting, the main guarantees for (EG) date back to [35] and can be summarized as follows: 1. For non-smooth problems (discontinuous V): Assume V is bounded, i.e., there exists some M > 0 such that V(x) ≤ M for all x ∈ X . (BD) Then, if (EG) is run with a step-size of the form γ t ∝ 1/ √ t, we have Gap C ( XT ) = O 1/ √ T . 2. For smooth problems (continuous V): Assume V is L-Lipschitz continuous, i.e., V(x) -V(x ) ≤ L x -x for all x, x ∈ X . (LC) Then, if (EG) is run with a constant step-size γ < 1/L, we have Gap C ( XT ) = O(1/T ). ( ) Remark. In the above, • is tacitly assumed to be the standard Euclidean norm. Non-Euclidean considerations will play a crucial role in the sequel, but they are not necessary for the moment. Importantly, the distinction between smooth and non-smooth problems cannot be lifted: the bounds ( 5) and ( 6) are tight in their respective problem classes and they cannot be improved without further assumptions [34, 39] . Moreover, we should also note the following: 1. The algorithm changes drastically from the non-smooth to the smooth case: non-smoothness requires γ t ∝ 1/ √ t, but such a step-size cannot achieve a fast O(1/T ) rate. 2. If (EG) is run with a constant step-size, L must be known in advance; otherwise, running (EG) with an ill-adapted step-size (γ > 1/L) could lead to non-convergence. We illustrate this failure of (EG) in Fig. 1 . As we discussed in the introduction, our aim in the sequel will be to provide a single, adaptive algorithm that simultaneously achieves the following: a) an order-optimal O 1/ √ T convergence rate in non-smooth problems and O(1/T ) in smooth ones; b) convergence in problems where the boundedness / Lipschitz continuity conditions (BD) / (LC) no longer hold; and c) achieves all this without prior knowledge of the problem's parameters.

4. Rate Interpolation: the Euclidean Case

As a prelude to our main result, we provide in this section an adaptive version of (EG) that achieves the "best of both worlds" in the Euclidean setting of Section 3, i.e., an O 1/

√

T convergence rate in problems satisfying (BD), and an O(1/T ) rate in problems satisfying (LC). Our starting point is the observation that, if the sequence X t produced by (EG) converges to a solution of (VI), the difference δ t V t+1/2 -V t = V(X t+1/2 ) -V(X t ) must itself become vanishingly small if V is (Lipschitz) continuous. On the contrary, if V is discontinuous, this difference may remain bounded away from zero (consider for example the L 1 loss (x) = |x| near 0). Based on this observation, we consider the adaptive step-size policy: γ t+1 = 1 1 + t s=1 δ 2 s . The intuition behind ( 8) is as follows: If V is not smooth and lim inf t→∞ δ t > 0, then γ t will vanish at a Θ 1/ √ t rate, which is the optimal step-size schedule for problems satisfying (BD) but not (LC). Instead, if V satisfies (LC) and X t converges to a solution x * of (VI), it is plausible to expect that the infinite series t δ 2 t is summable, in which case the step-size γ t will not vanish as t → ∞. Furthermore, since δ t is defined in terms of successive gradient differences, it automatically exploits the variation of the gradient data observed up to time t, so it can be expected to adjust to the "local" Lipschitz constant of V around a solution x * of (VI). Our step-size policy and motivation are similar in spirit to the "predictable sequence" approach of [43] . For now, we only state (without proof) our main result for problems satisfying (BD) or (LC). Theorem 1. Suppose V satisfies (Mon), let C be a compact neighborhood of a solution of (VI), and let H = sup x∈C X 1x 2 . If (EG) is run with the adaptive step-size policy (8), we have: a) If V satisfies (BD): Gap C ( XT ) = O H + 4M 3 + log(1 + 4M 2 T ) √ T . (9a) b) If V satisfies (LC): Gap C ( XT ) = O H T . Theorem 1 (which is proved in the sequel as a special case of Theorem 2) should be compared to the corresponding results of Bach & Levy [2] . In the non-smooth case, [2] provides a bound of the form Õ(αMD/ √ T ) with D 2 = 1 2 max x∈X x 2 -1 2 min x∈X x 2 (recall that [2] only treats problems with a bounded domain), and α = max{M/M 0 , M 0 /M} where M 0 is an initial estimate of M. The worst-case value of α is O(M) when good estimates are not readily available; in this regard, (9a) essentially replaces the O(D) constant of Bach & Levy [2] by O(M). Since D = ∞ in problems with an unbounded domain, Theorem 1 provides a significant improvement in this regard. In terms of L, the smooth guarantee of [2] is Õ(α 2 LD 2 /T ), so the multiplicative constant in the bound also becomes infinite in problems with an unbounded domain. In our case, D 2 is replaced by H (which is also finite) times an additional multiplicative constant which is increasing in M and L (but is otherwise asymptotic, so it is not included in the statement of Theorem 1). This removes an additional limitation in the results of [2] ; in the next sections we drop even the Euclidean regularity requirements (BD)/(LC), and we provide a rate interpolation result that does not require either condition.

5. Finsler Regularity

To motivate our analysis outside the setting of (BD)/(LC), consider the vector field V i (x) = (µ i -x i ) -1 + λ 1{x i > 0}, i = 1, . . . , N, which corresponds to the distributed computing problem of Section 2.3 plus a regularization term designed to limit the activation of computing nodes at low loads. Clearly, we have V(x) → ∞ whenever x i → 0 + , so (BD) and both fail (the latter even if λ = 0). On the other hand, if we consider the "local" norm v x, * = d i=1 (µ i -x i ) |v i |, we have V(x) x, * ≤ d + λ d i=1 µ i , so V is bounded relative to • x, * . This observation motivates the use of a local -as opposed to globalnorm, which we define formally as follows: Definition 1. A Finsler metric on a convex subset X of d is a continuous function F : X × d → + which satisfies the following properties for all x ∈ X and all z, z ∈ d : 1. Subadditivity: F(x; z + z ) ≤ F(x; z) + F(x; z ). 2. Absolute homogeneity: F(x; λz) = |λ|F(x; z) for all λ ∈ . 3. Positive-definiteness: F(x; z) ≥ 0 with equality if and only if z = 0. Given a Finsler metric on X , the induced primal / dual local norms on X are respectively defined as z x = F(x; z) and v x, * = max{ v, z : F(x; z) = 1} (11) for all x ∈ X and all z, v ∈ d . We will also say that a Finsler metric on X is regular when v x , * / v x, * = 1 + O( xx x ) for all x, x ∈ X , v ∈ d . Finally, for simplicity, we will also assume in the sequel that • x ≥ ν • for some ν > 0 and all x ∈ X (this last assumption is for convenience only, as the norm could be redefined to • x ← • x + ν • without affecting our theoretical analysis). When X is equipped with a regular Finsler metric as above, we will say that it is a Finsler space. Example 5.1. Let F(x; z) = z where • denotes the reference norm of X = d . Then the properties of Definition 1 are satisfied trivially. Example 5.2. For a more interesting example of a Finsler structure, consider the set X = (0, 1] d and the metric z x = max i |z i |/x i , z ∈ d , x ∈ X . In this case v x, * = d i=1 x i |v i | for all v ∈ d , and the only property of Definition 1 that remains to be proved is that of regularity. To that end, we have v x , * -v x, * ≤ d i=1 |v i | • |x i -x i | = d i=1 x i |v i | • |x i -x i |/x i ≤ v x, * • x -x x . Hence, by dividing by v x, * , we readily get v x , * / v x, * ≤ 1 + xx x i.e., • x is regular in the sense of Definition 1. As we discuss in the sequel, this metric plays an important role for distributed computing problems of the form presented in Section 2.3. With all this in hand, we will say that a vector field V : X → d is 1. Metrically bounded if there exists some M > 0 such that V(x) x, * ≤ M for all x ∈ X . (MB) 2. Metrically smooth if there exists some L > 0 such that V(x ) -V(x) x, * ≤ L x -x x for all x , x ∈ X . (MS) The notion of metric boundedness/smoothness extends that of ordinary boundedness/Lipschitz continuity to a Finsler context; note also that, even though neither side of (MS) is unilaterally symmetric under the change x ↔ x , the condition (MS) as a whole is. Our next example shows that this extension is proper, i.e., (BD)/(LC) may both fail while (MB)/(MS) both hold: Example 5.3. Consider the change of variables x i 1x i /µ i in the resource allocation problem of Section 2.3. Then, writing V i (x) = -(1/x i ) -λ 1{x i < 1} for the transformed field (10) under this change of variables, we readily get V i (x) → -∞ as x i → 0 + ; as a result, both (BD) and (LC) fail to hold for any global norm on d . Instead, under the local norm z x = max i |z| i /x i , we have: 1. For all λ ≥ 0, V satisfies (MB) with M = d(1 + λ): V(x) x, * ≤ d i=1 x i • (1/x i + λ) = d(1 + λ). 2. For λ = 0, V satisfies (MS) with L = d: indeed, for all x, x ∈ X , we have V(x ) -V(x) x, * = d i=1 x i 1 x i - 1 x i = d i=1 |x i -x i | x i ≤ d max i |x i -x i | x i = d x -x x . 6 The AdaProx Algorithm and its Guarantees The method. We are now in a position to define a family of algorithms that is capable of interpolating between the optimal smooth/non-smooth convergence rates for solving (VI) without requiring either (BD) or (LC).To do so, the key steps in our approach will be to (i) equip X with a suitable Finsler structure (as in Section 5); and (ii) replace the Euclidean projection in (EG) with a suitable "Bregman proximal" step that is compatible with the chosen Finsler structure on X . We begin with the latter (assuming that X is equipped with an arbitrary Finsler structure): Definition 2. We say that h : d → ∪ {∞} is a Bregman-Finsler function on X if: 1. h is convex, lower semi-continuous (l.s.c.), cl(dom h) = cl(X ), and dom ∂h = X . 2. The subdifferential of h admits a continuous selection ∇h(x) ∈ ∂h(x) for all x ∈ X . 3. h is strongly convex, i.e., there exists some K > 0 such that h(x ) ≥ h(x) + ∇h(x), x -x + K 2 x -x 2 x for all x ∈ X and all x ∈ dom h.

Published as a conference paper at ICLR 2021

The Bregman divergence induced by h is defined for all x ∈ X , x ∈ dom h as D(x , x) = h(x ) -h(x) -∇h(x), x -x and the associated prox-mapping is defined for all x ∈ X and y ∈ d as P x (y) = arg min x ∈X { y, x -x + D(x , x)}. ( ) Definition 2 is fairly technical, so some clarifications are in order. First, to connect this definition with the Euclidean setup of Section 4, the prox-mapping ( 16) should be seen as the Bregman equivalent of a Euclidean projection step, i.e., Π(x+y) P x (y). Second, a key difference between Definition 2 and other definitions of Bregman functions in the literature [4, 6, 7, 9, 24, 36, 37, 46] is that h is assumed strongly convex relative to a local norm -not a global norm. This "locality" will play a crucial role in allowing the proposed methods to adapt to the geometry of the problem. For concreteness, we provide below an example that expands further on Examples 5.2 and 5.3: Example 6.1. Consider the local norm z x = max i |z i |/x i on X = (0, 1] d and let h(x) = d i=1 1/x i on (0, 1] d . We then have D(x , x) = d i=1 1 x i - 1 x i + x i -x i x 2 i = d i=1 (x i -x i ) 2 x 2 i x i ≥ d i=1 (1 -x i /x i ) 2 ≥ x -x 2 x i.e., h is 1-strongly convex relative to • x on X . With all this is in place, the extra-gradient method can be adapted to our current setting as follows: X t+1/2 = P X t (-γ t V t ) δ t = V t+1/2 -V t X t+1/2 , * X t+1 = P X t (-γ t V t+1/2 ) γ t+1 = 1 1 + t s=1 δ 2 s (AdaProx) with V t = V(X t ), t = 1, 3/2, . . . , as in Section 3. In words, this method builds on the template of (EG) by (i) replacing the Euclidean projection with a mirror step; (ii) replacing the global norm in (8) with a dual Finsler norm evaluated at the algorithm's leading state X t+1/2 . Convergence speed. With all this in hand, our main result for AdaProx can be stated as follows: Theorem 2. Suppose V satisfies (Mon), let C be a compact neighborhood of a solution of (VI), and set H = sup x∈C D(x, X 1 ) Then, the AdaProx algorithm enjoys the guarantees: a) If V satisfies (MB): Gap C ( XT ) = O H + M 3 (1 + 1/K) 2 + log(1 + 4M 2 (1 + 2/K) 2 T ) √ T . (18a) b) If V satisfies (MS): Gap C ( XT ) = O H T . For the constants that appear in Eq. ( 18), we refer the reader to the discussion following Theorem 1. Moreover, we defer the proof of Theorem 2 to the paper's supplement. We only mention here that its key element is the determination of the asymptotic behavior of the adaptive step-size policy γ t in the non-smooth and smooth regimes, i.e., under (MB) and (MS) respectively. At a very high level, (MB) guarantees that the difference sequence δ t is bounded, which implies in turn that T t=1 γ t = Ω( √ T ) and eventually yields the bound (18a) for the algorithm's ergodic average XT . On the other hand, if (MS) kicks in, we have the following finer result: Lemma 1. Assume V satisfies (MS). Then, a) γ t decreases monotonically to a strictly positive limit γ ∞ = lim t→∞ γ t > 0; and b) the sequence δ t is square summable: in particular, ∞ t=1 δ 2 t = 1/γ 2 ∞ -1. By means of this lemma (which we prove in the paper's supplement), it follows that T t=1 γ t ≥ γ ∞ T = Ω(T ); hence it ultimately follows that AdaProx enjoys an O(1/T ) rate of convergence under (MS). Trajectory convergence. In complement to Theorem 2, we also provide a trajectory convergence result that governs the actual iterates of the AdaProx algorithm: Theorem 3. Suppose that V(x), xx * < 0 whenever x * is a solution of (VI) and x is not. If, in addition, V satisfies (MB) or (MS), the iterates X t of AdaProx converge to a solution of (VI). The importance of this result is that, in many practical applications (especially in non-monotone problems), it is more common to harvest the "last iterate" of the method (X t ) rather than its ergodic average ( XT ); as such, Theorem 3 provides a certain justification for this design choice. ○ ○ ○ ○ ○ ○ ○○○○○○○○ ○ ○ ○ ○ ○ ○○ ○ ○○○ ○ ○ ○ ○ ○○ ○ ○ ○ ○ ○○ ○ ○ ○○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ □ □ □ □ □ □ □ □ □□□□□□ □□□□□ □□ □ □□□ □□ □□ □□ □□ □□ □ □ □ □ □ □ □ □ □ □ □ □□ □ □ □ □ □ □□ □□□□ □ □ □ □ □ □ □□□ □ □□ □□□ □ □ □ □ □□□ □□□ □ □□ □□□□□□ □ □ □ □□ □ □ □□ □ □ □□ □ □ □ □□ □ □ □□ □ □ □□□□ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △△ △ △△△△△ △ △△ △ △ △ △ △ △ △△ △△△△△ △ △ △ △ △△ △△ △△ △ △△△ △△ △ △ △ △ △△ △ △ △ △ △ △△ △ △ △ △ △ △ △ △△ △ △△ △ △△ △ △ △△△ △ △ △ △ △ △ △ △ △ △△ △ △ △△ △ △ △△ △△△△ △ △△ ○ �����-�������� (������ �������) �����-�������� (����� ��������) □ ����-���� (������ �������) ����-���� (����� ��������) △ ������� (������ �������) ������� (����� ��������) � �� ��� ���� �� � ������� ������ ����� ���� ��� �� ��� ����������� ����� (���������� ���������) ○ ○ ○ ○ ○ ○ ○ ○○○○○ ○○○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ □ □ □ □ □ □ □ □ □□□□ □□□□ □□□□ □ □ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□ □□□□ □□□□□ □□□□□□□□□□□□□□ □□□ □□ □□ □□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □□□ □ □ □ □ □ □ □□ □□□ □ △ △ △ △ △ △ △ △△ △△△ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △ △△△ △ △ △ △ △ △ △△ The proof of Theorem 3 relies on non-standard arguments, so we relegate it to the supplement. Structurally, the first step is to show that X t visits any neighborhood of a solution point x * ∈ X * infinitely often (this is where the coherence assumption V(x), xx * is used). The second is to use this trapping property in conjunction with a suitable "energy inequality" to establish convergence via the use of a quasi-Fejér technique as in [10] ; this part is detailed in a separate appendix.

7. Numerical Experiments

We conclude in this section with a numerical illustration of the convergence properties of AdaProx in two different settings: a) bilinear min-max games; and b) a simple Wasserstein GAN in the spirit of Daskalakis et al. [12] with the aim of learning an unknown covariance matrix. Bilinear min-max games. For our first set of experiments, we consider a min-max game of the form of the form L(θ, φ) = (θ -θ * ) A(φ -φ * ) with θ, φ ∈ 100 and A ∈ 100 × 100 (drawn i.i.d. component-wise from a standard Gaussian). To test the convergence of AdaProx beyond the "full gradient" framework, we ran the algorithm with stochastic gradient signals of the form V t = V(X t )+U t where U t is drawn i.i.d. from a centered Gaussian distribution with unit covariance matrix. We then plotted in Fig. 2 the squared gradient norm V( XT ) 2 of the method's ergodic average XT after T iterations (so values closer to zero are better). For benchmarking purposes, we also ran the extragradient (EG) and Bach-Levy (BL) algorithms [2] with the same random seed for the simulated gradient noise. The step-size parameter of the EG algorithm was chosen as γ t = 0.025/ √ t, whereas the BL algorithm was run with diameter and gradient bound estimation parameters D 0 = .5 and M 0 = 2.5 respectively (both determined after a hyper-parameter search since the only theoretically allowable values are D 0 = M 0 = ∞; interestingly, very large values for D 0 and M 0 did not yield good results). The experiment was repeated S = 100 times, and AdaProx gave consistently faster rates.  The goal here is to generate data drawn from a centered Gaussian distribution with unknown covariance Σ; in particular, this model follows the Wasserstein GAN formulation of Daskalakis et al. [12] with generator and discriminator respectively given by G(z) = θz and D(x) = x φx (no clipping). For the experiments, we took d = 100, a mini-batch of m = 128 samples per update, and we ran the EG, BL and AdaProx algorithms as above, tracing the square norm of V as a measure of convergence. Since the problem is non-monotone, there are several disjoint equilibrium components so the algorithms' behavior is considerably more erratic; however, after this initial warm-up phase, AdaProx again gave the faster convergence rates.



Min-max / Saddle-point problems. A min-max game is a saddle-point problem of the form min θ∈Θ max φ∈Φ L(θ, φ) (SP)

Figure 1: The behavior of (EG) in the bilinear min-max problem L(θ, φ) = θφ with θ, φ ∈ [-1, 1]. Given the clipping at [-1, 1], this problem is smooth with L = 1; instead, in the unconstrained case, both (BD) and (LC) fail. Still, even in the constrained case, running (EG) with a step-size only slightly above the 1/L bound (L = 1, γ = 1.04) results in a dramatic convergence failure (left plot). Tuning the step-size of (EG) resolves this problem (center), but a constant step-size makes the algorithm unnecessarily conservative towards the end. The proposed AdaProx algorithm automatically exploits previous gradient data to perform more informative extra-gradient steps in later ones, thus achieving faster convergence without tuning.

Figure2: Numerical comparison between the extra-gradient (EG), Bach-Levy (BL) and AdaProx algorithms (red circles, green squares and blue triangles respectively). The figure on the left shows the methods' convergence in a 100 × 100 bilinear game; the one on the right shows the methods' convergence in a non-convex/non-concave covariance learning problem. In both cases, the parameters of the EG and BL algorithms have been tuned with a grid search (AdaProx has no parameters to tune). All curves have been averaged over S = 100 sample runs, and the 95% confidence interval is indicated by the shaded area.

Covariance matrix learning. Going a step further, consider the covariance learning gameL(θ, φ) = ¾ x∼N (0,Σ) [x θx] -¾ z∼N (0,I) [z θ φθz], θ, φ ∈ d × d .

Acknowledgments

This research was partially supported by the COST Action CA16228 "European Network for Game Theory" (GAMENET) and the French National Research Agency (ANR) in the framework of the grants ORACLESS (ANR-16-CE33-0004-01) and ELIOT (ANR-18-CE40-0030 and FAPESP 2018/12579-7), the "Investissements d'avenir" program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), and MIAI@Grenoble Alpes (ANR-19-P3IA-0003).

