LINEAR LAST-ITERATE CONVERGENCE IN CON-STRAINED SADDLE-POINT OPTIMIZATION

Abstract

Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood -previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate whose value only depends on the smoothness of the objective function. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-stronglyconcave functions, recovering the result of (Hsieh et al., 2019) . Finally, we provide experimental results to further support our theory.

1. INTRODUCTION

Saddle-point optimization in the form of min x max y f (x, y) dates back to (Neumann, 1928) , where the celebrated minimax theorem was discovered. Due to advances of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) (which itself is a saddle-point problem), the question of how to find a good approximation of the saddle point, especially via an efficient iterative algorithm, has recently gained significant research interest. Simple algorithms such as Gradient Descent Ascent (GDA) and Multiplicative Weights Update (MWU) are known to cycle and fail to converge even in simple bilinear cases (see e.g., (Bailey & Piliouras, 2018) and (Cheung & Piliouras, 2019) ). Many recent works consider resolving this issue via simple modifications of standard algorithms, usually in the form of some extra gradient descent/ascent steps. This includes Extra-Gradient methods (EG) (Liang & Stokes, 2019; Mokhtari et al., 2020b) , Optimistic Gradient Descent Ascent (OGDA) (Daskalakis et al., 2018; Gidel et al., 2019; Mertikopoulos et al., 2019) , Optimistic Multiplicative Weights Update (OMWU) (Daskalakis & Panageas, 2019b; Lei et al., 2021) , and others. In particular, OGDA and OMWU are suitable for the repeated game setting where two players repeatedly propose x t and y t and receive only ∇ x f (x t , y t ) and ∇ y f (x t , y t ) respectively as feedback, with the goal of converging to a saddle point or equivalently a Nash equilibrium using game theory terminology. One notable benefit of OGDA and OMWU is that they are also no-regret algorithms with important applications in online learning, especially when playing against adversarial opponents (Chiang et al., 2012; Rakhlin & Sridharan, 2013) . Despite considerable progress, especially those for the unconstrained setting, the behavior of these algorithms for the constrained setting, where x and y are restricted to closed convex sets X and Y respectively, is still not fully understood. This is even true when f is a bilinear function and X and Y are simplex, known as the classic two-player zero-sum games in normal form, or simply matrix games. Indeed, existing convergence results on the last iterate of OGDA or OMWU for matrix games are unsatisfactory -they lack explicit convergence rates (Popov, 1980; Mertikopoulos et al., 2019) , only apply to exponentially small learning rate thus not reflecting the behavior of the algorithms in practice (Daskalakis & Panageas, 2019b) , or require additional conditions such as uniqueness of the equilibrium or a good initialization (Daskalakis & Panageas, 2019b) . Motivated by this fact, in this work, we first improve the last-iterate convergence result of OMWU for matrix games. Under the same unique equilibrium assumption as made by Daskalakis & Panageas (2019b) , we show linear convergence with a concrete rate in terms of the Kullback-Leibler divergence between the last iterate and the equilibrium, using a learning rate whose value is set to a universal constant. We then significantly extend our results and consider OGDA for general constrained and smooth convex-concave saddle-point problems, without the uniqueness assumption. Specifically, we start with proving an average duality gap convergence of OGDA at the rate of O(1/ √ T ) after T iterations. Then, to obtain a more favorable last-iterate convergence in terms of the distance to the set of equilibria, we propose a general sufficient condition on X , Y, and f , called Saddle-Point Metric Subregularity (SP-MS), under which we prove concrete last-iterate convergence rates, all with a constant learning rate and without further assumptions. Our last-iterate convergence results of OGDA greatly generalize that of (Hsieh et al., 2019, Theorem 2), which itself is a consolidated version of results from several earlier works. The key implication of our new results is that, by showing that matrix games satisfy our SP-MS condition, we provide by far the most general last-iterate guarantee with a linear convergence for this problem using OGDA. Compared to that of OMWU, the convergence result of OGDA holds more generally even when there are multiple equilibria. More generally, the same linear last-iterate convergence holds for any bilinear games over polytopes since they also satisfy the SP-MS condition as we show. To complement this result, we construct an example of a bilinear game with a non-polytope feasible set where OGDA provably does not ensure linear convergence, indicating that the shape of the feasible set matters. Finally, we also provide experimental results to support our theory. In particular, we observe that OGDA generally converges faster than OMWU for matrix games, despite the facts that both provably converge exponentially fast and that OMWU is often considered more favorable compared to OGDA when the feasible set is the simplex.

2. RELATED WORK

Average-iterate convergence. While showing last-iterate convergence has been a challenging task, it is well-known that the average-iterate of many standard algorithms such as GDA and MWU enjoys a converging duality gap at the rate of O(1/ √ T ) (Freund & Schapire, 1999) . A line of works show that the rate can be improved to O(1/T ) using the "optimistic" version of these algorithms such as OGDA and OMWU (Rakhlin & Sridharan, 2013; Daskalakis et al., 2015; Syrgkanis et al., 2015) . For tasks such as training GANs, however, average-iterate convergence is unsatisfactory since averaging large neural networks is usually prohibited. Extra-Gradient (EG) algorithms. The saddle-point problem fits into the more general variational inequality framework (Harker & Pang, 1990) . A classic algorithm for variational inequalities is EG, first introduced in (Korpelevich, 1976) . Tseng (1995) is the first to show last-iterate convergence for EG in various settings such as bilinear or strongly-convex-strongly-concave problems. Recent works significantly expand the understanding of EG and its variants for unconstrained bilinear problems (Liang & Stokes, 2019) , unconstrained strongly-convex-strongly-concave problems (Mokhtari et al., 2020b) , and more (Zhang et al., 2019; Lin et al., 2020; Golowich et al., 2020b) . The original EG is not applicable to a repeated game setting where only one gradient evaluation is possible in each iteration. Moreover, unlike OGDA and OMWU, EG is shown to have linear regret against adversarial opponents, and thus it is not a no-regret learning algorithm (Bowling, 2005; Golowich et al., 2020a) . However, there are "single-call variants" of EG that address these issues. In fact, some of these versions coincide with the OGDA algorithm under different names such as modified Arrow-Hurwicz method (Popov, 1980) and "extrapolation from the past" (Gidel et al., 2019) . Apart from OGDA, other single-call variants of EG include Reflected Gradient (Malitsky, 2015; Cui & Shanbhag, 2016; Malitsky & Tam) and Optimistic Gradient (Daskalakis et al., 2018; Mokhtari et al., 2020a) . These variants are all equivalent in the unconstrained setting but differ in the constrained setting. To the best of our knowledge, none of the existing results for any single-call variant of EG covers the constrained bilinear case (which is one of our key contributions). Error Bounds and Metric Subregularity To derive linear convergence for variational inequality problems, error bound method is a commonly used technique (Pang, 1997; Luo & Tseng, 1993) . For example, it is a standard approach to studying the last-iterate convergence of EG algorithms (Tseng, 1995; Hsieh et al., 2020; Azizian et al., 2020) . An error bound method is associated with an error function that gives every point in the feasible set a measure of sub-optimality that is lower bounded by the distance of the point to the optimal set up to some problem dependent constant. If such a error function exists, linear convergence can be obtained. The choice of the error function depends on the feasible region, the objection function, and the algorithm. Common error functions include natural residual functions (Iusem et al., 2017; Malitsky, 2019) and gap functions (Larsson & Patriksson, 1994; Solodov & Tseng, 2000; Chen et al., 2017) . Our method to derive the last-iterate convergence for OGDA can also be viewed as an error bound method. Metric subregularity is another important concept to derive linear convergence via some Lipschitz behavior of a set-valued operator (Leventhal, 2009; Liang et al., 2016; Alacaoglu et al., 2019; Latafat et al., 2019) . Metric subregularity is closely related to error bound methods (Kruger, 2015) . In fact, as we prove in Appendix F, one special case of our condition SP-MS (that allows us to show linear convergence) is equivalent to metric subregularity of an operator defined in terms of the normal cone of the feasible set and the gradient of the objective. This is also the reason why we call our condition Saddle-Point Metric Subregularity. Although metric subregularity has been extensively used in the literature, to the best of our knowledge, our work is the first to use this condition to analyze OGDA. OGDA and OMWU. Recently, last-iterate convergence for OGDA has been proven in various settings such as convex-concave problems (Daskalakis et al., 2018) , unconstrained bilinear problems (Daskalakis & Panageas, 2018; Liang & Stokes, 2019) , strongly-convex-strongly-concave problems (Mokhtari et al., 2020b) , and others (e.g. (Mertikopoulos et al., 2019) ). However, the behavior of OGDA and OMWU for the constrained bilinear case, or even the special case of classic matrix games, appears to be much more mysterious and less understood. Cheung & Piliouras (2020) provide an alternative view on the convergence behavior of OMWU by studying volume contraction in the dual space. Daskalakis & Panageas (2019b) show last-iterate convergence of OMWU for matrix games under a uniqueness assumption and without a concrete rate. Although it is implicitly suggested in (Daskalakis & Panageas, 2019b; a) that a rate of O(1/T 1/9 ) is possible, it is still not clear how to choose the learning rate appropriately from their analysis. As mentioned, our results for OMWU significantly improve theirs, with a clean linear convergence rate using a constant learning rate under the same uniqueness assumption, while our results for OGDA further remove the uniqueness assumption.

3. NOTATIONS AND PRELIMINARIES

We consider the following constrained saddle-point problem: min x∈X max y∈Y f (x, y), where X and Y are closed convex sets, and f is a continuous differentiable function that is convex in x for any fixed y and concave in y for any fixed x. By the celebrated minimax theorem (Neumann, 1928) , we have min x∈X max y∈Y f (x, y) = max y∈Y min x∈X f (x, y). The set of minimax optimal strategy is denoted by X * = argmin x∈X max y∈Y f (x, y), and the set of maximin optimal strategy is denoted by Y * = argmax y∈Y min x∈X f (x, y). It is well-known that X * and Y * are convex, and any pair (x * , y * ) ∈ X * × Y * is a Nash equilibrium satisfying f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ) for any (x, y) ∈ X × Y. For notational convenience, we define Z = X × Y and similarly Z * = X * × Y * . For a point z = (x, y) ∈ Z, we further define f (z) = f (x, y) and F (z) = (∇ x f (x, y), -∇ y f (x, y)). Our goal is to find a point z ∈ Z that is close to the set of Nash equilibria Z * , and we consider three ways of measuring the closeness. The first one is the duality gap, defined as α f (z) = max y ∈Y f (x, y ) -min x ∈X f (x , y), which is always non-negative since max y ∈Y f (x, y ) ≥ f (x, y) ≥ min x ∈X f (x , y). The second one is the distance between z and Z * . Specifically, for any closed set A, we define the projection operator Π A as Π A (a) = argmin a ∈A aa (throughout this work • represents L 2 norm). The squared distance between z and Z * is then defined as dist 2 (z, Z * ) = z -Π Z * (z) 2 . The third one is only for the case when X and Y are probability simplices, and z * = (x * , y * ) is the unique equilibrium. In this case, we use the sum of Kullback-Leibler divergence KL(x * , x) + KL(y * , y) to measure the closeness between z = (x, y) and z * , where KL(x, x ) = i x i ln xi x i . With a slight abuse of notation, we use KL(z, z ) to denote KL(x, x ) + KL(y, y ). Other notations. We denote the (d -1)-dimensional probability simplex as ∆ d = {u ∈ R d + : d i=1 u i = 1}. For a convex function ψ, the corresponding Bregman divergence is defined as D ψ (u, v) = ψ(u)-ψ(v)-∇ψ(v), u-v . If ψ is γ-strongly convex in a domain, then D ψ (u, v) ≥ γ 2 u -v 2 for any u, v in that domain. For u ∈ R d , we define supp(u) = {i : u i > 0}. Optimistic Gradient Descent Ascent (OGDA). Starting from an arbitrary point ( x 1 , y 1 ) = (x 0 , y 0 ) from Z, OGDA with step size η > 0 iteratively computes the following for t = 1, 2, . . ., x t = Π X x t -η∇ x f (x t-1 , y t-1 ) , x t+1 = Π X x t -η∇ x f (x t , y t ) , y t = Π Y y t + η∇ y f (x t-1 , y t-1 ) , y t+1 = Π Y y t + η∇ y f (x t , y t ) . Note that there are several slightly different versions of the algorithm in the literature, which differ in the timing of performing the projection. Our version is the same as those in (Chiang et al., 2012;  Also note that OGDA only requires accessing f via its gradient. In fact, only one gradient at the point (x t , y t ) is needed for iteration t. This aspect makes it especially suitable for a repeated game setting, where in each round, one player proposes x t while another player proposes y t . With only the information of the gradient from the environment (∇ x f (x t , y t ) for the first player and ∇ y f (x t , y t ) for the other), both players can execute the algorithm. Optimistic Multiplicative Weights Update (OMWU). When the feasible sets X and Y are probability simplices ∆ M and ∆ N for some integers M and N , OMWU is another common iterative algorithm to solve the saddle-point problem. For simplicity, we assume that it starts from the uniform distributions ( x 1 , y 1 ) = (x 0 , y 0 ) = 1 M M , 1 N N , where 1 d is the all-one vector of dimension d. Then OMWU with step size η > 0 iteratively computes the following for t = 1, 2, . . ., x t,i = x t,i exp(-η(∇ x f (x t-1 , y t-1 )) i ) j x t,j exp(-η(∇ x f (x t-1 , y t-1 )) j ) , x t+1,i = x t,i exp(-η(∇ x f (x t , y t )) i ) j x t,j exp(-η(∇ x f (x t , y t )) j ) , y t,i = y t,i exp(η(∇ y f (x t-1 , y t-1 )) i ) j y t,j exp(η(∇ y f (x t-1 , y t-1 )) j ) , y t+1,i = y t,i exp(η(∇ y f (x t , y t )) i ) j y t,j exp(η(∇ y f (x t , y t )) j ) . OMWU and OGDA as Optimistic Mirror Descent Ascent. OMWU and OGDA can be viewed as special cases of Optimistic Mirror Descent Ascent. Specifically, let regularizer ψ(u) denote the negative entropy i u i ln u i for the case of OMWU and (half of) the L 2 norm square 1 2 u 2 for the case of OGDA (so that D ψ (u, v) is KL(u, v) and 1 2 uv 2 respectively). Then using the shorthands z t = (x t , y t ) and z t = ( x t , y t ) and recalling the notation defined earlier: Z = X × Y and F (z) = (∇ x f (x, y), -∇ y f (x, y)), one can rewrite OMWU/OGDA compactly as z t = argmin z∈Z η z, F (z t-1 ) + D ψ (z, z t ) , z t+1 = argmin z∈Z η z, F (z t ) + D ψ (z, z t ) . By the standard regret analysis of Optimistic Mirror Descent, we have the following important lemma, which is readily applied to OMWU and OGDA when ψ is instantiated as the corresponding regularizer. The proof is mostly standard (see e.g., (Rakhlin & Sridharan, 2013 , Lemma 1)). For completeness, we include it in Appendix B. Lemma 1. Consider update rules Eq. (1) and Eq. (2) and define dist 2 p (z, z ) = xx 2 p + yy 2 p . Suppose that ψ satisfies D ψ (z, z ) ≥ 1 2 dist 2 p (z, z ) for some p ≥ 1, and F satisfies dist 2 q (F (z), F (z )) ≤ L 2 dist 2 p (z, z ) for q ≥ 1 with 1 p + 1 q = 1. Also, assume that η ≤ 1 8L . Then for any z ∈ Z and any t ≥ 1, we have ηF (z t ) (z t -z) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) -15 16 D ψ (z t , z t ) + 1 16 D ψ ( z t , z t-1 ).

4. CONVERGENCE RESULTS FOR OMWU

In this section, we show that for a two-player zero-sum matrix game with a unique equilibrium, OMWU with a constant learning rate converges to the equilibrium exponentially fast. The assumption and the algorithm are the same as those considered in (Daskalakis & Panageas, 2019b) , but our analysis improves theirs in two ways. First, we do not require the learning rate to be exponentially smaller than some problem-dependent quantity. Second, we explicitly provide a linear convergence rate. In Section 5, we further remove the uniqueness assumption and significantly generalize the results by studying OGDA. In a matrix game we have X = ∆ M , Y = ∆ N , and f (z) = x Gy for some matrix G ∈ [-1, 1] M ×N . To show the last-iterate convergence of OMWU, we first apply Lemma 1 with D ψ (u, v) = KL(u, v), z = z * (the unique equilibrium of the game matrix G) and (p, q) = (1, ∞). The constant L can be chosen as  1 since dist 2 ∞ (F (z), F (z )) = max i |(G(y - y )) i | 2 + max j |(G (x -x )) j | 2 ≤ y -y 2 1 + x -x 2 1 = dist 2 1 (z, z ). Also notice that F (z t ) (z t -z * ) = f (x t , y t ) -f (x * , y t ) + f (x t , y * ) -f (x t , y t ) = f (x t , y * ) -f (x * , y t ) ≥ Θ t+1 ≤ Θ t -15 16 ζ t . From Eq. (3) it is clear that the quantity Θ t is always non-increasing in t due to the non-negativity of ζ t . Furthermore, the more the algorithm moves between round t and round t + 1 (that is, the larger ζ t is), the more Θ t decreases. To establish the rate of convergence, a natural idea is to relate ζ t back to Θ t or Θ t+1 . For example, if we can show ζ t ≥ cΘ t+1 for some constant c > 0, then Eq. (3) implies Θ t+1 ≤ Θ t -15c 16 Θ t+1 , which further gives Θ t+1 ≤ 1 + 15c 16 -1 Θ t . This immediately implies a linear convergence rate for Θ t as well as KL(z * , z t ) since KL(z * , z t ) ≤ Θ t . Moreover, notice that to find such c, it suffices to find a c > 0 such that ζ t ≥ c KL(z * , z t+1 ). This is because it will then give ζ t ≥ 1 16 KL( z t+1 , z t ) + 15 16 ζ t ≥ 1 16 KL( z t+1 , z t ) + 15c 16 KL(z * , z t+1 ) ≥ min{1, 15c 16 }Θ t+1 , and thus c min{1, 15c 16 } satisfies the condition. From the discussion above, we see that to establish the linear convergence of KL(z * , z t ), we only need to show that there exists some c > 0 such that KL( z t+1 , z t ) + KL(z t , z t ) ≥ c KL(z * , z t+1 ). The high-level interpretation of this inequality is that when z t+1 is far from the equilibrium z * (i.e., KL(z * , z t+1 ) is large), the algorithm should have a large move between round t and t + 1 making KL( z t+1 , z t ) + KL(z t , z t ) large. In our analysis, we use a two-stage argument to find such a c . In the first stage, we only show that KL( z t+1 , z t ) + KL(z t , z t ) ≥ c KL(z * , z t+1 ) 2 for some c > 0, and use it to argue a slower convergence rate KL(z * , z t ) = O 1 t . Then in the second stage, we show that after z t and z t become close enough to z * , we have KL( z t+1 , z t )+KL(z t , z t ) ≥ c KL(z * , z t+1 ) for some c > 0. This kind of two-stage argument might be reminiscent of that used by Daskalakis & Panageas (2019b) ; however, the techniques we use are very different. Specifically, Daskalakis & Panageas (2019b) utilize tools of "spectral analysis" similar to (Liang & Stokes, 2019) and show that the OMWU update can be viewed as a "contraction mapping" with respect to a matrix whose eigenvalue is smaller than 1. Our analysis, on the other hand, leverages analysis of online mirror descent, starting from the "one-step regret bound" (Lemma 1) and making use of the two negative terms that are typically dropped in the analysis. Importantly, our analysis does not need an exponentially small learning rate required by (Daskalakis & Panageas, 2019b) . Thus, unlike their results, our learning rate is kept as a universal constant in all stages. The arguments above are formalized below: Lemma 2. Consider a matrix game f (x, y) = x Gy with X = ∆ M , Y = ∆ N , and G ∈ [-1, 1] M ×N . Assume that there exists a unique Nash equilibrium z * and η ≤foot_1 8 . Then, there exists a constant C 1 > 0 that depends on G such that for any t ≥ 1, OMWU ensures KL( z t+1 , z t ) + KL(z t , z t ) ≥ η 2 C 1 KL(z * , z t+1 ) 2 . Also, there is a constant ξ > 0 that depends on G (defined in Definition 2) such that as long as max{ z * -z t 1 , z * -z t 1 } ≤ ηξ 10 , then KL( z t+1 , z t ) + KL(z t , z t ) ≥ η 2 C 2 KL(z * , z t+1 ) for another constant C 2 > 0 that depends on G. With Lemma 2 and the earlier discussion, the last-iterate convergence rate of OMWU is established: Theorem 3. For a matrix game f (x, y) = x Gy with a unique Nash equilibrium z * , OMWU with a learning rate η ≤ 1 8 guarantees KL(z * , z t ) ≤ C 3 (1 + C 4 ) -t , where C 3 , C 4 > 0 are some constants depending on the game matrix G. Proofs for this section are deferred to Appendix D, where all problem-dependent constants are specified as well. 1 To the best of our knowledge, Theorem 3 gives the first last-iterate convergence result for OMWU with a concrete linear rate. We note that the uniqueness assumption is critical for our analysis, and whether this is indeed necessary for OMWU is left as an important future direction.

5. CONVERGENCE RESULTS FOR OGDA

In this section, we provide last-iterate convergence results for OGDA, which are much more general than those in Section 4. We propose a general condition subsuming many well-studied cases, under which OGDA enjoys a concrete last-iterate convergence guarantee in terms of the L 2 distance between z t and Z * . The results in this part can be specialized to the setting of bilinear games over simplex, but the unique equilibrium assumption made in Section 4 and in (Daskalakis & Panageas, 2019b) is no longer needed. Throughout the section we make the assumption that f is L-smooth: Assumption 1. For any z, z ∈ Z, F (z) -F (z ) ≤ L z -z holds. 2 To introduce our general condition, we first provide some intuition by applying Lemma 1 again. Letting ψ(u) = 1 2 ufoot_2 in Lemma 1, we get that for OGDA, for any z ∈ Z and any t ≥ 1, 2ηF (z t ) (z t -z) ≤ z t -z 2 -z t+1 -z 2 -z t+1 -z t 2 -15 16 z t -z t 2 + 1 16 z t -z t-1 2 .

Now we instantiate the inequality above with

z = Π Z * ( z t ) ∈ Z * . Since z = Π Z * ( z t ) is an equilibrium, we have F (z t ) (z t -z) ≥ f (x t , y t ) -f (x, y t ) + f (x t , y) -f (x t , y t ) = f (x t , y) - f (x, y t ) ≥ 0 by the convexity/concavity of f and the optimality of z, and thus z t+1 -Π Z * ( z t ) 2 ≤ z t -Π Z * ( z t ) 2 -z t+1 -z t 2 -15 16 z t -z t 2 + 1 16 z t -z t-1 2 . Further noting that the left-hand side is lower bounded by dist 2 ( z t+1 , Z * ) by definition, we arrive at dist 2 ( z t+1 , Z * ) ≤ dist 2 ( z t , Z * ) -z t+1 -z t 2 -15 16 z t -z t 2 + 1 16 z t -z t-1 2 . Similarly, we define Θ t = z t -Π Z * ( z t ) 2 + 1 16 z t -z t-1 2 , ζ t = z t+1 -z t 2 + z t -z t 2 , and rewrite the above as Θ t+1 ≤ Θ t -15 16 ζ t . As in Section 4, our goal now is to lower bound ζ t by some quantity related to dist 2 ( z t+1 , Z * ), and then use Eq. ( 4) to obtain a convergence rate for Θ t . In order to incorporate more general objective functions into the discussion, in the following Lemma 4, we provide an intermediate lower bound for ζ t , which will be further related to dist 2 ( z t+1 , Z * ) later. Lemma 4. For any t ≥ 0 and z ∈ Z with z = z t+1 , OGDA with η ≤ 1 8L ensures z t+1 -z t 2 + z t -z t 2 ≥ 32 81 η 2 F ( z t+1 ) ( z t+1 -z ) 2 + z t+1 -z 2 , ( ) where [a] + max{a, 0}, and similarly, for z = z t+1 , z t+1 -z t+1 2 + z t -z t+1 2 ≥ 32 81 η 2 F (z t+1 ) (z t+1 -z ) 2 + z t+1 -z 2 . ( ) We note that a direct consequence of Lemma 4 is an "average duality gap" guarantee for OGDA when Z is bounded: 1 T T t=1 α(z t ) = 1 T T t=1 max x ∈X ,y ∈Y (f (x t , y ) -f (x , y t )) = O D η √ T where D sup z,z ∈Z zz is the diameter of Z (the duality gap may be undefined when Z is unbounded). We are not aware of any previous work that gives this result for the constrained case. See Appendix E for the proof of Eq. ( 7) and comparisons with previous works. However, to obtain last-iterate convergence results, we need to make sure that the right-hand side of Eq. ( 5) is large enough. Motivated by this fact, we propose the following general condition on f and Z to achieve so. Definition 1 (Saddle-Point Metric Subregularity (SP-MS)). The SP-MS condition is defined as: for any z ∈ Z\Z * with z * = Π Z * (z), sup z ∈Z F (z) (z -z ) z -z ≥ C z -z * β+1 (SP-MS) holds for some parameter β ≥ 0 and C > 0. We call this condition Saddle-Point Metric Subregularity because the case with β = 0 is equivalent to one type of metric subregularity in variational inequality problems, as we prove in Appendix F. The condition is also closely related to other error bound conditions that have been identified for variational inequality problems (e.g., Tseng (1995) ; Gilpin et al. (2008) ; Malitsky (2019) ). Although these works have shown that under similar conditions their algorithms exhibit linear convergence, to the best of our knowledge, there is no previous work that analyzes OGDA or other no-regret learning algorithms using such conditions. SP-MS covers many standard settings studied in the literature. The first and perhaps the most important example is bilinear games with a polytope feasible set, which in particular includes the classic two-player matrix games considered in Section 4. Theorem 5. A bilinear game f (x, y) = x Gy with X ⊆ R M and Y ⊆ R N being polytopes and G ∈ R M ×N satisfies SP-MS with β = 0. We emphasize again that different from Lemma 2, Theorem 5 does not require a unique equilibrium. Note that we have not provided the concrete form of the parameter C in the theorem (which depends on X , Y, and G), but it can be found in the proof (see Appendix G). 3 The next example shows that strongly-convex-strongly-concave problems are also special cases of our condition. Theorem 6. If f is strongly convex in x and strongly concave in y, then SP-MS holds with β = 0. Next, we provide a toy example where SP-MS holds with β > 0. Theorem 7. Let X = Y {(a, b) : 0 ≤ a, b ≤ 1, a + b = 1}, n > 2 be an integer, and f (x, y) = x 2n 1 -x 1 y 1 -y 2n 1 . Then SP-MS holds with β = 2n -2. With this general condition, we are now able to complete the loop. For any value of β, we show the following last-iterate convergence guarantee for OGDA. Theorem 8. For any η ≤ 1 8L , if SP-MS holds with β = 0, then OGDA guarantees linear last-iterate convergence: dist 2 (z t , Z * ) ≤ 64dist 2 ( z 1 , Z * )(1 + C 5 ) -t ; (8) on the other hand, if the condition holds with β > 0, then we have a slower convergence: dist 2 (z t , Z * ) ≤ 32 1 + 4 4 β 1 β dist 2 ( z 1 , Z * ) + 2 2 C 5 β 1 β t -1 β , where C 5 min 16η 2 C 2 81 , 1 2 . We defer the proof to Appendix I and make several remarks. First, note that based on a convergence result on dist 2 (z t , Z * ), one can immediately obtain a convergence guarantee for the duality gap α f (z t ) as long as f is also Lipschitz. This is because α f (z t ) ≤ max x ,y f (x t , y ) -f (x * , y ) + f (x , y * ) -f (x , y t ) ≤ O( x t -x * + y t -y * ) = O dist 2 (z t , Z * ) , where (x * , y * ) = Π Z * (z t ). While this leads to stronger guarantees compared to Eq. ( 7), we emphasize that the latter holds even without the SP-MS condition. Second, our results significantly generalize (Hsieh et al., 2019, Theorem 2) which itself is a consolidated version of several earlier works and also shows a linear convergence rate of OGDA under a condition stronger than our SP-MS with β = 0 as discussed earlier. More specifically, our results show that linear convergence holds for a much broader set of problems. Furthermore, we also show slower sublinear convergence rates for any value of β > 0, which is also new as far as we know. In particular, we empirically verify that OGDA indeed does not converge exponentially fast for the toy example defined in Theorem 7 (see Appendix A). Last but not least, the most significant implication of Theorem 8 is that it provides by far the most general linear convergence result for OGDA for the classic two-player matrix games, or more generally bilinear games with polytope constraints, according to Theorem 5 and Eq. ( 8). Compared to recent works of (Daskalakis & Panageas, 2018; 2019b) for matrix games (on OGDA or OMWU), our result is considerably stronger: 1) we do not require a unique equilibrium while they do; 2) linear convergence holds for any initial points z 1 , while their result only holds if the initial points are in a small neighborhood of the unique equilibrium (otherwise the convergence is sublinear initially); 3) our only requirement on the step size is η ≤ 1 8L ,foot_4 while they require an exponentially small η, which does not reflect the behavior of the algorithms in practice. Even compared with our result in Section 4, we see that for OGDA, the unique equilibrium assumption is not required, and we do not have an initial phase of sublinear convergence as in Lemma 2. In Appendix A, we empirically show that OGDA often outperforms OMWU when both are tuned with a constant learning rate. One may wonder what happens if a bilinear game has a non-polytope constraint. It turns out that in this case, SP-MS may only hold with β > 0, due to the following example showing that linear convergence provably does not hold for OGDA when the feasible set has a curved boundary. Theorem 9. There exists a bilinear game with a non-polytope feasible set such that SP-MS holds with β = 3, and dist 2 (z t , Z * ) = Ω(1/t 2 ) holds for OGDA. This example indicates that the shape of the feasible set plays an important role in last-iterate convergence, which may be an interesting future direction to investigate, This is also verified empirically in our experiments (see Appendix A).

6. EXPERIMENTS FOR MATRIX GAMES

In this section, we provide empirical results on the performance of OGDA and OMWU for matrix games on probability simplex. 5 We include more empirical results in other settings in Appendix A. We set the size of the game matrix to be 32 × 32, then generate a random matrix with each entry G ij drawn uniformly at random from [-1, 1], and finally rescale its operator norm to 1. With probability 1, the game has a unique Nash Equilibrium (Daskalakis & Panageas, 2019b) . We compare the performances of OGDA and OMWU. For both algorithms, we choose a series of different learning rates and compare their performances, as shown in Figure 1 . The x-axis represents time step t, and the y-axis represents ln(KL(z * , z t )) (we observe similar results using dist 2 (z * , z t ) or the duality gap as the measure; see Appendix A.1). Note that here we approximate z * by running OGDA for much more iterations and taking the very last iterate. We also verify that the iterates of OMWU converge to the same point as OGDA. From Figure 1 , we see that all curves eventually become a straight line, supporting our linear convergence results. Generally, the slope of the straight line is larger for a larger learning rate η. However, the algorithm diverges when η exceeds some value (such as 11 for the case of OMWU). Comparing OMWU and OGDA, we see that OGDA converges faster, which is also consistent with our theory if one compares the bounds in Theorem 3 and Theorem 8 (with the value of the constants revealed in the proofs). We find this observation interesting, since OMWU is usually considered more favorable for problems defined over the simplex, especially in terms of regret minimization. Our experiments suggest that, however, in terms of last-iterate convergence, OGDA might perform even better than OMWU.

A.1 MORE EMPIRICAL RESULTS FOR MATRIX GAMES

Here, we provide more plots for the same matrix game experiment described in Section 6. Specifically, the left plot in Figure 2 shows the convergence with respect to ln z t -z * , while the right plot shows the convergence with respect to the logarithm of the duality gap ln(α f (z t )) = ln max j (G x t ) j -min i (Gy t ) i . One can see that the plots are very similar to those in Figure 1 . Figure 2 : Experiments of OGDA and OMWU with different learning rates on a matrix game f (x, y) = x Gy, where we generate G ∈ R 32×32 with each entry G ij drawn uniformly at random from [-1, 1] and then rescale G's operator norm to 1. "OGDA/OMWU-eta=η" represents the curve of OGDA/OMWU with learning rate η. The configuration order in the legend is consistent with the order of the curves. For OMWU, η ≥ 11 makes the algorithm diverge. The plot confirms the linear convergence of OMWU and OGDA, although OGDA is generally observed to converge faster than OMWU.

A.2 MATRIX GAME ON CURVED REGIONS

Next, we conduct experiments on a bilinear game similar to the one constructed in the proof of Theorem 9. Specifically, the bilinear game is defined by f (x, y) = x 2 y 1 -x 1 y 2 , X = Y {(a, b), 0 ≤ a ≤ 1 2 , 0 ≤ b ≤ 1 2 n , a n ≤ b}. For any positive integer n, the equilibrium point of this game is (0, 0) for both x and y. Note that in Theorem 9, we prove that OGDA only converges at a rate no better than Ω(1/t 2 ) in this game when n = 2. Figure 3 shows the empirical results for various values of n. In this figure, we plot z t -z * versus time step t in log-log scale. Note that in a log-log plot, a straight line with slope s implies a convergence rate of order O(t s ), that is, a sublinear convergence rate. It is clear from Figure 3 that OGDA indeed converges sublinearly for all n, supporting our Theorem 9.

A.3 STRONGLY-CONVEX-STRONGLY-CONCAVE GAMES

In this section, we use the same experiment setup for strongly-convex-strongly-concave games in (Lei et al., 2021) , where f (x, y) = x 2 1 -y 2 1 + 2x 1 y 1 , and X = Y {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The equilibrium point is (0, 1) for both x and y. In Figure 4 , we present the log plot of z t -z * versus time step t and compare OGDA with OMWU using different learning rates as in Appendix A.1. The straight line of OGDA implies that OGDA algorithm converges exponentially fast, supporting Theorem 6 and Theorem 8. Also note that here, OGDA outperforms OMWU, which is different from the empirical results shown in (Lei et al., 2021) . We hypothesize that this is because they use a different version of OGDA.  (x, y) = x 2 y 1 - x 1 y 2 , X = Y {(a, b), 0 ≤ a ≤ 1 2 , 0 ≤ b ≤ 1 2 n , a n ≤ (x, y) = x 2n 1 -x 1 y 1 -y 2n 1 for some integer n ≥ 2 and X = Y {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The result shows that OGDA converges to the Nash equilibrium with sublinear rates in these instances.

A.4 AN EXAMPLE WITH β > 0 FOR SP-MS

We also consider the toy example in Theorem 7, where f (x, y) = x 2n 1 -x 1 y 1 -y 2n 1 for some integer n ≥ 2 and X = Y {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The equilibrium point is (0, 1) for both x and y. We prove in Theorem 7 that SP-MS does not hold for β = 0 but does hold for β = 2n -2. The point-wise convergence result is shown in Figure 5 , which is again a log-log plot of z t -z * versus time step t. One can observe that the convergence rate of OGDA is sublinear, supporting our theory again.

A.5 MATRIX GAMES WITH MULTIPLE NASH EQUILIBRIA

Finally, we provide empirical results for OGDA and OMWU in matrix games with multiple Nash equilibria, even though theoretically we only prove linear convergence results for OMWU assuming that the Nash equilibrium is unique. We consider the following game matrix G =      0 -1 1 0 0 1 0 -1 0 0 -1 1 0 0 0 -1 1 0 2 -1 -1 1 0 -1 2      . The value of G is 0. To verify this, consider x 0 = y 0 = 1 3 1 3 1 3 0 0 . Then we have for max y∈∆5 x 0 Gy = min x∈∆5 x Gy 0 = 0. Direct calculation gives the following set of Nash equilibria. X * = {x 0 } , Y * = y ∈ ∆ 5 : y 1 = y 2 = y 3 ; 1 2 y 5 ≤ y 4 ≤ 2y 5 . Figure 6 shows the point-wise convergence result. Π Z * (z t ) is the projection of z t on the set of Nash qquilibria. One can observe from the plots that both OGDA and OMWU achieve linear convergence rate in this example. We thus conjecture that the uniqueness assumption for Theorem 3 can be further relaxed. Figure 6 : Experiments of OGDA and OMWU with different learning rates on a matrix game with multiple Nash equilibria. "OGDA/OMWU-eta=η" represents the curve of OGDA/OMWU with learning rate η. We observe from these plots that both OGDA and OMWU enjoy a linear convergence rate, even though we are only able to show the linear convergence of OMWU under the uniqueness assumption.

B LEMMAS FOR OPTIMISTIC MIRROR DESCENT

We prove Lemma 1 in this section. To do so, we use the following two lemmas. Lemma 10. Let A be a convex set and u = argmin u ∈A { u , g + D ψ (u , u)}. Then for any u * ∈ A, u -u * , g ≤ D ψ (u * , u) -D ψ (u * , u ) -D ψ (u , u). ( ) Proof. Since D ψ (u , u) = ψ(u )-ψ(u)-∇ψ(u), u -u , by the first-order optimality condition of u , we have (g + ∇ψ(u ) -∇ψ(u)) (u * -u ) ≥ 0. On the other hand, notice that the right-hand side of Eq. ( 10) is ψ(u * ) -ψ(u) -∇ψ(u), u * -u -ψ(u * ) + ψ(u ) + ∇ψ(u ), u * -u -ψ(u ) + ψ(u) + ∇ψ(u), u -u = ∇ψ(u ) -∇ψ(u), u * -u . Therefore, Eq. ( 10) is equivalent to g + ∇ψ(u ) -∇ψ(u), u * -u ≥ 0, which we have already shown above. Lemma 11. Suppose that ψ satisfies D ψ (x, x ) ≥ 1 2 xx 2 p for some p ≥ 1, and let u, u 1 , u 2 ∈ A (a convex set) be related by the following: u 1 = argmin u ∈A { u , g 1 + D ψ (u , u)} , u 2 = argmin u ∈A { u , g 2 + D ψ (u , u)} . Then we have u 1 -u 2 p ≤ g 1 -g 2 q , where q ≥ 1 and 1 p + 1 q = 1. Proof. By the first-order optimality conditions of u 1 and u 2 , we have ∇ψ(u 1 ) -∇ψ(u) + g 1 , u 2 -u 1 ≥ 0, ∇ψ(u 2 ) -∇ψ(u) + g 2 , u 1 -u 2 ≥ 0. Summing them up and rearranging the terms, we get u 2 -u 1 , g 1 -g 2 ≥ ∇ψ(u 1 ) -∇ψ(u 2 ), u 1 -u 2 . ( ) By the condition on ψ, we have ∇ψ(u 1 ), u 1 -u 2 ≥ ψ(u 1 ) -ψ(u 2 ) + 1 2 u 1 -u 2 2 p and ∇ψ(u 2 ), u 2 -u 1 ≥ ψ(u 2 ) -ψ(u 1 ) + 1 2 u 1 -u 2 2 p . Summing them up we get ∇ψ(u 1 ) - ∇ψ(u 2 ), u 1 -u 2 ≥ u 1 -u 2 2 p . Combining this with Eq. ( 11) we get u 2 -u 1 , g 1 -g 2 ≥ u 1 -u 2 2 p . Since u 2 -u 1 , g 1 -g 2 ≤ u 1 -u 2 p g 1 -g 2 q by Hölder's inequality, we further get u 1 - u 2 p ≤ g 1 -g 2 q . Proof of Lemma 1. Considering Eq. (2), and using Lemma 10 with u = z t , u = z t+1 , u * = z, and g = ηF (z t ), we get ηF (z t ) ( z t+1 -z) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ). Considering Eq. (1), and using Lemma 10 with u = z t , u = z t , u * = z t+1 , and g = ηF (z t-1 ), we get ηF (z t-1 ) (z t -z t+1 ) ≤ D ψ ( z t+1 , z t ) -D ψ ( z t+1 , z t ) -D ψ (z t , z t ). Summing up the two inequalities above, and adding η (F (z t ) -F (z t-1 )) (z t -z t+1 ) to both sides, we get ηF (z t ) (z t -z) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) -D ψ (z t , z t ) + η (F (z t ) -F (z t-1 )) (z t -z t+1 ). ( ) Using Lemma 11 with u = x t , u 1 = x t , u 2 = x t+1 , g 1 = η∇ x f (z t-1 ) and g 2 = η∇ x f (z t ), we get x t -x t+1 p ≤ η ∇ x f (z t-1 )-∇ x f (z t ) q . Similarly, we have y t -y t+1 p ≤ η ∇ y f (z t )-∇ y f (z t-1 ) q . Therefore, by Hölder's inequality, we have η (F (z t ) -F (z t-1 )) (z t -z t+1 ) ≤ η x t -x t+1 p ∇ x f (z t-1 ) -∇ x f (z t ) q + η y t -y t+1 p ∇ y f (z t-1 ) -∇ y f (z t ) q ≤ η 2 ∇ x f (z t-1 ) -∇ x f (z t ) 2 q + η 2 ∇ y f (z t-1 ) -∇ y f (z t ) 2 q = η 2 dist 2 q (F (z t ), F (z t-1 )) ≤ η 2 L 2 dist 2 p (z t , z t-1 ) (by assumption) ≤ 1 64 dist 2 p (z t , z t-1 ). (by our choice of η) Continuing from Eq. ( 12), we then have ηF (z t ) (z t -z) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) -D ψ (z t , z t ) + 1 64 dist 2 p (z t , z t-1 ) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) -D ψ (z t , z t ) + 1 32 dist 2 p (z t , z t ) + 1 32 dist 2 p ( z t , z t-1 ) ( u + v 2 p ≤ ( u p + v p ) 2 ≤ 2 u 2 p + 2 v 2 p ) ≤ D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) -D ψ (z t , z t ) + 1 16 D ψ (z t , z t ) + 1 16 D ψ ( z t , z t-1 ) (by the assumption on ψ) = D ψ (z, z t ) -D ψ (z, z t+1 ) -D ψ ( z t+1 , z t ) - 15 16 D ψ (z t , z t ) + 1 16 D ψ ( z t , z t-1 ). This concludes the proof.

C AN AUXILIARY LEMMA ON RECURSIVE FORMULAS

Here, we provide an auxiliary lemma that gives an explicit bound based on a particular recursive formula. This will be useful later for deriving the convergence rate. Lemma 12. Consider a non-negative sequence {B t } t=1,2,••• that satisfies for some p > 0 and q > 0, • B t+1 ≤ B t -qB p+1 t+1 , ∀t ≥ 1 • q(1 + p)B p 1 ≤ 1. Then B t ≤ ct -1 p , where c = max B 1 , 2 qp 1 p . Proof. We first prove that B t+1 ≤ B t -q 2 B p+1 t . Notice that since B t are all non-negative, by the first condition, we have B t+1 ≤ B t ≤ • • • ≤ B 1 . Using the fundamental theorem of calculus, we have B p+1 t -B p+1 t+1 = Bt Bt+1 d dx x p+1 dx = (p + 1) Bt Bt+1 x p dx ≤ (p + 1)(B t -B t+1 )B p t and thus B t+1 ≤ B t -qB p+1 t+1 ≤ B t -qB p+1 t + q(p + 1) (B t -B t+1 ) B p t . By rearranging, we get B t+1 ≤ 1 - qB p t 1 + q(1 + p)B p t B t ≤ 1 - qB p t 2 B t = B t - q 2 B p+1 t , where the last inequality is because q(1 + p)B p t ≤ q(1 + p)B p 1 ≤ 1. Below we use induction to prove B t ≤ ct -1 p , where c = max B 1 , 2 qp 1 p . This clearly holds for t = 1. Suppose that it holds for 1, . . . , t. Note that the function f (B t ) = 1 -q 2 B p t B t is increasing in B t as f (B t ) = 1 -q(p+1) 2 B p t ≥ 1 -q(p+1)

2

B p 1 ≥ 0. Therefore, we apply the induction hypothesis and get B t+1 ≤ 1 - q 2 B p t B t ≤ 1 - q 2 c p t -1 ct -1 p = ct -1 p - q 2 c p+1 t -1-1 p ≤ ct -1 p - c p t -1-1 p ( c p ≤ q 2 c p+1 by the definition of c) ≤ c(t + 1) -1 p , where the last inequality is by the fundamental theorem of calculus: t -1 p -(1 + t) -1 p = t 1+t d dx x -1 p dx = t 1+t - 1 p x -1-1 p dx = t+1 t 1 p x -1-1 p dx ≤ 1 p t -1-1 p . This completes the induction.

D PROOFS OF LEMMA 2 AND THEOREM 3

In this section, we consider f (x, y) = x Gy with X = ∆ M and Y = ∆ N being simplex and G ∈ [-1, 1] M ×N . We assume that G has a unique Nash equilibrium z * = (x * , y * ). The value of the game is denoted as ρ = min x∈X max y∈Y x Gy = max y∈Y min x∈X x Gy = x * Gy * . Before proving Lemma 2 and Theorem 3, in Section D.1, we define some constants for later analysis; in Section D.2, we state more auxiliary lemmas, which are useful when proving Lemma 2 and Theorem 3 in Section D.3.

D.1 SOME PROBLEM-DEPENDENT CONSTANTS

First, we define a constant ξ that is determined by G. Definition 2. ξ min min i / ∈supp(x * ) (Gy * ) i -ρ, ρ -max i / ∈supp(y * ) (G x * ) i ∈ (0, 1]. The fact ξ ≤ 1 can be shown by: ξ ≤ min i / ∈supp(x * ) (Gy * ) i -ρ + ρ -max i / ∈supp(y * ) (G x * ) i 2 ≤ Gy * ∞ + G x * ∞ 2 ≤ 1, while the fact ξ > 0 is a direct consequence of Lemma C.3 of Mertikopoulos et al. (2018) , stated below. Lemma 13 (Lemma C.3 of Mertikopoulos et al. (2018) ). Let G ∈ R M ×N be a game matrix for a two-player zero-sum game with value ρ. Then there exists a Nash equilibrium (x * , y * ) such that (Gy * ) i = ρ ∀i ∈ supp(x * ), (Gy * ) i > ρ ∀i / ∈ supp(x * ), (G x * ) i = ρ ∀i ∈ supp(y * ), (G x * ) i < ρ ∀i / ∈ supp(y * ). Below, we define V * (Z) = V * (X ) × V * (Y), where V * (X ) {x : x ∈ ∆ M , supp(x) ⊆ supp(x * )} and V * (Y) {y : y ∈ ∆ N , supp(y) ⊆ supp(y * )}. Definition 3. c x min x∈∆ M \{x * } max y∈V * (Y) (x -x * ) Gy x -x * 1 , c y min y∈∆ N \{y * } max x∈V * (X ) x G(y * -y) y * -y 1 . Note that in the definition of c x and c y , the outer minimization is over an open set, which may make the definition problematic as the optimal value may not be attained. However, the following lemma shows that c x and c y are well-defined. Lemma 14. c x and c y are well-defined, and 0 < c x , c y ≤ 1. Proof. We first show c x and c y are well-defined. To simplify the notations, we define x * min min i∈supp(x * ) x * i and X {x : x ∈ ∆ M , xx * 1 ≥ x * min }, and define y * min and Y similarly. We will show that c x = min x∈X max y∈V * (Y) (x -x * ) Gy x -x * 1 , c y = min y∈Y max x∈V * (X ) x G(y * -y) y * -y 1 , which are well-defined as the outer minimization is now over a closed set. Consider c x , it suffices to show that for any x ∈ ∆ M such that x = x * and xx * 1 < x * min , there exists x ∈ ∆ M such that xx * 1 = x * min and (x -x * ) Gy x -x * 1 = (x -x * ) Gy x -x * 1 , ∀y. In fact, we can simply choose x = x * + (x -x * ) • x * min x-x * 1 . We first argue that x is still in ∆ M . For each j ∈ [K], if x j -x * j ≥ 0, we surely have x j ≥ x * j + 0 ≥ 0; otherwise, x * j > x j ≥ 0 and thus j ∈ supp(x * ) and x * j ≥ x * min , which implies x j ≥ x * j -|x j -x * j | • x * min x-x * 1 ≥ x * j -x * min ≥ 0. In addition, j x j = j x * j = 1. Combining these facts, we have x ∈ ∆ M . Moreover, according to the definition of x , x -x * 1 = x * min holds. Also, since x * -x and x * -x are parallel vectors, Eq. ( 13) is satisfied. The arguments above show that the c x in Definition 3 is a well-defined real number. The case of c y is similar. Now we show 0 < c x , c y ≤ 1. The fact that c x , c y ≤ 1 is a direct consequence of G being in [-1, 1] M ×N . Below, we use contradiction to prove that c y > 0. First, if c y < 0, then there exists y = y * such that x * Gy * < x * Gy. This contradicts with the fact that (x * , y * ) is the equilibrium. On the other hand, if c y = 0, then there is some y = y * such that max x∈V * (X ) x G(y * -y) = 0. ( ) Consider the point y = y * + ξ 2 (y-y * ) (recall the definition of ξ in Definition 2 and that 0 < ξ ≤ 1), which lies on the line segment between y * and y. Then, for any x ∈ X , x Gy = i / ∈supp(x * ) x i (Gy ) i + i∈supp(x * ) x i (Gy ) i ≥ i / ∈supp(x * ) x i (Gy * ) i -x i y -y * 1 + i∈supp(x * ) ξ 2 • x i (G(y -y * )) i + x i (Gy * ) i (using G ij ∈ [-1, -1] for the first part and y = y * + ξ 2 (y -y * ) for the second) ≥ i / ∈supp(x * ) x i (Gy * ) i -x i y -y * 1 + i∈supp(x * ) x i ρ (using Eq. ( 14) and (Gy * ) i = ρ for all i ∈ supp(x * )) ≥ i / ∈supp(x * ) x i ((Gy * ) i -ξ) + i∈supp(x * ) x i ρ (using y -y * = ξ 2 (y -y * ) and y -y * 1 ≤ 2) ≥ i / ∈supp(x * ) x i ρ + i∈supp(x * ) x i ρ (by the definition of ξ) = ρ. This shows that min x∈X x Gy ≥ ρ, that is, y = y * is also a maximin point, contradicting that z * is unique. Therefore, c y > 0 has to hold, and so does c x > 0 by the same argument. Finally, we define the following constant that depends on G: Definition 4. min j∈supp(z * ) exp - ln(M N ) z * j .

D.2 AUXILIARY LEMMAS

All lemmas stated in this section is for the case f (x, y) = x Gy with Z = ∆ M × ∆ N and a unique Nash equilibrium z * = (x * , y * ). Lemma 15. For any z ∈ Z, we have max z ∈V * (Z) F (z) (z -z ) ≥ C z * -z 1 for C = min{c x , c y } ∈ (0, 1]. Proof. Recall that ρ = x * Gy * is the game value and note that max z ∈V * (Z) F (z) (z -z ) = max z ∈V * (Z) (x -x ) Gy + x G(y -y) = max z ∈V * (Z) -x Gy + x Gy = max x ∈V * (X ) ρ -x Gy + max y ∈V * (Y) x Gy -ρ = max x ∈V * (X ) x G(y * -y) + max y ∈V * (Y) (x -x * ) Gy (Lemma 13) ≥ c y y * -y 1 + c x x * -x 1 (by Definition 3) ≥ min{c x , c y } z * -z 1 , which completes the proof. Lemma 16. For any z ∈ Z, we have KL(z * , z) ≤ i∈supp(z * ) (z * i -z i ) 2 z i + i / ∈supp(z * ) z i ≤ 1 min i∈supp(z * ) z i z * -z 1 . Proof. Using the definition of the Kullback-Leibler divergence, we have KL(x * , x) = i x * i ln x * i x i ≤ ln i x * i 2 x i = ln 1 + i (x * i -x i ) 2 x i ≤ i (x * i -x i ) 2 x i , where the first inequality is by the concavity of the ln(•) function, and the second inequality is because ln(1 + u) ≤ u. Considering i ∈ supp(x * ) and i / ∈ supp(x * ) separately in the last summation, we have i (x * i -x i ) 2 x i = i∈supp(x * ) (x * i -x i ) 2 x i + i / ∈supp(x * ) (x i ) 2 x i = i∈supp(x * ) (x * i -x i ) 2 x i + i / ∈supp(x * ) x i . The case for KL(y * , y) is similar. Combining both cases finishes the proof of the first inequality (recall that KL(z * , z) is defined as KL(x * , x) + KL(y * , y)). The second inequality is straightforward: i∈supp(z * ) (z * i -z i ) 2 z i + i / ∈supp(z * ) z i ≤ 1 min i∈supp(z * ) z i   i∈supp(z * ) |z * i -z i | + i / ∈supp(z * ) |z i |   = 1 min i∈supp(z * ) z i z * -z 1 . Lemma 17. For η ≤ 1 8 , OMWU guarantees 3 4 z t,i ≤ z t,i ≤ 4 3 z t,i and 3 4 z t,i ≤ z t+1,i ≤ 4 3 z t,i . Proof. This is shown directly by the update of x t : x t,i exp (-η) exp (η) ≤ x t+1,i = x t,i exp (-η • (Gy t ) i ) j x t,j exp (-η • (Gy t ) j ) ≤ x t,i exp (η) exp(-η) . So by the condition on η, we have 3 4 x t,i ≤ exp(-2η) • x t,i ≤ x t+1,i ≤ exp(2η) • x t,i ≤ 4 3 x t,i . The cases for x t , y t and y t are similar. Lemma 18. For any two probability vectors u, v, if for every entry i, 1 2 u i ≤ v i ≤ 3 2 u i , then 1 3 i (vi-ui) 2 ui ≤ KL(u, v) ≤ i (vi-ui) 2 ui ≤ 1 4 . Proof. Using the definition of the Kullback-Leibler divergence, we have KL(u, v) = - i u i ln v i u i ≥ - i u i v i -u i u i - 1 3 (v i -u i ) 2 u 2 i = 1 3 i (v i -u i ) 2 u i , KL(u, v) = - i u i ln v i u i ≤ - i u i v i -u i u i - (v i -u i ) 2 u 2 i = i (v i -u i ) 2 u i ≤ 1 4 , where the first inequality is because ln(1+a) ≤ a-1 3 a 2 for -1 2 ≤ a ≤ 1 2 , and the second inequality is because ln(1 + a) ≥ a -a 2 for -1 2 ≤ a ≤ 1 2 . The third inequality is by using the condition |u i -v i | ≤ 1 2 u i . Lemma 19. For all i ∈ supp(z * ) and t, OMWU guarantees z t,i ≥ ( is defined in Definition 4). Proof. Using Eq. ( 3), we have KL(z * , z t ) ≤ Θ t ≤ • • • ≤ Θ 1 = 1 16 KL( z 1 , z 0 ) + KL(z * , z 1 ) = KL(z * , z 1 ), where the last equality is because z 1 = z 0 = ( 1 M M , 1 N N ). Then, for any i ∈ supp(z * ), we have z * i ln 1 z t,i ≤ j z * j ln 1 z t,j = KL(z * , z t ) - j z * j ln z * j ≤ KL(z * , z 1 ) - j z * j ln z * j = j z * j ln 1 z 1,j = ln(M N ). Therefore, we conclude for all t and i ∈ supp(z * ), z t,i satisfies z t,i ≥ exp - ln(M N ) z * i ≥ min j∈supp(z * ) exp - ln(M N ) z * j = . D.3 PROOFS OF LEMMA 2 AND THEOREM 3 Proof of Lemma 2. Below we consider any z ∈ Z such that supp(z ) ⊆ supp(z * ), that is, z ∈ V * (Z). Considering Eq. ( 1), and using the first-order optimality condition of z t+1 , we have (∇ψ( z t+1 ) -∇ψ( z t ) + ηF (z t )) (z -z t+1 ) ≥ 0, where ψ(z) = i z i ln z i . Rearranging the terms and we get ηF (z t ) ( z t+1 -z ) ≤ (∇ψ( z t+1 ) -∇ψ( z t )) (z -z t+1 ) = i (z i -z t+1,i ) ln z t+1,i z t,i . The left hand side of Eq. ( 16) is lower bounded as ηF (z t ) ( z t+1 -z ) = ηF ( z t+1 ) ( z t+1 -z ) + η (F (z t ) -F ( z t+1 )) ( z t+1 -z ) ≥ ηF ( z t+1 ) ( z t+1 -z ) -η F (z t ) -F ( z t+1 ) ∞ z t+1 -z 1 ≥ ηF ( z t+1 ) ( z t+1 -z ) -4η z t -z t+1 1 ( F (z t ) -F ( z t+1 ) ∞ ≤ z t -z t+1 1 ≤ 4) ≥ ηF ( z t+1 ) ( z t+1 -z ) - 1 2 z t -z t+1 1 ; (η ≤ 1/8) on the other hand, the right hand side of Eq. ( 16) is upper bounded by i (z i -z t+1,i ) ln z t+1,i z t,i = i∈supp(z * ) z i ln z t+1,i z t,i -KL( z t+1 , z t ) (supp(z ) ⊆ supp(z * )) ≤ i∈supp(z * ) ln z t+1,i z t,i = i∈supp(z * ) max ln 1 + z t+1,i -z t,i z t,i , ln 1 + z t,i -z t+1,i z t+1,i ≤ i∈supp(z * ) ln 1 + | z t+1,i -z t,i | min{ z t+1,i , z t,i } ≤ 4 3 i∈supp(z * ) | z t+1,i -z t,i | z t,i (ln(1 + a) ≤ a and Lemma 17) Combining the bounds on the two sides of Eq. ( 16), we get ηF ( z t+1 ) ( z t+1 -z ) ≤ 4 3 i∈supp(z * ) | z t+1,i -z t,i | z t,i + 1 2 z t -z t+1 1 . Since z can be chosen as any point in V * (Z), we further lower bound the left-hand side above using Lemma 15 and get ηC z * -z t+1 1 ≤ 4 3 i∈supp(z * ) | z t+1,i -z t,i | z t,i + 1 2 z t -z t+1 1 ≤ 4 3 z t+1 -z t 1 + 1 2 z t -z t+1 1 , (Lemma 19) ≤ 4 3 ( z t+1 -z t 1 + z t -z t+1 1 ) where the last inequality uses ≤ 1. With the help of Eq. ( 17), below we prove the desired inequalities. Case 1. General case. KL( z t+1 , z t ) + KL(z t , z t ) ≥ 1 2 x t+1 -x t 2 1 + 1 2 y t+1 -y t 2 1 + 1 2 x t -x t 2 1 + 1 2 y t -y t 2 1 (Pinsker's inequality) ≥ 1 4 z t+1 -z t 2 1 + 1 4 z t -z t 2 1 (a 2 + b 2 ≥ 1 2 (a + b) 2 ) ≥ 1 16 z t+1 -z t 2 1 + 1 8 z t+1 -z t 2 1 + z t -z t 2 1 ≥ 1 16 z t+1 -z t 2 1 + 1 16 z t+1 -z t 2 1 (a 2 + b 2 ≥ 1 2 (a + b) 2 and triangle inequality) ≥ 1 32 ( z t+1 -z t 1 + z t+1 -z t 1 ) 2 (a 2 + b 2 ≥ 1 2 (a + b) 2 ) ≥ 1 32 3 ηC 4 2 z * -z t+1 2 1 (Eq. ( 17)) (Lemma 16 and Lemma 19) This proves the first part of the lemma with C 1 = 4 C 2 /64. ≥ 2 η 2 C 2 64 × 2 KL(z * , z t+1 ) 2 = 4 η 2 C 2 64 KL(z * , z t+1 ) 2 . Case 2. The case when max{ z * -z t 1 , z * -z t 1 } ≤ ηξ 10 . KL( z t+1 , z t ) + KL(z t , z t ) ≥ 1 3 i ( z t+1,i -z t,i ) 2 z t+1,i + (z t,i -z t,i ) 2 z t,i (Lemma 17 and Lemma 18) ≥ 1 4 i / ∈supp(z * ) ( z t+1,i -z t,i ) 2 z t,i + (z t,i -z t,i ) 2 z t,i (Lemma 17) ≥ 1 8 i / ∈supp(z * ) ( z t+1,i -z t,i ) 2 z t,i . ( ) Below we continue to bound i / ∈supp(z * ) ( zt+1,i-zt,i) 2 zt,i . By the assumption, we have y t -y * 1 ≤ ηξ 10 , which by Lemma 13 and Definition 2 implies ∀i ∈ supp(x * ), (Gy t ) i ≤ (Gy * ) i + ηξ 10 = ρ + ηξ 10 ≤ ρ + ξ 10 , ∀i / ∈ supp(x * ), (Gy t ) i ≥ (Gy * ) i - ηξ 10 ≥ ρ + ξ - ηξ 10 ≥ ρ + 9ξ 10 . We also have x t -x * 1 ≤ ηξ 10 , so j / ∈supp(x * ) x t,j ≤ ηξ 10 . Then, for i / ∈ supp(x * ), we have x t+1,i = x t,i exp(-η(Gy t ) i ) j x t,j exp(-η(Gy t ) j ) ≤ x t,i exp(-η(Gy t ) i ) j∈supp(x * ) x t,j exp(-η(Gy t ) j ) ≤ x t,i exp(-η(ρ + 9ξ 10 )) j∈supp(x * ) x t,j exp(-η(ρ + ξ 10 )) = x t,i exp -8 10 ηξ 1 -j / ∈supp(x * ) x t,j ≤ x t,i exp -8 10 ηξ 1 -ηξ 10 ≤ x t,i 1 - 1 2 ηξ , where the last inequality is because exp(-0.8u) 1-0.1u ≤ 1 -0.5u for u ∈ [0, 1]. Rearranging gives | x t+1,i -x t,i | 2 x t,i ≥ η 2 ξ 2 4 x t,i ≥ η 2 ξ 2 8 x t+1,i , where the last step uses Lemma 17. The case for y t is similar, so we have | z t+1,i -z t,i | 2 z t,i ≥ η 2 ξ 2 8 z t+1,i . Combining this with Eq. ( 18), we get KL( z t+1 , z t ) + KL(z t , z t ) ≥ η 2 ξ 2 64 i / ∈supp(z * ) z t+1,i . Now we combine two lower bounds of KL( z t+1 , z t ) + KL(z t , z t ). Using an intermediate step in Case 1, and Eq. ( 19), we get 16 and Lemma 19) This proves the second part of the lemma with C 2 = 3 C 2 ξ 2 /128. Now we are ready to prove Theorem 3. KL( z t+1 , z t ) + KL(z t , z t ) = 1 2 (KL( z t+1 , z t ) + KL(z t , z t )) + 1 2 (KL( z t+1 , z t ) + KL(z t , z t )) ≥ 2 η 2 C 2 128 z * -z t+1 2 1 + η 2 ξ 2 128 i / ∈supp(z * ) z t+1,i = 3 η 2 C 2 ξ 2 128   1 ξ 2 z t+1 -z * 2 1 + 1 3 C 2 i / ∈supp(z * ) z t+1,i   ≥ 3 η 2 C 2 ξ 2 128   1 z t+1 -z * 2 1 + i / ∈supp(z * ) z t+1,i   (ξ ≤ 1, C ≤ 1, and ≤ 1) ≥ 3 η 2 C 2 ξ 2 128 KL(z * , z t+1 ). ( Proof of Theorem 3. As argued in Section 4, with Θ t = KL(z * , z t ) + 1 16 KL( z t , z t-1 ) and ζ t = KL( z t+1 , z t ) + KL(z t , z t ), we have (see Eq. ( 3)) Θ t+1 ≤ Θ t -15 16 ζ t . We the proceed as, ζ t ≥ 1 2 KL( z t+1 , z t ) + 1 2 ζ t ≥ 1 2 KL( z t+1 , z t ) + η 2 C 1 2 KL(z * , z t+1 ) 2 (Lemma 2) ≥ 2KL( z t+1 , z t ) 2 + η 2 C 1 2 KL(z * , z t+1 ) 2 (by Lemma 17 and Lemma 18) ≥ η 2 C 1 2 KL( z t+1 , z t ) 2 + KL(z * , z t+1 ) 2 (C 1 = 4 C 2 /64 ≤ 1/64 as shown in the proof of Lemma 2) ≥ η 2 C 1 4 (KL( z t+1 , z t ) + KL(z * , z t+1 )) 2 ≥ η 2 C 1 4 Θ 2 t+1 . Therefore, Θ t+1 ≤ Θ t -15η 2 C1 64 Θ 2 t+1 ≤ Θ t -15η 2 C1 64+ln M N Θ 2 t+1 . Also, recall z 1 = z 0 = ( 1 M M , 1 N N ) and thus Θ 1 = KL(z * , z 1 ) ≤ ln(M N ). Therefore, the conditions of Lemma 12 are satisfied with p = 1 and q = 15η 2 C1 64+ln(M N ) , and we conclude that Θ t ≤ C t , where C = max ln(M N ), 128+2 ln(M N ) 15η 2 C1 = 128+2 ln(M N ) 15η 2 C1 . Next we prove the main result. Set T 0 = 12800C η 2 ξ 2 . For t ≥ T 0 , we have using Pinsker's inequality, z * -z t 2 1 ≤ 2 x * -x t 2 1 + 2 y * -y t 2 1 ≤ 4KL(z * , z t ) ≤ 4C T 0 ≤ η 2 ξ 2 100 , z * -z t 2 1 ≤ 2 z * -z t+1 2 1 + 2 z t+1 -z t 2 1 ≤ 4 x * -x t+1 2 1 + 4 x t+1 -x t 2 1 + 4 y * -y t+1 2 1 + 4 y t+1 -y t 2 1 ≤ 8KL(z * , z t+1 ) + 8KL( z t+1 , z t ) ≤ 128Θ t+1 ≤ 128C T 0 ≤ η 2 ξ 2 100 . Therefore, when t ≥ T 0 , the condition of the second part of Lemma 2 is satisfied, and we have ζ t ≥ 1 2 KL( z t+1 , z t ) + 1 2 ζ t ≥ 1 2 KL( z t+1 , z t ) + η 2 C 2 2 KL(z * , z t+1 ) (by Lemma 2) ≥ η 2 C 2 2 Θ t+1 . (C 2 = 3 C 2 ξ 2 /128 ≤ 1/128 as shown in the proof of Lemma 2) Therefore, when t ≥ T 0 , Θ t+1 ≤ Θ t -15η 2 C2 32 Θ t+1 , which further leads to Θ t ≤ Θ T0 • 1 + 15η 2 C 2 32 T0-t ≤ Θ 1 • 1 + 15η 2 C 2 32 T0-t ≤ ln(M N ) 1 + 15η 2 C 2 32 T0-t . where the second inequality uses Eq. ( 15). The inequality trivially holds for t < T 0 as well, so it holds for all t. We finish the proof by relating KL(z * , z t ) and Θ t+1 . Note that by Lemma 16, Lemma 17, and Lemma 19, we have KL(z * , z t ) 2 ≤ z * -z t 2 min i∈supp(z * ) z 2 t,i ≤ 16 z * -z t 2 9 2 ≤ 4 z * -z t+1 2 + z t+1 -z t 2 2 . We continue to bound the last term as 4 z * -z t+1 2 + z t+1 -z t 2 2 = 4 x * -x t+1 2 + y * -y t+1 2 + x t+1 -x t 2 + y t+1 -y t 2 2 = 4 x * -x t+1 2 1 + y * -y t+1 2 1 + x t+1 -x t 2 1 + y t+1 -y t 2 1 2 ( x 2 ≤ x 1 ) ≤ 128 2 KL(z * , z t+1 ) 16 + KL( z t+1 , z t ) 16 (Pinsker's inequality) ≤ 128 2 Θ t+1 . Combining everything, we get KL(z * , z t ) ≤ √ 128 Θ t+1 ≤ 128 ln(M N ) 1 + 15η 2 C 2 32 T 0 -t-1 2 , which completes the proof.

E PROOFS OF LEMMA 4 AND THE SUM-OF-DUALITY-GAP BOUND

Proof of Lemma 4. Below we consider any z = z t+1 ∈ Z. Considering Eq. ( 1) with D ψ (u, v) = 1 2 u -v 2 , and using the first-order optimality condition of z t+1 , we have ( z t+1 -z t + ηF (z t )) (z -z t+1 ) ≥ 0, (z t+1 -z t+1 + ηF (z t )) (z -z t+1 ) ≥ 0. Rearranging the terms and we get ( z t+1 -z t ) (z -z t+1 ) ≥ ηF (z t ) ( z t+1 -z ) = ηF ( z t+1 ) ( z t+1 -z ) + η (F (z t ) -F ( z t+1 )) ( z t+1 -z ) ≥ ηF ( z t+1 ) ( z t+1 -z ) -ηL z t -z t+1 z t+1 -z ≥ ηF ( z t+1 ) ( z t+1 -z ) - 1 8 z t -z t+1 z t+1 -z , (z t+1 -z t+1 ) (z -z t+1 ) ≥ ηF (z t ) (z t+1 -z ) = ηF (z t+1 ) (z t+1 -z ) + η (F (z t ) -F (z t+1 )) (z t+1 -z ) ≥ ηF (z t+1 ) (z t+1 -z ) -ηL z t -z t+1 z t+1 -z ≥ ηF (z t+1 ) (z t+1 -z ) - 1 8 z t -z t+1 z t+1 -z . Here, for both block, the third step uses Hölder's inequality and the smoothness condition Assumption 1, and the last step uses the condition η ≤ 1/(8L). Upper bounding the left-hand side of the two inequalities by z t+1 -z t z t+1 -z and z t+1 -z t+1 z t+1 -z respectively and then rearranging, we get z t+1 -z z t+1 -z t + 1 8 z t -z t+1 ≥ ηF ( z t+1 ) ( z t+1 -z ), z t+1 -z z t+1 -z t+1 + 1 8 z t -z t+1 ≥ ηF (z t+1 ) (z t+1 -z ). Therefore, we have z t+1 -z t + 1 8 z t -z t+1 2 ≥ η 2 [F ( z t+1 ) ( z t+1 -z )] 2 + z t+1 -z 2 , z t+1 -z t+1 + 1 8 z t -z t+1 2 ≥ η 2 [F (z t+1 ) (z t+1 -z )] 2 + z t+1 -z 2 . Finally, by the triangle inequality and the fact (a + b) 2 ≤ 2a 2 + 2b 2 , we have z t+1 -z t + 1 8 z t -z t+1 2 ≤ z t -z t + 9 8 z t -z t+1 2 ≤ 9 8 z t -z t + 9 8 z t -z t+1 2 ≤ 81 32 z t -z t 2 + z t -z t+1 2 , z t+1 -z t+1 + 1 8 z t -z t+1 2 ≤ 9 8 z t+1 -z t+1 + z t -z t+1 2 ≤ 9 8 z t+1 -z t+1 + 9 8 z t -z t+1 2 ≤ 81 32 z t+1 -z t+1 2 + z t -z t+1 2 , which finishes the proof. Next, we use Eq. ( 4) and Eq. ( 6) to derive a result on the convergence of "average duality gap" across time. First, we use the following lemma to relate the right-hand side of Eq. ( 6) to the duality gap of z t . Lemma 20. Let Z be closed and bounded. Then for any z ∈ Z, we have α f (z) ≤ max z ∈Z F (z) (z -z ). Proof. This is a direct consequence of the convexity of f (•, y) and the concavity of f (x, •): α f (z) = max (x ,y )∈X ×Y (f (x, y ) -f (x, y) + f (x, y) -f (x , y)) ≤ max (x ,y )∈X ×Y ∇ y f (x, y) (y -y) + ∇ x f (x, y) (x -x ) = max z ∈Z F (z) (z -z ). With Lemma 20, the following theorem can be proven straightforwardly. Theorem 21. Let Z be closed and bounded. Then OGDA with η ≤ 1 8L ensures 1 T T t=1 α f (z t ) = O D η √ T for any T , where D sup z,z ∈Z zz . Proof. We first bound the sum of squared duality gap as (recall ζ t = z t+1 -z t 2 + z t -z t 2 ): T t=1 α f (z t ) 2 ≤ T t=1 max z ∈Z F (z t ) (z t -z ) 2 (Lemma 20) ≤ 81 32η 2 T t=1 (ζ t-1 + ζ t ) z t -z 2 (Lemma 4) ≤ O D 2 η 2 T t=2 (Θ t-1 -Θ t + Θ t -Θ t+1 ) (Eq. (4)) = O D 2 η 2 . (telescoping) Finally, by Cauchy-Schwarz inequality, we get 1 T T t=1 α f (z t ) ≤ 1 T T T t=1 α f (z t ) 2 = O D η √ T . This theorem indicates that α f (z t ) is converging to zero. A rate of α f (z t ) = O( D η √ t ) would be compatible with the theorem, but is not directly implied by it. In a recent work, Golowich et al. (2020b) consider the unconstrained setting and show that the extra-gradient algorithm obtains the rate α f (z t ) = O( D η √ t ) , under an extra assumption that the Hessian of f is also Lipschitz (since Golowich et al. (2020b) study the unconstrained setting, their duality gap α f is defined only with respect to the best responses that lie within a ball of radius D centered around the equilibrium). Note that the extra-gradient algorithm requires more cooperation between the two players compared to OGDA and is less suitable for a repeated game setting.

F THE EQUIVALENCE BETWEEN SP-MS AND METRIC SUBREGULARITY

In this section, we formally that show our SP-MS condition with β = 0 is equivalent to metric subregularity. Before introducing the main theorem, we introduce several definitions. We let Z * ⊆ Z ⊆ R K (Z * and Z follow the same definitions as in our main text). First, we define the elementto-set distance function d: Definition 5. The element-to-set distance function d: R K × 2 R K → R is defined as d(z, S) = inf z ∈S z -z . The definition of metric subregularity involves a set-valued operator T : Z → 2 R K , which maps an element of Z to a set in R K . Definition 6. A set-valued operator T is called metric subregular at ( z, v) for v ∈ T ( z) if there exists κ > 0 and a neighborhood Ω of z such that d(v, T (z)) ≥ κd(z, T -1 (v)) for all z ∈ Ω, where x ∈ T -1 (v) ⇔ v ∈ T (x). If Ω = Z, we call T globally metric subregular. The following definition of normal cone is also required in the analysis: Definition 7. The normal cone of Z at point z is N (z) = {g | g (z -z) ≤ 0, ∀z ∈ Z} (we omit its dependence on Z for simplicity). Equivalently, N (z) is the polar cone of the convex set Z -z (a property that we will use in the proof). Now we are ready to show that our SP-MS condition with β = 0 is equivalent to metric subregularity of the operator N + F , defined via: (N + F )(z) = {g + F (z) | g ∈ N (z)}. Theorem 22. Let z * ∈ Z * . Then the following two statements are equivalent: • (N + F ) is globally metric subregular at (z * , 0) with κ > 0; • For all z ∈ Z\Z * , max z ∈Z F (z) (z-z ) z-z ≥ κd(z, Z * ). Proof. Let T = N + F . Notice that z ∈ Z * ⇔ F (z) (z -z) ≥ 0 ⇔ -F (z) ∈ N (z) ⇔ 0 ∈ (N + F )(z). Therefore, 0 ∈ T (z * ) indeed holds, and we have T -1 (0) = Z * . This means that the first statement in the theorem is equivalent to d(0, T (z)) ≥ κd(z, T -1 (0)) ⇔ d(0, N (z) + F (z)) ≥ κd(z, Z * ). This inequality holds trivially when z ∈ Z * . Thus, to complete the proof, it suffices to prove that d(0, N (z) + F (z)) = max z ∈Z F (z) (z-z ) z-z for z ∈ Z\Z * . To do so, note that d(0, N (z) + F (z)) = d(-F (z), N (z)) = -F (z) -Π N (z) (-F (z)) = Π N • (z) (-F (z)) where N • (z) = {g | g n ≤ 0, ∀n ∈ N (z)} is the polar cone of N (z) and the last step is by Moreau's theorem. Now consider the projection of -F (z) onto the polar cone N • (z): Π N • (z) (-F (z)) = argmin y∈N • (z) -F (z) -y 2 = argmin y∈N • (z) 2F (z) y + y 2 = argmin y∈N • (z) 2F (z) y y • y + y 2 = argmin λ≥0, z∈N • (z), z =1 2λF (z) z + λ 2 , where the last equality is because N • (z) is a cone. Next, we find the z * and λ * that realize the last argmin operator: notice that the objective is increasing in F (z) z, so z * = argmin z∈N • (z): z =1 F (z) z , and thus λ * = -F (z) z * when F (z) z * ≤ 0 and λ * = 0 otherwise. Therefore, Π N • (z) (-F (z)) = λ * = max 0, max z∈N • (z), z =1 -F (z) z . Note that N (z) is the polar cone of the conic hull of Z -z. Therefore, N • (z) = (ConicHull(Z - z)) •• = ConicHull(Z -z) and max 0, max z∈N • (z), z =1 -F (z) z = max 0, max z ∈Z F (z) (z -z ) z -z . Finally, note that when z ∈ Z\Z * , we have max z ∈Z F (z) (z -z ) > 0. Combining all the facts above, we have shown d(0, N (z) + F (z)) = max z ∈Z F (z) (z-z ) z-z . G PROOF OF THEOREM 5 Proof of Theorem 5. Let ρ = min x∈X max y∈Y x Gy = max y∈Y min x∈X x Gy be the game value. In this proof, we prove that there exists some c > 0 such that max y ∈Y x Gy -ρ ≥ c x -Π X * (x) for all x ∈ X . Similarly we prove max x ∈X ρ -x Gy ≥ c y -Π Y * (y) for all y ∈ Y. Assume that the diameter of the polytope is D < ∞. Then combining the two proves max z F (z) (z -z ) z -z ≥ 1 D max z F (z) (z -z ) = 1 D max y x Gy -min x x Gy ≥ c D ( y -Π Y * (y) + x -Π X * (x) ) ≥ c D z -Π Z * (z) , meaning that SP-MS holds with β = 0. We break the proof into following several claims. Claim 1. If X , Y are polytopes, then X * and Y * are also polytopes. Proof of Claim 1. Note that X * = x ∈ X : max y∈Y x Gy ≤ ρ . Since Y is a polytope, the maximum is attained at vertices of Y. Therefore, X * can be equivalently written as x ∈ X : max y∈V(Y) x Gy ≤ ρ , where V(Y) is the set of vertices of Y. Since the constraints of X * are all linear constraints, X * is a polytope. With Claim 1, we without loss of generality write X * as X * = x ∈ R M : a i x ≤ b i , for i = 1, . . . , L, c i x ≤ d i , for i = 1, . . . , K , where the a i x ≤ b i constraints come from x ∈ X and the c i x ≤ d i constraints come from max y∈V(Y) x Gy ≤ ρ. Below, we refer to a i x ≤ b i as the feasibility constraints, and c i x ≤ d i as the optimality constraints. In fact, one can identify the i-th optimality constraint as c i = Gy (i) and d i = ρ, where y (i) is the i-th vertex of Y. This is based on our construction of X * in the proof of Claim 1. Therefore, K = |V(Y)|. Since Eq. ( 20) clearly holds for x ∈ X * , below, we focus on an x ∈ X \X * , and let x * Π X * (x). We say a constraint is tight at x * if a i x * = b i or c i x * = d i . Below we assume that there are tight feasibility constraints at and k tight optimality constraints at x * . Without loss of generality, we assume these tight constraints correspond to i = 1, . . . , and i = 1, . . . , k respectively. That is, a i x * = b i , for i = 1, . . . , , c i x * = d i , for i = 1, . . . , k. Claim 2. x violates at least one of the tight optimality constraint at x * . Proof of Claim 2. We prove this by contradiction. Suppose that x satisfies all k tight optimality constraints at x * . Then x must violates some of the remaining K -k optimality constraints (otherwise x ∈ X * ). Assume that it violates constraints K -n + 1, . . . , K for some 1 ≤ n ≤ K -k. Thus, we have the following: c i x ≤ d i for i = 1, . . . K -n; c i x > d i for i = K -n + 1, . . . , K. Recall that c i x * ≤ d i for i = 1, . . . , K -n and c i x * < d i for all i = K -n + 1, . . . , K. Thus, there exists some x that lies strictly between x and x * that makes all constraints hold (notice that x and x * both satisfy all feasibility constraints), which contradicts with Π X * (x) = x * . Claim 3. max y ∈Y x Gy -ρ ≥ max i∈{1,...,k} c i (x -x * ). Proof of Claim 3. Recall that we identify c i with Gy (i) and d i = ρ. Therefore, max y ∈Y x Gy -ρ = max  c i (x -x * ), where the last equality is because c i x * = d i for i = 1, . . . , k. Recall from linear programming literature Davis (2016a; b) that the normal cone of X * at x * is expressed as follows: N x * = x -x * : x ∈ R M , Π X * (x ) = x * = i=1 p i a i + k i=1 q i c i : p i ≥ 0, q i ≥ 0 . The normal cone of X * at x * consists of all outgoing normal vectors of X * originated from x * . Clearly, xx * belongs to N x * . However, besides the fact that xx * is a normal vector of X * , we also have the additional constraints that x ∈ X . We claim that in our case, xx * lies in the following smaller cone (which is a subset of N x * ): Claim 4. xx * belongs to M x * = i=1 p i a i + k i=1 q i c i : p i ≥ 0, q i ≥ 0, a j i=1 p i a i + k i=1 q i c i ≤ 0, ∀j = 1, . . . , . Proof of Claim 4. As argued above, xx * ∈ N x * , and thus xx * can be expressed as i=1 p i a i + k i=1 q i c i with p i ≥ 0, q i ≥ 0. To prove that xx * ∈ M x * , we only need to prove that it satisfies the additional constraints, that is, a i (x -x * ) ≤ 0, ∀i = 1, . . . , . This is shown by noticing that for all i = 1, . . . , , a i (x -x * ) = a i x * -b i + a i (x -x * ) (the i-th constraint is tight at x * ) = a i (x * + x -x * ) -b i = a i x -b i ≤ 0. (x ∈ X ) Claim 5. xx * can be written as i=1 p i a i + k i=1 q i c i with 0 ≤ p i , q i ≤ C xx * for all i and some problem-dependent constant C < ∞. Proof of Claim 5. Notice that x-x * x-x * ∈ M x * (because 0 = x -x * ∈ M x * and M x * is a cone). Furthermore, x-x * x-x * ∈ {v ∈ R M : v ∞ ≤ 1}. Therefore, x-x * x-x * ∈ M x * ∩ {v ∈ R M : v ∞ ≤ 1}, which is a bounded subset of the cone M x * . Below we argue that there exists a large enough C > 0 such that i=1 p i a i + k i=1 q i c i : 0 ≤ p i , q i ≤ C , ∀i ⊇ M x * ∩ {v ∈ R M : v ∞ ≤ 1} P. To see this, first note that P is a polytope. For every vertex v of P, the smallest C such that v belongs to the left-hand side is the solution of the following linear programming: min pi,qi,C v C v s.t. v = i=1 p i a i + k i=1 q i c i , 0 ≤ p i , q i ≤ C v . Since v ∈ M x * , this linear programming is always feasible and admits a finite solution C v < ∞. Now let C = max v∈V(P) C v , where V(P) is the set of all vertices of P. Then since any v ∈ P can be expressed as a convex combination of points in V(P), v can be also be expressed as i=1 p i a i + k i=1 q i c i with 0 ≤ p i , q i ≤ C . To sum up, x-x * x-x * can be represented as i=1 p i a i + k i=1 q i c i with 0 ≤ p i , q i ≤ C . This further implies that xx * can be represented as i=1 p i a i + k i=1 q i c i with 0 ≤ p i , q i ≤ C xx * . Notice that C only depends on the set of tight constraints at x * . Finally, we are ready to combine all previous claims and prove the desired inequality. Define A i a i (x -x * ) and C i c i (x -x * ). By Claim 5, we can write xx * as i=1 p i a i + k i=1 q i c i with 0 ≤ p i , q i ≤ C xx * , and thus, i=1 p i A i + k i=1 q i C i = i=1 p i a i + k i=1 q i c i (x -x * ) = x -x * 2 . On the other hand, since xx * ∈ M x * by Claim 4, we have i=1 p i A i = i=1 p i a i (x -x * ) ≤ 0 and k i=1 q i C i ≤ max i∈{1,...,k} C i k i=1 q i ≤ max i∈{1,...,k} C i kC x -x * , where in the first inequality we use the fact p i ≥ 0, and in the second inequality we use the fact max i∈{1,...,k} C i > 0 (by Claim 2) and 0 ≤ q i ≤ C xx * . Combining the three inequalities above, we get max i∈{1,...,k} C i ≥ 1 kC x -x * . Then by Claim 3, max y ∈Y x Gy -ρ ≥ max i∈{1,...,k} C i ≥ 1 kC x -x * . Note that k and C only depend on the set of tight constraints at the projection point x * , and there are only finitely many different sets of tight constraints. Therefore, we conclude that there exists a constant c > 0 such that max y ∈Y x Gy -ρ ≥ c xx * holds for all x and x * , which completes the proof. H PROOF OF THEOREM 6 AND THEOREM 7 Proof of Theorem 6. Suppose that f is γ-strongly-convex in x and γ-strongly-concave in y, and let (x * , y * ) ∈ Z * . Then for any (x, y) we have f (x, y) -f (x * , y) ≤ ∇ x f (x, y) (x -x * ) - γ 2 x -x * 2 , f (x, y * ) -f (x, y) ≤ ∇ y f (x, y) (y * -y) - γ 2 y -y * 2 . Summing up the two inequalities, and noticing that f (x, y * )-f (x * , y) ≥ 0 for any (x * , y * ) ∈ Z * , we get F (z) (z -z * ) ≥ γ 2 z -z * 2 , and therefore, for z / ∈ Z * , F (z) (z -z * ) z -z * ≥ γ 2 z -z * , which implies SP-MS with β = 0 and C = γ/2. Proof of Theorem 7. First, we show that f has a unique Nash Equilibrium z * = (x * , y * ) = ((0, 1), (0, 1)). As f is a strictly monotone decreasing function with respect to y 1 , we must have y * 1 = 0 and y * 2 = 1. In addition, if x = (0, 1), max y∈Y f (x, y) = -min y∈Y y 2n 1 = 0. If x = (0, 1), then by choosing y * = (0, 1), f (x, y * ) = x 2n 1 > 0. Therefore, we have x * = (0, 1), which proves that the unique Nash Equilibrium is x * = (0, 1), y * = (0, 1). Second, we show that f satisfies SP-MS with β = 2n -2. In fact, for any z = (x, y) = z * , we have F (z) (z -z * ) =    2nx 2n-1 1 -y 1 0 2ny 2n-1 1 + x 1 0       x 1 x 2 -1 y 1 y 2 -1    = 2n x 2n 1 + y 2n 1 ≥ 4n • x 2 1 + y 2 1 2 n (Jensen's inequality) = n 2 n-2 x 2 1 + y 2 1 n . Note that z -z * = x 2 1 + (1 -x 2 ) 2 + y 2 1 + (1 -y 2 ) 2 = 2x 2 1 + 2y 2 1 . Therefore, we have F (z) (z-z * ) z-z * ≥ n 2 2n-2 z -z * 2n-1 . This shows that f satisfies SP-MS with β = 2n -2 and C = n 2 2n-2 .

I PROOF OF THEOREM 8

Proof of Theorem 8. As argued in Section 5, with Θ t = z t -Π Z * ( z t ) 2 + 1 16 z t -z t-1 2 , ζ t = z t+1 -z t 2 + z t -z t 2 , we have (see Eq. ( 4)) Θ t+1 ≤ Θ t -15 16 ζ t . (21) Below, we relate ζ t to Θ t+1 using the SP-MS condition, and then apply Lemma 12 to show Θ t ≤    2dist 2 ( z 1 , Z * )(1 + C 5 ) -t if β = 0, 1 + 4 4 β 1 β dist 2 ( z 1 , Z * ) + 2 2 C5β 1 β t -1 β if β > 0, where C 5 = min 16η 2 C 2 81 , 1 2 as defined in the statement of the theorem. This is enough to prove the theorem since dist 2 (z t , Z * ) ≤ z t -Π Z * ( z t+1 ) 2 ≤ 2 z t+1 -Π Z * ( z t+1 ) 2 + 2 z t+1 -z t 2 ≤ 32Θ t+1 ≤ 32Θ t . Next, we prove Eq. ( 22). We first show a simple fact by Eq. ( 21): z t+1 -z t 2 ≤ ζ t ≤ 16 15 Θ t ≤ • • • ≤ 16 15 Θ 1 . Notice that ζ t ≥ 1 2 z t+1 -z t 2 + 1 2 z t+1 -z t 2 + z t -z t 2 ≥ 1 2 z t+1 -z t 2 + 16η 2 81 sup z ∈Z F ( z t+1 ) ( z t+1 -z ) 2 + z t+1 -z 2 (Lemma 4) ≥ 1 2 z t+1 -z t 2 + 16η 2 C 2 81 z t+1 -Π Z * ( z t+1 ) 2(β+1) (SP-MS condition) ≥ min 16η 2 C 2 81 , 1 2 15 16Θ 1 β z t+1 -z t 2(β+1) + z t+1 -Π Z * ( z t+1 ) 2(β+1) (by Eq. ( 23)) ≥ min 16η 2 C 2 2 β • 81 , 1 2 15 32Θ 1 β z t+1 -z t 2 + z t+1 -Π Z * ( z t+1 ) 2 β+1 (by Hölder's inequality: (a β+1 + b β+1 )(1 + 1) β ≥ (a + b) β+1 ) ≥ min C 5 2 β , 1 2 1 4Θ 1 β Θ β+1 t+1 (recall that C 5 = min{ 16η 2 C 2 81 , 1 2 }) = C Θ β+1 t+1 . (define C = min C5 2 β , 1 2 1 4Θ1 β ) Combining this with Eq. ( 21), we get 24) is of the form specified in Lemma 12 with p = β and q = C . Note that the second required condition is satisfied: Θ t+1 ≤ Θ t -C Θ β+1 t+1 (24) When β = 0, Eq. (24) implies Θ t+1 ≤ (1 + C 5 ) -1 Θ t , which immediately implies Θ t ≤ (1 + C 5 ) -t+1 Θ 1 ≤ 2Θ 1 (1 + C 5 ) -t . When β > 0, Eq. ( C (β + 1)Θ β 1 ≤ β+1 2•4 β ≤ 1. Therefore, by the conclusion of Lemma 12, Θ t ≤ max Θ 1 , 2 C β 1 β t -1 β = max Θ 1 , 2 • 2 β C 5 β 1 β , 4Θ 1 4 β 1 β t -1 β ≤ 1 + 4 4 β 1 β Θ 1 + 2 2 C 5 β 1 β t -1 β . Eq. ( 22) is then proven by noticing that Θ 1 = dist 2 ( z 1 , Z * ). We prove this by induction. The base case trivially holds. Suppose that for step t, we have x t = y t , x t = y t , and x t , y t , x t , y t ∈ {u ∈ R 2 : u 2 = u 2 1 }. Then consider step t + 1. According to the dynamic of OGDA, we have x t+1 = Π X x t -η -y t,2 y t,1 = Π X x t,1 + ηy t,2 x t,2 -ηy t,1 , x t+1 = Π X x t+1 -η -y t,2 y t,1 = Π X x t+1,1 + ηy t,2 , x t+1,2 -ηy t,1 y t+1 = Π Y y t + η x t,2 -x t,1 = Π Y y t,1 + ηx t,2 y t,2 -ηx t,1 y t+1 = Π Y y t+1 + η x t,2 -x t,1 = Π Y y t+1,1 + ηx t,2 y t+1,2 -ηx t,1 . According to induction hypothesis, we have x t+1 = y t+1 , which further leads to x t+1 = y t+1 . Now we prove that for any x 1 x 2 such that x 1 ≥ 0, x 2 ≤ 1 4 and x 2 < x 1 2 , x 1 x 2 = Π X x 1 x 2 satisfies that x 2 1 = x 2 . Otherwise, suppose that x 2 1 < x 2 . Then according to the intermediate value theorem, there exists x 1 x 2 that lies in the line segment of x 1 x 2 and x 1 x 2 such that x 2 1 = x 2 . Moreover, as x 1 ≥ 0, x 1 ≥ 0, x 2 ≤ 1 4 , x 2 ≤ 1 4 , we know that x 1 x 2 ∈ X . Therefore, we have xx < xx , which leads to contradiction. Now consider x t+1 . According to induction hypothesis, we have ( x t,1 + ηy t,2 ) 2 ≥ x 2 t,1 = x t,2 ≥ x t,2 -ηy t,1 . If equalities hold, trivially we have x 2 t+1,1 = x 2 t,1 = x t,2 = x t+1,2 according to Eq. ( 25). Otherwise, as x t,1 + ηy t,2 ≥ 0, x t,2 -ηy t,1 ≤ 1 4 , according to the analysis above, we also have x 2 t+1,1 = x t+1,2 . Applying similar analysis to y t+1 , x t+1 and y t+1 finishes the induction proof. Claim 3. With η ≤ 1 64 , the following holds for all t ≥ 1, x t,1 ∈ 1 2 x t,1 , 2 x t,1 , x t,1 ∈ x t-1,1 -4η x 2 t-1,1 , x t-1,1 + 4η x 2 t-1,1 . We prove the claim by induction on t. The case t = 1 trivially holds. Suppose that Eq. ( 26) and Eq. ( 27) hold at step t. Now consider step t + 1. Induction to get Eq. ( 27). According to Claim 2, we have x t+1 = Π X x t -η -y t,2 y t,1 = Π X x t,1 + ηx 2 t,1 x 2 t,1 -ηx t,1 , and x t+1 = (u, u 2 ) for some u ∈ [0, 1/2]. Using the definition of the projection function, we have x t+1,1 = argmin u∈[0, 1 2 ] x t,1 + ηx 2 t,1 -u 2 + x 2 t,1 -ηx t,1 -u 2 2 argmin u∈[0, 1 2 ] g(u). Now we show that argmin u∈[0, 1 2 ] g(u) = argmin u∈R g(u). Note that ∇g(u) = 2(u -x t,1 -ηx 2 t,1 ) + 4u u 2 + ηx t,1 -x 2 t,1 , Therefore, when u > 1 2 , using x t,1 ≤ 1 2 , we have ∇g(u) > -2ηx 2 t,1 + 2ηx t,1 ≥ 0, which means g(u) > g( 1 2 ). On the other hand, when u < 0, using x t,1 ≤ 1 2 , we have ∇g(u) < 2u -4u x 2 t,1 ≤ u < 0, which means g(u) > g(0). Combining Eq. ( 29) and Eq. ( 30), we know that argmin u∈[0, 1 2 ] g(u) = argmin u∈R g(u). Therefore, x t+1,1 is the unconstrained minimizer of convex function g(u), which means ∇g( x t+1,1 ) = 0. Below we use contradiction to prove that x t+1,1 ≥ x t,1 -4η x 2 t,1 . If x t+1,1 < x t,1 -4η x 2 t,1 , we use Eq. ( 28) and get ∇g( x t+1,1 ) = 2( x t+1,1 -x t,1 -ηx 2 t,1 ) + 4 x t+1,1 x 2 t+1,1 + ηx t,1 -x 2 t,1 < 2(-4η x 2 t,1 -ηx 2 t,1 ) + 4 x t+1,1 ηx t,1 -8η x 3 t,1 + 16η ( x t+1,1 < x t,1 -4η x 2 t,1 ) = -1 2 η x 2 t,1 + 64η 2 x 5 t,1 -32η 2 x 3 t,1 -256η 3 x 6 t,1 ≤ -1 2 η x 2 t,1 -16η 2 x 3 t,1 -256η 3 x 6 t,1 ( x t,1 ≤ 1 2 ) ≤ 0, which leads to contradiction. Similarly, if x t+1,1 > x t,1 + 4η x 2 t,1 , we have ∇g( x t+1,1 ) = 2( x t+1,1 -x t,1 -ηx 2 t,1 ) + 4 x t+1,1 x 2 t+1,1 + ηx t,1 -x 2 t,1 > 2(4η x 2 t,1 -ηx 2 t,1 ) + 4 x t+1,1 ηx t,1 + 8η x 3 t,1 + 16η 2 x 4 t,1 ≥ 0. (Eq. ( 26)) The calculations above conclude that x t+1,1 ∈ x t,1 -4η x 2 t,1 , x t,1 + 4η x 2 t,1 . Induction to get Eq. ( 26). Similarly, we have x t+1,1 = argmin u∈[0, 1 2 ] x t+1,1 + ηx 2 t,1 -u 2 + x 2 t+1,1 -ηx t,1 -u 2 2 argmin u∈[0, 1 2 ] h(u), ∇h(u) = 2(u -x t+1,1 -ηx 2 t,1 ) + 4u(u 2 + ηx t,1 -x 2 t+1,1 ), and ∇h(x t+1,1 ) = 0. If x t+1,1 < 1 2 x t+1,1 , we have ∇h(x t+1,1 ) = 2(x t+1,1 -x t+1,1 -ηx 2 t,1 ) + 4x t+1,1 x 2 t+1,1 + ηx t,1 -x 2 t+1,1 < -x t+1,1 -2ηx 2 t,1 -3x t+1,1 x 2 t+1,1 + 2η x t+1,1 x t,1 (x t+1,1 < 1 2 x t+1,1 ) ≤ 0. (η ≤ 1 64 , x t,1 ≤ 1 2 ) If x t+1,1 > 2 x t+1,1 , we also have ∇h(x t+1,1 ) = 2(x t+1,1 -x t+1,1 -ηx 2 t,1 ) + 4x t+1,1 x 2 t+1,1 + ηx t,1 -x 2 t+1,1 > 2 x t+1,1 -2ηx 2 t,1 + 24 x 3 t+1,1 + 8η x t+1,1 x t,1 (x t+1,1 > 2 x t+1,1 ) ≥ 2 x t+1,1 -2ηx 2 t,1 + 24 x 3 t+1,1 + 8η( x t,1 -4η x 2 t,1 )x t,1 (Eq. ( 31)) ≥ 2 x t+1,1 -2ηx 2 t,1 + 24 x 3 t+1,1 + 8η( 1 2 x t,1 -4η x 2 t,1 )x t,1 (Eq. ( 26)) = 2 x t+1,1 + 2ηx 2 t,1 + 24 x 3 t+1,1 -32η 2 x 2 t,1 x t,1 ≥ 2 x t+1,1 + 1 4 η x 2 t,1 + 24 x 3 t+1,1 -32η 2 x 2 t,1 x t,1 (Eq. ( 26)) ≥ 0. (η ≤ 1 64 , x t,1 ≤ 1 2 ) Both lead to contradiction. Therefore, we conclude that x t+1 ∈ [ 1 2 x t+1,1 , 2 x t+1,1 ], which finishes the induction proof. Claim 4. x t,1 ≥ x t,1 -4η x 2 t,1 , for all t ≥ 1. The case t = 1 holds trivially. For t ≥ 2, we prove this by contradiction. Using the definition of the projection function, we have: x t+1,1 = argmin u∈[0, 1 2 ] x t+1,1 + ηx 2 t,1 -u 2 + x 2 t+1,1 -ηx t,1 -u 2 2 argmin u∈[0, 1 2 ] h(u). Similar to the analysis in Claim 3, we have argmin u∈[0, 1 2 ] h(u) = argmin u∈R h(u), which means that ∇h(x t+1,1 ) = 0. Note that η ≤ 1 64 and 0 ≤ x t,1 ≤ 1 2 , according to Eq. ( 26) and Eq. ( 27), we have x t+1,1 ∈ x t,1 -4η x 2 t,1 , x  If x t+1,1 < x t+1,1 -4η x 2 t+1,1 , we show that ∇h(x t+1,1 ) < 0. In fact, ∇h(x t+1,1 ) = 2(x t+1,1 -x t+1,1 -ηx 2 t,1 ) + 4x t+1,1 x 2 t+1,1 + ηx t,1 -x 2 t+1,1 < 2(-4η x 2 t+1,1 -ηx 2 t,1 ) + 4x t+1,1 ηx t,1 -8η x 3 t+1,1 + 16η ≤ 64η 2 x 5 t+1,1 -32η 2 x 3 t+1,1 -256η 3 x 6 t+1,1 ≤ -16η 2 x 3 t,1 -256η 3 x 6 t,1 ( x t,1 ≤ 1 2 ) ≤ 0, which leads to contradiction. Therefore, we show that x t,1 ≥ x t,1 -4η x 2 t,1 for all t ≥ 1. Claim 5. If η ≤ 1 64 , we have z t -z * ≥ Ω(1/t). Now we are ready to prove z t -z * ≥ Ω(1/t). First we show x t,1 ≥ 1 2t for all t ≥ 1 by induction. The case t = 1 trivially holds. Suppose that it holds at step t. Considering step t + 1, we have x t+1,1 ≥ x t,1 -4η x 2 t,1 (Claim 3) ≥ x t,1 - 1 16 x 2 t,1 (η ≤ 1 64 ) ≥ 1 2t - 1 64t 2 ( 1 2t ≤ x t,1 ≤ 1 2 , and x -1 16 x 2 is increasing when x ≤ 8) ≥ 1 2(t + 1) . (t ≥ 1) Therefore, x t,1 ≥ 1 2t , ∀t ≥ 1. This, by Claim 4 and the analysis above, shows that x t,1 ≥ x t,1 -4η x 2 t,1 ≥ 1 2(t + 1) . Note that according to Claim 1, x * = 0. Therefore, we have z t -z * ≥ x t,1 ≥ 1 2(t+1) , which finishes the proof. Claim 6. In this example, SP-MS holds with β = 3. This can be seen by the following: max z ∈Z F (z) (z -z ) z -z ≥ max z ∈Z F (z) (z -z ) = max x ∈X ,y ∈Y x Gyx Gy

= max

x ∈X ,y ∈Y {-x 1 y 2 + x 2 y 1 + x 1 y 2 -x 2 y 1 } ≥ -x 1 x 2 2 + x 2 2 + y 2 2 -y 2 2 y 1 (picking y 1 = x 2 , y 2 = x 2 2 , x 1 = y 2 , x 2 = y 2 2 )  ≥



Rakhlin & Sridharan, 2013). It is also referred to as "single-call extra-gradient" in(Hsieh et al., 2019), but it does not belong to the class of "extra-gradient" methods discussed in(Tseng, 1995;Liang & Stokes, 2019; Golowich et al., 2020b) for example. One might find that the constant C3 is exponential in some problem-dependent quantity T0. However, this is simply a loose bound in exchange for more concise presentation -our proof in fact shows that when t < T0, the convergence is of a slower 1/t rate, and when t ≥ T0, the convergence is linear without this large constant. This is equivalent to the condition dist 2 q (F (z), F (z )) ≤ L 2 dist 2 p (z, z ) in Lemma 1 with p = 2, hence the same notation L. After the first version of this paper, we found that(Gilpin et al., 2008, Lemma 3) gives a simpler proof for our Theorem 5. Although their lemma only focuses on the case where the feasible sets are probability simplices, it can be directly extended to the case of polytopes. In fact, any η < 1 2L is enough to achieve linear convergence rate for OGDA, as one can verify by going over our proof. We use η ≤ 1 8L simply for consistency with the results for OMWU (where η cannot be set any larger due to technical reasons). Note that in this case the projection step of OGDA can be implemented efficiently in O(M ln M +N ln N ) time(Wang & Carreira-Perpinán, 2013).



Figure1: Experiments of OGDA and OMWU with different learning rates for a matrix game f (x, y) = x Gy. "OGDA/OMWU-eta=η" represents the curve of OGDA/OMWU with learning rate η. The configuration order in the legend is consistent with the order of the curves. For OMWU, η ≥ 11 makes the algorithm diverge. The plot confirms the linear convergence of OMWU and OGDA, although OGDA is generally observed to converge faster than OMWU.

Figure 3: Experiments of OGDA on matrix games with curved regions where f(x, y) = x 2 y 1x 1 y 2 , X = Y {(a, b), 0 ≤ a ≤ 1 2 , 0 ≤ b ≤ 12 n , a n ≤ b}, and n = 2, 4, 6, 8. This figure is a log-log plot of z t -z * versus t, and it indicates sublinear convergence rates of OGDA in all these games.

Figure4: Experiments on a strongly-convex-strongly-concave game where f (x,y) = x 2 1 -y 2 1 + 2x 1 y 1 and X = Y {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}.The figure is showing ln z t -z * versus the time step t. The result shows that OGDA enjoys linear convergence and outperforms OMWU in this case.

Figure 5: Experiments of OGDA on a set of games satisfying SP-MS with β > 0, where f(x, y) = x 2n 1 -x 1 y 1 -y 2n 1 for some integer n ≥ 2 and X = Y {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}.The result shows that OGDA converges to the Nash equilibrium with sublinear rates in these instances.

i∈{1,...,|V(Y)|} c i x -d i ≥ max i∈{1,...,k} c i x -d i = max i∈{1,...,k}

This figure is a log-log plot of z t -z * versus t, and it indicates sublinear convergence rates of OGDA in all these games.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for providing highly constructive comments which bring about significant improvement of the result during the rebuttal phase. CL would like to thank Yu-Guan Hsieh for many helpful discussions on error bounds and metric subregularity. The authors are supported by NSF Awards IIS-1755781 and IIS-1943607. 

J PROOF OF THEOREM 9

Proof of Theorem 9. Consider the following 2 × 2 bilinear game with curved feasible sets:Below, we use Claim 1 -Claim 5 to argue that if the two players start from x 0 = y 0 = x 0 = y 0 = ( 1 2 , 1 4 ), and use any constant learning rate η ≤ 1 64 , then the convergence is sublinear in the sense that z t -z * ≥ Ω(1/t). Then, in Claim 6, we show that in this example, SP-MS holds with β = 3.Claim 1. The unique equilibrium is x * = 0, y * = 0.When x = 0, clearly max y ∈Y f (x, y ) = 0. When x = 0, we prove max y ∈Y f (x, y ) > 0 below. If x 1 = 0, we let y 1 = 1 2 x 1 and y 2 = 1 4 x 1 2 (which satisfies y ∈ Y), and thusIf x 1 = 0 but x 2 = 0, we let y 1 = 1 2 , y 2 = 1 4 , and thusThus, max y ∈Y f (x, y ) > 0 if x = 0, and x * = 0 is the unique optimal solution for x. By the symmetry between x and y (because G = -G ), we can also prove that the unique optimal solution for y is y * = 0.Claim 2. Suppose that x 0 = y 0 = x 0 = y 0 = ( 1 2 , 1 4 ). Then, at any step t ∈ [T ], we have x t = y t and x t = y t , and all x t , y t , x t , y t belong to {u ∈ R 2 : u 2 = u 2 1 }.

