PROXIMAL GRADIENT DESCENT-ASCENT: VARIABLE CONVERGENCE UNDER KŁ GEOMETRY

Abstract

The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function values or gradient norms. However, the variable convergence of GDA has been proved only under convexity geometries, and there lacks understanding for general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of a more general proximal-GDA for regularized nonconvex-strongly-concave minimax optimization. Specifically, we show that proximal-GDA admits a novel Lyapunov function, which monotonically decreases in the minimax optimization process and drives the variable sequence to a critical point. By leveraging this Lyapunov function and the KŁ geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point x * , i.e., x t → x * , y t → y * (x * ). Furthermore, over the full spectrum of the KŁ-parameterized geometry, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, depending on the geometry associated with the KŁ parameter. This is the first theoretical result on the variable convergence for nonconvex minimax optimization.

1. INTRODUCTION

Minimax optimization is a classical optimization framework that has been widely applied in various modern machine learning applications, including game theory Ferreira et al. (2012) , generative adversarial networks (GANs) Goodfellow et al. (2014) , adversarial training Sinha et al. (2017) , reinforcement learning Qiu et al. (2020) , imitation learning Ho and Ermon (2016) ; Song et al. (2018) , etc. A typical minimax optimization problem is shown below, where f is a differentiable function. A popular algorithm for solving the above minimax problem is gradient descent-ascent (GDA), which performs a gradient descent update on the variable x and a gradient ascent update on the variable y alternatively in each iteration. Under the alternation between descent and ascent updates, it is much desired that GDA generates sequences of variables that converge to a certain optimal point, i.e., the minimax players obtain convergent optimal policies. In the existing literature, many studies have established the convergence of GDA-type algorithms under various global geometries of the objective function, e.g., convex-concave geometry (f is convex in x and concave in y) Nedić and Ozdaglar (2009) , bi-linear geometry Neumann (1928) ; Robinson (1951) and Polyak-Łojasiewicz (PŁ) geometry Nouiehed et al. (2019) ; Yang et al. (2020) . Some other work studied GDA under stronger global geometric conditions of f such as convex-strongly-concave geometry Du and Hu (2019) and strongly-convex-strongly-concave geometry Mokhtari et al. (2020) ; Zhang and Wang (2020) , under which GDA is shown to generate convergent variable sequences. However, these special global function geometries do not hold for modern machine learning problems that usually have complex models and nonconvex geometry. Recently, many studies characterized the convergence of GDA in nonconvex minimax optimization, where the objective function is nonconvex in x. Specifically, Lin et al. (2020) ; Nouiehed et al. (2019) ; Xu et al. (2020b) ; Boţ and Böhm (2020) studied the convergence of GDA in the nonconvex-concave setting and Lin et al. (2020) ; Xu et al. (2020b) studied the nonconvex-strongly-concave setting. In these general nonconvex settings, it has been shown that GDA converges to a certain stationary point at a sublinear rate, i.e., G(x t ) ≤ t -α for some α > 0, where G(x t ) corresponds to a certain notion of gradient. Although such a gradient convergence result implies the stability of the algorithm, namely, lim t→∞ x t+1 -x t = 0, it does not guarantee the convergence of the variable sequences {x t } t , {y t } t generated by GDA. So far, the variable convergence of GDA has not been established for nonconvex problems, but only under (strongly) convex function geometries that are mentioned previously Du and Hu (2019) ; Mokhtari et al. (2020) ; Zhang and Wang (2020) . Therefore, we want to ask the following fundamental question: • Q1: Does GDA have guaranteed variable convergence in nonconvex minimax optimization? If so, where do they converge to? In fact, proving the variable convergence of GDA in the nonconvex setting is highly nontrivial due to the following reasons: 1) the algorithm alternates between a minimization step and a maximization step; 2) It is well understood that strong global function geometry leads to the convergence of GDA. However, in general nonconvex setting, the objective functions typically do not have an amenable global geometry. Instead, they may satisfy different types of local geometries around the critical points. Hence, it is natural and much desired to exploit the local geometries of functions in analyzing the convergence of GDA. The Kurdyka-Łojasiewicz (KŁ) geometry provides a broad characterization of such local geometries for nonconvex functions. The Kurdyka-Łojasiewicz (KŁ) geometry (see Section 2 for details) Bolte et al. (2007; 2014) parameterizes a broad spectrum of the local nonconvex geometries and has been shown to hold for a broad class of practical functions. Moreover, it also generalizes other global geometries such as strong convexity and PŁ geometry. In the existing literature, the KŁ geometry has been exploited extensively to analyze the convergence rate of various gradient-based algorithms in nonconvex optimization, e.g., gradient descent Attouch and Bolte (2009) ; Li et al. (2017) and its accelerated version Zhou et al. (2020) as well as the distributed version Zhou et al. (2016a) . Hence, we are highly motivated to study the convergence rate of variable convergence of GDA in nonconvex minimax optimization under the KŁ geometry. In particular, we want to address the following question: • Q2: How does the local function geometry captured by the KŁ parameter affects the variable convergence rate of GDA? In this paper, we provide comprehensive answers to these questions. We develop a new analysis framework to study the variable convergence of GDA in nonconvex-strongly-concave minimax optimization under the KŁ geometry. We also characterize the convergence rates of GDA in the full spectrum of the parameterization of the KŁ geometry.

1.1. OUR CONTRIBUTIONS

We consider the following regularized nonconvex-strongly-concave minimax optimization problem min x∈R m max y∈Y f (x, y) + g(x) -h(y), (P) where f is a differentiable and nonconvex-strongly-concave function, g is a general nonconvex regularizer and h is a convex regularizer. Both g and h can be possibly nonsmooth. To solve the above regularized minimax problem, we study a proximal-GDA algorithm that leverages the forward-backward splitting update Lions and Mercier (1979) ; Attouch et al. (2013) . We study the variable convergence property of proximal-GDA in solving the minimax problem (P). Specifically, we show that proximal-GDA admits a novel Lyapunov function H(x, y) (see Proposition 2), which is monotonically decreasing along the trajectory of proximal GDA, i.e., H(x t+1 , y t+1 ) < H(x t , y t ). Based on the monotonicity of this Lyapunov function, we show that every limit point of the variable sequences generated by proximal-GDA is a critical point of the objective function. Moreover, by exploiting the ubiquitous KŁ geometry of the Lyapunov function, we prove that the entire variable sequence of proximal-GDA has a unique limit point, or equivalently speaking, it converges to a certain critical point x * , i.e., x t → x * , y t → y * (x * ) (see the definition of y * in Section 2). To the best of our knowledge, this is the first variable convergence result of GDA-type algorithms in nonconvex minimax optimization. Furthermore, we characterize the asymptotic convergence rates of both the variable sequences and the function values of proximal-GDA in different parameterization regimes of the KŁ geometry. Depending on the value of the KŁ parameter θ, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, as we summarize in Table 1 below.  θ = 1 Finite-step convergence Finite-step convergence θ ∈ ( 1 2 , 1) O exp -[2(1 -θ)] t0-t Super-linear convergence O exp -[2(1 -θ)] t0-t Super-linear convergence θ = 1 2 O 1 + ρ t0-t , ρ > 0 Linear convergence O min 2, 1 + ρ (t0-t)/2 , ρ > 0 Linear convergence θ ∈ (0, 1 2 ) O (t -t 0 ) -1 1-2θ Sub-linear convergence O (t -t 0 ) -θ 1-2θ Sub-linear convergence

1.2. RELATED WORK

Deterministic GDA algorithms: Yang et al. (2020) studied an alternating gradient descent-ascent (AGDA) algorithm in which the gradient ascent step uses the current variable x t+1 instead of x t . Boţ and Böhm (2020) extended the AGDA algorithm to an alternating proximal-GDA (APGDA) algorithm for a regularized minimax optimization. Xu et al. (2020b) studied an alternating gradient projection algorithm which applies 2 regularizer to the local objective function of GDA followed by projection onto the constraint sets. Daskalakis and Panageas (2018) ; Mokhtari et al. (2020) ; Zhang and Wang (2020) analyzed optimistic gradient descent-ascent (OGDA) which applies negative momentum to accelerate GDA. Mokhtari et al. (2020) also studied an extra-gradient algorithm which applies two-step GDA in each iteration. Nouiehed et al. (2019) KŁ geometry: The KŁ geometry was defined in Bolte et al. (2007) . The KŁ geometry has been exploited to study the convergence of various first-order algorithms for solving minimization problems, including gradient descent Attouch and Bolte (2009) , alternating gradient descent Bolte et al. (2014) , distributed gradient descent Zhou et al. (2016a; 2018a) , accelerated gradient descent Li et al. (2017) . It has also been exploited to study the convergence of second-order algorithms such as Newton's method Noll and Rondepierre (2013); Frankel et al. (2015) and cubic regularization method Zhou et al. (2018b) .

2. PROBLEM FORMULATION AND KŁ GEOMETRY

In this section, we introduce the problem formulation, technical assumptions and the Kurdyka-Łojasiewicz (KŁ) geometry. We consider the following regularized minimax optimization problem. min x∈R m max y∈Y f (x, y) + g(x) -h(y), where f : R m × R n → R is a differentiable and nonconvex-strongly-concave loss function, Y ⊂ R n is a compact and convex set, and g, h are the regularizers that are possibly non-smooth. In particular, define Φ(x) := max y∈Y f (x, y) -h(y), and then the problem (P) is equivalent to the minimization problem min x∈R m Φ(x) + g(x). Throughout the paper, we adopt the following standard assumptions on the problem (P). Assumption 1. The objective function of the problem (P) satisfies: et al. (2020) . Items 2 and 3 guarantee that the minimax problem (P) has at least one solution, and the variable sequences generated by the proximal-GDA algorithm (See Algorithm 1) are bounded. Item 4 requires the regularizer h to be convex (possibly non-smooth), which includes many normbased popular regularizers such as p (p ≥ 1), elastic net, nuclear norm, spectral norm, etc. On the other hand, the other regularizer g can be nonconvex but lower semi-continuous, which includes all the aforementioned convex regularizers, p (0 ≤ p < 1), Schatten-p norm, rank, etc. Hence, our formulation of the problem (P) covers a rich class of nonconvex objective functions and regularizers and is more general than the existing nonconvex minimax formulation in Lin et al. (2020) , which does not consider any regularizer. Remark 1. We note that the strong concavity of f (x, •) in item 1 can be relaxed to concavity, provided that the regularizer h(y) is µ-strongly convex. In this case, we can add -µ 2 y 2 to both f (x, y) and h(y) such that Assumption 1 still holds. For simplicity, we will omit the discussion on this case. 1. Function f (•, •) is L-smooth and function f (x, •) is µ- By strong concavity of f (x, •), it is clear that the mapping y * (x) := arg max y∈Y f (x, y) -h(y) is uniquely defined for every x ∈ R m . In particular, if x * is the desired minimizer of Φ(x), then (x * , y * (x * )) is the desired solution of the minimax problem (P). Next, we present some important properties regarding the function Φ(x) and the mapping y * (x). The following proposition from Boţ and Böhm (2020) generalizes the Lemma 4.3 of Lin et al. (2020) to the regularized setting. The proof can be found in Appendix A. Throughout, we denote κ = L/µ as the condition number and denote ∇ 1 f (x, y), ∇ 2 f (x, y) as the gradients with respect to the first and the second input argument, respectively. For example, with this notation, ∇ 1 f (x, y * (x)) denotes the gradient of f (x, y * (x)) with respect to only the first input argument x, and the x in the second input argument y * (x) is treated as a constant. Proposition 1 (Lipschitz continuity of y * (x) and ∇Φ(x)). Let Assumption 1 hold. Then, the mapping y * (x) and the function Φ(x) satisfy 1. Mapping y * (x) is κ-Lipschitz continuous; 2. Function Φ(x) is L(1 + κ)-smooth with ∇Φ(x) = ∇ 1 f (x, y * (x)). As an intuitive explanation of Proposition 1, since the function f (x, y) -h(y) is L-smooth with respect to x, both the maximizer y * (x) and the corresponding maximum function value Φ(x) should not change substantially with regard to a small change of x. Recall that the minimax problem (P) is equivalent to the standard minimization problem min x∈R m Φ(x) + g(x), which, according to item 2 of Proposition 1, includes a smooth nonconvex function Φ(x) and a lower semi-continuous regularizer g(x). Hence, we can define the optimization goal of the minimax problem (P) as finding a critical point x * of the nonconvex function Φ(x) + g(x) that satisfies the necessary optimality condition 0 ∈ ∂(Φ + g)(x * ) for minimizing nonconvex functions. Here, ∂ denotes the notion of subdifferential as we elaborate below. Definition 1. (Subdifferential and critical point, Rockafellar and Wets ( 2009)) The Frechét subdifferential ∂h of function h at x ∈ dom h is the set of u ∈ R d defined as ∂h(x) := u : lim inf z =x,z→x h(z) -h(x) -u (z -x) z -x ≥ 0 , and the limiting subdifferential ∂h at x ∈ dom h is the graphical closure of ∂h defined as: ∂h(x) := {u : ∃x k → x, h(x k ) → h(x), u k ∈ ∂h(x k ), u k → u}. The set of critical points of h is defined as crith := {x : 0 ∈ ∂h(x)}. Throughout, we refer to the limiting subdifferential as subdifferential. We note that subdifferential is a generalization of gradient (when h is differentiable) and subgradient (when h is convex) to the nonconvex setting. In particular, any local minimizer of h must be a critical point. Next, we introduce the Kurdyka-Łojasiewicz (KŁ) geometry of a function h. Throughout, the point-to-set distance is denoted as dist Ω (x) := inf u∈Ω x -u . Definition 2 (KŁ geometry, Bolte et al. ( 2014)). A proper and lower semi-continuous function h is said to have the KŁ geometry if for every compact set Ω ⊂ domh on which h takes a constant value h Ω ∈ R, there exist ε, λ > 0 such that for all x ∈ Ω and all x ∈ {z ∈ R m : dist Ω (z) < ε, h Ω < h(z) < h Ω + λ}, the following condition holds: ϕ (h(x) -h Ω ) • dist ∂h(x) (0) ≥ 1, ( ) where ϕ is the derivative of function ϕ : [0, λ) → R + , which takes the form ϕ(t) = c θ t θ for certain universal constant c > 0 and KŁ parameter θ ∈ (0, 1]. The KŁ geometry characterizes the local geometry of a nonconvex function around the set of critical points. To explain, consider the case where h is a differentiable function so that ∂h(x) = ∇h(x). Then, the KŁ inequality in eq. (1) becomes h Łojasiewicz (1963) ; Karimi et al. (2016) (i.e., KŁ parameter θ = 1 2 ). Moreover, the KŁ geometry has been shown to hold for a large class of functions including sub-analytic functions, logarithm and exponential functions and semi-algebraic functions. These function classes cover most of the nonconvex objective functions encountered in practical machine learning applications Zhou et al. (2016b) ; Yue et al. (2018) ; Zhou and Liang (2017) ; Zhou et al. (2018b) . (x) -h Ω ≤ O( ∇h(x) 1 1-θ ), which generalizes the Polyak-Łojasiewicz (PL) condition h(x) -h Ω ≤ O( ∇h(x) 2 ) The KŁ geometry has been exploited extensively to analyze the convergence of various first-order algorithms, e.g., gradient descent Attouch and Bolte (2009) ; Li et al. (2017) , alternating minimization Bolte et al. (2014) and distributed gradient methods Zhou et al. (2016a) . It has also been exploited to study the convergence of second-order algorithms such cubic regularization Zhou et al. (2018b) . In these works, it has been shown that the variable sequences generated by these algorithms converge to a desired critical point in nonconvex optimization, and the convergence rates critically depend on the parameterization θ of the KŁ geometry. In the subsequent sections, we provide a comprehensive understanding of the convergence and convergence rate of proximal-GDA under the KŁ geometry.

3. PROXIMAL-GDA AND GLOBAL CONVERGENCE ANALYSIS

In this section, we study the following proximal-GDA algorithm that leverages the forward-backward splitting updates Lions and Mercier (1979) ; Attouch et al. (2013) to solve the regularized minimax problem (P) and analyze its global convergence properties. In particular, the proximal-GDA algorithm is a generalization of the GDA Du and Hu (2019) and projected GDA Nedić and Ozdaglar (2009) algorithms. The algorithm update rule is specified in Algorithm 1, where the two proximal gradient steps are formally defined as prox ηxg x t -η x ∇ 1 f (x t , y t ) :∈ argmin u∈R m g(u) + 1 2η x u -x t + η x ∇ 1 f (x t , y t ) 2 , ( ) prox ηyh y t + η y ∇ 2 f (x t , y t ) := argmin v∈Y h(v) + 1 2η y v -y t -η y ∇ 2 f (x t , y t ) 2 , (3) Algorithm 1 Proximal-GDA Input: Initialization x 0 , y 0 , learning rates η x , η y . for t = 0, 1, 2, . . . , T -1 do x t+1 ∈ prox ηxg x t -η x ∇ 1 f (x t , y t ) , y t+1 = prox ηyh y t + η y ∇ 2 f (x t , y t ) . end Output: x T , y T . Recall that our goal is to obtain a critical point of the minimization problem min x∈R m Φ(x) + g(x). Unlike the gradient descent algorithm which generates a sequence of monotonically decreasing function values, the function value (Φ + g)(x k ) along the variable sequence generated by proximal-GDA is generally oscillating due to the alternation between the gradient descent and gradient ascent steps. Hence, it seems that proximal-GDA is less stable than gradient descent. However, our next result shows that, for the problem (P), the proximal-GDA admits a special Lyapunov function that monotonically decreases in the optimization process. The proof of Proposition 2 is in Appendix B. Proposition 2. Let Assumption 1 hold and define the Lyapunov function H(z) := Φ(x) + g(x) + (1 -1 4κ 2 ) y-y * (x) 2 with z := (x, y). Choose the learning rates such that η x ≤ 1 κ 3 (L+3) 2 , η y ≤ 1 L . Then, the variables z t = (x t , y t ) generated by proximal-GDA satisfy, for all t = 0, 1, 2, ... H(z t+1 ) ≤ H(z t ) -2 x t+1 -x t 2 - 1 4κ 2 y t+1 -y * (x t+1 ) 2 + y t -y * (x t ) 2 . (4) We first explain how this Lyapunov function is introduced in the proof. By eq. ( 19) in the supplementary material, we established a recursive inequality on the objective function (Φ + g)(x t+1 ). One can see that the right hand side of eq. ( 19) contains a negative term -x t+1 -x t 2 and an undesired positive term y * (x t ) -y t 2 . Hence, the objective function (Φ + g)(x t+1 ) may be oscillating and cannot serve as a proper Lyapunov function. In the subsequent analysis, we break this positive term into a difference of two terms y * (x t ) -y t 2 -y * (x t+1 ) -y t+1 2 , by leveraging the update of y t+1 for solving the strongly concave maximization problem. After proper rearranging, this difference term contributes to the quadratic term in the Lyapunov function. We note that the Lyapunov function H(z) is the objective function Φ(x) + g(x) regularized by the additional quadratic term (1 -1 4κ 2 ) y -y * (x) 2 , and such a Lyapunov function clearly characterizes our optimization goal. To elaborate, consider a desired case where the sequence x t converges to a certain critical point x * and the sequence y t converges to the corresponding point y * (x * ). In this case, it can be seen that the Lyapunov function H(z t ) converges to the desired function value (Φ + g)(x * ). Hence, solving the minimax problem (P) is equivalent to minimizing the Lyapunov function. More importantly, Proposition 2 shows that the Lyapunov function value sequence {H(z t )} t is monotonically decreasing in the optimization process of proximal-GDA, implying that the algorithm continuously makes optimization progress. We also note that the coefficient (1-1 4κ 2 ) in the Lyapunov function is chosen in a way so that eq. ( 4) can be proven to be strictly decreasing. This monotonic property is the core of our analysis of proximal-GDA. Based on Proposition 2, we obtain the following asymptotic properties of the variable sequences generated by proximal-GDA. The proof can be found in Appendix C. Corollary 1. Based on Proposition 2, the sequences {x t , y t } t generated by proximal-GDA satisfy lim t→∞ x t+1 -x t = 0, lim t→∞ y t+1 -y t = 0, lim t→∞ y t -y * (x t ) = 0. The above result shows that the variable sequences generated by proximal-GDA in solving the problem (P) are asymptotically stable. In particular, the last two equations show that y t asymptotically approaches the corresponding maximizer y * (x t ) of the objective function f (x t , y) + g(x t ) -h(y). Hence, if x t converges to a certain critical point, y t will converge to the corresponding maximizer. Discussion: We note that the monotonicity property in Proposition 2 further implies the convergence rate result min 0≤k≤t x k+1 -x k ≤ O(t -1/2 ) (by telescoping over t). When there is no regularizer, this convergence rate result can be shown to further imply that min 0≤k≤t ∇Φ(x k ) ≤ O(t -1/2 ), which reduces to the Theorem 4.4 of Lin et al. (2020) . However, such a convergence rate result does not imply the convergence of the variable sequences {x t } t , {y t } t . To explain, we can apply the convergence rate result x t+1 -x t ≤ O(t -1/2 ) to bound the trajectory norm as x T ≤ x 0 + T -1 t=0 x t+1 -x t ≈ √ T , which diverges to +∞ as T → ∞. Therefore, such a type of convergence rate does not even imply the boundedness of the trajectory. In this paper, our focus is to establish the convergence of the variable sequences generated by proximal-GDA. All the results in Corollary 1 imply that the alternating proximal gradient descent & ascent updates of proximal-GDA can achieve stationary points, which we show below to be critical points. Theorem 1 (Global convergence). Let Assumption 1 hold and choose the learning rates η x ≤ 1 κ 3 (L+3) 2 , η y ≤ 1 L . Then, proximal-GDA satisfies the following properties. 1. The function value sequence {(Φ + g)(x t )} t converges to a finite limit H * > -∞; 2. The sequences {x t } t , {y t } t are bounded and have compact sets of limit points. Moreover, (Φ + g)(x * ) ≡ H * for any limit point x * of {x t } t ; 3. Every limit point of {x t } t is a critical point of (Φ + g)(x). The proof of Theorem 1 is presented in Appendix D. The above theorem establishes the global convergence property of proximal-GDA. Specifically, item 1 shows that the function value sequence {(Φ + g)(x t )} t converges to a finite limit H * , which is also the limit of the Lyapunov function sequence {H(z t )} t . Moreover, items 2 & 3 further show that all the converging subsequences of {x t } t converge to critical points of the problem, at which the function Φ + g achieves the constant value H * . These results show that proximal-GDA can properly find critical points of the minimax problem (P). Furthermore, based on these results, the variable sequences generated by proximal-GDA are guaranteed to enter a local parameter region where the Kurdyka-Łojasiewicz geometry holds, which we exploit in the next section to establish stronger convergence results of the algorithm.

4. VARIABLE CONVERGENCE OF PROXIMAL-GDA UNDER KŁ GEOMETRY

We note that Theorem 1 only shows that every limit point of {x t } t is a critical point, and the sequences {x t , y t } t may not necessarily be convergent. In this section, we exploit the local KŁ geometry of the Lyapunov function to formally prove the convergence of these sequences. Throughout this section, we adopt the following assumption. Assumption 2. Regarding the mapping y * (x), the function y * (x) -y 2 has a non-empty subdifferential, i.e., ∂ x ( y * (x) -y 2 ) = ∅. Note that in many practical scenarios y * (x) is sub-differentiable. In addition, Assumption 2 ensures the sub-differentiability of the Lyapunov function H(z) := Φ(x) + g(x) + (1 -1 4κ 2 ) y -y * (x) 2 . We obtain the following variable convergence result of proximal-GDA under the KŁ geometry. The proof is presented in Appendix E. Theorem 2 (Variable convergence). Let Assumption 1 & 2 hold and assume that H has the KŁ geometry. Choose the learning rates η x ≤ 1 κ 3 (L+3) 2 and η y ≤ 1 L . Then, the sequence {(x t , y t )} t generated by proximal-GDA converges to a certain critical point (x * , y * (x * )) of (Φ + g)(x), i.e., x t t → x * , y t t → y * (x * ). Theorem 2 formally shows that proximal-GDA is guaranteed to converge to a certain critical point (x * , y * (x * )) of the minimax problem (P), provided that the Lyapunov function belongs to the large class of KŁ functions. To the best of our knowledge, this is the first variable convergence result of GDA-type algorithms in nonconvex minimax optimization. The proof logic of Theorem 2 can be summarized as the following two key steps. Step 1: By leveraging the monotonicity property of the Lyapunov function in Proposition 2, we first show that the variable sequences of proximal-GDA eventually enter a local region where the KŁ geometry holds; Step 2: Then, combining the KŁ inequality in eq. ( 1) and the monotonicity property of the Lyapunov function in eq. ( 4), we show that the variable sequences of proximal-GDA are Cauchy sequences and hence converge to a certain critical point.

5. CONVERGENCE RATE OF PROXIMAL-GDA UNDER KŁ GEOMETRY

In this section, we exploit the parameterization of the KŁ geometry to establish various types of asymptotic convergence rates of proximal-GDA. We obtain the following asymptotic convergence rates of proximal-GDA under different parameter regimes of the KŁ geometry. The proof is presented in Appendix F. In the sequel, we denote t 0 as a sufficiently large positive integer, denote c > 0 as the constant in Definition 2 and also define M := max 1 2 1 η x + (L + 4κ 2 )(1 + κ) 2 , 4κ 2 (L + 4κ) 2 . ( ) Theorem 3 (Funtion value convergence rate). Under the same conditions as those of Theorem 2, the Lyapunov function value sequence {H(z t )} t converges to the limit H * at the following rates. 1. If KŁ geometry holds with θ = 1, then H(z t ) ↓ H * within finite number of iterations; 2. If KŁ geometry holds with θ ∈ ( 1 2 , 1), then H(z t ) ↓ H * super-linearly as H(z t ) -H * ≤ (2M c 2 ) -1 2θ-1 exp - 1 2(1 -θ) t-t0 , ∀t ≥ t 0 ; (6) 3. If KŁ geometry holds with θ = 1 2 , then H(z t ) ↓ H * linearly as H(z t ) -H * ≤ 1 + 1 2M c 2 t0-t (H(z t0 ) -H * ), ∀t ≥ t 0 ; 4. If KŁ geometry holds with θ ∈ (0, 1 2 ), then H(z t ) ↓ H * sub-linearly as H(z t ) -H * ≤ C(t -t 0 ) -1 1-2θ , ∀t ≥ t 0 . where C = min 1-2θ 8M c 2 , d -(1-2θ) t0 1 -2 -(1-2θ) > 0. It can be seen from the above theorem that the convergence rate of the Lyapunov function of proximal-GDA is determined by the KŁ parameter θ. A larger θ implies that the local geometry of H is 'sharper', and hence the corresponding convergence rate is orderwise faster. In particular, the algorithm converges at a linear rate when the KŁ geometry holds with θ = 1 2 (see the item 3), which is a generalization of the Polyak-Łojasiewicz (PL) geometry. As a comparison, in the existing analysis of GDA, such a linear convergence result is established under stronger geometries, e.g., convex-strongly-concave Du and Hu (2019) , strongly-convex-strongly-concave Mokhtari et al. (2020) ; Zhang and Wang (2020) and two-sided PL condition Yang et al. (2020) . In summary, the above theorem provides a full characterization of the fast convergence rates of proximal-GDA in the full spectrum of the KŁ geometry. Moreover, we also obtain the following asymptotic convergence rates of the variable sequences that are generated by proximal-GDA under different parameterization of the KŁ geometry. The proof is presented in Appendix G. Theorem 4 (Variable convergence rate). Under the same conditions as those of Theorem 2, the sequences {x t , y t } t converge to their limits x * , y * (x * ) respectively at the following rates. 1. If KŁ geometry holds with θ = 1, then (x t , y t ) → (x * , y * (x * )) within finite number of iterations; 2. If KŁ geometry holds with θ ∈ ( 1 2 , 1), then (x t , y t ) → (x * , y * (x * )) super-linearly as max x t -x * , y t -y * (x * ) ≤ O exp - 1 2(1 -θ) t-t0 , ∀t ≥ t 0 ; (9) 3. If KŁ geometry holds with θ = 1 2 , then (x t , y t ) → (x * , y * (x * )) linearly as max x t -x * , y t -y * (x * ) ≤ O min 2, 1 + 1 2M c 2 (t0-t)/2 , ∀t ≥ t 0 ; (10) 4. If KŁ geometry holds with θ ∈ (0, 1 2 ), then (x t , y t ) → (x * , y * (x * )) sub-linearly as max x t -x * , y t -y * (x * ) ≤ O (t -t 0 ) -θ 1-2θ , ∀t ≥ t 0 . To the best of our knowledge, this is the first characterization of the variable convergence rates of proximal-GDA in the full spectrum of the KŁ geometry. It can be seen that, similar to the convergence rate results of the function value sequence, the convergence rate of the variable sequences is also affected by the parameterization of the KŁ geometry.

6. CONCLUSION

In this paper, we develop a new analysis framework for the proximal-GDA algorithm in nonconvexstrongly-concave optimization. Our key observation is that proximal-GDA has a intrinsic Lyapunov function that monotonically decreases in the minimax optimization process. Such a property demonstrates the stability of the algorithm. Moreover, we establish the formal variable convergence of proximal-GDA to a critical point of the objective function under the ubiquitous KŁ geometry. Our results fully characterize the impact of the parameterization of the KŁ geometry on the convergence rate of the algorithm. In the future study, we will leverage such an analysis framework to explore the convergence of stochastic GDA algorithms and their variance-reduced variants.

B PROOF OF PROPOSITION 2

Proposition 2. Let Assumption 1 hold and define the Lyapunov function H(z) := Φ(x) + g(x) + (1 -1 4κ 2 ) y-y * (x) 2 with z := (x, y). Choose the learning rates such that η x ≤ 1 κ 3 (L+3) 2 , η y ≤ 1 L . Then, the variables z t = (x t , y t ) generated by proximal-GDA satisfy, for all t = 0, 1, 2, ... H(z t+1 ) ≤ H(z t ) -2 x t+1 -x t 2 - 1 4κ 2 y t+1 -y * (x t+1 ) 2 + y t -y * (x t ) 2 . (4) Proof. Consider the t-th iteration of proximal-GDA. By smoothness of Φ we obtain that Φ(x t+1 ) ≤ Φ(x t ) + x t+1 -x t , ∇Φ(x t ) + L(1 + κ) 2 x t+1 -x t 2 . ( ) On the other hand, by the definition of the proximal gradient step of x t , we have g(x t+1 ) + 1 2η x x t+1 -x t + η x ∇ 1 f (x t , y t ) 2 ≤ g(x t ) + 1 2η x η x ∇ 1 f (x t , y t ) 2 , which further simplifies to g(x t+1 ) ≤ g(x t ) - 1 2η x x t+1 -x t 2 -x t+1 -x t , ∇ 1 f (x t , y t ) . ( ) Adding up eq. ( 17) and eq. ( 18) yields that Φ(x t+1 ) + g(x t+1 ) ≤ Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 x t+1 -x t 2 + x t+1 -x t , ∇Φ(x t ) -∇ 1 f (x t , y t ) = Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 x t+1 -x t 2 + x t+1 -x t ∇Φ(x t ) -∇ 1 f (x t , y t ) = Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 x t+1 -x t 2 + x t+1 -x t ∇ 1 f (x t , y * (x t )) -∇ 1 f (x t , y t ) ≤ Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 x t+1 -x t 2 + L x t+1 -x t y * (x t ) -y t . ≤ Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 - L 2 κ 2 2 x t+1 -x t 2 + 1 2κ 2 y * (x t ) -y t 2 (19) Next, consider the term y * (x t ) -y t in the above inequality. Note that y * (x t ) is the unique minimizer of the strongly concave function f (x t , y) -h(y), and y t+1 is obtained by applying one proximal gradient step on it starting from y t . Hence, by the convergence rate of proximal gradient ascent algorithm under strong concavity, we conclude that with η y ≤ 1 L , y t+1 -y * (x t ) 2 ≤ 1 -κ -1 y t -y * (x t ) 2 . ( ) Hence, we further obtain that y * (x t+1 ) -y t+1 2 ≤ 1 + κ -1 y t+1 -y * (x t ) 2 + (1 + κ) y * (x t+1 ) -y * (x t ) 2 ≤ 1 -κ -2 y t -y * (x t ) 2 + κ 2 (1 + κ) x t+1 -x t 2 . ( ) Adding eqs. ( 19) & ( 21), we obtain Φ(x t+1 ) + g(x t+1 ) ≤ Φ(x t ) + g(x t ) - 1 2η x - L(1 + κ) 2 - L 2 κ 2 2 -κ 2 (1 + κ) x t+1 -x t 2 + 1 - 1 2κ 2 y * (x t ) -y t 2 -y * (x t+1 ) -y t+1 Rearranging the equation above and recalling the definition of the Lyapunov function H(z) := Φ(x) + g(x) + 1 -1 4κ 2 y -y * (x) 2 , we have H(z t+1 ) ≤H(z t ) - 1 2η x - L(1 + κ) 2 - L 2 κ 2 2 -κ 2 (1 + κ) x t+1 -x t 2 - 1 4κ 2 ( y * (x t ) -y t 2 + y * (x t+1 ) -y t+1 2 ) (22) When η x < κ -3 (L + 3) -2 , using κ ≥ 1 yields that 1 2η x - L(1 + κ) 2 - L 2 κ 2 2 -κ 2 (1 + κ) ≥ 1 2 κ 3 (L + 3) 2 - L 2 (2κ)κ 2 - L 2 κ 3 2 -κ 2 (2κ) = 1 2 κ 3 [(L + 3) 2 -2L -L 2 -4] = 1 2 κ 3 (4L + 5) > 2 (23) As a result, eq. ( 4) can be concluded by substituting eq. ( 23) into eq. ( 22).

C PROOF OF COROLLARY 1

Corollary 1. Based on Proposition 2, the sequences {x t , y t } t generated by proximal-GDA satisfy lim t→∞ x t+1 -x t = 0, lim t→∞ y t+1 -y t = 0, lim t→∞ y t -y * (x t ) = 0. Proof. To prove the first and third items of Corollary 1, summing the inequality of Proposition 2 over t = 0, 1, ..., T -1, we obtain that for all T ≥ 1, T -1 t=0 2 x t+1 -x t 2 + 1 4κ 2 ( y t+1 -y * (x t+1 ) 2 + y t -y * (x t ) 2 ) ≤ H(z 0 ) -H(z T ) ≤ H(z 0 ) -Φ(x T ) + g(x T ) ≤ H(z 0 ) -inf x∈R m Φ(x) + g(x) < +∞. Letting T → ∞, we conclude that ∞ t=0 2 x t+1 -x t 2 + 1 4κ 2 ( y t+1 -y * (x t+1 ) 2 + y t -y * (x t ) 2 ) < +∞. Therefore, we must have lim t→∞ x t+1 -x t = lim t→∞ y t -y * (x t ) =0. To prove the second item, note that y t+1 -y t ≤ y t+1 -y * (x t ) + y t -y * (x t ) eq. ( 20) ≤ ( 1 -κ -1 + 1) y t -y * (x t ) t → 0.

D PROOF OF THEOREM 1

Theorem 1 (Global convergence). Let Assumption 1 hold and choose the learning rates η x ≤ 1 κ 3 (L+3) 2 , η y ≤ 1 L . Then, proximal-GDA satisfies the following properties. 1. The function value sequence {(Φ + g)(x t )} t converges to a finite limit H * > -∞; 2. The sequences {x t } t , {y t } t are bounded and have compact sets of limit points. Moreover, (Φ + g)(x * ) ≡ H * for any limit point x * of {x t } t ; 3. Every limit point of {x t } t is a critical point of (Φ + g)(x). Proof. We first prove some useful results on the Lyapunov function H(z). By Assumption 1 we know that Φ+g is bounded below and have compact sub-level sets, and we first show that H(z) also satisfies these conditions. First, note that H(z) = Φ(x) + g(x) + 1 -1 4κ 2 y -y * (x) 2 ≥ Φ(x) + g(x). Taking infimum over x, y on both sides we obtain that inf x,y H(z) ≥ inf x Φ(x) + g(x) > -∞. This shows that H(z) is bounded below. Second, consider the sub-level set Z α := {z = (x, y) : H(z) ≤ α} for any α ∈ R. This set is equivalent to {(x, y) : Φ(x) + g(x) + 1 -1 4κ 2 y -y * (x) 2 ≤ α}. For any point (x, y) ∈ Z α , the x part is included in the compact set {x : Φ(x) + g(x) ≤ α}. Therefore, the x in this set must be compact. Also, the y in this set should also be compact as it is inside the co-coercive function y -y * (x) 2 . Hence, we have shown that H(z) is bounded below and have compact sub-level set. We first show that {(Φ + g)(x t )} t has a finite limit. We have shown in Proposition 2 that {H(z t )} t is monotonically decreasing. Since H(z) is bounded below, we conclude that {H(z t )} t has a finite limit H * > -∞, i.e., lim t→∞ (Φ + g)(x t ) + 1 -1 4κ 2 y t -y * (x t ) 2 = H * . Moreover, since y t -y * (x t ) t → 0, we further conclude that lim t→∞ (Φ + g)(x t ) = H * . Next, we prove the second item. Since {H(z t )} t is monotonically decreasing and H(z) has compact sub-level set, we conclude that {x t } t , {y t } t are bounded and hence have compact sets of limit points. Next, we derive a bound on the subdifferential. By the optimality condition of the proximal gradient update of x t and the summation rule of subdifferential in Corollary 1.12.2 of Kruger (2003) , we have 0 ∈ ∂g(x t+1 ) + 1 η x x t+1 -x t + η x ∇ 1 f (x t , y t ) . Then, we obtain that 1 η x x t -x t+1 -∇ 1 f (x t , y t ) + ∇Φ(x t+1 ) ∈ ∂(Φ + g)(x t+1 ), which further implies that dist ∂(Φ+g)(xt+1) (0) ≤ 1 η x x t+1 -x t + ∇ 1 f (x t , y t ) -∇Φ(x t+1 ) = 1 η x x t+1 -x t + ∇ 1 f (x t , y t ) -∇ 1 f (x t+1 , y * (x t+1 )) ≤ 1 η x x t+1 -x t + L( x t+1 -x t + y * (x t+1 ) -y t ) ≤ 1 η x + L x t+1 -x t + L y * (x t+1 ) -y * (x t ) + y * (x t ) -y t ≤ 1 η x + L(1 + κ) x t+1 -x t + L y * (x t ) -y t . Since we have shown that x t+1 -x t t → 0, y * (x t ) -y t t → 0, we conclude from the above inequality that dist ∂(Φ+g)(xt) (0) t → 0. Therefore, we have shown that 1 η x x t-1 -x t -∇ 1 f (x t-1 , y t-1 ) + ∇Φ(x t ) ∈ ∂(Φ + g)(x t ), 1 η x x t-1 -x t -∇ 1 f (x t-1 , y t-1 ) + ∇Φ(x t ) t → 0. Now consider any limit point x * of x t so that x t(j) j → x * along a subsequence. By the proximal update of x t(j) , we have g(x t(j) ) + 1 2η x x t(j) -x t(j)-1 2 + x t(j) -x t(j)-1 , ∇ 1 f (x t(j)-1 , y t(j)-1 ) ≤ g(x * ) + 1 2η x x * -x t(j)-1 2 + x * -x t(j)-1 , ∇ 1 f (x t(j)-1 , y t(j)-1 ) . Taking limsup on both sides of the above inequality and noting that {x t } t , {y t } t are bounded, ∇f is Lipschitz, x t+1 -x t t → 0 and x t(j) → x * , we conclude that lim sup j g(x t(j) ) ≤ g(x * ). Since g is lower-semicontinuous, we know that lim inf j g(x t(j) ) ≥ g(x * ). Combining these two inequalities yields that lim j g(x t(j) ) = g(x * ). By continuity of Φ, we further conclude that lim j (Φ + g)(x t(j) ) = (Φ + g)(x * ). Since we have shown that the entire sequence {(Φ + g)(x t )} t converges to a certain finite limit H * , we conclude that (Φ + g)(x * ) ≡ H * for all the limit points x * of {x t } t . Next, we prove the third item. To this end, we have shown that for every subsequence x t(j) j → x * , we have that (Φ + g)(x t(j) ) j → (Φ + g)(x * ) and there exists u t ∈ ∂(Φ + g)(x t ) such that u t t → 0 (by eq. ( 25)). Recall the definition of limiting sub-differential, we conclude that every limit point x * of {x t } t is a critical point of (Φ + g)(x), i.e., 0 ∈ ∂(Φ + g)(x * ).

E PROOF OF THEOREM 2

Theorem 2 (Variable convergence). Let Assumption 1 & 2 hold and assume that H has the KŁ geometry. Choose the learning rates η x ≤ 1 κ 3 (L+3) 2 and η y ≤ 1 L . Then, the sequence {(x t , y t )} t generated by proximal-GDA converges to a certain critical point (x * , y * (x * )) of (Φ + g)(x), i.e., x t t → x * , y t t → y * (x * ). Proof. We first derive a bound on ∂H(z). Recall that H(z) = Φ(x)+g(x)+ 1 -1 4κ 2 y-y * (x) 2 , and that y * (x) -y 2 has non-empty subdifferential ∂ x ( y * (x) -y 2 ). We therefore have ∂ x H(z) ⊃ ∂(Φ + g)(x) + 1 - 1 4κ 2 ∂ x ( y * (x) -y 2 ), ∇ y H(z) = -2 - 1 2κ 2 y * (x) -y , where the first inclusion follows from the scalar multiplication rule and sum rule of sub-differential, see Proposition 1.11 & 1.12 of Kruger (2003) . Next, we derive upper bounds on these sub-differentials. Based on Definition 1, we can take any u ∈ ∂ x ( y * (x) -y 2 ) and obtain that 0 ≤ lim inf z =x,z→x y * (z) -y 2 -y * (x) -y 2 -u (z -x) z -x ≤ lim inf z =x,z→x [y * (z) -y * (x)] [y * (z) + y * (x) -2y] -u (z -x) z -x ≤ lim inf z =x,z→x y * (z) -y * (x) y * (z) + y * (x) -2y -u (z -x) z -x (i) ≤ lim inf z =x,z→x κ y * (z) + y * (x) -2y - u (z -x) z -x (ii) = 2κ y * (x) -y -lim sup z =x,z→x u (z -x) z -x (iii) = 2κ y * (x) -y -u where (i) and (ii) use the fact that y * is κ-Lipschitz based on Proposition 1, and the limsup in (iii) is achieved by letting z = x + σu with σ → 0 + in (ii). Hence, we conclude that u ≤ 2κ y * (x) -y . Since ∂ x ( y * (x) -y 2 ) is the graphical closure of ∂ x ( y * (x) -y 2 ), we have that dist ∂x( y * (x)-y 2 ) (0) ≤ 2κ y * (x) -y . Then, utilizing the characterization of ∂(Φ + g)(x) in eq. ( 24), we obtain that dist ∂H(zt+1) (0) ≤ dist ∂xH(zt+1) (0) + ∇ y H(z t+1 ) ≤ dist ∂(Φ+g)(xt+1) (0) + 1 - 1 4κ 2 dist ∂x( y * (xt+1)-yt+1 2 ) (0) + 2 - 1 2κ 2 y * (x t+1 ) -y t+1 ≤ 1 η x x t+1 -x t + ∇ 1 f (x t , y t ) -∇Φ(x t+1 ) + 2 - 1 2κ 2 1 + κ y * (x t+1 ) -y t+1 (i) ≤ 1 η x + L x t+1 -x t + L y * (x t+1 ) -y t + 2 1 + κ y * (x t+1 ) -y t+1 (ii) ≤ 1 η x + L(1 + κ) x t+1 -x t + L y * (x t ) -y t + 2 1 + κ 1 -κ -2 y * (x t ) -y t + κ (1 + κ) x t+1 -x t (iii) ≤ 1 η x + (L + 4κ 2 )(1 + κ) x t+1 -x t + (L + 4κ) y * (x t ) -y t . where (i) uses Proposition 1 that ∇Φ(x t+1 ) = ∇ 1 f (x t+1 , y * (x t+1 )) and that y * is κ-Lipschitz, (ii) uses eq. ( 21) and the inequality that √ a + b ≤ √ a + √ b (a, b ≥ 0) and (iii) uses κ ≥ 1. Next, we prove the convergence of the sequence under the assumption that H(z) is a KŁ function. Recall that we have shown in the proof of Theorem 1 that: 1) {H(z t )} t decreases monotonically to the finite limit H * ; 2) for any limit point x * , y * of {x t } t , {y t } t , H(x * , y * ) has the constant value H * . Hence, the KŁ inequality (see Definition 2) holds after sufficiently large number of iterations, i.e., there exists t 0 ∈ N + such that for all t ≥ t 0 , ϕ (H(z t ) -H * )dist ∂H(zt) (0) ≥ 1. Rearranging the above inequality and utilizing eq. ( 27), we obtain that for all t ≥ t 0 , ϕ (H(z t ) -H * ) ≥ 1 dist ∂H(zt) (0) ≥ 1 η x + (L + 4κ 2 )(1 + κ) x t -x t-1 + (L + 4κ) y * (x t-1 ) -y t-1 -1 By concavity of the function ϕ (see Definition 2), we know that ϕ(H(z t ) -H * ) -ϕ(H(z t+1 ) -H * ) ≥ ϕ (H(z t ) -H * )(H(z t ) -H(z t+1 )) (i) ≥ x t+1 -x t 2 + 1 4κ 2 y t -y * (x t ) 2 1 ηx + (L + 4κ 2 )(1 + κ) x t -x t-1 + (L + 4κ) y * (x t-1 ) -y t-1 ( ) (ii) ≥ 1 2 x t+1 -x t + 1 2κ y t -y * (x t ) 2 1 ηx + (L + 4κ 2 )(1 + κ) x t -x t-1 + (L + 4κ) y * (x t-1 ) -y t-1 , where (i) uses Proposition 2 and eq. ( 28), (ii) uses the inequality that a 2 + b 2 ≥ 1 2 (a + b) 2 . Rearranging the above inequality that x t+1 -x t + 1 2κ y t -y * (x t ) 2 ≤ 2[ϕ(H(z t ) -H * ) -ϕ(H(z t+1 ) -H * )] 1 η x + (L + 4κ 2 )(1 + κ) x t -x t-1 + (L + 4κ) y * (x t-1 ) -y t-1 ≤ C[ϕ(H(z t ) -H * ) -ϕ(H(z t+1 ) -H * )] + 1 C 1 η x + (L + 4κ 2 )(1 + κ) x t -x t-1 + 1 C (L + 4κ) y * (x t-1 ) -y t-1 2 where the final step uses the inequality that 2ab ≤ (Ca + b C ) 2 for any a, b ≥ 0 and C > 0 (the value of C will be assigned later). Taking square root of both sides of the above inequality and telescoping over t = t 0 , . . . , T -1, we obtain that T -1 t=t0 x t+1 -x t + 1 2κ T -1 t=t0 y t -y * (x t ) ≤ Cϕ[H(z t0 ) -H * ] -Cϕ[H(z T ) -H * ] + 1 C 1 η x + (L + 4κ 2 )(1 + κ) T -1 t=t0 x t -x t-1 + 1 C (L + 4κ) T -1 t=t0 y * (x t-1 ) -y t-1 ≤ Cc θ [H(z t0 ) -H * ] θ + 1 C 1 η x + (L + 4κ 2 )(1 + κ) T -2 t=t0-1 x t+1 -x t + 1 C (L + 4κ) T -2 t=t0-1 y * (x t ) -y t where the final steps uses ϕ(s) = c θ s θ and the fact that H(z T ) -H * ≥ 0. Since the value of C > 0 is arbitrary, we can select large enough C such that 1 C 1 ηx + (L + 4κ) 2 (1 + κ) < 1 2 and 1 C (L + 4κ) < 1 2κ . Hence, the inequality above further implies that 1 2 T -1 t=t0 x t+1 -x t ≤ Cc θ [H(z t0 ) -H * ] θ + 1 2 x t0 -x t0-1 + 1 2κ y * (x t0-1 ) -y t0-1 < +∞. Letting T → ∞, we conclude that ∞ t=1 x t+1 -x t < + ∞. Moreover, this implies that {x t } t is a Cauchy sequence and therefore converges to a certain limit, i.e., x t t → x * . We have shown in Theorem 1 that any such limit point must be a critical point of Φ + g. Hence, we conclude that {x t } t converges to a certain critical point x * of (Φ + g)(x). Also, note that y * (x t ) -y t t → 0, x t t → x * and y * is a Lipschitz mapping, so we conclude that {y t } t converges to y * (x * ).

F PROOF OF THEOREM 3

Theorem 3 (Funtion value convergence rate). Under the same conditions as those of Theorem 2, the Lyapunov function value sequence {H(z t )} t converges to the limit H * at the following rates. 1. If KŁ geometry holds with θ = 1, then H(z t ) ↓ H * within finite number of iterations; 2. If KŁ geometry holds with θ ∈ ( 1 2 , 1), then H(z t ) ↓ H * super-linearly as H(z t ) -H * ≤ (2M c 2 ) -1 2θ-1 exp - 1 2(1 -θ) t-t0 , ∀t ≥ t 0 ; (6) 3. If KŁ geometry holds with θ = 1 2 , then H(z t ) ↓ H * linearly as H(z t ) -H * ≤ 1 + 1 2M c 2 t0-t (H(z t0 ) -H * ), ∀t ≥ t 0 ; Proof. (Case 1) If θ = 1, then based on the first case of Appendix F, H(z t ) ≡ H * after finite number of iterations. Hence, for large enough t, Proposition 2 yields that 2 x t+1 -x t 2 + 1 4κ 2 y t+1 -y * (x t+1 ) 2 + y t -y * (x t ) 2 ≤ H(z t ) -H(z t+1 ) = 0, which implies that x t+1 = x t and y t = y * (x t ) for large enougth t. Hence, x t → x * and y t → y * (x * ) within finite number of iterations. (Case 2) If θ ∈ ( 1 2 , 1), denote A t = x t+1 -x t + 1 2κ y t -y * (x t ) . Then, based on the definition of M in eq. ( 5), we have 1 η x + (L + 4κ 2 )(1 + κ) x t -x t-1 + (L + 4κ) y * (x t-1 ) -y t-1 ≤ √ 2M A t-1 . Hence, eqs. ( 28) & ( 40) and ϕ (s) = cs θ-1 imply that c(H(z t ) -H * ) θ-1 ≥ ( √ 2M A t-1 ) -1 , which along with θ -1 < 0 implies H(z t ) -H * ≤ (c √ 2M A t-1 ) 1 1-θ . Then, eqs. ( 29) & ( 40) imply that ϕ(H(z t ) -H * ) -ϕ(H(z t+1 ) -H * ) ≥ x t+1 -x t 2 + 1 4κ 2 y t -y * (x t ) 2 2 √ 2M A t-1 . Using the inequality that a 2 + b 2 ≥ 1 2 (a + b) 2 and recalling the definition of A t and ϕ(s) = c θ s θ , the above inequality further implies that c θ (H(z t ) -H * ) θ - c θ (H(z t+1 ) -H * ) θ ≥ A 2 t 4 √ 2M A t-1 . Substituting eq. ( 41) into eq. ( 42) and using H(z t+1 ) -H * ≥ 0 yield that A 2 t ≤ 4 θ (c √ 2M A t-1 ) 1 1-θ , which is equivalent to that C 1 A t ≤ (C 1 A t-1 ) 1 2(1-θ) , where C 1 = (4/θ) 1-θ 2θ-1 (c √ 2M ) 1 2θ-1 . Note that eq. ( 43) holds for t ≥ t 0 . Since A t → 0, there exists t 1 ≥ t 0 such that C 1 A t1 ≤ e -1 . Hence, by iterating eq. ( 43) from t = t 1 + 1, we obtain C 1 A t ≤ exp - 1 2(1 -θ) t-t1 , ∀t ≥ t 1 + 1. Hence, for any t ≥ t 1 + 1, ∞ s=t A s ≤ 1 C 1 ∞ s=t exp - 1 2(1 -θ) s-t1 = 1 C 1 exp - 1 2(1 -θ) t-t1 ∞ s=t exp 1 2(1 -θ) t-t1 - 1 2(1 -θ) s-t1 = 1 C 1 exp - 1 2(1 -θ) t-t1 ∞ s=t exp 1 2(1 -θ) t-t1 1 - 1 2(1 -θ) s-t (i) ≤ 1 C 1 exp - 1 2(1 -θ) t-t1 ∞ s=t exp 1 - 1 2(1 -θ) s-t = 1 C 1 exp - 1 2(1 -θ) t-t1 ∞ s=0 exp 1 - 1 2(1 -θ) s (ii) ≤ O exp - 1 2(1 -θ) t-t1 , where (i) uses the inequalities that 1 2(1-θ) > 1 and that s ≥ t ≥ t 1 + 1, and (ii) uses the fact that ∞ s=0 exp 1 -1 2(1-θ) s < +∞ is a positive constant independent from t. Therefore, the convergence rate (9) can be directly derived as follows x t -x * = lim sup T →∞ x t -x T ≤ lim sup T →∞ T -1 s=t x s+1 -x s ≤ lim sup T →∞ T -1 s=t A s ≤ O exp - 1 2(1 -θ) t-t1 , and y t -y * (x * ) ≤ y t -y * (x t ) + y * (x t ) -y * (x * ) (i) ≤ 2κA t + κ x t -x * ≤2κ ∞ s=t A s + κ x t -x * (ii) ≤ O exp - 1 2(1 -θ) t-t1 , where (i) uses the Lipschitz property of y * in Proposition 1, and (ii) uses eqs. ( 44) & ( 45). (Case 3 & 4) Notice that eq. ( 42) still holds if θ ∈ 0, 1 2 . Hence, if A t ≥ 1 2 A t-1 , then eq. ( 42) implies that A t ≤ 8c √ 2M θ (H(z t ) -H * ) θ -(H(z t+1 ) -H * ) θ . Otherwise, A t ≤ 1 2 A t-1 . Combining these two inequalities yields that A t ≤ 8c √ 2M θ (H(z t ) -H * ) θ -(H(z t+1 ) -H * ) θ + 1 2 A t-1 . Notice that the inequality above holds whenever t ≥ t 0 . Hence, telescoping the inequality above yields T s=t A s ≤ 8c √ 2M θ (H(z t ) -H * ) θ -(H(z T +1 ) -H * ) θ + 1 2 T -1 s=t-1 A s , ∀t ≥ t 0 , which along with A T ≥ 0, H(z (47) (Case 3) If θ = 1/2, eq. ( 7) holds. Substituting eq. ( 7) and θ = 1/2 into eq. ( 47) yields that  S + 1 8M c 2 < 1, t s=t0+1 1 4 + 1 8M c 2 -s/2 = 1 4 + 1 8M c 2 -t/2 1 -1 4 + 1 8M c 2 (t-t0)/2 1 -1 4 + 1 8M c 2 1/2 ≤ O 1 4 + 1 8M c 2 -t/2 Since either of the two above inequalities holds, combining them yields that t s=t0+1 1 4 + 1 8M c 2 -s/2 ≤ O max t -t 0 , 1 4 + 1 8M c 2 -t/2 Substituing the above inequality into eq. ( 48) yields that S t ≤ 1 2 t-t0 S t0 + O max 2 -t (t -t 0 ), 1 + 1 2M c 2 -t/2 ≤O min 2, 1 + 1 2M c 2 -t/2 . Hence, x t -x * (i) ≤ ∞ s=t A s = S t ≤ O min 2, 1 + 1 2M c 2 -t/2 , where (i) comes from eq. ( 45). Then, y t -y * (x * ) ≤ y t -y * (x t ) + y * (x t ) -y * (x * ) ≤ 2κA t + κ x t -x * ≤2κS t + κ x t -x * ≤ O min 2, 1 + 1 2M c 2 -t/2 . The two above inequalities yield the linear convergence rate (10). (Case 4) If θ ∈ (0, 1 2 ), then eq. ( 8) holds. Substituting eq. ( 8) into eq. ( 47) yields that for some constant C 3 > 0, S t ≤ 1 2 t-t0 S t0 + 8c √ 2M θ t s=t0+1 C 3 2 t-s (s -t 0 ) -θ 1-2θ ≤ 1 2 t-t0 S t0 + 8cC 3 √ 2M 2 t-t0 θ t-t0 s=1 2 s s -θ 1-2θ (i) = 1 2 t-t0 S t0 + 8cC 3 √ 2M 2 t-t0 θ t1 s=1 2 s s -θ 1-2θ + 8cC 3 √ 2M 2 t-t0 θ t-t0 s=t1+1 2 s s -θ 1-2θ ≤ 1 2 t-t0 S t0 + 8cC 3 √ 2M 2 t-t0 θ t1 s=1 2 s + 8cC 3 √ 2M 2 t-t0 θ t-t0 s=t1+1 2 s t -t 0 2 -θ 1-2θ (ii) ≤ 1 2 t-t0 S t0 + 8cC 3 √ 2M 2 t-t0 θ 2 t1+1 + 8cC 3 √ 2M 2 t-t0 θ t -t 0 2 -θ 1-2θ 2 t-t0+1 =O 1 2 t-t0 + 1 2 (t-t0)/2 + (t -t 0 ) -θ 1-2θ = O (t -t 0 ) -θ 1-2θ , where (i) denotes t 1 = (t -t 0 )/2 , (ii) uses the inequality that t-t0 s=t1+1 2 s < t-t0 s=0 2 s < 2 t-t0+1 . Therefore, the sub-linear convergence rate eq. ( 11) follows from the following inequalities. x t -x * ≤ S t ≤ O (t -t 0 ) -θ 1-2θ , and y t -y * (x * ) ≤ y t -y * (x t ) + y * (x t ) -y * (x * ) ≤ 2κA t + κ x t -x * ≤2κS t + κ x t -x * ≤ O (t -t 0 ) -θ 1-2θ .



studied multi-step GDA where multiple gradient ascent steps are performed, and they also studied the momentum-accelerated version. Cherukuri et al. (2017); Daskalakis and Panageas (2018); Jin et al. (2020) studied GDA in continuous time dynamics using differential equations. Adolphs et al. (2019) analyzed a second-order variant of the GDA algorithm. Stochastic GDA algorithms: Lin et al. (2020); Yang et al. (2020); Boţ and Böhm (2020) analyzed stochastic GDA, stochastic AGDA and stochastic APGDA, which are direct extensions of GDA, AGDA and APGDA to the stochastic setting respectively. Variance reduction techniques have been applied to stochastic minimax optimization, including SVRG-based Du and Hu (2019); Yang et al. (2020), SPIDER-based Xu et al. (2020a), STORM Qiu et al. (2020) and its gradient free version Huang et al. (2020). Xie et al. (2020) studied the complexity lower bound of first-order stochastic algorithms for finite-sum minimax problem.

Convergence rates of proximal-GDA under different parameterizations of KŁ geometry. Note that t 0 denotes a sufficiently large positive integer.

T +1 ) -H * ≥ 0 implies that Letting t = t 0 and T → ∞ in the above inequality yields that ∞ s=t0 A s < +∞. Hence, by letting T → ∞ and denoting S t =

t ≤ 1 2 t-t0 S t0 + 8c 2M [H(z t0 ) -H * ]

ACKNOWLEDGEMENT

The work of T. Xu and Y. Liang was supported partially by the U.S. National Science Foundation under the grants CCF-1900145 and CCF-1909291. 

SUPPLEMENTARY MATERIAL A PROOF OF PROPOSITION 1

Proposition 1 (Lipschitz continuity of y * (x) and ∇Φ(x)). Let Assumption 1 hold. Then, the mapping y * (x) and the function Φ(x) satisfy 1. Mapping y * (x) is κ-Lipschitz continuous; 2. Function Φ(x) is L(1 + κ)-smooth with ∇Φ(x) = ∇ 1 f (x, y * (x)).Proof. We first prove item 1. Since f (x, y) is strongly concave in y for every x and h(y) is convex, the mapping y * (x) = arg max y∈Y f (x, y) -h(y) is uniquely defined. We first show that y * (x) is a Lipschitz mapping. Consider two arbitrary points x 1 , x 2 . The optimality conditions of y * (x 1 ) and(13) Setting y = y * (x 2 ) in eq. ( 12), y = y * (x 1 ) in eq. ( 13) and summing up the two inequalities, we obtain thatSince ∂h is a monotone operator (by convexity), we know that u 2 -u 1 , y * (x 2 ) -y * (x 1 ) ≥ 0. Hence, the above inequality further implies thatAdding up the above two inequalities yields that µ y * (x 1 ) -The above inequality shows that y * (x 1 ) -y * (x 2 ) ≤ κ x 2 -x 1 , and item 1 is proved.Next, we will prove item 2.Therefore, based on Corollary 10.1.1 of Rockafellar (1970) , h(y) is continuous on A n and thusTherefore, based on the Danskin theorem Bernhard and Rapaport (1995) , the function4. If KŁ geometry holds with θ ∈ (0, 1 2 ), then H(z t ) ↓ H * sub-linearly aswhereProof. Note that eq. ( 27) implies thatRecall that we have shown that for all t ≥ t 0 , the KŁ property holds and we haveThroughout the rest of the proof, we assume t ≥ t 0 . Substituting eq. ( 30) into the above bound yields thatwhere the second inequality uses the definition of M in eq. ( 5).Substituting eq. ( 4) and ϕ (s) = cs θ-1 (c > 0) into eq. ( 31) and rearranging, we further obtain thatDefining d t = H(z t ) -H * , the above inequality further becomesNext, we prove the convergence rates case by case.(Case 1) If θ = 1, then eq. ( 32) implies that d t-1 -d t ≥ 1 2M c 2 > 0 whenever d t > 0. Hence, d t achieves 0 (i.e., H(z t ) achieves H * ) within finite number of iterations.(Case 2) If θ ∈ ( 1 2 , 1), since d t ≥ 0, eq. ( 32) implies thatwhich is equivalent to thatfor sufficiently large t 1 ∈ N + and t 1 ≥ t 0 . Hence, eq. ( 34) implies that for t ≥ t 1.Note that θ ∈ ( 1 2 , 1) implies that 1 2(1-θ) > 1, and thus the inequality above implies that H(z t ) ↓ H * at the super-linear rate given by eq. ( 6).which implies that. Therefore, d t ↓ 0 (i.e., H(z t ) ↓ H * ) at the linear rate given by eq. ( 7).(Case 4) If θ ∈ (0, 1 2 ), consider the following two subcases.where (i) uses d t ≤ d t-1 and -2(1 -θ) < -1, and (ii) uses eq. ( 32).where we useHence,which implies thatBy substituing the definition of ψ, the inequality above implies that H(z t ) ↓ H * in a sub-linear rate given by eq. ( 8).

G PROOF OF THEOREM 4

Theorem 4 (Variable convergence rate). Under the same conditions as those of Theorem 2, the sequences {x t , y t } t converge to their limits x * , y * (x * ) respectively at the following rates. 

