LOCAL CONVERGENCE ANALYSIS OF GRADIENT DE-SCENT ASCENT WITH FINITE TIMESCALE SEPARATION

Abstract

We study the role that a finite timescale separation parameter τ has on gradient descent-ascent in non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by γ 1 and the learning rate of player 2 is defined to be γ 2 = τ γ 1 . We provide a non-asymptotic construction of the finite timescale separation parameter τ * such that gradient descent-ascent locally converges to x * for all τ ∈ (τ * , ∞) if and only if it is a strict local minmax equilibrium. Moreover, we provide explicit local convergence rates given the finite timescale separation. The convergence results we present are complemented by a non-convergence result: given a critical point x * that is not a strict local minmax equilibrium, we present a non-asymptotic construction of a finite timescale separation τ 0 such that gradient descent-ascent with timescale separation τ ∈ (τ 0 , ∞) does not converge to x * . Finally, we extend the results to gradient penalty regularization methods for generative adversarial networks and empirically demonstrate on CIFAR-10 and CelebA the significant impact timescale separation has on training performance.

1. INTRODUCTION

In this paper we study learning in zero-sum games of the form min x1∈X1 max x2∈X2 f (x 1 , x 2 ) where the objective function of the game f is assumed to be sufficiently smooth and potentially nonconvex and non-concave in the strategy spaces X 1 and X 2 respectively with each X i a precompact subset of R ni . This general problem formulation has long been fundamental in game theory (Bas ¸ar & Olsder, 1998) and recently it has become central to machine learning with applications in generative adversarial networks (Goodfellow et al., 2014) , robust supervised learning (Madry et al., 2018; Sinha et al., 2018) , reinforcement and multi-agent reinforcement learning (Rajeswaran et al., 2020; Zhang et al., 2019) , imitation learning (Ho & Ermon, 2016) , constrained optimization (Cherukuri et al., 2017) , and hyperparameter optimization (Lorraine et al., 2020; MacKay et al., 2019) . The gradient descent-ascent learning dynamics are widely studied as a potential method for efficiently computing equilibria in game formulations. However, in zero-sum games, a number of past works highlight problems with this learning dynamic including both non-convergence to meaningful critical points as well as convergence to critical points devoid of game theoretic meaning, where common notions of 'meaningful' equilibria include the local Nash and local minmax (Stackelberg) concepts. For instance, in bilinear games, gradient descent-ascent avoids local Nash and Stackelberg equilibria due to the inherent instability of the update rule for this class. Fortunately, in this class of games, regularization or gradient-based learning dynamics that employ different numerical discretization schemes (as compared to forward Euler for gradient descent-ascent) are known to alleviate this issue (Daskalakis et al., 2018; Mertikopoulos et al., 2019; Zhang & Yu, 2020) . For the more general nonlinear nonconvex-nonconcave class of games, it has been shown gradient descentascent with a shared learning rate is prone to reaching critical points that are neither local Nash equilibria nor local Stackelberg equilibria (Daskalakis & Panageas, 2018; Jin et al., 2020; Mazumdar et al., 2020) . While an important negative result, it does not rule out the prospect that gradient descent-ascent may be able to guarantee equilibrium convergence as it fails to account for a key structural parameter of the dynamics, namely the ratio of learning rates between the players. Motivated by the observation that the order of play between players is fundamental to the definition of the game, the role of timescale separation in gradient descent-ascent has recently been explored theoretically (Chasnov et al., 2019; Heusel et al., 2017; Jin et al., 2020) . On the empirical side, it has been widely demonstrated that timescale separation in gradient descent-ascent is crucial to improving the solution quality when training generative adversarial networks (Arjovsky et al., 2017; Goodfellow et al., 2014; Heusel et al., 2017) . Denoting γ 1 as the learning rate of the player 1, the learning rate of player 2 can be redefined as γ 2 = τ γ 1 where τ = γ 2 /γ 1 > 0 is the learning rate ratio. Toward understanding the effect of timescale separation, Jin et al. (2020) show the locally stable critical points of gradient descent-ascent coincide with the set of strict local minmax/Stackelberg equilibrium across the spectrum of sufficiently smooth zero-sum games as τ → ∞. In other words, all 'bad critical points' (critical points lacking game-theoretic meaning) become unstable and all 'good critical points' (game-theoretically meaningful equilibria) remain or become locally exponentially stable (cf. Definition 3) as τ → ∞. While a promising theoretical development, gradient descent-ascent with a timescale separation approaching infinity does not lead to a practical learning rule and the analysis of it does not necessarily provide insights into the common usage of a reasonable finite timescale separation. An important observation is that choosing τ arbitrarily large with the goal of ensuring local equilibrium convergence can lead to numerically ill-conditioned problems. This highlights the significance of understanding the exact range of learning rate ratios that guarantee local stability. Moreover, our experiments in Section 5 (Dirac-GAN) and in Appendix K show that modest values of τ are typically sufficient to guarantee stability of only equilibria which allows for larger choices of γ 1 and results in faster convergence to an equilibrium. Contributions. We show gradient descent-ascent locally converges to a critical point for a range of finite learning rate ratios if and only if the critical point is a strict local Stackelberg equilibria (Theorem 1). 1 This result is constructive in the sense that we explicitly characterize the exact range of learning rate ratios for which the guarantee holds. Furthermore, we show all other critical points are unstable for a range of finite learning rate ratios that we explicitly construct (Theorem 2). To our knowledge, the aforementioned guarantees are the first of their kind in nonconvex-nonconcave zero-sum games for an implementable first-order method. Moreover, the technical results in this work rely on tools that have not appeared in the machine learning and optimization communities analyzing games. Finally, we extend these results to gradient penalty regularization methods in generative adversarial networks (Theorem 3), thereby providing theoretical guarantees for a common combination of heuristics used in practice, and empirically demonstrate the benefits and trade-offs of regularization and timescale separation on the Dirac-GAN along with image datasets.

2. PRELIMINARIES

A two-player zero-sum continuous game is defined by a collection of costs (f 1 , f 2 ) where f 1 ≡ f and f 2 ≡ -f with f ∈ C r (X, R) for some r ≥ 2 and where X = X 1 × X 2 with each X i a precompact subset of R ni for i ∈ {1, 2} and n = n 1 + n 2 . Each player i ∈ I seeks to minimize their cost f i (x i , x -i ) with respect to their choice variable x i where x -i is the vector of all other actions x j with j = i. We denote D i f i as the derivative of f i with respect to x i , D ij f i as the partial derivative of D i f i with respect to x j , and D 2 i f i as the partial derivative of D i f i with respect to x i . Mathematical Notation. Given a matrix A ∈ R n1×n2 , let vec(A) ∈ R n1n2 be its vectorization such that vec(A) takes rows a i of A, transposes them and stacks them vertically in order of their index. Let ⊗ and ⊕ denote the Kronecker product and sum respectively, where A ⊕ B = A ⊗ I + I ⊗ B. Moreover, is an operator that generates an 1 2 n(n+1)× 1 2 n(n+1) matrix from a matrix A ∈ R n×n such that A A = H + n (A ⊕ A)H n where H + n = (H n H n ) -1 H n is the (left) pseudo-inverse of H n , a full column rank duplication matrix. Let λ + max (•) be the largest positive real root of its argument if it exists and zero otherwise. See Lancaster & Tismenetsky (1985) and Appendix B for more detail. Equilibrium. There are natural equilibrium concepts depending on the order of play: the (local) Nash equilibrium concept in the case of simultaneous play and the (local) Stackelberg (equivalently minmax in zero-sum games) equilibrium concept in the case of hierarchical play (Bas ¸ar & Olsder, 1998) . Formal local equilibrium definitions are provided in Appendix B, while here we characterize the different equilibrium notions in terms of sufficient conditions on player costs as is typical in the machine learning and optimization literature (see, e.g., Berard et al. 2020; Daskalakis & Panageas 2018; Fiez et al. 2020; Goodfellow 2016; Jin et al. 2020; Mazumdar et al. 2020; Wang et al. 2020) . The following definition is characterized by sufficient conditions for a local Nash equilibrium. Definition 1 (Differential Nash Equilibrium, Ratliff et al. 2013 ). The joint strategy x ∈ X is a differential Nash equilibrium if D 1 f (x) = 0, -D 2 f (x) = 0, Dfoot_1 1 f (x) > 0, and D 2 2 f (x) < 0. The Jacobian of the vector of individual gradients g(x) = (D 1 f (x), -D 2 f (x)) is defined by J(x) = D 2 1 f (x) D 12 f (x) -D 12 f (x) -D 2 2 f (x) . (1) Let S 1 (•) denote the Schur complement of (•) with respect to the n 2 × n 2 block in (•). The following definition is characterized by sufficient conditions for a local Stackelberg equilibrium. Definition 2 (Differential Stackelberg Equilibrium, Fiez et al. 2020 ). The joint strategy x ∈ X is a differential Stackelberg equilibrium if D 1 f (x) = 0, -D 2 f (x) = 0, S 1 (J(x)) > 0, D 2 2 f (x) < 0. Learning Dynamics. We study agents seeking equilibria of the game via a learning algorithm and consider arguably the most natural learning rule in zero-sum continuous games: gradient descentascent (GDA). Moreover, we investigate this learning rule with timescale separation between the players. Let τ = γ 2 /γ 1 be the learning rate ratio and define Λ τ = blockdiag(I n1 , τ I n2 ) where I ni is a n i × n i identity matrix. The τ -GDA dynamics with g(x) = (D 1 f (x), -D 2 f (x)) are given by x k+1 = x k -γ 1 Λ τ g(x k ).

3. STABILITY OF CONTINUOUS TIME GDA WITH TIMESCALE SEPARATION

To characterize the convergence of τ -GDA, we begin by studying its continuous time limiting system ẋ = -Λ τ g(x). (3) The Jacobian of the system from (3) is given by J τ (x) = Λ τ J(x) where J(x) is defined in (1). Observe that critical points (x * such that g(x * ) = 0) are shared between τ -GDA and (3). Thus, by analyzing the stability of the continuous time system around critical points as a function of the timescale separation τ using the Jacobian J τ (x * ), we can draw conclusions about the stability and convergence of the discrete time system τ -GDA. It well known that a critical point is locally (exponentially) stable when the spectrum of -J τ (x * ) is in the open left-half complex plane C • -(cf. Theorem B.1, Appendix B). Throughout, we use the broader term "stable" to mean the following. Definition 3. A critical point x * is locally exponentially stable for ẋ = -Λ τ g(x) if an only if spec(-J τ (x * )) ⊂ C • -(or, equivalently, spec(J τ (x * )) ⊂ C • + ) where C • -and C • + denote the open left-half and right-half complex plane, respectively. We now show differential Stackelberg equilibria are the only critical points that are stable for a range of finite learning rate ratios, 2 whereas the remainder of critical points are unstable for a range of finite learning rate ratios. Importantly, we characterize the learning rate ratios for which the results hold.

3.1. NECESSARY AND SUFFICIENT CONDITIONS FOR STABILITY

To motivate our main stability result, the following example shows the existence of a differential Stackelberg which is unstable for τ = 1, but is stable all for τ ∈ (τ * , ∞) where τ * is finite. Example 1. Consider the quadratic zero-sum game defined by the cost f (x 1 , x 2 ) = v 2 (-x 2 11 + 1 2 x 2 12 -2x 11 x 21 -1 2 x 2 21 + x 12 x 22 -x 2 22 ) where v > 0 and x 1 , x 2 ∈ R 2 . The unique critical point x * = (0, 0) is a differential Stackelberg equilibrium since g(x * ) = 0, S 1 (J(x * )) = diag(v, v 4 ) > 0, and D 2 2 f (x * ) = -diag( v 2 , v) < 0. Moreover, spec(-J τ (x * )) = { -v 4 (2τ +1± √ 4τ 2 -8τ + 1), -v 4 (τ -2± √ τ 2 -12τ + 4)}. Observe that for any v > 0, x * is unstable for τ = 1 since spec(-J τ (x * )) ⊂ C • -, but x * is stable for a range of learning rates since spec(-J τ (x * )) ⊂ C • -for all τ ∈ (2, ∞). In other words, GDA fails to converge to the equilibrium but a finite timescale separation is sufficient to remedy this problem. We now fully characterize this phenomenon. To provide some background, we remark it is known that the spectrum of -J τ (x * ) asymptotically splits as τ → ∞ such that n 1 eigenvalues tend to fixed positions defined by the eigenvalues of -S 1 (J(x * )), while the remaining n 2 eigenvalues tend to infinity at a linear rate τ along asymptotes defined by the eigenvalues of D 2 2 f (x * ). This result is known from Klimushchev & Krasovskii (1961) and further discussion can be found in Appendix I as well as from Kokotovic et al. (1986, Chap. 2, Thm. 3.1) . The previous fact is specialized from the class of singularly perturbed linear systems to τ -GDA by Jin et al. (2020) which directly results in the connection between critical points of ∞-GDA and differential Stackelberg equilibrium. Specifically, the result of Jin et al. (2020) is showing that for the class of all sufficiently smooth games the stable critical points of ∞-GDA are exactly the strict local minmax. As a corollary of this fact, there exists a τ 1 < ∞, such that τ -GDA is stable for all τ > τ 1 (Kokotovic et al., 1986, Chap. 2, Cor. 3.1) ; this can be inferred from the proof of Theorem 28 in Jin et al. (2020) as well. Indeed, Jin et al. (2020) gives an asymptotic expansion showing that n 1 eigenvalues of -J τ (x * ) are in spec(-S 1 (J(x * ))) + O(τ -1 ) and the remaining n 2 eigenvalues are in τ (spec(D 2 2 f (x * )) + O(τ -1 )). Using the limit definition for the asymptotic expansion, for any fixed game and a strict local minmax x * , one can show that there exists a finite τ such that x * is stable. We provide a detailed discussion of the relationship between the results of Jin et al. (2020) and Kokotovic et al. (1986) in Appendices A, I, and J. Unfortunately, the finite τ 1 obtainable from the asymptotic expansion method can be arbitrarily large. From a practical perspective, this poses significant problems for the implementation and performance of τ -GDA. Indeed, the eigenvalue gap between spec(-S 1 (J(x * ))) and spec(τ D 2 2 f (x * )) has a linear dependence on τ and, in turn, the problem may become highly ill-conditioned from a numerical perspective as τ becomes large (Kokotovic, 1975) . In contrast, we determine exactly the range of τ such that the spectrum of -J τ (x) remains in C • -, and hence, remedy this problem. For the statement of the following theorem on the non-asymptotic construction of τ * , we define the following matrices: for a critical point x * , let S 1 = S 1 (-J τ (x * )) = A 11 -A 12 A -1 22 A 12 and -J τ (x * ) = -D 2 1 f (x * ) -D 12 f (x * ) τ D 12 f (x * ) τ D 2 2 f (x * ) = A 11 A 12 -τ A 12 τ A 22 . Theorem 1 (Non-Asymptotic Construction of Necessary and Sufficient Conditions for Stability). Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is such that g(x * ) = 0 and det(D 2 2 f 2 (x * )) = 0. There exists a τ * ∈ [0, ∞) such that spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞) if and only if x * is a differential Stackelberg equilibrium. Moreover, τ * = λ + max (Q) where Q = 2 (A 12 ⊗ A -1 22 )H n2 (I n1 ⊗ A -1 22 A 12 )H n1 Ā-1 22 H + n2 (A 12 ⊗ I n2 ) -S-1 1 H + n1 (S 1 ⊗ A 12 A -1 22 ) -(A 11 ⊗ A -1 22 ) with Ā22 = A 22 A 22 and S1 = S 1 S 1 . While at first glance Q may appear difficult to understand, it is efficiently computable and can be used to understand the typical value for important classes of games. Indeed, many problems like generative adversarial networks have specific structure for the individual Hessians of each player and the interaction matrix D 12 f (cf. Assumption 1, Section 3.3) and are in a sense subject to design via network architecture and loss function selection. This result opens up an interesting future direction of research on understanding and potentially designing the structure of Q. To take a step in this direction, we explore a number of games in Section 5 and Appendix K where we compute τ * by the construction and validate it is tight empirically. Along the way, we discover that τ * is typically a reasonable value that is amenable to practical implementations. As a direct consequence of Theorem 1, τ -GDA converges locally asymptotically for any sufficiently small γ(τ ) and for all τ ∈ (τ Proof Sketch of Theorem 1. The full proof is contained in Appendix C. The key tools used in this proof are a combination of Lyapunov stability and the notion of a guard map (Saydy et al., 1990) , a new tool to the learning community. Recall that a matrix is exponentially stable if and only if there exists a symmetric positive definite P = P > 0 such that P J τ (x * ) + J τ (x * )P > 0 (cf. Theorem B.1, Appendix B). Hence, given a positive definite Q = Q > 0, -J τ (x * ) is stable if and only if there exists a unique solution P = P to ((J τ (x * ) ⊗ I) + (I ⊗ J τ (x * )))vec(P ) = (J τ (x * ) ⊕ J τ (x * ))vec(P ) = vec(Q) (4) where ⊗ and ⊕ denote the Kronecker product and Kronecker sum, respectively. 3 The existence of a unique solution P occurs if and only if J τ and -J τ have no eigenvalues in common. Hence, using the fact that eigenvalues vary continuously, if we vary τ and examine the eigenvalues of the map J τ (x * ) ⊕ J τ (x * ), this tells us the range of τ for which spec(-J τ (x * )) remains in C • -. This method of varying parameters and determining when the roots of a polynomial (or correspondingly, the eigenvalues of a map) cross the boundary of a domain uses a guard map; it provides a certificate that the roots of a polynomial lie in a particular guarded domain for a range of parameter values. Formally, let X be the set of all n × n real matrices or the set of all polynomials of degree n with real coefficients. Consider S an open subset of X with closure S and boundary ∂S. The map ν : X → C is said to be a guardian map for S if for all x ∈ S, ν et al., 1990, Prop. 1) . The map ν(τ (x) = 0 ⇐⇒ x ∈ ∂S. Elements of S(C • -) = {A ∈ R n×n : spec(A) ⊂ C • -} are (Hurwitz) stable. Given a pathwise connected set U ⊆ R, the parameterized family {A(τ ) : τ ∈ U } is stable if and only if (i) it is nominally stable-meaning A(τ 1 ) ∈ S(C • -) for some τ 1 ∈ U -and (ii) ν(A(τ )) = 0 for all τ ∈ U (Saydy ) = det(2(-J τ (x * ) I)) = det(-(J τ (x * ) ⊕ J τ (x * ))) guards S(C • -) where is the bialternate product and is defined by B (Govaerts, 2000, Sec. 4.4.4) . For intuition, consider the case where each x 1 , x 2 ∈ R so that A B = 1 2 (A ⊕ B) for matrices A and J τ (x * ) = a b -τ b τ d ∈ R 2×2 . It is known that spec(-J τ (x * )) ⊂ C • -if det(-J τ (x * )) > 0 and tr(-J τ (x * )) < 0 so that ν(τ ) = det(-J τ (x * )) tr(-J τ (x * )) is a guard map for the 2×2 stable matrices S(C • -). Since the bialternate product generalizes the trace operator and det(-J τ (x * )) = τ n2 det(D 2 2 f (x * )) det(-S 1 (J(x * ))) = 0 for τ = 0 by the facts (det(S 1 (J(x * ))) = 0 and det(D 2 2 f (x * )) = 0) for a differential Stackelberg equilibrium x * , a guard map in the general n × n case is ν(τ ) = det(-(J τ (x * ) ⊕ J τ (x * ))). This guard map in τ is closely related to the vectorization in (4): for any symmetric positive definite Q = Q > 0, there will be a symmetric positive definite solution P = P > 0 of -(J τ (x * ) ⊕ J τ (x * ))vec(P ) = vec(-Q) if and only if det(-(J τ (x * ) ⊕ J τ (x * ))) = 0. Hence, to find the range of τ for which, given any Q = Q > 0, the solution P = P is no longer positive definite, we need to find the value of τ such that ν(τ ) = det(-(J τ (x * ) ⊕ J τ (x * ))) = 0-that is, where it hits the boundary ∂S(C • -). Through algebraic manipulation, this problem reduces to an eigenvalue problem in τ , giving rise to an explicit construction of τ * .

3.2. SUFFICIENT CONDITIONS FOR INSTABILITY

To motivate our main instability result, the following example shows a non-equilibrium critical point that is stable for τ = 1, but is unstable for all τ ∈ (τ 0 , ∞) where τ 0 is finite. Example 2. Consider the quadratic zero-sum game defined by the cost f (x 1 , x 2 ) = v 4 (x 2 11 -1 2 x 2 12 + 2x 11 x 21 + 1 2 x 2 21 + 2x 12 x 22 -x 2 22 ) where x 1 , x 2 ∈ R 2 and v > 0. The unique critical point x * = (0, 0) is not a differential Stackelberg (nor Nash) equilibrium since D 2 1 f (x * ) = diag(v/2, -v/4) ≯ 0, D 2 2 f (x * ) = diag(v/4, -v/2) ≮ 0. Moreover, spec(-J τ (x * )) = { -v 8 (2τ -1 ± √ 4τ 2 -12τ + 1), -v 8 (2 -τ ± √ τ 2 -12τ + 4)}. Observe that for any v > 0, x * is stable for τ = 1 since spec(-J τ (x * )) ⊂ C • -, but x * is unstable for a range of learning rates since spec(-J τ (x * )) ⊂ C • -for all τ ∈ (2, ∞) . This is not an artifact of the quadratic example: games can be constructed in which stable critical points lacking gametheoretic meaning become unstable for all τ > τ 0 even in the presence of multiple equilibria. This example demonstrates a finite timescale separation can prevent convergence to critical points lacking game-theoretic meaning. We now characterize this behavior generally. Note that Theorem 1 implies that for any critical point which is not a differential Stackelberg equilibrium, there is no finite τ * such that spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞). In particular, there exists at least one finite, positive value of τ such that spec(-J τ (x * )) ⊂ C • -. We can extend this result to answer the question of whether there exists a finite learning rate ratio τ 0 such that -J τ (x * ) has at least one eigenvalue with strictly positive real part for all τ ∈ (τ 0 , ∞), thereby implying that x * is unstable. Theorem 2 (Non-Asymptotic Construction of Sufficient Condition for Instability.). Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is such that g(x * ) = 0, det(D 2 2 f 2 (x * ) = 0, and x * is not a differential Stackelberg equilibrium. Then spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ 0 , ∞) with τ 0 =λ + max (Q -1 2 ((P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) -P 2 L 0 D 12 f (x * ) -(P 2 L 0 D 12 f (x * )) )). where P 1 , P 2 , Q 1 , Q 2 are any non-singular Hermitian matrices such that (a) Q i > 0 for each i = 1, 2, (b) S 1 (-J(x * ))P 1 + P 1 S 1 (-J(x * )) = Q 1 and D 2 2 f (x * )P 2 + P 2 D 2 2 f (x * ) = Q 2 , and (c) the following matrix pairs have the same inertia: (P 1 , S 1 (-J(x * ))) and (P 2 , D 2 2 f (x * )). Proof Sketch. The full proof is provided in Appendix D. The key idea is to leverage the Lyapunov equation and Lemma B.3 to show that -J τ (x * ) has at least one eigenvalue with strictly positive real part. Indeed, Lemma B.3 states that if S 1 (-J(x * )) has no zero eigenvalues, then there exists matrices P 1 = P 1 and Q 1 = Q 1 > 0 such that P 1 S 1 (-J(x * )) + S 1 (-J(x * ))P 1 = Q 1 where P 1 and S 1 (-J(x * )) have the same inertia-that is, the number of eigenvalues with positive, negative and zero real parts, respectively, are the same. An analogous statement applies to -D 2 2 f (x * ) with some P 2 and Q 2 . Since x * is a non-equilibrium critical point, without loss of generality, let S 1 (-J(x * )) have at least one strictly positive eigenvalue so that P 1 does as well. Next, we construct a matrix P that is congruent to blockdiag(P 1 , P 2 ) and a matrix Q τ such that -P J τ (x * ) -J τ (x * )P = Q τ . Since P and blockdiag(P 1 , P 2 ) are congruent, Sylvester's law of inertia implies that they have the same number of eigenvalues with positive, negative, and zero real parts, respectively. Hence, P has at least one eigenvalue with strictly positive real part. We then construct τ 0 via an eigenvalue problem such that for all τ > τ 0 , Q τ > 0. Applying Lemma B.3 again, for any τ > τ 0 , -J τ (x * ) has at least one eigenvalue with strictly positive real part so that spec(-J τ (x * )) ⊂ C • -.

3.3. REGULARIZATION WITH APPLICATIONS TO ADVERSARIAL LEARNING

In this section, we focus on generative adversarial networks with regularization and using the theory developed so far extend the results to provide a stability guarantee for a range of regularization parameters and learning rate ratios. Consider the training objective f (θ, ω) = E p(z) [ (D(G(z; θ); ω))] + E p D (x) [ (-D(x; ω))] where D ω (x) and G θ (z) are discriminator and generator networks, p D (x) is the data distribution while p(z) is the latent distribution, and ∈ C 2 (R) is some real-value function. 4 Nagarajan & Kolter (2017) show, under suitable assumptions, that gradient-based methods for training generative adversarial networks are locally convergent assuming the data distributions are absolutely continuous. However, as observed by Mescheder et al. (2018) , such assumptions not only may not be satisfied by many practical generative adversarial network training scenarios such as natural images, but often the data distribution is concentrated on a lower dimensional manifold. The latter characteristic leads to highly ill-conditioned problems and nearly purely imaginary eigenvalues. Gradient penalties ensure that the discriminator cannot create a non-zero gradient which is orthogonal to the data manifold without suffering a loss. Introduced by Roth et al. (2017) and refined in Mescheder et al. (2018) , we consider training generative adversarial networks with one of two fairly natural gradient-penalties used to regularize the discriminator: R 1 (θ, ω) = µ 2 E p D (x) [ ∇ x D(x; ω) 2 ] and R 2 (θ, ω) = µ 2 E p θ (x) [ ∇ x D(x; ω) 2 ], where, by a slight abuse of notation, ∇ x (•) denotes the partial gradient with respect to x of the argument (•) when the argument is the discriminator D(•; ω) in order prevent any conflation between the notation D(•) elsewhere for derivatives. Let  h 1 (θ) = E p θ (x) [∇ ω D(x; ω)| ω=ω * ] and h 2 (ω) = E p D (x) [|D(x; ω)| 2 + ∇ x D(x; ω) 2 ]. Define reparameterization manifolds M G = {θ : p θ = p D } and M D = {ω : h 2 (ω) = 0} M G ∩ B (θ * ) and M D ∩ B (ω * ) define C 1 -manifolds. Moreover, (i) if w / ∈ T θ * M G , then w ∇ w h 1 (θ * )w = 0, and (ii) if v / ∈ T ω * M D , then v ∇ 2 ω h 2 (ω * )v = 0. We note that as explained by Mescheder et al. (2018) , Assumption 1.c(i) implies that the discriminator is capable of detecting deviations from the generator distribution in equilibrium, and Assumption 1.c(ii) implies that the manifold M D is sufficiently regular and, in particular, its (local) geometry is captured by the second (directional) derivative of h 2 . Theorem 3. Consider training a generative adversarial network via a zero-sum game with generator network G θ , discriminator network D ω , and loss f (θ, ω) with regularization R j (θ, ω) (for either j = 1 or j = 2) and any regularization parameter µ ∈ (0, ∞) such that Assumption 1 is satisfied for an equilibrium x * = (θ * , ω * ) of the regularized dynamics. Then, x * = (θ * , ω * ) is a differential Stackelberg equilibrium. Furthermore, for any τ ∈ (0, ∞), spec(-J (τ,µ) (x * )) ⊂ C • -.

4. PROVABLE CONVERGENCE OF GDA WITH TIMESCALE SEPARATION

In this section, we characterize the asymptotic convergence rate for τ -GDA to differential Stackelberg equilibria, and provide a finite time guarantee for convergence to an ε-approximate equilibrium. The asymptotic convergence rate result uses Theorem 1 to construct a finite τ * ∈ (0, ∞) such that x * is stable, meaning spec(-J τ (x * )) ⊂ C • -, and then for any τ ∈ (τ * , ∞), the two key lemmasnamely, Lemmas F.1 and F.2 in Appendix F-imply a local asymptotic convergence rate. Theorem 4. Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for r ≥ 2 and let x * be a differential Stackelberg equilibrium of the game. There exists a τ * ∈ (0, ∞) such that for any τ ∈ (τ * , ∞) and α ∈ (0, γ), τ -GDA with learning rate γ 1 = γ -α converges locally asymptotically at a rate of O((1 -α/(4β)) k/2 ) where γ = min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , λ m = arg min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , and β = (2Re(λ m ) -α|λ m | 2 ) -1 . Moreover, if x * is a differential Nash equilibrium, τ * = 0 so that for any τ ∈ (0, ∞) and α ∈ (0, γ), τ -GDA with γ 1 = γ -α converges with a rate O((1 -α/(4β)) k/2 ). To build some intuition, consider a differential Stackelberg equilibrium x * and its corresponding τ * obtained via Theorem 1 so that for any fixed τ ∈ (τ * , ∞), spec(-J τ (x * )) ⊂ C • -. For the discrete time system x k+1 = x k -γ 1 Λ τ g(x k ), if γ 1 is chosen such that the spectral radius of the local linearization of the discrete time map is a contraction, then x k locally (exponentially) converges to x * (cf. Proposition B.1). With this in mind, we formulate an optimization problem to find the upper bound γ on the learning rate γ 1 such that for all γ 1 ∈ (0, γ), ρ(I -γ 1 J τ (x * )) < 1; indeed, let γ = min γ>0 γ : max λ∈spec(Jτ (x * )) |1 -γλ| ≤ 1 . The intuition is as follows. The inner maximization problem is over a finite set spec(J τ (x * )) = {λ 1 , . . . , λ n } where J τ (x * ) ∈ R n×n . As γ increases away from zero, each |1-γλ i | shrinks in magnitude. The last λ i such that 1-γλ i hits the boundary of the unit circle in the complex plane gives us the optimal γ and the λ m ∈ spec(J τ (x * )) that achieves it. Examining the constraint, we have that for each λ i , γ(γ|λ i | 2 -2Re(λ i )) ≤ 0 for any γ > 0. As noted this constraint will be tight for one of the λ, in which case γ = 2Re(λ)/|λ| 2 since γ > 0. Hence, by selecting γ = min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , we have that |1 -γ 1 λ| < 1 for all λ ∈ spec(J τ (x * )) and any γ 1 ∈ (0, γ). From here, one can use standard arguments from numerical analysis to show that for the choice of α and β, the claimed asymptotic rate holds. Theorem 4 directly implies a finite time convergence guarantee for obtaining an ε-differential Stackelberg equilibrium, that is, a point with an ε-ball around a differential Stackelberg equilibrium x * . Corollary 1. Given ε > 0, under the assumptions of Theorem 4, τ -GDA obtains an ε-differential Stackleberg equilibrium in (4β/α) log( x 0 -x * /ε) iterations for any x 0 ∈ B δ (x * ) with δ = α/(4Lβ) where L is the local Lipschitz constant of I -γJ τ (x * ). Moreover, the convergence rates and finite time guarantees extend to the gradient penalty regularized generative adversarial network described in the preceeding section. Corollary 2. Under the assumptions of Theorems 3 and 4, for any fixed µ ∈ (0, ∞) and τ ∈ (0, ∞), τ -GDA converges locally asymptotically at a rate of O((1 -α/(4β)) k/2 ), and achieves an ε-equilibrium in (4β/α) log( x 0 -x * /ε) iterations for any x 0 ∈ B δ (x * ). In Appendix H, we extend the convergence analysis to the stochastic setting.

5. EXPERIMENTS

We now present numerical experiments and Appendix K contains further simulations and details. Dirac-GAN: Regularization, Timescale Separation, and Convergence Rate. The Dirac-GAN (Mescheder et al., 2018) consists of a univariate generator distribution p θ = δ θ and a linear discriminator D(x; ω) = ωx, where the real data distribution p D is given by a Dirac-distribution concentrated at zero. The resulting zero-sum game is defined by the cost f (θ, ω) = (θω) + (0) and the unique critical point (θ * , ω * ) = (0, 0) is a local Nash equilibrium. However, the eigenvalues of the Jacobian are purely imaginary regardless of the choice of timescale separation so that τ -GDA oscillates and fails to converge. This behavior is expected since the equilibrium is not hyperbolic and corresponds to neither a differential Nash equilibrium nor a differential Stackelberg equilibrium but it is undesirable nonetheless. The zero-sum game corresponding to the Dirac-GAN with regularization can be defined by the cost f (θ, ω) = (θω) + (0) -µ 2 ω 2 . The unique critical point remains unchanged, but for all τ ∈ (0, ∞) and µ ∈ (0, ∞) the equilibrium of the unregularized game is stable and corresponds to a differential Stackelberg equilibrium of the regularized game. From Figures 1a and 1f , we observe that the impact of timescale separation with regularization µ = 0.3 is that the trajectory is not as oscillatory since it moves faster to the zero line of -D 2 f (θ, ω) and then follows along that line until reaching the equilibrium. We further see from Figure 1b that with regularization µ = 0.3, τ -GDA with τ = 8 converges faster to the equilibrium than τ -GDA with τ = 16, despite the fact that the former exhibits some cyclic behavior in the dynamics while the 1c explains this behavior since the imaginary parts are non-zero with τ = 8 and zero with τ = 16, while the eigenvalue with the minimum real part is greater at τ = 8 than at τ = 16. This highlights that some oscillatory behavior in the dynamics is not always harmful for convergence. For µ = 1 and τ = 1, Figures 1a and 1b show that even though τ -GDA does not cycle since the eigenvalues of the Jacobian are purely real, the trajectory converges slowly to the equilibrium. Indeed, for each regularization parameter, the eigenvalues of J τ (θ * , ω * ) split after becoming purely real and then converge toward the eigenvalues of S 1 (J(θ * , ω * )) and -τ D 2 2 f (θ * , ω * ). Since S 1 (J(θ * , ω * )) ∝ 1/µ and -τ D 2 2 f (θ * , ω * ) ∝ τ µ, there is a trade-off between the choice of regularization µ and the timescale separation τ on the conditioning of the Jacobian matrix that dictates the convergence rate. Generative Adversarial Networks: Image Datasets. We build on the implementations of Mescheder et al. (2018) and train with the non-saturating objective and the R 1 gradient penalty. The network architectures are both ResNet based. We fix the initial learning rate for the generator to be γ 1 = 0.0001 with CIFAR-10 and γ 1 = 0.00005 for CelebA. The learning rates are decayed so that γ 1,k = γ 1 /(1 + ν) k and γ 2,k = τ γ 1,k are the generator and discriminator learning rates at update k where ν = 0.005. The batch size is 64, the latent data is drawn from a standard normal of dimension 256, and the resolution of the images is 32 × 32 × 3. We run RMSprop with parameter α = 0.99 and retain an exponential moving average of the generator parameters for evaluation with parameter β = 0.9999. We remark that RMSprop is an adaptive method that builds on GDA and is commonly used in training for image datasets. It is adopted here to explore the interplay with timescale separation and to determine if similar observations emerge compared to our extensive experiments with τ -GDA (see Appendix K). The FID scores (Heusel et al., 2017) along the learning path and in numeric form at 150k/300k mini-batch updates for CIFAR-10 and CelebA with regularization parameters µ = 10 and µ = 1 are presented in Figures 2 and 3 , respectively. The experiments were each repeated with 3 random seeds which yielded similar results and the mean scores are reported. The choices of τ = 4 and τ = 8 converge fastest with each regularization parameter for CIFAR-10 and CelebA, respectively. The performance with regularization µ = 1 is superior to that with µ = 10, which highlights the interplay between timescale separation and regularization. Moreover, we see that timescale separation improves convergence until hitting a limiting value. These conclusions agree with the insights from the simple Dirac-GAN experiment. Finally, it is worth reiterating there is a coupling between τ and γ 1 : τ must be selected so that the continuoustime system is stable and then γ 1 must be chosen so that the discrete-time update is both stable and numerically well-conditioned for the choice of τ .

6. CONCLUSION

We prove gradient descent-ascent locally converges to a critical point for a range of finite learning rate ratios if and only if the critical point is a differential Stackelberg equilibrium. This answers a standing open question about the local convergence of first order methods to local minimax equilibria. A key component of the proof is the construction of a (tight) finite lower bound on the learning rate ratio τ for which stability is guaranteed, and hence local asymptotic convergence of τ -GDA.

APPENDIX

Below we provide a table of contents as a guide to the appendix.

A Related Work 16

Section A contains an extensive discussion of related work at the intersection of machine learning and game theory as well as historical connections to control theory and dynamical systems.

B Mathematical and Game Theoretic Preliminaries 20

Section B includes mathematical and game theoretic preliminaries not covered in the main text and it contains proofs of short technical helper lemmas. C Proof of Theorem 1: Stability of τ -GDA 24 Section C contains the proof of our main stability result: a critical point x * of the continuous time dynamical system ẋ = -Λ τ g(x) is stable for a range of τ if and only if x * is a differential Stackelberg equilibrium. Moreover this range of timescale separation parameters is characterized.

D Proof of Theorem 2: Instability of τ -GDA 29

Section D contains the proof of our second main result on the existence of a finite range of values of τ such that non-equilibrium critical points which are stable for some τ , become unstable for all τ in a range of τ 's lower bounded by a finite value. E Proof of Theorem 3: Stability of τ -GDA in Regularized GANs 30 Section E contains the proof of Theorem 3 which is an application of the main results to the important adversarial learning application of training generative adversarial networks. Within the section, we also provide necessary conditions on the sizes of the network architectures for the discriminator and generator network for stability. F Proof of Helper Lemmas and Theorem 4 for τ -GDA Convergence 32 Section F contains the statement and proofs of two core technical lemmas on the convergence rate of τ -GDA. We then show how to invoke them together with the stability results to obtain Theorem 4.

G Proof of Corollary 1: Finite Time Convergence of τ -GDA 35

Section G contains the proof of a finite time convergence guarantee on achieving an ε-differential Stackelberg equilibrium, including an estimate on the size of the ball and commentary on how to improve that estimate with alternative techniques.

H Convergence of Stochastic GDA with Timescale Separation 35

Section H contains an extension of the deterministic convergence guarantees to the stochastic setting in which agents have access only to an unbiased estimator of their gradient. We provide convergence guarantees for τ -GDA in this setting as well under the least restrictive, to our knowledge, assumptions on stochastic approximation updates to date.

I Stability of ∞-GDA: A singular perturbation approach 38

Section I contains a proof of the result for the limiting case of τ → ∞. While this result appeared recently in Jin et al. (2020) , an analogous result has been known for some time in the theory of singularly perturbed dynamical systems. Since the proof techniques are particularly illuminating and expose a new set of tools to this community, we provide this alternative proof from the singular perturbation theory perspective.

J Further Details on Related Work 42

Section J contains an extended discussion on the related work by Jin et al. (2020) . Specifically, we expand on the relationship between Proposition 27 in Jin et al. (2020) , which provides examples of games such that the stable critical points of τ -GDA are not a subset of local minmax and vice versa, and our main results. We show that the examples in these results, while illustrative of the fact intended to be shown by the proposition stated, do not conflict with out findings.

K Experiments Supplement 44

Section K contains extended experimental results including applications to illustrative games (such as those examples included in the main body) as well as extended GAN experiments. Our results show that we are able to leverage τ and improve training results. These results are promising as they suggest that understanding the relationship between the three key hyperparameters-timescale separation τ , regularization µ, and exponential moving averaging weight β-can improve first order training (and hence, scalable) algorithms. L Alternative Proof of Theorem 3 via τ * Construction from Theorem 1 60 In order to highlight the utility of Theorem 1 along with future directions of obtaining values of τ * for structured games, we revisit the proof of Theorem 3 and derive the result directly from the construction. The purpose of this section is to illustrate that the structure of equilibria considered in Theorem 3 can be exploited to obtain the value of τ * for the entire class of games using properties of the Kronecker product and sum.

A RELATED WORK

In this section, we provide a review of related work at the intersection machine learning and game theory, as well as connections to dynamical systems theory and control.

A.1 MACHINE LEARNING AND LEARNING IN GAMES

Given the extensive work on the topic of learning in games in machine learning that has gone on over the last several years, we cannot cover all of it and instead focus our attention on only the most relevant to this paper. We begin by reviewing solution concepts developed for the class of games under consideration and then discuss some learning dynamics studied in the literature beyond gradient descent-ascent. Following this, we delineate the related work studying gradient descentascent in non-convex, non-concave zero-sum games and finish by making note of the literature on bilinear and non-convex, concave zero-sum games. Solution Concepts. Owing to the numerous applications in machine learning, a significant portion of the modern work on learning in games has focused on the zero-sum formulation with non-convex, non-concave cost functions. Most recently, Daskalakis et al. (2021) tout the importance and significance of this class of games in a paper on the complexity of finding equilibria (in particular, in the constrained setting) in such games. Consequently, local solution concepts have been broadly adopted. Compared to the standard game-theoretic notions of equilibrium that characterize player's incentive to deviate given the game and information structure, local equilibrium concepts restrict the deviation search space to a suitable local neighborhood. Following the standard game-theoretic viewpoint, a vast number of works in machine learning study the local Nash equilibrium concept and critical points satisfying gradient-based sufficient conditions for the equilibrium, which are often referred to as differential Nash equilibria (Ratliff et al., 2013; 2014; Ratliff et al., 2016) . Based on the observation that in non-convex, non-concave zero-sum games the order of play is fundamental in the definition of the game, there has been a push toward considering local notions of the Stackelberg equilibrium concept, which is the usual game-theoretic equilibrium when there is an explicit order of play between players. In the zero-sum formulation, Stackelberg equilibrium are often referred to as minmax equilibria. Similar to as for the Nash equilibrium, gradient-based sufficient conditions for local minmax/Stackelberg equilibrium have been given (Fiez et al., 2020; Jin et al., 2020) and such critical points have been referred to as differential Stackelberg equilibria (Fiez et al., 2020) . We remark that it has been shown that local/differential Nash equilibria are a subset of local/differential Stackelberg equilibria (Fiez et al., 2020; Jin et al., 2020) Learning Dynamics. Given that the focus of this work is on gradient descent-ascent, we center our coverage of related work on papers analyzing its behavior. Nonetheless, we mention that a significant number of learning dynamics for zero-sum games have been developed in the past few years, in some cases motivated by the shortcomings of gradient descent-ascent without timescale separation. The methods include optimistic and extra-gradient algorithms (Daskalakis et al., 2018; Gidel et al., 2019a; Mertikopoulos et al., 2019 ), negative momentum (Gidel et al., 2019b) , gradient adjustments (Balduzzi et al., 2018; Letcher et al., 2019a; Mescheder et al., 2017) , and opponent modeling methods (Foerster et al., 2018; Letcher et al., 2019b; Metz et al., 2017; Schäfer & Anandkumar, 2019; Zhang & Lesser, 2010) , among others. While the aforementioned learning dynamics possess some desirable characteristics, they cannot guarantee that the set of stable critical points coincide with a set of local equilibria for the class of games under consideration. However, there have been a select few learning dynamics proposed that can guarantee the stable critical points coincide with either the set of differential Nash equilibria (Adolphs et al., 2019; Mazumdar et al., 2019) or the set of differential Stackelberg equilibria (Fiez et al., 2020; Wang et al., 2020; Zhang et al., 2020b )effectively solving the problem of guaranteeing local convergence to only a class of local equilibria. However, since each of the algorithms achieving the equilibrium stability guarantee require solving a linear equation in each update step, they are not efficient and can potentially suffer from degeneracies along the learning path in applications such as generative adversarial networks. These practical shortcomings motivate either proving that existing learning dynamics using only first-order gradient feedback achieve analogous theoretical guarantees or developing novel computationally efficient learning dynamics that can match the theoretical guarantee of interest. Gradient Descent-Ascent. Gradient descent-ascent has been studied extensively in non-convex, non-concave zero-sum games since it is a natural analogue to gradient descent from optimization, is computationally efficient, and has been shown to be effective in practice for applications of interest when combined with common heuristics. A prevailing approach toward gaining understanding of the convergence characteristics of gradient descent-ascent has been to analyze the local stability around critical points of the continuous time limiting dynamical system. The majority of this work has not considered the impact of timescale separation. Numerous papers have pointed out that the stable critical points of gradient descent-ascent without timescale separation may not be gametheoretically meaningful. In particular, it has been shown that there can exist stable critical points that are not differential Nash equilibrium (Daskalakis & Panageas, 2018; Mazumdar et al., 2020) . Furthermore, it is known that there can exist stable critical points that are not differential Stackelberg equilibria (Jin et al., 2020) . The aforementioned results rule out the possibility that gradient descentascent without timescale separation can guarantee equilibrium convergence. In terms of the stability of equilibria, it is known that differential Nash equilibrium are stable for gradient descent-ascent without timescale separation (Daskalakis & Panageas, 2018; Mazumdar et al., 2020) , but that there can exist differential Stackelberg equilibria which are not stable with respect to gradient descentascent without timescale separation. The work of Jin et al. (2020) is the most relevant exploring how the aforementioned stability properties of gradient descent-ascent change with timescale separation. In particular, Jin et al. (2020) investigate whether the desirable stability characteristics (stability of differential Nash equilibria) and undesirable stability characteristics (stability of non-equilibrium critical points and instability of differential Stackelberg equilibria) of gradient descent without timescale separation are main-tained and remedied, respectively with timescale separation. In terms of the former query, extending the examples shown in Mazumdar et al. (2020) and Daskalakis & Panageas (2018) , Jin et al. (2020) show that differential Nash equilibrium are stable for gradient descent-ascent with any amount of timescale separation. On the other hand, for the latter query, Jin et al. (2020) shows (in Proposition 27) two interesting examples: (a) for an a priori fixed τ , there exists a game with a differential Stackelberg equilibrium that is not stable and (b) for an a priori fixed τ , there exists a game with a stable critical point that is not a differential Stackelberg equilibrium. However, (a) does not imply that for the constructed game, there does not exist another (finite) τ -independent of the game parameters-such the differential Stackelebrg equilibrium is stable for all larger τ . In simple language, the result summarized in (a) says the following: if a bad timescale separation is chosen, then convergence may not be guaranteed. Similarly, (b) does not imply that there is no τ such that for all larger τ for the constructed game instance, the critical point becomes unstable. Again, in simple language, the result summarized in (b) says the following: if a bad timescale separation is chosen, then non-game theoretically meaningful equilibria may persist. While at first glance this set of results may appear to indicate that the undesirable stability characteristics of gradient descent without timescale separation cannot be averted by any finite timescale separation, it is important to emphasize that these results do not answer the questions of whether there (a) exists a game with a critical point that is not a differential Stackelberg equilibrium which is stable with respect to gradient descent-ascent without timescale separation and remains stable for all finite timescale separation ratios or (b) exists a game with a differential Stackelberg equilibrium that is not stable for all finite timescale separation ratios. The preceding questions are left open from previous work and are exactly the focus of this paper. In Appendix J, we go into greater detail on the comparison between Proposition 27 of Jin et al. (2020) as we believe this to be an important point of distinction between Theorem 1 and 2 in this paper. Finally, Jin et al. ( 2020) study GDA with a timescale separation approaching infinity and show that the stable critical points of gradient descent-ascent coincide with the set of differential Stackelberg equilibria in this regime. This result effectively shows that gradient descent-ascent can guarantee only equilibrium convergence with timescale separation, albeit only when it becomes arbitrarily large. We remark that an equivalent result in the context of general singularly perturbed systems has been known in the literature (Kokotovic et al., 1986, Chap. 2) as we discuss further in Section I. Finally, we point out that since a timescale separation approaching infinity does not result in an implementable algorithm, fully understanding the behavior with a finite timescale separation is of fundamental importance and the motivation for our work. Beyond the work of Jin et al. (2020) considering timescale separation in gradient descent-ascent, it is worth mentioning the work of Chasnov et al. (2019) and Heusel et al. (2017) . Chasnov et al. (2019) study the impact of timescale separation on gradient descent-ascent, but focus on the convergence rate as a function of it given an initialization around a differential Nash equilibrium and do not consider the stability questions examined in this paper. Heusel et al. (2017) study stochastic gradient descent-ascent with timescale separation and invoke the results of Borkar (2008) for analysis. The stochastic approximation results the claims rely on guarantee the convergence of the system locally to a stable critical point. Consequently, the claim of convergence to differential Nash equilibria of stochastic gradient descent-ascent given by Heusel et al. (2017) only holds given an initialization in a local neighborhood around a differential Nash equilibrium. In this regard, the issue of the local stability of the types of critical point is effectively assumed away and not considered. In contrast, we are able to combine our stability results for gradient descent-ascent with timescale separation together with the stochastic approximation theory of Borkar (2008) to guarantee local convergence to a differential Stackelberg equilibrium in Section H. We remark that Heusel et al. (2017) empirically demonstrate that timescale separation can significantly improve the performance of gradient descent-ascent when training generative adversarial networks. The final relevant line of work studying gradient descent-ascent is specific to generative adversarial networks. The results from this literature develop assumptions relevant to generative adversarial networks and then analyze the stability and convergence properties of gradient descent-ascent under them (see, e.g., works by Daskalakis et al. (2018) ; Goodfellow et al. (2014) ; Mescheder et al. (2018) ; Metz et al. (2017) ; Nagarajan & Kolter (2017) ). Within this body of work, there has been a significant amount of effort focusing on how the stability (and, hence, convergence properties) of gradient descent-ascent in generative adversarial networks can be enhanced with regularization methods. Nagarajan & Kolter (2017) show, under suitable assumptions, that gradient-based methods for training generative adversarial networks are locally convergent assuming the data distributions are absolutely continuous. However, as observed by Mescheder et al. (2018) , such assumptions not only may not be satisfied by many practical generative adversarial network training scenarios such as natural images, but it can often be the case that the data distribution is concentrated on a lower dimensional manifold. The latter characteristic leads to nearly purely imaginary eigenvalues and highly ill-condition problems. Mescheder et al. (2018) provide an explanation for observed instabilities consequent of the true data distribution being concentrated on a lower dimensional manifold using discriminator gradients orthogonal to the tangent space of the data manifold. Further, the authors introduce regularization via gradient penalties that leads to convergence guarantees under less restrictive assumptions than were previously known. In this paper, we further extend these results to show that convergence to differential Stackelberg equilibria is guaranteed under a wide array of hyperparameter configurations (i.e., learning rate ratio and regularization). Bilinear Games. We would be remiss to not include a discussion of gradient methods applied to an important class of zero-sum games: bilinear games. Bilinear games fall within a measure zero set of C 2 games that do not possess the generic properties that at critical points x * , det(D 2 2 f (x * )) = 0, det(S 1 (J(x * )) = 0 (see Fiez et al. 2020 for the genericity statements regarding local minmax equilibria in zero-sum games). This means they need special treatment. For zero-sum bilinear unconstrained games, however, the behavior of gradient descent-ascent is already known. In particular, the zero point is always a center type equilibrium of the continuous time dynamics so that the continuous time dynamics are recurrent. A forward Euler discretization of the continuous time dynamics gives gradient descent-ascent and such dynamics will always diverge. The alternating gradient descent update introduced by Bailey et al. ( 2020) is one solution to this problem such that the behavior of the continuous time dynamics is preserved under the discretization. This method simply uses a different discretization scheme that respects the continuous time behavior. We also note that due to the fact that zero-sum bilinear games do not admit equilibria with generic properties, simply introducing regularization can remedy the problem. In particular, if the maximizing player's update is modified to include the derivative of -µ 2 y 2 2 , then the local minmax (which is not strict) becomes a strict local minmax for the regularized game. In this case, our results do apply. Nonconvex-Concave Optimization. A final related line of work is on nonconvex-concave optimization (Lin et al., 2020a; b; Lu et al., 2020; Nouiehed et al., 2019; Ostrovskii et al., 2020; Rafique et al., 2018) . The focus in this set of works (among many others on the topic) is on characterizing the iteration complexity to stationary points, rather than stability and asymptotic convergence as in the non-convex, non-concave zero-sum game setting. The primary relevance of work on this problem is that a number of the algorithms rely on timescale separation and variations of gradient descentascent. Moreover, the methods for obtaining fast convergence rates may be relevant to future work attempting to characterize fast rates in the non-convex, non-concave setting after there is a more fundamental understanding of the stability and asymptotic convergence.

A.2 HISTORICAL PERSPECTIVE: DYNAMICAL SYSTEMS AND CONTROL

The study of gradient descent-ascent dynamics with timescale separation between the minimizing and maximizing players is closely related to that of singularly perturbed dynamical systems (Kokotovic et al., 1986) . Such systems arise in classical control and dynamical systems in the context of physical systems that either have multiple states which evolve on different timescales due to some underlying immutable physical process or property, or a single dynamical system which evolves on a sub-manifold of the larger state-space. For example, robot manipulators or end effectors often have have slower mechanical dynamics than electrical dynamics. On the other hand, in electrical circuits or mechanical systems, certain resistor-capacitor circuits or spring-mass systems have a state which evolves subject to a constraint equation (Lagerstrom & Casten, 1972; Sastry & Desoer, 1981) . Due to their prevalence, singularly perturbed systems have been studied extensively with one of the outcomes being a number of works on determining the range of perturbation parameters for which the overall system is stable (Kokotovic et al., 1986; Saydy, 1996; Saydy et al., 1990) . We exploit these results and analysis techniques to develop novel results for learning in games. One of contributions of this work is the introduction of the algebraic analysis techniques to the machine learning and game theory communities. These tools open up new avenues for algorithm synthesis; we comment on potential directions in the concluding discussion section. This being said, there are a couple key difference between the present setting and that of the classical literature including the following: 1. The perturbation parameter is no longer an immutable characteristic of the physical system, but rather a hyperparameter subject to design. Indeed, in singular perturbation theory, the typical dynamical system studied takes the form ẋ = g 1 (x, y) ẏ = g 2 (x, y) where is a small parameter that abstracts some physical characteristics of the state variables. On the other hand, in learning in games, the continuous time limiting dynamical system of gradient descent-ascent for a zero-sum game defined by f ∈ C 2 (X × Y, R) takes the form ẋ = -D 1 f (x, y) ẏ = τ D 2 f (x, y) ) where the x-player seeks to minimize f with respect to x and the y-player seeks to maximize f with respect to y, and τ is the ratio of learning rates (without loss of generality) of the maximizing to the minimizing player. These learning rates-and hence the value of τ -are hyperparameters subject to design in most machine learning and optimization applications. Another feature of (7) as compared to (6), is that the dynamics D i f are partial derivatives of a function f , which leads to the second key difference. 2. There is structure in the dynamical system that arises from gradient-play which reflects the underlying game theoretic interactions between players. This structure can be exploited in obtaining convergence guarantees in machine learning and optimization applications of game theory. For instance, minmax optimization is analogous to a zero sum game for which the local linearization of gradient descent-ascent dynamics has the structure J = A B -τ B -τ C where A = A and C = C and τ is the learning rate ratio or timescale separation parameter. Such block matrices have very interesting properties. In particular, second order optimality conditions for a minmax equilibrium correspond to positive definiteness of the first Schur complement S 1 (J) = A -BC -1 B > 0, and of -C > 0 (Fiez et al., 2020) . This turns out to be keenly important for understanding convergence of gradient descentascent. Furthermore, due to the structure of J, tools from the theory of block operators (see, e.g., works by Lancaster & Tismenetsky (1985) ; Magnus (1988) ; Tretter (2008) ) such as the quadratic numerical range can be exploited (and combined with singular perturbation theory) to understand the effects of hyperparameters such as τ (the learning rate ratio) and regularization (which is common in applications such as generative adversarial networks) on convergence.

B MATHEMATICAL AND GAME THEORETIC PRELIMINARIES

In this appendix section, we review mathematical and game theoretic preliminaries needed for the technical details of the proofs. We also include some short technical lemmas from algebra that are used in the proofs.

B.1 GAME THEORY

We now formally present the local equilibrium definitions (see Bas ¸ar & Olsder 1998) that we characterize in the main body by gradient-based sufficient conditions as is typical in the literature. Definition B.1 (Local Nash Equilibrium). The joint strategy x ∈ X is a local Nash equilibrium on i∈I U i ⊂ X, where U i ⊆ X i , if f (x 1 , x 2 ) ≤ f (x 1 , x 2 ), for all x 1 ∈ U 1 ⊂ X 1 and f (x 1 , x 2 ) ≥ f (x 1 , x 2 ) for all x 2 ∈ U 2 ⊂ X 2 . Furthermore, if the inequalities are strict, we say x is a strict local Nash equilibrium. Definition B.2 (Local Stackelberg Equilibrium). Consider U i ⊂ X i for i = 1, 2 where, without loss of generality, player 1 is the leader (minimizing player) and player 2 is the follower (maximizing player). The strategy x * 1 ∈ U 1 is a local Stackelberg solution for the leader if, ∀x 1 ∈ U 1 , sup x2∈r U 2 (x * 1 ) f (x * 1 , x 2 ) ≤ sup x2∈r U 2 (x1) f (x 1 , x 2 ), where r U2 (x 1 ) = {y ∈ U 2 |f (x 1 , y) ≥ f (x 1 , x 2 ), ∀x 2 ∈ U 2 } is the reaction curve. Moreover, for any x * 2 ∈ r U2 (x * 1 ), the joint strategy profile (x * 1 , x * 2 ) ∈ U 1 × U 2 is a local Stackelberg equilibrium on U 1 × U 2 . While characterizing existence of equilibria is outside the scope of this work, we remark that Nash equilibria exist for convex costs on compact and convex strategy spaces and Stackelberg equilibria exist on compact strategy spaces (Bas ¸ar & Olsder, 1998, Thm. 4.3, Thm. 4.8, & Sec. 4.9) . Existence of local equilibria is guaranteed if the neighborhoods and cost functions restricted to those neighborhoods satisfy the assumptions of the cited results. The differential characterization of local Nash equilibria in continuous games was first reported in (Ratliff et al., 2013) . Genericity and structural stability we studied in general-sum settings in (Ratliff et al., 2014) and in zero-sum settings in (Mazumdar & Ratliff, 2019) . 

B.2 DYNAMICAL SYSTEMS THEORY

Recall the following equivalent characterizations of stability for an equilibrium of ẋ = -g(x) in terms of the Jacobian matrix J(x) = Dg(x). Theorem B.1 (Theorem 4.15, Khalil 2002) . Consider a critical point x * of g(x). The following are equivalent: (a) x * is a locally exponentially stable equilibrium of ẋ = -g(x); (b) spec(-J(x * )) ⊂ C • -; (c) there exists a symmetric positive-definite matrix P = P > 0 such that P J(x * ) + J(x * ) P > 0. Leveraging Linearization to Infer Qualitative Properties. The Hartman-Grobman theorem asserts that it is possible to continuously deform all trajectories of a nonlinear system onto trajectories of the linearization at a fixed point of the nonlinear system. Informally, the theorem states that if the linearization of the nonlinear dynamical system ẋ = F (x) around a fixed point x-i.e., F (x) = 0has no zero or purely imaginary eigenvalues, then there exists a neighborhood U of x and a homeomorphism h : U → R n -i.e., h, h -1 ∈ C(U, R n )-taking trajectories of ẋ = F (x) and mapping them onto those of ż = DF (x)z. In particular, h(x) = 0. Given a dynamical system ẋ = F (x), the state or solution of the system at time t starting from x at time t 0 is called the flow and is denoted φ t (x). Theorem B.2 (Hartman-Grobman: Theorem 7.3, Sastry 1999; Theorem 9.9, Teschl 2000) . Consider the n-dimensional dynamical system ẋ = F (x) with equilibrium point x. If DF (x) has no zero or purely imaginary eigenvalues, there is a homeomorphism h defined on a neighborhood U of x taking orbits of the flow φ t to those of the linear flow e tDF (x) of ẋ = F (x)-that is, the flows are topologically conjugate. The homeomorphism preserves the sense of the orbits and is chosen to preserve parameterization by time. The above theorem says that the qualitative properties of the nonlinear system ẋ = F (x) in the vicinity (which is determined by the neighborhood U ) of an isolated equilibrium x are determined by its linearization if the linearization has no eigenvalues on the imaginary axes in the complex plane. We also remark that Hartman-Grobman can also be applied to discrete time maps (Sastry, 1999, Thm. 2.18) with the same qualitative outcome. Limiting dynamical systems and connections to singular perturbation theory. The continuous time dynamical system takes the form ẋ = -Λ τ g(x) due to the timescale separation τ . Such a system is known as a singularly perturbed system or a multi-timescale system in the dynamical systems theory literature (Kokotovic et al., 1986) , particularly where τ -1 is small. Singularly perturbed systems are classically expressed as ẋ = -D 1 f 1 (x, z) ż = -D 2 f 2 (x, z) where = τ -1 is most often a physically meaningful quantity inherent to some dynamical system that describes the evolution of some physical phenomena; e.g., in circuits it may be a constant related to device material properties, and in communication networks, it is often the speed at which data flows through a physical medium such as cable. This brings up to one key point of separation in applying dynamical systems theory to the study of algorithms versus physical system dynamics: is no longer necessarily is a physical quantity but is most often a hyper-parameter subject to design. Internally Chain Transitivity. In proving results for stochastic gradient descent-ascent, we leverage what is known as the ordinary differential equation method in which the flow of the limiting continuous time system starting at sample points from the stochastic updates of the players actions is compared to asymptotic psuedo-trajectories-i.e., linear interpolations between sample points. To understand stability in the stochastic case, we need the notion of internally chain transitive sets. For more detail, the reader is referred to (Alongi & Nelson, 2007, Chap. 2-3) . A closed set U ⊂ R m is an invariant set for a differential equation ẋ = F (x) if any trajectory x(t) with x(0) ∈ U satisfies x(t) ∈ U for all t ∈ R. Let φ t be a flow on a metric space (X, d). Given ε > 0, T > 0 and x, y ∈ X, an (ε, T )-chain from x to y with respect to φ t and d is a pair of finite sequences x = x 0 , x 1 , . . . , x k-1 , x k = y in X and t 0 , . . . , t k-1 in [T, ∞), denoted together by (x 0 , x 1 , . . . , x k-1 , x k ; t 0 , . . . , t k-1 ), such that d(φ ti (x i ), x i+1 ) < ε for i = 0, 1, 2, . . . , k -1. A set U ⊆ X is (internally) chain transitive with respect to φ t if U is a non-empty closed invariant set with respect to φ t such that for each x, y ∈ U , > 0 and T > 0 there exists an (ε, T )-chain from x to y. A compact invariant set U is invariantly connected if it cannot be decomposed into two disjoint closed nonempty invariant sets. It is easy to see that every internally chain transitive set is invariantly connected.

B.3 TOOLS FOR CONVERGENCE ANALYSIS

The following proposition is a well-known result in numerical analysis and can be found in a number of books and papers on the subject. Essentially, it provides an asymptotic convergence guarantee for a discrete time update process or dynamical system. Proposition B.1 (Ostrowski's Theorem Argyros 1999; Theorem 10.1.2, Ortega & Rheinboldt 1970) . Let x * be a fixed point for the discrete dynamical system x k+1 = F (x k ). If the spectral radius of the Jacobian satisfies ρ(DF (x * )) < 1, then F is a contraction at x * and hence, x * is asymptotically stable. We analyze the iteration complexity or local asymptotic rate of convergence of learning rules of the form x k+1 = h(x k ) in the neighborhood of an equilibrium. Given two real valued functions F (k) and G(k), we write F (k) = O(G(k)) if there exists a positive constant c > 0 such that |F (k)| ≤ c|G(k)|. For example, consider iterates generated by x k+1 = h(x k ) with initial condition x 0 and critical point x * . Then, if F (k) = x k+1 -x * ≤ M k x 0 -x * , we write F (k) = O(M k ) where c = x 0 -x * . B.4 NUMERICAL AND QUADRATIC NUMERICAL RANGE. The numerical range and quadratic numerical range of a block operator matrix are particularly useful for proving results about the spectrum of a block operator matrix as they are supersets of the spectrum (Tretter, 2008) . Given a matrix A ∈ R n×n , the numerical range is defined by W(A) = {z ∈ C n : Az, z , z = 1}, and is a convex subset of C. Define spaces W i = {z ∈ C ni : z = 1} for each i ∈ {1, 2}.

Consider a block operator

A = A 11 A 12 A 21 A 22 , where A ii ∈ R ni×ni and A ij ∈ R ni×nj for each i, j ∈ {1, 2}. Given v ∈ W 1 and w ∈ W 2 , let A v,w ∈ C 2×2 be defined by A v,w = A 11 v, v A 12 w, v A 21 v, w A 22 w, w . The quadratic numerical range of A is defined by W 2 (A) = v∈W1,w∈W2 spec(A v,w ) where spec(•) denotes the spectrum of its argument. The quadratic numerical range can be described as the set of solutions of the characteristic polynomial λ 2 -λ( A 11 v, v + A 22 w, w ) + A 11 v, v A 22 w, w -A 12 v, w A 21 w, v = 0 (9) for v ∈ W 1 and w ∈ W 2 . We use the notation Av, w = v Aw to denote the inner product. Note that W 2 (A) is a (potentially non-convex) subset of W(A) and contains spec(A).

B.5 TECHNICAL LEMMAS

In this appendix, we present a handful of technical lemmas and review some additional mathematical preliminaries excluded from the main body but which are important in proving the results in the paper. The following technical lemma is used in proving an upper bound on the spectral radius of the linearization of the discrete time update τ -GDA a requirement for obtaining the convergence rate results. Lemma B.1. The function c(z) = (1 -z) 1/2 + z 4 -(1 -z 2 ) 1/2 satisfies c(x) ≤ 0 for all z ∈ [0, 1]. Proof. Since c(0) = 0 and c(1) = 1 4 -1 √ 2 ≤ 0, we simply need to show that c (z) ≤ 0 on (0, 1) to get that c(z) is a decreasing function on [0, 1], and hence negative on [0, 1]. Indeed, c (z) = 1 4 + 1 2 √ 4-2z -1 2 √ 1-z ≤ 0 since (1 -z) -1/2 -(4 -2z) -1/2 ≥ 1/2 for all z ∈ (0, 1). The following technical lemma, due to Mustafa & Davidson (1994) , is used in constructing the finite learning rate ratio. Lemma B.2 (Lemma 15, Mustafa & Davidson 1994). Let V, Z ∈ R p×p , W ∈ R p×q and Y ∈ R q×q . If V and Y -XV -1 W are non-singular, then det V + Z W X Y = det(V ) det(Y -XV -1 W ) det(I +V -1 (I +W (Y -XV -1 W ) -1 XV -1 )Z) For completeness (and because there is a typo in the original manuscript), we provide the proof here. Proof. Suppose that V and Y -XV -1 W are non-singular so that the partial Schur decomposition V W X Y = V 0 X Y -XV -1 W I V -1 W 0 I holds, and det V W X Y = det(V ) det(Y -XV -1 W ). (10) Further, V W X Y -1 = I -V -1 W 0 I V -1 0 -(Y -XV -1 W ) -1 XV -1 (Y -XV -1 W ) -1 . Applying the determinant operator, we have that det V + Z W X Y = det V W X Y det I 0 0 I + V W X Y -1 Z 0 0 0 (11) so that det I 0 0 I + V W X Y -1 Z 0 0 0 = det(V -1 (I + W (Y -XV -1 W ) -1 XV -1 )Z + I). Combining ( 10) with ( 12) in ( 11) gives exactly the claimed result. The following lemma is Theorem 2 Lancaster & Tismenetsky (1985, Chap. 13.1) . We use this lemma several times in the proofs of Theorem 1 and 2 so we include it here for ease of reference. For a given matrix A, υ + (A), υ -(A), and ζ(A) are the number of eigenvalues of the argument that have positive, negative and zero real parts, respectively. Lemma B.3. Consider a matrix A ∈ R n×n . (a) If P is a symmetric matrix such that AP + P A = Q where Q = Q > 0, then P is nonsingular and P and A have the same inertia, meaning that υ + (A) = υ + (P ), υ -(A) = υ -(P ), ζ(A) = ζ(P ). ( ) (b) On the other hand, if ζ(A) = 0, then there exists a matrix P = P and a matrix Q = Q > 0 such that AP + P A = Q and P and A have the same inertia ((13) holds). C PROOF OF THEOREM 1: STABILITY OF τ -GDA To prove Theorem 1 and Corollary C.1, we introduce some techniques that are arguably new to the machine learning and artificial intelligence communities. The first is the notion of a guard map. A guard map can be used to provide a certificate of a particular behavior for a dynamical system as a parameter(s) varies. A critical point of a dynamical systems is known to be stable if the spectrum of the Jacobian at the critical point lies in the open left-half complex plane, denoted C • -. Hence, we construct a guard map as a function of τ and show that it guards C • -. Specifically we show that the existence of a τ * ∈ (0, ∞) such that ν(τ * ) = 0 and ν(τ ) = 0 for all τ ∈ (τ * , ∞) is equivalent to S 1 (J(x * )) > 0 and -D 2 2 f (x * ) > 0 where S 1 (J(x * )) = S 1 (J τ (x * )) = D 2 1 f (x * ) -D 12 f (x * )(D 2 2 f (x * )) -1 D 21 f (x * ). Towards this end, we need to introduced some notation as well as formal definitions for important concepts such as the guard map.

C.1 NOTATION AND PRELIMINARIES

Given a matrix A ∈ R n1×n2 , let vec(A) ∈ R n1n2 be the vectorization of A. We use the convention that rows are transposed and stacked in order. That is, vec :    -a 1 - . . . -a n1 -    →    a 1 . . . a n1    Let ⊗ and ⊕ denote the Kronecker product and Kronecker sum respectively. Recall that A ⊕ B = A ⊗ I + I ⊗ B. A less common operator, we define as an operator that generates an 1 2 n(n + 1) × 1 2 n(n + 1) matrix from a matrix A ∈ R n×n such that A A = H + n (A ⊕ A)H n where H + n = (H n H n ) -1 H n is the (left) pseudo-inverse of H n , a full column rank duplication matrix. A duplication matrix H n ∈ R n 2 ×n(n+1)/2 is a clever linear algebra tool for mapping a n 2 (n + 1) vector to a n 2 vector generated by applying vec(•) to a symmetric matrix and it is designed to respect the vectorization map vec(•). In particular, if vech(X) is the half-vectorization map of any symmetric matrix X ∈ R n×n , then vec(X) = H n vech(X) and vech(X) = H + n vec(X). Given a square matrix A, let λ + max (A) be the largest positive real eigenvalue of A and if A does not have a positive real eigenvalue then it is zero. Guardian maps. The use of guardian maps for studying stability of parameterized families of dynamical systems was arguably introduced by Saydy et al. (1990) . Guardian or guard maps act as a certificate for a performance criteria such as stability. Formally, let X be the set of all n × n real matrices or the set of all polynomials of degree n with real coefficients. Consider S an open subset of X with closure S and boundary ∂S. Definition C.1. The map ν : X → C is said to be a guardian map for S if for all x ∈ S, ν(x) = 0 ⇐⇒ x ∈ ∂S. Consider an open subset Ω of the complex plane that is symmetric with respect to the real axis. Then, elements of S(Ω) = {A ∈ R n×n : spec(A) ⊂ Ω} are said to be stable relative to Ω. The following result gives a necessary and sufficient condition for stability of parameterized families of matrices relative to some open subset of the complex plane. Proposition C.1 (Proposition 1 (Saydy et al., 1990) ; Theorem 2 (Abed et al., 1990) ). Consider U to be a pathwise connected subset of R and A(τ ) ∈ R n×n a matrix which depends continuously on τ . Let S(Ω) be guarded by the map ν. The family {A(τ ) : τ ∈ U } is stable relative to Ω if and only if (i) it is nominally stable-meaning A(τ 1 ) ∈ S(Ω) for some τ 1 ∈ U -and (ii) ν(A(τ )) = 0 for all τ ∈ U . In proving Theorem 1, we define a guard map for the space of n × n Hurwitz stable matrices which is denoted by S(C • -). Lemma C.1. The map ν : A → det(A A) guards the set of non-singular n × n Hurwitz stable matrices S(C • -). Proof. This follows from the following observation: for A ∈ R n×n , vech(AX + XA ) = H + n vec(AX + XA ) = H + n (A ⊕ A)vec(X) = H + n (A ⊕ A)H n vech(X) from which it can be shown that the eigenvalues of A A are λ i + λ j for 1 ≤ j ≤ i ≤ n where λ i for i = 1, . . . , n are the eigenvalues of A. Indeed, let S be a non-singular matrix such that S -1 AS = M where M is upper triangular with λ 1 , . . . , λ n on its diagonal. Observe that for any n × n matrix P , H n H + n (P ⊗ P )H n = (P ⊗ P )H n and H + n (P ⊗ P )H n H + n = H + n (P ⊗ P ). Hence, using properties of the Kronecker product (namely, that (A 1 ⊗ A 2 )(B 1 ⊗ B 2 ) = (A 1 B 1 ⊗ A 2 B 2 )), we have that H + n (S -1 ⊗ S -1 )H n H + n (I ⊗ A + A ⊗ I)H n H + n (S ⊗ S)H n = H + n (I ⊗ M + M ⊗ I)H n so that the spectrum of H + n (I ⊗ A + A ⊗ I)H n and H + n (I ⊗ M + M ⊗ I)H n coincide. Now, since M is upper triangular, H + n (I ⊗ M + M ⊗ I) H n is upper triangular with diagonal elements λ i + λ j (1 ≤ j ≤ i ≤ n) which can be verified by direct computation and using the definition of H n . This implies that λ i + λ j (1 ≤ j ≤ i ≤ n) are exactly the eigenvalues of H + n (I ⊗ A + A ⊗ I)H n . We note that there are several other guard maps for the space of Hurwtiz stable matrices including ν : A → det(A ⊕ A). To give some intuition for this map, it is fairly straightforward to see that the Kronecker sum A ⊕ A = A ⊗ I + I ⊗ A has spectrum {λ j + λ i } where λ i , λ j ∈ spec(A). The operator A A is simply a more computationally efficient expression of A ⊕ A, and as such the eigenvalues of A A are those of A ⊕ A removing redundancies. We use A A specifically because of its computational advantages in computing τ * .

C.2 PROOF OF THEOREM 1

We first prove that if x * is a differential Stackelberg equilibrium (that is, S 1 (J τ (x * )) > 0 and -D 2 2 f (x * ) > 0), then there exists a finite τ * ∈ (0, ∞) such that for all τ ∈ (τ * , ∞), x * is locally exponentially stable for ẋ = -Λ τ g(x) (that is, spec(-J τ (x * )) ⊂ C • -) . Towards this end, we construct a guard map for the space of n × n Hurwtiz stable matrices and explicitly construct the τ * using it. Then we prove the other direction. That is, if there exists a finite τ * ∈ (0, ∞) such that for all τ ∈ (τ * , ∞), x * is exponentially stable for ẋ = -Λ τ g(x), then x * is a differential Stackelberg equilibrium. We prove this by contradiction. C.2.1 PROOF THAT IF x * IS A DIFFERENTIAL STACKELBERG THEN FINITE τ * EXISTS For a critical point x * , let -J τ (x * ) = -D 2 1 f (x * ) -D 12 f (x * ) τ D 12 f (x * ) τ D 2 2 f (x * ) = A 11 A 12 -τ A 12 τ A 22 and define S 1 = S 1 (-J τ (x * )) = A 11 -A 12 A -1 22 A 12 . Note that this is equivalent to the first Schur complement of -J(x * ) (i.e., when τ = 1) since the τ and τ -1 cancel, and by assumption the first Schur complement of -J(x * ) is positive definite. Suppose that x * is a differential Stackelberg equilibrium so that -S 1 > 0 and -A 22 > 0. Polynomial guard map with family of matrices parameterized by τ . By Lemma C.1, ν : A → det(A A) is a guard map for S(C • -). Indeed, using the fact that the determinant is the product of the eigenvalues of a matrix and the fact that spec(A A) = {λ i +λ j , 1 ≤ i ≤ j ≤ n, λ i , λ j ∈ spec(A)}, we have that det(A A) = 1≤j≤i≤n (λ i + λ j ) = 1≤i≤n 2Re(λ i )(4Re 2 (λ i ) + 4Im 2 (λ i )) 1<i<j<n: λi = λj (λ i + λ j ). Hence, consider S(C • -), det(A A) = 0 if and only if A A is singular if and only if A has a purely imaginary eigenvalue-that is, if and only if A ∈ ∂S(C • -).foot_4 Now, consider the parameterized family of matrices -J τ (x * ), parameterized by τ . By an abuse of notation, let ν(τ ) = det(-J τ (x * ) -J τ (x * )). If we consider the subset of this family of matrices that lies in S(C • -) (this subset could a priori be empty thought we show it is not), then for any τ such that -J τ (x * ) is in this subset, we have that ν(τ ) = 0 if and only if -J τ (x * ) (-J τ (x * )) is singular if and only if -J τ (x * ) ∈ ∂S(C • -). Hence, ν(τ ) = det(-J τ (x * ) -J τ (x * )) guards S(C • -). In particular, if we envision -J τ (x * ) as the input to ν : A → det(A A) and simply vary τ (holding all the entries of -J τ (x * ) otherwise fixed), then ν : τ → det(-J τ (x * ) (-J τ (x * ))) can be thought of simply as a function of τ which guards the set of Hurwitz stable matrices via the reasoning describe above. Indeed, by slightly overloading the notation for ν, ν(τ ) := ν 0 + ν 1 τ + • • • + ν p-1 τ p-1 + ν p τ p = ν(-J τ (x * )) Hence, for intuition, observe that as τ decreases (towards zero) stability is first lost when at least one eigenvalue of -J τ (x * ) reaches the imaginary axis, at which point ν(τ ) = 0. There are two cases to consider: Case 1: ν(τ ) is an identically zero polynomial. In this case, -J τ (x * ) is in the interior of the complement of the set of Hurwitz stable matrices for all values of τ > 0-that is, -J τ (x * ) ∈ int(S c (C • -) ) for all τ ∈ R + = (0, ∞). Case 2: ν(τ ) is not an identically zero polynomial. In this case, ν(τ ) has finitely many zeros. If ν(τ ) has no positive real roots, then as τ varies in R + , -J τ (x * ) does not cross ∂S(C • -i.e., the boundary of the space of n × n Hurwitz stable matrices. Hence, {-J τ (x * ) : τ ∈ R + } ⊂ S c (C • -) or {-J τ (x * ) : τ ∈ R + } ⊂ int(S c (C • -)). It suffices to check -J τ (x * ) ∈ S c (C • -) or -J τ (x * ) ∈ int(S c (C • -)) for an arbitrary τ ∈ R + . On the other hand, if ν(τ ) has ≥ 1 real positive zeros, say 0 < τ 1 < • • • < τ = τ * , then by Proposition C.1, -J τ (x * ) ∈ S(C • -) for all τ > τ * if and only if -J τ (x * ) ∈ S(C • -) for arbitrarily chosen τ > τ * . We choose the largest positive root τ because we are guaranteed that ν(τ ) stops changing sign for τ > τ * . Further, the largest neighborhood in R + for which -J τ (x * ) ∈ S(C • -) is (τ , ∞) . Recall that we have assumed that x * is a differential Stackelberg equilibrium (that is, S 1 > 0 and -A 22 > 0). We will show next (by way of explicit construction of τ * ) that we are always in case 2. Construction of τ * . We note that there are more elegant, simpler constructions, but to our knowledge this construction gives the tightest bound on the range of τ for which -J τ (x * ) is guarnateed to be Hurwitz stable. Recall that -J τ (x * ) = -D 2 1 f (x * ) -D 12 f (x * ) τ D 12 f (x * ) τ D 2 2 f (x * ) = A 11 A 12 -τ A 12 τ A 22 and S 1 = A 11 -A 12 A -1 22 A 12 . Let I m denote the m × m identity matrix for some m. Claim C.1. The finite learning rate ratio is τ * = λ + max (Q) where Q = 2 (A 12 ⊗ A -1 22 )H n2 (I n1 ⊗ A -1 22 A 12 )H n1 Ā-1 22 H + n2 (A 12 ⊗ I n2 ) -S-1 1 H + n1 (S 1 ⊗ A 12 A -1 22 ) -(A 11 ⊗ A -1 22 ) (14) with Ā22 = A 22 A 22 and S1 = S 1 S 1 . Proof. Recall that ν(τ ) = det(-J τ (x * ) (-J τ (x * ))) is a guard map for S(C • -). We apply basic properties of the Kronecker product and sum as well as Schur's determinant formula to obtain a reduced form of the guard map. To this end, we have that -J τ (x * ) (-J τ (x * )) =   A 11 A 11 2H + n1 (I n1 ⊗ A 12 ) 0 τ (I n1 ⊗ (-A 12 ))H n1 A 11 ⊕ τ A 22 (A 12 ⊗ I n2 )H n2 0 2τ H + n2 (-A 12 ⊗ I n2 ) τ (A 22 A 22 )   Now, we apply Schur's determinant formula to get that ν(τ ) = τ n2(n2+1)/2 det(A 22 A 22 ) det A 11 A 11 2H + n1 (I n1 ⊗ A 12 ) τ (I n1 ⊗ (-A 12 ))H n1 A 11 ⊕ τ A 22 + M 1 ( ) where M 1 = -2H + n2 (-A 12 ⊗ I n2 )(A 22 A 22 ) -1 (A 12 ⊗ I n2 )H n2 From here, we apply Lemma B.2 to further reduce the guard map. First, note that A 11 ⊕ τ A 22 = A 11 ⊗ I n2 + I n1 ⊗ τ A 22 . Let V = I n1 ⊗ τ A 22 , Z = A 11 ⊗ I n2 + M 1 , Y = A 11 A 11 , W = -τ (I n1 ⊗ A 12 )H n1 , and X = 2H + n1 (I n1 ⊗ A 12 ). Using the two properties of the Kronecker product (B 1 ⊗ B 2 )(B 3 ⊗ B 4 ) = (B 1 B 3 ⊗ B 2 B 4 ) and (B 1 ⊗ B 2 ) -1 = (B -1 1 ⊗ B -1 2 ), we have that Y -XV -1 W = A 11 A 11 + 2H + n1 (I n1 ⊗ A 12 )(I n1 ⊗ A 22 ) -1 (I n1 ⊗ A 12 )H n1 (16) = A 11 A 11 + 2H + n1 (I n1 ⊗ A 12 A -1 22 A 12 )H n1 (17) = A 11 A 11 + H + n1 ((I n1 ⊗ A 12 A -1 22 A 12 ) + (A 12 A -1 22 A 12 ⊗ I n1 ))H n1 (18) = S 1 S 1 where ( 18) holds since H + n1 (I n1 ⊗ A 12 A -1 22 A 12 )H n1 = H + n1 (A 12 A -1 22 A 12 ⊗ I n1 )H n1 . Now, define V -1 + V -1 W (Y -XV -1 W ) -1 XV -1 = τ -1 M 2 where M 2 = I n1 ⊗ A -1 22 -2(I n1 ⊗ A -1 22 A 12 )H n1 (S 1 S 1 ) -1 H + n1 (I n1 ⊗ A 12 A -1 22 ) so that applying Lemma B.2 we have ν(τ ) = τ n2(n2+1)/2 det(A 22 A 22 ) det(S 1 S 1 ) det(I n1 ⊗A 22 ) det(τ I n1n2 +M 2 (A 11 ⊗I n2 +M 1 )) (20) The assumptions that S 1 > 0 and -A 22 > 0 together imply that det(S 1 S 1 ) = 0 and det(I n1 ⊗ A 22 ) = 0. Hence, ν(τ ) = 0 if and only if det(τ I n1n2 + M 2 (A 11 ⊗ I n2 + M 1 )) = 0 since 0 < τ < ∞. The determinant expression is exactly an eigenvalue problem. Since by assumption the Schur complement of J(x * ) and the individual Hessian -D 2 2 f (x * ) are positive definite (that is, x * is a differential Stackelberg equilibrium), Thus, the largest positive real root of ν(τ ) = 0 is τ * = λ + max (-M 2 (A 11 ⊗ I n2 + M 1 )) where λ + max (•) is the largest positive real eigenvalue of its argument if one exists and otherwise its zero. Using properties of the Kronecker product and duplication matrices, it can easily be seen that Q, as defined in ( 14), is equivalent to -M 2 (A 11 ⊗ I n2 + M 1 ). The result of this claim concludes the proof that if x * is a differential Stackelberg, then there exists a finite τ * ∈ [0, ∞) such that for all τ ∈ (τ * , ∞), spec(- J τ (x * )) ⊂ C • -. C.2.2 PROOF THAT EXISTENCE OF FINITE τ * IMPLIES THAT x * IS A DIFFERENTIAL STACKELBERG To begin, consider a critical point x * such that g(x * ) = 0 and det(D 2 2 f (x * )) = 0. Then, det(-J τ (x * )) = τ n2 det(D 2 2 f (x * )) det(-S 1 (J(x * ))) so that det(-J τ (x * )) = 0 if and only if det(-S 1 (J(x * ))) = 0 which implies -J τ (x * ) is unstable for all τ ∈ (0, ∞) when det(-S 1 (J(x * ))) = 0. As a result, we are left to consider when det(S 1 (J(x * ))) = 0 for the remainder of the proof. We proceed by arguing a contradiction. Let -C ≡ -D 2 2 f (x * ) and S 1 ≡ S 1 (J(x * )) = D 2 1 f (x * ) - D 12 f (x * )(D 2 2 f (x * )) -1 D 12 f (x * ) have no zero eigenvalues-that is, det(S 1 ) = 0 and det(C) = 0. Suppose that there exists a τ * ∈ (0, ∞) such that for all τ ∈ (τ * , ∞), spec(-J τ (x * )) ⊂ C • -, yet x * is not a differential Stackelberg equilibrium. That is, either -S 1 or C have at least one positive eigenvalue. Without loss of generality, let -S 1 have at least one positive eigenvalue. Since det(S 1 ) = 0 and det(C) = 0, by Lemma B.3.b, there exists non-singular Hermitian matrices P 1 , P 2 and positive definite Hermitian matrices Q 1 , Q 2 such that -S 1 P 1 -P 1 S 1 = Q 1 and CP 2 + P 2 C = Q 2 . Further, -S 1 and P 1 have the same inertia, meaning υ + (-S 1 ) = υ + (P 1 ), υ -(-S 1 ) = υ -(P 1 ), ζ(-S 1 ) = ζ(P 1 ) where for a given matrix A, υ + (A), υ -(A), and ζ(A) are the number of eigenvalues of the argument that have positive, negative and zero real parts, respectively. Similarly, C and P 2 have the same inertia: υ + (C) = υ + (P 2 ), υ -(C) = υ -(P 2 ), ζ(C) = ζ(P 2 ). Since -S 1 has at least one strictly positive eigenvalue, υ + (P 1 ) = υ + (-S 1 ) ≥ 1.

Define

P = I L 0 0 I P 1 0 0 P 2 I 0 L 0 I where L 0 = (D 2 2 f (x * )) -1 D 12 f (x * ) = CD 12 f (x * ). Since P is congruent to blockdiag(P 1 , P 2 ), by Sylvester's law of inertia (Horn & Johnson, 1985, Thm. 4.5.8 ), P and blockdiag(P 1 , P 2 ) have the same inertia, meaning that υ + (P ) = υ + (blockdiag(P 1 , P 2 )), υ -(P ) = υ -(blockdiag(P 1 , P 2 )), and ζ(P ) = ζ(blockdiag(P 1 , P 2 )). Consider the matrix equation -P J τ (x * ) -J τ (x * )P = Q τ for -J τ (x * ) where Q τ = I L 0 0 I B τ I 0 L 0 I with B τ = Q 1 P 1 D 12 f (x * ) -S 1 L 0 P 2 (P 1 D 12 f (x * ) -S 1 L 0 P 2 ) P 2 L 0 D 12 f (x * ) + (P 2 L 0 D 12 f (x * )) + τ Q 2 which can be verified by straightforward calculations. Observe that Q τ > 0 is equivalent to B τ > 0 and both matrices are symmetric so that B τ > 0 if and only if Q 1 > 0 and S 2 (B τ ) > 0 where S 2 (B τ ) = P 2 L 0 D 12 f (x * ) + (P 2 L 0 D 12 f (x * )) + τ Q 2 -(P 1 D 12 f (x * ) -S 1 L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) -S 1 L 0 P 2 ). Now, S 2 (B τ ) is also a real symmetric matrix, and hence, it is positive definite if and only if all its eigenvalues are positive. To determine the range of τ such that S 2 (B τ ) is positive definite, we can formulate an eigenvalue problem to determine the value of τ such that the matrix S 2 (B τ ) becomes singular. This is analogous to the guard map approach used in the proof in the previous subsection for the other direction of the proof, and in this case, we are varying τ from zero to infinity and finding the point such that for all larger τ , S 2 (B τ ) is positive definite. Intuitively, such an argument works since τ scales the positive definite matrix Q 2 . Towards this end, consider the eigenvalue problem in τ given by 0 = det τ I -Q -1 2 (P 1 D 12 f (x * ) -S 1 L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) -S 1 L 0 P 2 ) -P 2 L 0 D 12 f (x * ) -(P 2 L 1 D 12 f (x * )) . Let τ 0 be the maximum positive eigenvalue, and zero otherwise. Then, since eigenvalues vary continuously, for all τ ∈ (τ 0 , ∞), Q τ > 0 so that by Lemma B.3.a we conclude that P and -J τ (x * ) have the same inertia, but this contradicts the stability of -J τ (x * ) for all τ ∈ (τ * , ∞) since υ + (P ) ≥ 1. Remark on the tightness of τ * . The construction of τ * is tight in the following sense. While it is possible to construct multiple guard maps for a domain, all guard maps have the same positive real roots by definition (Saydy, 1996, Remark 2) . Hence, independent of the guard map choice, we will get the same value of τ * . Moreover, τ * tells us exactly when the eigenvalues move into the open left-half complex plane C • -and remain there. Hence, this gives us the precise, tight lower bound on the value of τ * .

C.3 τ -GDA PROVABLE CONVERGES TO DIFFERENTIAL STACKELEBRG EQUILIBRIA

As a corollary to Theorem 1, we first show that the discrete time τ -GDA update is locally asymptotically stable for a range of learning rates γ 1 . Corollary C.1 (Asymptotic convergence of τ -GDA). Suppose the assumptions of Theorem 1 hold so that x * is a critical points of g and S 1 (J(x * )) and D 2 2 f 2 (x * ) are non-singular. There exists a τ * ∈ (0, ∞) such that τ -GDA with γ 1 ∈ (0, γ(τ )) where γ(τ ) = arg min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 converges locally asymptotically for all τ ∈ (τ * , ∞) if and only if x * is a differential Stackelberg equilibrium. The proceeding corollary to Theorem 1 follows immediately from Lemma F.1 given in Appendix F. Proof. Suppose that x * is a differential Stackelberg equilibrium so that by Theorem 1, there exists a τ * ∈ (0, ∞) such that spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞). Now that we have a guarantee that -J τ (x * ) is Hurwitz stable for any τ ∈ (τ * , ∞), we apply Hartman-Grobman to get that the nonlinear system ẋ = -Λ τ g(x) is stable in a neighborhood of x * . Fix any τ ∈ (τ * , ∞) and let γ = arg min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 . Then, applying Lemma F.1, for any γ 1 ∈ (0, γ), τ -GDA converges locally asymptotically to x * . On the other hand, suppose that there exists a τ * ∈ (0, ∞) such that spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞). Then by Theorem 1, if x * is a differential Stackelberg equilibrium. Furthermore, since spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞), if we let γ = arg min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , then by Lemma F.1 τ -GDA converges locally asymptotically to x * for any choice of γ 1 ∈ (0, γ).

D PROOF OF THEOREM 2: INSTABILITY OF τ -GDA

To begin, consider a critical point x * defined by g(x * ) = 0 which is not a differential Stackelberg equilibria such that det(D 2 2 f (x * )) = 0. Then, det(-J τ (x * )) = τ n2 det(D 2 2 f (x * )) det(-S 1 (J(x * )) ) so that det(-J τ (x * )) = 0 if and only if det(-S 1 (J(x * ))) = 0 which implies -J τ (x * ) is unstable for all τ ∈ (τ 0 , ∞) where τ 0 = 0 when det(-S 1 (J(x * ))) = 0. We now consider when det(S 1 (J(x * ))) = 0 for the remainder of the proof. Let x * be a stable critical point of 1-GDA (without loss of generality) which is not a differential Stackelberg equilibrium. Without loss of generality, suppose that S 1 (-J(x * )) has at least one strictly positive eigenvalue. Note that both S 1 (-J(x * )) and -D 2 2 f (x * ) are symmetric matrices and hence, have purely real eigenvalues. Since both S 1 (-J(x * )) and D 2 2 f (x * ) have no zero valued eigenvalues, by Lemma B.3.b, there exists non-singular Hermitian matrices P 1 , P 2 and positive definite Hermitian matrices Q 1 , Q 2 such that S 1 (-J(x * ))P 1 + P 1 S 1 (-J(x * )) = Q 1 and D 2 2 f (x * )P 2 + P 2 D 2 2 f (x * ) = Q 2 . Further, S 1 (-J(x * )) and P 1 have the same inertia, meaning υ + (S 1 (-J(x * ))) = υ + (P 1 ), υ -(S 1 (-J(x * ))) = υ -(P 1 ), ζ(S 1 (-J(x * ))) = ζ(P 1 ) where for a given matrix A, υ + (A), υ -(A), and ζ(A) are the number of eigenvalues of the argument that have positive, negative and zero real parts, respectively. Similarly, D 2 2 f (x * ) and P 2 have the same inertia: υ + (D 2 2 f (x * )) = υ + (P 2 ), υ -(D 2 2 f (x * )) = υ -(P 2 ), ζ(D 2 2 f (x * )) = ζ(P 2 ). Recall that we assumed S 1 (-J(x * )) has at least one eigenvalue with strictly positive real part. Hence, υ + (P 1 ) = υ + (S 1 (-J(x * ))) ≥ 1.

Define

P = I L 0 0 I P 1 0 0 P 2 I 0 L 0 I where L 0 = (D 2 2 f (x * )) -1 D 12 f (x * ). Since P is congruent to blockdiag(P 1 , P 2 ), by Sylvester's law of inertia (Horn & Johnson, 1985, Thm. 4.5.8 ), P and blockdiag(P 1 , P 2 ) have the same inertia, meaning that υ + (P ) = υ + (blockdiag(P 1 , P 2 )), υ -(P ) = υ -(blockdiag(P 1 , P 2 )), and ζ(P ) = ζ(blockdiag(P 1 , P 2 )). Consider now the Lyapunov equation -P J τ (x * ) -J τ (x * )P = Q τ for -J τ (x * ) where Q τ = I L 0 0 I B τ I 0 L 0 I with B τ = Q 1 P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 (P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) P 2 L 0 D 12 f (x * ) + (P 2 L 0 D 12 f (x * )) + τ Q 2 which can be verified by straightforward calculations. Since υ + (P 1 ) ≥ 1, we have that υ + (P ) ≥ 1. Now, we find the value of τ 0 such that for all τ > τ 0 , Q τ > 0 so that, in turn, we can apply Lemma B.3.a, to conclude that spec(-J τ (x * )) ⊂ C • -. Indeed, observe that Q τ > 0 is equivalent to B τ > 0 and both matrices are symmetric so that B τ > 0 if and only if Q 1 > 0 and S 2 (B τ ) > 0 where S 2 (B τ ) = P 2 L 1 D 12 f (x * ) + (P 2 L 1 D 12 f (x * )) + τ Q 2 -(P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ). Now, S 2 (B τ ) is also a real symmetric matrix, and hence, it is positive definite if and only if all its eigenvalues are positive. To determine the range of τ for which Q τ > 0, we simply need to solve the eigenvalue problem 0 = det(τ I -Q -1 2 ((P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) -P 2 L 0 D 12 f (x * ) -(P 2 L 0 D 12 f (x * )) )). and extract the maximum eigenvalue, namely, τ 0 =λ max (Q -1 2 ((P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) Q -1 1 (P 1 D 12 f (x * ) + S 1 (-J(x * ))L 0 P 2 ) -P 2 L 0 D 12 f (x * ) -(P 2 L 0 D 12 f (x * )) )). Hence, as noted previously, by Lemma B.3.a, we conclude that for all τ ∈ (τ 0 , ∞), spec(-J τ (x * )) ⊂ C • -. This concludes the proof. Additional context/intuition for the proof approach. To provide some context for the proof approach, we remark that it follows the same idea as the proof of Theorem 1 in Appendix C.2.2. Indeed, to determine the range of τ such that S 2 (B τ ) is positive definite, we can formulate an eigenvalue problem to determine the value of τ such that the matrix S 2 (B τ ) becomes singular. We vary τ from zero to infinity in order to find the point such that for all larger τ , S 2 (B τ ) is positive definite. Intuitively, such an argument works since τ scales the positive definite matrix Q 2 .

E PROOF OF THEOREM 3: STABILITY OF τ -GDA IN REGULARIZED GANS

As in Mescheder et al. (2018) , we only apply the regularization to the discriminator. In the following proof, we use ∇ x (•) to denote the partial gradient with respect to x of the argument (•) when the argument is the discriminator D(•; ω) in order prevent any confusion between the notation D(•) which we use elsewhere for derivatives. To prove the first part of this result, we following similar arguments to Theorem 4.1 of (Mescheder et al., 2018) . To prove the second part, we leverage the concept of the quadratic numerical range. For both components of the proof, we will use the following form of the Jacobian of the regularized game. Indeed, first observe that the structural form of J (τ,µ) (x * ) is J (τ,µ) (x * ) = 0 B -τ B τ (C + µR) where B = D 12 f (x * ), C = -D 2 2 f (x * ) and R = D 2 2 R j (x * ) where R j is either gradient penalty indexed by j = 1, 2.foot_5 This follows from Assumption 1-a., which implies that D(x; ω * ) = 0 in some neighborhood of supp(p D ) and hence, ∇ x D(x; ω * ) = 0 and ∇ 2 x D(x; ω * ) = 0 for x ∈ supp(p D ). In turn, we have that D 2 1 f (x * ) = 0. Proof that x * = (θ * , ω * ) is a differential Stackelberg equilibrium. For any fixed µ ∈ (0, ∞), then we first observe that x * is also a critical point of the unregularized dynamics. Indeed, by Assumption 1-a., D(x; ω * ) = 0 in some neighborhood of supp(p D ) and hence, ∇ x D(x; ω * ) = 0 and ∇ 2 x D(x; ω * ) = 0 for x ∈ supp(p D ). Further, D 2 R j (θ, ω) = µE pi(x) [D 2 (∇ x D(x; ω))∇ x D(x, )] for j = 1, 2 where p 1 (x) = p D (x) and p 2 (x) = p θ (x). Thus, using the above observation that ∇ x D(x; ω * ) = 0, we have that D 2 R i (θ * , ω * ) = 0 for i = 1, 2 meaning that the derivative of the regularizer with respect to ω is zero at x * = (θ * , ω * ) which in turn implies that D 1 f (x * ) = 0 and -D 2 f (x * ) = 0. Hence, x * is a critical point of the unregularized dynamics as claimed. Further, C + µR > 0 which follows from Lemma D.5 in (Mescheder et al., 2018) . From Lemma D.6 in (Mescheder et al., 2018) , due to Assumption 1-c., if v = 0 and v ∈ T θ * M G , then Bv = 0 which implies that B can only be rank deficient on T θ * M G . Using this fact along with the structure of the Jacobian as in ( 22), we have that the Schur complement of J (τ,µ) (x * ) is equal to B (C+µR) -1 B > 0 since C + µR > 0. Hence, x * = (θ * , ω * ) is a differential Stackelberg equilibrium. Proof of stability. Examining ( 22), it is straightforward to see that the quadratic numerical range W 2 (J (τ,µ) ) has eigenvalues of the form λ τ,µ = 1 2 (τ (c + µr)) ± 1 2 (-τ (c + µr)) 2 -4τ |b| 2 where b = D 12 f (x * )v, w , c = -D 2 2 f (x * )w, w and r = D 2 2 R i (x * )w, w for vectors v ∈ W 1 ∩ (T θ * M G ) ⊥ and w ∈ W 2 ∩ (T ω * M D ) ⊥ where U ⊥ denotes the orthogonal complement of U . We claim that for any value of µ ∈ (0, ∞) and any τ ∈ (0, ∞), Re(λ τ,µ ) > 0. Indeed, we argue this by considering the two possible cases: (1) (τ (c + µr)) 2 ≤ 4|b| 2 τ or (2) (τ (c + µr)) 2 > 4τ |b| 2 . • Case 1: Suppose that (τ (c + µr)) 2 ≤ 4|b| 2 τ . Then, Re(λ τ,µ ) = 1 2 (τ (c + µr)) > 0 trivially since c + µr > 0. • Case 2: Suppose that (τ (c + µr)) 2 > 4τ |b| 2 . In this case, we want to ensure that Re(λ τ ) > 1 2 (τ (c + µr)) -1 2 (-τ (c + µr)) 2 -4τ |b| 2 > 0. which holds since (τ (c + µr)) 2 > (-τ (c + µr)) 2 -4τ |b| 2 ⇐⇒ 0 > -4τ |b| 2 This concludes the proof.

E.1 NECESSARY CONDITIONS ON GAN ARCHITECTURE

The following proposition provides necessary conditions on the sizes of the network architectures for the discriminator and generator network for stability. Theorem A.7 of Mescheder et al. (2018) shows that matrices of the form -J = 0 -B B -C are stable if B is full rank and C > 0. The following proposition provides necessary conditions on the sizes of the network architectures for the discriminator and generator network for stability. Proposition E.1. Consider training a generative adversarial network via a zero-sum game with generator network G θ , discriminator network D ω , and loss f (θ, ω) with regularization R j (θ, ω) (for some j ∈ {1, 2}) such that Assumption 1 is satisfied for an equilibrium x * = (θ * , ω * ). Independent of the learning rate ratio and the regularization parameter µ, for x * to be stable it is necessary that the dimension of the discriminator network parameter vector is at least half as large as the corresponding generator network parameter vector: n 2 ≥ n 1 /2 where θ ∈ R n1 and ω ∈ R n2 . The intuition for the why this proposition should hold follows immediately from observing the structure of the Jacobian: for any matrix of the form ( 23), at least one eigenvalue will be purely imaginary if n 2 < n 1 /2 where B ∈ R n1×n2 and C ∈ R n2×n2 . This proposition follows immediately from observing the structure of the Jacobian: for any matrix of the form -J = 0 -B B -C at least one eigenvalue will be purely imaginary if n 2 < n 1 /2 where B ∈ R n1×n2 and C ∈ R n2×n2 . Indeed, by Lyapunov's stability theorem for linear systems (Hespanha, 2018, Theorem 8 .2), a matrix A is Hurwitz stable if and only if for every symmetric positive definite Q = Q > 0, there exists a unique symmetric positive definite P = P > 0, such that A P + P A = -Q. Hence, -J is Hurwitz stable if and only if there exists a P = P > 0 such that 0 < Q = 0 -B B C P 1 P 2 P 2 P 3 + P 1 P 2 P 2 P 3 0 B -B C = -BP 2 -P 2 B -BP 3 + P 1 B + P 2 C B P 1 + CP 2 -P 3 B B P 2 + CP 3 + P 2 B + P 3 C Since this is a symmetric positive definite matrix, the block diagonal components must also be symmetric positive definite so that -BP 2 -P 2 B > 0.foot_6 Recall that B ∈ R n1×n2 and P 2 ∈ R n2×n1 . Hence, a necessary condition for this matrix to be positive definite is that n 2 ≥ n 1 /2 for -BP 2 -P 2 B to have full rank; of course this is not sufficient, but it is necessary. It is easy to see this argument is independent of whether a learning rate ratio τ = 0 or regularization is incorporated.

F PROOF OF HELPER LEMMAS AND THEOREM 4 FOR τ -GDA CONVERGENCE

In this appendix section, we prove the following two lemmas. The proofs build on one another so we state them jointly here. We then invoke them to prove Theorem 4. Lemma F.1. Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is a differential Stackelberg equilibrium and that given τ > 0, spec(-J τ (x * )) ⊂ C • -. Let γ = min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 . For any γ 1 ∈ (0, γ), τ -GDA converges locally asymptotically. Lemma F.2. Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is a differential Stackelberg equilibrium and that given τ , spec(-J τ (x * )) ⊂ C • -. Let γ = min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , and λ m = arg min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 . For any α ∈ (0, γ), τ -GDA with learning rate γ 1 = γ -α converges locally asymptotically at a rate of O((1 -α 4β ) k/2 ) where β = (2Re(λ m ) -α|λ m | 2 ) -1 . F.1 PROOF OF LEMMA F.1 Suppose that x * is a differential Stackelberg or Nash equilibrium and that 0 < τ < ∞ is such that spec(-J τ (x * )) ⊂ C • -. For the discrete time dynamical system x k+1 = x k -γ 1 Λ τ g(x k ), it is well known that if γ 1 is chosen such that ρ(I -γ 1 J τ (x * )) < 1, then x k locally (exponentially) converges to x * (Ortega & Rheinboldt, 1970) . With this in mind, we formulate an optimization problem to find the upper bound γ on the learning rate γ 1 such that for all γ 1 ∈ (0, γ), the spectral radius of the local linearization of the discrete time map is a contraction which is precisely ρ(I -γ 1 J τ (x * )) < 1. The optimization problem is given by γ = min γ>0 γ : max λ∈spec(Jτ (x * )) |1 -γλ| ≤ 1 . ( ) The intuition is as follows. The inner maximization problem is over a finite set spec(J τ (x * )) = {λ 1 , . . . , λ n } where J τ (x * ) ∈ R n×n . As γ increases away from zero, each |1 -γλ i | shrinks in magnitude. The last λ i such that 1 -γλ i hits the boundary of the unit circle in the complex plane (that is, |1 -γλ i | = 1) gives us the optimal value of γ and the element of spec(J τ (x * )) that achieves it. Examining the constraint, we have that for each λ i , γ(γ|λ i | 2 -2Re(λ i )) ≤ 0 for any γ > 0. As noted this constraint will be tight for one of the λ, in which case γ = 2Re(λ)/|λ| 2 since γ > 0. Hence, by selecting γ = min λ∈spec(Jτ (x * )) 2Re(λ)/|λ| 2 , we have that |1 -γ 1 λ| < 1 for all λ ∈ spec(J τ (x * )) and any γ 1 ∈ (0, γ). To see this is the case, let γ = min Using the expression for γ, we have that 1 -2γRe(λ) + γ 2 (Re(λ) 2 + Im(λ) 2 ) = 1 -2 2Re(λ m ) |λ m | 2 Re(λ) + 2Re(λ m ) |λ m | 2 2 |λ| 2 . Now, using the fact that Re(λ)/|λ| 2 > Re(λ m )/|λ m | 2 , we have 1 -4 Re(λ m ) |λ m | 2 Re(λ) + 2Re(λ m ) |λ m | 2 2 |λ| 2 ≤ 1 -2 2Re(λ m ) |λ m | 2 Re(λ) + 2Re(λ m ) |λ m | 2 2 |λ m | 2 Re(λ) Re(λ m ) = 1 -4 Re(λ m ) |λ m | 2 Re(λ) + 4 Re(λ m ) |λ m | 2 Re(λ) = 1 as claimed. From this argument, it is clear that for any γ 1 ∈ (0, γ), |1 -γ 1 λ| < 1 for all λ ∈ spec(J τ (x * )). Now, consider any α ∈ (0, γ) and let β = (2Re(λ m ) -α|λ m | 2 ) -1 . Observe that γ 1 = γ -α so that γ 1 ∈ (0, γ). Hence, |1 -(γ -α)λ m | 2 = 1 - 2Re(λ m ) |λ m | 2 -α Re(λ m ) 2 + 2Re(λ m ) |λ m | 2 -α 2 Im(λ m ) 2 = 1 -4 Re(λ m ) 2 |λ m | 2 + 2αRe(λ m ) + 4 Re(λ m ) 2 |λ m | 2 -4αRe(λ m ) + α 2 |λ m | 2 = 1 -2αRe(λ m ) + α 2 |λ m | 2 = 1 - α β so that ρ(I -γ 1 J τ (x * )) < 1 - α β 1/2 . Hence, the ρ(I -γ 1 J τ (x * )) < 1 so that an application of Proposition B.1 gives us the desired result.

F.2 PROOF OF LEMMA F.2

To prove this lemma, we build directly on the conclusion of the proof of Lemma F.1. Indeed, since ρ(I -γ 1 J τ (x * )) < 1 - α β 1/2 , given ε = α 4β > 0 there exists a norm • (cf. Lemma 5.6.10 in Horn & Johnson ( 1985))foot_7 such that I -γ 1 J τ (x * ) ≤ 1 - α β 1/2 + α 4β ≤ 1 - α 2β 1/2 where the last inequality holds by Lemma B.1. Taking the Taylor expansion of I -γ 1 g τ (x) around x * , we have I -γ 1 g τ (x) = (I -γ 1 g τ (x * )) + (I -γ 1 J τ (x * ))(x -x * ) + R 2 (x -x * ) where R 2 (x -x * ) is the remainder term satisfying R 2 (x -x * ) = o( x -x * ) as x → x * . 9 This implies that there is a δ > 0 such that R 2 (x -x * ) ≤ α 8β x -x * whenever x -x * < δ. Hence, I -γ 1 g τ (x) -(I -γ 1 g τ (x * )) ≤ I -γ 1 J τ (x * ) + α 4β x -x * ≤ 1 - α 2β 1/2 + α 8β x -x * ≤ 1 - α 4β 1/2 x -x * where the last inequality holds again by Lemma B.1. Hence, x k -x * ≤ 1 - α 4β k/2 x 0 -x * (25) whenever x 0 -x * < δ which verifies the claimed convergence rate.

F.3 PROOF OF THEOREM 4

This result follows directly from Theorem 1, Theorem B.2, Lemma F.2, and the following result. Proposition F.1 (Jin et al. 2020) . Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is a differential Nash equilibrium. Then, spec(-J τ (x * )) ⊂ C • -for all τ ∈ (0, ∞). Now to prove Theorem 4, we apply Theorem 1 to construct τ * via the guard map ν(τ ) = det(-J τ (x * ) -J τ (x * )) such that for all τ ∈ (τ * , ∞), spec(J τ (x * )) ⊂ C • + . This guarantees that spec(-J τ (x * )) ⊂ C • -for any τ ∈ (τ * , ∞) and hence the nonlinear dynamical system ẋ = -Λ τ g(x) is locally asymptotically (in fact, exponentially) stable by the Hartman-Grobman theorem (cf. Theorem B.2). Therefore, for any τ ∈ (τ * , ∞), by Lemma F.2, τ -GDA converges with a rate of O((1 -α 4β ) k/2 ). Finally, τ * = 0 by Proposition F.1 if x * is a differential Nash equilibrium. This concludes the proof. Intuition for Proposition F.1. While this result was shown in Jin et al. (2020) , we give some intuition for why it holds based on the quadratic numerical range tool. Indeed simple observation of the eigenvalues of the quadratic numerical range show that spec(J τ (x * )) ⊂ C • + for any τ ∈ (0, ∞). Recall that W 2 (J τ (x * )) = v∈W1,w∈W2 spec(J v,w τ (x * )) where J v,w τ (x * ) = D 2 1 f (x * )v, v D 12 f (x * )w, v -τ D 12 f (x * )v, w -τ D 2 2 f (x * )w, w and W i = {z ∈ C ni : z = 1} for each i = 1, 2. Fix v ∈ W 1 and w ∈ W 2 and consider J v,w τ (x * ) = a b -τ b τ d Then, the elements of W 2 (J τ (x * )) are of the form λ τ = 1 2 (a + τ d) ± 1 2 (a -τ d) 2 -4τ |b| 2 9 The notation R2(x -x * ) = o( x -x * ) as x → x * means limx→x * R2(x -x * ) / x -x * = 0. where a = D 2 1 f (x * )v, v , b = D 12 f (x * )w, v and d = -D 2 2 f (x * )w, w for vectors v ∈ W 1 and w ∈ W 2 . Since D 2 1 f (x * ) > 0 and -D 2 f (x * ) > 0, Re(λ(J v,w (x * ))) > 0 since λ(J v,w (x * )) = 1 2 (a + d) ± 1 2 (a -d) 2 -4|b| 2 . The introduction of τ ∈ (0, ∞) does not alter the sign. This is obvious when Im(λ τ ) = 0. On the other hand, if Im(λ τ ) = 0 so that (a -τ d) 2 > 4τ |b| 2 , then Re(λ τ ) > 1 2 (a + τ d) -1 2 (a -τ d) 2 -4τ |b| 2 > 0. The last inequality is easily seen to be equivalent to -ad < |b| 2 , which holds for any pair of vectors (v, w) such that v ∈ W 1 and w ∈ W 2 since a > 0 and d > 0. Hence, for any τ ∈ (0, ∞), spec(J τ (x * )) ⊂ C • + since the spectrum of an operator is contained in its quadratic numerical range and the above argument shows that W 2 (J τ (x * )) ⊂ C • + . G PROOF OF COROLLARY 1: FINITE TIME CONVERGENCE OF τ -GDA Let • be the norm that exists (via construction a la Horn & Johnson (1985, Lem. 5.6.10 )) in the proof of Lemma F.2 which is given in Appendix F. Following standard arguments, (25) in the proof of Lemma F.2 implies a finite time convergence guarantee. Indeed, let ε > 0 be given. Since 0 < α 4β < 1 we have that (1 -α/(4β)) k < exp(-kα/(4β)). Hence, x k -x * ≤ exp(-kα/(4β)) x 0 -x * . In turn, this implies that x k ∈ B ε (x * ), meaning that x k is a ε-differential Stackelberg equilibrium for all k ≥ 4β α log( x 0 -x * /ε) whenever x 0 -x * < δ. Now, given that f i ∈ C r (X, R) for r ≥ 2, I -γ 1 J τ (x) is locally Lipschitz with constant L so that we can find an explicit expression for δ in terms of L. Indeed, recall that R 2 (x -x * ) = o( x -x * ) as x → x * which means lim x→x * R 2 (x -x * ) / x -x * = 0 so that R 2 (x -x * ) ≤ 1 0 I -γ 1 J τ (x * + η(x -x * )) -(I -γ 1 J τ (x * )) x -x * dη ≤ L 2 x -x * 2 Observing that R 2 (x -x * ) ≤ L 2 x -x * 2 = L 2 x -x * x -x * , we have that the δ > 0 such that R 2 (x -x * ) ≤ α/(8β) x -x * is δ = α/(4Lβ). Comments on computing the neighborhood B δ (x * ). We note that we have essentially given a proof that there exists a neighborhood on which τ -GDA converges. Of course, due to the nonconvexity of the problem in general, this neighborhood could be arbitrarily small. We provide an estimate of the neighborhood size using the local Lipschitz constant of the local linearization I -γ 1 J τ (x * ). One way to better understand the size of this neighborhood is to use Lyapunov analysis, a tool which is well explored in the singular perturbation theory (Kokotovic et al., 1986) . In particular, Lyapunov methods can be applied directly to the nonlinear system if one can construct Lyapunov functions for the fast and slow subsystems individually-also known as the boundary layer model and reduced order model. With these Lyapunov functions in hand, one can "stitch" the two together (via convex combination) and show under some reasonable assumptions that this combined function is a Lyapunov function for the overall singularly perturbed system. The benefit of this analysis is that the Lyapunov function gives one an estimate of the region of attraction (via, e.g., the level sets); however, it is not easy to construct a Lyapunov function for a nonlinear system in general. We leave expanding to such methods to future work.

H CONVERGENCE OF STOCHASTIC GDA WITH TIMESCALE SEPARATION

In this section, we analyze convergence when players do not have oracle access to their gradients but instead have an unbiased estimator in the presence of zero mean, finite variance noise. Specifically, we show that the agents will converge locally asymptotically almost surely to a differential Stackelberg equilibrium. The key insight in this section is that due to Theorem 1 in the main body, we know that a critical point x * is stable for ẋ = -Λ τ g(x) for a range of finite learning rates τ ∈ (τ * , ∞) if and only if x * is a differential Stackelberg equilibrium. Hence, treating ẋ = -Λ τ g(x) as the continuous time limiting differential equation in the so-called ordinary differential equation (ODE) method in stochastic approximation (Borkar, 2008) , we apply classical stochastic approximation analysis to conclude that the stochastic gradient descent-ascent update with timescale separation converges.

H.1 ASYMPTOTIC CONVERGENCE GUARANTEES VIA STOCHASTIC APPROXIMATION

The stochastic form of the update is given by x k+1 = x k -γ k (Λ τ g(x k ) + w k+1 ) (26) where w k+1 is a zero mean, finite variance random variable and {γ k } is the learning rate sequence. Assumption 2. The stochastic process {w k } is a martingale difference sequence with respect to the increasing family of σ-fields defined by F k = σ(x , w , ≤ k), ∀k ≥ 0, so that E[w k+1 | F k ] = 0 almost surely (a.s. ) for all k ≥ 0. Moreover, w k is square-integrable so that, for some constant C > 0, E[ w k+1 2 | F k ] ≤ C(1 + x k 2 ) a.s., ∀k ≥ 0. We note that this assumption has been relaxed in the literature (Thoppe & Borkar, 2019) , however simplicity, we state the theorem with the most accessible criteria. We remark below in the paragraph on extensions to concentration bounds on the nature of the relaxed assumptions. Theorem H.1. Consider a zero-sum game (f, -f ) such that f ∈ C r (X, R) for some r ≥ 2. Suppose that Assumption 2 holds and that {γ k } is square summable but not summable-i.e., k γ 2 k < ∞, yet k γ k = ∞. For any τ ∈ (0, ∞), the sequence {x k } generated by (26) converges to a, possibly sample path dependent, internally chain transitive invariant set of ẋ = -Λ τ g(x). Moreover, if x * is a differential Stackelberg equilibrium, then there exists a finite τ * ∈ [0, ∞) such that {x k } almost surely converges locally asymptotically to x * for every τ ∈ (τ * , ∞). Proof. The convergence of {x k } to a, possibly sample path dependent, compact connected internally chain transitive invariant set of ẋ = -Λ τ g(x) follows from classical results in stochastic approximation theory (Borkar (2008, Chap. 2); Benaim (1996) ). Suppose that x * is a differential Stackelberg equilibrium. By Theorem 1, there exists a finite τ * ∈ [0, ∞) such that for all τ ∈ (τ * , ∞), x * is a locally exponentially stable equilibrium of the continuous time dynamics ẋ = -Λ τ g(x)-that is, spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞). Fix arbitrary τ ∈ (τ * , ∞). Since spec(-J τ (x * )) ⊂ C • -, det(-J τ (x * )) = 0 so that x * is an isolated critical point. Furthermore, exponentially stability of x * implies that there exists a (local) Lyapunov function defined on a neighborhood of x * by the converse Lyapunov theorem (Sastry (1999 , Thm. 5.17), Krasovskii (1963, Thm. 4.3) ). Let U be the neighborhood of x * on which the local Lyapunov function is defined, such that U contains no other critical points (which is possible since x * is isolated). That is, let Φ : U → [0, ∞) be the local Lyapunov function defined on U where x * ∈ U , Φ is positive definite on U , and for all x ∈ U , d dt Φ(x) ≤ 0 where equality holds for z ∈ U if and only if Φ(z) = 0. By Corollary 3 (Borkar, 2008, Chap. 2) , {x k } converges to an internally chain transitive invariant set contained in U almost surely. The only internally chain transitive invariant set in U is x * . The following corollary shows that if there is a finite τ * such that x * is stable for ẋ = -Λ τ g(x), then by Theorem 1 x * must be a differential Stackelberg equilibrium and in turn, {x k } almost surely converges locally asymptotically to x * by the above theorem. Corollary H.1. Consider a zero-sum game (f, -f ) such that f ∈ C 2 (X, R). Suppose that Assumption 2 holds and that {γ k } is square summable but not summable: k γ 2 k < ∞, yet k γ k = ∞. If there exists a finite τ * ∈ [0, ∞) such that spec(-J τ (x * )) ⊂ C • -for all τ ∈ (τ * , ∞), then x * is a differential Stackelberg equilibrium and {x k } almost surely converges locally asymptotically to x * . While (local) almost sure convergence in gradient descent-ascent (Chasnov et al., 2019) to a critical pointfoot_8 in the stochastic setting, the result requires time varying learning rates with a sufficient separation in timescale. Specifically, the players need to be using learning rate sequences {γ i,k } for each i ∈ {1, 2} such that (without loss of generality) not only is it assumed that γ 1,k = o(γ 2,k ), but also k γ 2 1,k + γ 2 2,k < ∞ and k γ i,k = ∞ for each i ∈ {1, 2}. The challenge with these assumptions on the learning rate sequences is that empirically the sequences that satisfy them result in poor behavior along the learning path such as getting stuck at saddle points or making no progress. This is, in essence, due to the fact that the faster player-that is, player 2 if γ 1,k = o(γ 2,k )equilibriates too quickly causing progress to stall. This can result in undesirable behavior such as vanishing gradients (so that the discriminator does not provide enough information for the generator to make progress), mode collapse, or failure to converge in practical applications such as generative adversarial networks. On the other hand, our convergence result gives a similar guarantee with less restrictive requirements on the learning rate sequence. In particular, only a single learning rate sequence is required (so that the algorithm can be viewed as a single timescale stochastic approximation update) as long as the fast player (who, without loss of generality, is player 2 in this paper) scales their estimated gradient by τ ∈ (τ * , ∞) where τ * is as in Theorem 1.

STEPSIZES

It is possible to obtain concentration bounds and even finite time, high probability guarantees on convergence leveraging recent advances in stochastic approximation (Borkar, 2008; Kamal, 2010; Thoppe & Borkar, 2019) . To our knowledge, the concentration bounds in (Thoppe & Borkar, 2019) require the weakest assumptions on learning rates-e.g., the learning rate sequence {γ k } needs only to satisfy k γ k = ∞, lim k→∞ γ k = 0, and k γ k ≤ 1. Specifically, since it is assumed, for the zero sum game (f, -f ), that f ∈ C 2 (X, R) and x * is a differential Stackelberg equilibrium, Theorem 1 implies that x * is a locally asymptotically stable attractor of ẋ = -Λ τ g(x) for arbitrary fixed τ ∈ (τ * , ∞), and hence, the concentration bounds in Theorem 1.1 and 1.2 of (Thoppe & Borkar, 2019) directly apply. Furthermore, we note that in applications such as generative adversarial networks, while it has been observed that timescale separation heuristics such as unrolling or annealing the stepsize of the discriminator work well, in the stochastic case, summmable/square-summable assumptions on stepsizes are generally too restrictive in practice since they lead to a rapid decay in the stepsize which, in turn, can stall progress. On the other hand, stepsize sequences such as γ k = 1/(k + 1) β for β ∈ (0, 1]a sequence which satisfies the assumptions posed in (Thoppe & Borkar, 2019) -tend not to have this issue of decaying too rapidly for appropriately chosen β, while also maintaining the guarantees of the theoretical results. We state a convergence guarantee under these relaxed assumptions in Proposition H.1 below. Let x(t) be the asymptotic pseudo-trajectories of the stochastic approximation process {x k }. That is, x(t) are linear interpolates between the sample points x k generated by the stochastic τ -GDA process, and are defined by x(t) = x(t k ) + (t -t k ) γ k (x(t k+1 ) -x(t k )) where t k = t k + γ k and t 0 = 0. Assumption 3. The stochastic process {w k } is a martingale difference sequence with respect to the increasing family of σ-fields defined by F k = σ(x , w , ≤ k), ∀k ≥ 0, so that E[w k+1 | F k ] = 0 almost surely for all k ≥ 0. Furthermore, there exists c 1 , c 2 ∈ C(R d , R >0 ) such that Pr{ w k+1 > v| F k } ≤ c 1 (x k ) exp(-c 2 (x k )v) , n ≥ 0 for all v ≥ ṽ where ṽ is some sufficiently large, fixed number. Proposition H.1. Suppose that Assumption 3 holds and that x * is a differential Stackelberg equilibrium. Let γ k = 1/(k + 1) β where β ∈ (0, 1]. There exists a τ * ∈ [0, ∞) and an 0 ∈ (0, ∞) such that for any fixed ∈ (0, 0 ], there exists functions h 1 ( ) = O(log(1/ )) and h 2 ( ) = O(1/ ) so that when T ≥ h 1 ( ) and k 0 ≥ K τ where K τ is such that 1/γ k ≥ h 2 ( ) for all k ≥ K τ , the stochastic iterates of τ -GDA with stepsize sequence γ k and timescale separation τ ∈ (τ * , ∞) satisfy Pr{ x(t) -x * ≤ ∀t ≥ t k0 + T + 1| x(t k0 ) ∈ B (x * )} = 1 -O(k 1-β/2 0 exp(-C τ k β/2 0 )) for some constant C τ > 0. The proof largely follows from the proofs of Theorem 1.1 and 1.2 in (Thoppe & Borkar, 2019) , combined with the existence of a finite timescale separation parameter obtained via Theorem 1. Indeed, since x * is a differential Stackelberg equilibrium, by Theorem 1 there exists a range of τnamely, (τ * , ∞)-such that for any τ ∈ (τ * , ∞), x * is a locally asymptotically stable equillibrium for ẋ = -Λ τ g(x). Hence, fixing any τ ∈ (τ * , ∞), a converse Lyapunov theorem can be applied to construct a local Lyapunov function. Let V : R n → R be this Lyapunov function so that there exists r, r 0 , 0 > 0 such that r > r 0 , and B (x * ) ⊆ V r0 ⊂ N 0 (V r0 ) ⊆ V r for any ∈ (0, 0 ] where, for a given q > 0, V q = {x ∈ dom(V ) : V (x) ≤ q} and N 0 (V r0 ) is an 0 -neighborhood of V r0 -i.e., N 0 (V r0 ) = {x ∈ R n | ∃y ∈ V r0 , x -y ≤ 0 }. From here, the result follows from an application of the results in the work by Thoppe & Borkar (2019) . The utility of this result is that it provides a guarantee in the stochastic setting for a more reasonable and practically useful stepsize sequence. However, constructing the constants such as K τ , C τ and 0 is highly non-trivial as can be seen in the work of Thoppe & Borkar (2019) and similar works in the area of stochastic approximation (Borkar, 2008) . One direction of future work is examining the Lyapunov approach for directly analyzing the nonlinear singularly perturbed system; it is known, however, that the stochastic singularly perturbed systems have much weaker guarantees in terms of stability (Kokotovic et al., 1986, Chap. 4 ).

I STABILITY OF ∞-GDA: A SINGULAR PERTURBATION APPROACH

The examples included in Section 3 provide evidence that there exists a range of finite learning rate ratios for which differential Stackelberg equilibrium are stable and a range of learning rate ratios for which non-equilibrium critical points are unstable. Yet, until this paper, no result has appeared in the literature on gradient descent-ascent with timescale separation confirming this behavior in general. The closest existing result studies the limiting case τ → ∞. As mentioned previously Jin et al. (2020) show that as τ → ∞, the set of stable critical points with respect to the dynamics ẋ = -Λ τ g(x) coincide with the set of differential Stackelberg equilibrium. However, an equivalent result in the context of general singularly perturbed systems has been known in the literature ( Kokotovic et al. 1986, Chap. 2) . We give a proof based on this type of analysis because it reveals a new set of analysis tools to the study of game-theoretic formulations of machine learning and optimization problems. The formal statement (which is a restatement of the result from Jin et al. 2020 in our notation) is given below. Proposition I.1. Consider a zero-sum game (f 1 , f 2 ) = (f, -f ) defined by f ∈ C r (X, R) for some r ≥ 2. Suppose that x * is such that g(x * ) = 0 and det(D 2 2 f 2 (x * )) = 0. Then, as τ → ∞, spec(-J τ (x * )) ⊂ C • -if and only if x * is a differential Stackelberg equilibrium. The structure of this proof is as follows. We begin by introducing general background for analyzing general singularly perturbed systems. Following this, we consider the linearization of the singularly perturbed system that approximates the simultaneous gradient dynamics and describe how insights made about this system translate to the corresponding nonlinear system. Finally, we analyze the stability of the linear system around a critical point to arrive at the stated result. The analysis is primarily from Kokotovic et al. (1986) . Analysis of General Singularly Perturbed Systems. Let us begin by considering a general singularly perturbed system for x ∈ R n , z ∈ R m , and a sufficiently small parameter ε > 0 given by ẋ = f (x, z, ε, t), x(t 0 , ε) = x 0 , x ∈ R n ε ż = g(x, z, ε, t), z(t 0 , ε) = z 0 , z ∈ R m (27) where f and g are assumed to be sufficiently many times continuously differential functions of the arguments x, z, ε, and t. Observe that when ε = 0, the dimension of the system given in ( 27) drops from n + m to n since ż degenerates into the equation 0 = g(x, z, 0, t) where the notation of x, z indicates that the variables belong to the system with ε = 0. We further require the assumption that (28) has k ≥ 1 isolated roots, which for each i ∈ {1, . . . , k} are given by z = φi (x, t). We now define an n-dimensional manifold M ε for any ε > 0 characterized by the expression z(t, ε) = φ(x(t, ε), ε), where φ is sufficiently many times continuously differentiable function of x and ε. For M ε to be an invariant manifold of the system given in ( 27), the expression in (29) must hold for all t > t * if it holds for t = t * . Formally, if z(t * , ε) = φ(x(t * , ε), ε) → z(t, ε) = φ(x(t, ε), ε) ∀t ≥ t * , then M ε is an invariant manifold for ( 27). Differentiating the expression in (30) with respect to t, we obtain ż = d dt φ(x(t, ε), ε) = dφ ∂x ẋ. Now, multiplying the expression in (31) by ε and substituting in the forms of ẋ, ż, and z from ( 27) and ( 29), the manifold condition becomes g(x, φ(x, ε), ε, t) = ε ∂φ ∂x f (x, φ(x, ε), ε, t), which φ(x, ε) must satisfy for all x of interest and all ε ∈ [0, ε * ], where ε * is a positive constant. We now define η = z -φ(x, ε). Then, in terms of x and η, the system becomes ẋ = f (x, φ(x, ε) + η, ε, t) ε η = g(x, φ(x, ε) + η, ε, t) -ε ∂φ ∂x f (x, φ(x, ε) + η, ε, t). Remark 1. One interesting observation is that the above system is exactly the continuous time limiting system for the τ -Stackelberg learning update in Fiez et al. (2020) under a simple transformation of coordinates. Observe that the invariant manifold M ε is characterized by the fact that η = 0 implies η = 0 for all x for which the manifold condition from (32) holds. This implies that if η(t 0 , ε) = 0, it is sufficient to solve the system ẋ = f (x, φ(x, ε), ε, t), x(t 0 , ε) = x 0 . This system is often referred to as the exact slow model and is valid for all x, z ∈ M ε and M ε known as the slow manifold of (35). Linearization of Simultaneous Gradient Descent Singularly Perturbed System. We now consider the singularly perturbed system for simulataneous gradient descent given by ẋ = -D 2 1 f 1 (x, z) ε ż = -D 2 f 2 (x, z). Let us linearize the system around a point (x * , z * ). Then,foot_9  D 1 f 1 (x, z) ≈ D 1 f 1 (x * , z * ) + D 2 1 f 1 (x * , z * )(x -x * ) + D 12 f 1 (x * , z * )(z -z * ) D 2 f 2 (x, z) ≈ D 2 f 2 (x * , z * ) + D 21 f 2 (x * , z * )(x -x * ) + D 2 2 f 2 (x * , z * )(z -z * ). Defining u = (x-x * ) and v = (z-z * ) and considering a point (x * , z * ) such that D 1 f 1 (x * , z * ) = 0 and D 2 f 2 (x * , z * ) = 0, then linearized singularly perturbed system is given by u = -D 2 1 f 1 (x * , z * )u -D 12 f 1 (x * , z * )v ε v = -D 21 f 2 (x * , z * )u -D 2 2 f 2 (x * , z * )v. To simplify notation, let us define J τ as follows J τ = D 2 1 f 1 (x * , z * ) D 12 f 1 (x * , z * ) ε -1 D 21 f 2 (x * , z * ) ε -1 D 2 2 f 2 (x * , z * ) = A 11 A 12 ε -1 A 21 ε -1 A 22 along with ẇ = u v and w = u v . Then, an equivalent form of ( 35) is given by ẇ = -J τ w. In what follows, we make insights about the behavior of the nonlinear system given in (33) around a critical point (x * , z * ) by analyzing the linear system given in (36). Recall that if (x * , z * ) is asymptotically stable with respect to the linear system in (36), then it is also asymptotically stable with respect to the nonlinear system from (33). Moreover, to determine asymptotic stability, it is sufficient to prove that spec(J τ (( x * , z * )) ⊂ C • + . In what follows, we specialize the general analysis of singularly perturbed systems to the singularly perturbed linear system given in (36). Stability of Critical Points of Simutaneous Gradient Descent. The manifold condition from (32) for the system in (36) is given by A 21 u + A 22 φ(u, ε) = ε ∂φ ∂u (A 11 u + A 12 φ(u, ε)). We claim that (37) can be satisfied by a function φ that is linear in u. Indeed, defining v = φ(u, ε) = -L(ε)u and then substituting back into (32), we get the simplified manifold condition of A 21 -A 22 L(ε) = -εL(ε)A 11 + εL(ε)A 12 L(ε). Before we prove that an L(ε) always exists to satisfy (38), consider the change of variables η = v + L(ε)u. The change of variables transforms the system from (36) into the equivalent representation u η = A 11 -A 12 L(ε) A 12 R(L, ε) A 22 + εL(ε)A 12 u η where R(L, ε) = A 21 -A 22 L(ε) + εL(ε)A 11 -εL(ε)A 12 L(ε). (40) Consider that R(L, ε) = 0. Then, the system from (39) has the upper block-triangular form ẋ η = A 11 -A 12 L(ε) A 12 0 A 22 + εL(ε)A 12 x η , which has the effect of generating a replacement fast subsystem given by ε η = (A 22 + εLA 12 )η. We now proceed to show that an L(ε) such that R(L, ε) = 0 always exists. Lemma I.1. If A 22 is such that det(A 22 ) = 0, there is an ε * such that for all ε ∈ [0, ε * ], there exists a solution L(ε) to the matrix quadratic equation R(L, ε) = A 21 -A 22 L(ε) + εL(ε)A 11 -εL(ε)A 12 L = 0 (42) which is approximated according to L(ε) = A -1 22 A 21 + εA -2 22 A 21 A 0 + O(ε 2 ), where A 0 = A 11 -A 12 A -1 22 A 21 . Proof. To begin, observe that for ε = 0, the unique solution to ( 42) is given by L(0) = A -1 22 A 21 . Now, differentiating R(L, ε) from ( 42) with respect to ε, we find A 22 + εL(ε)A 12 dL dε -ε dL dε (A 11 -A 12 L(ε)) = L(ε)A 11 -L(ε)A 12 L(ε). The unique solution of this equation at ε is dL dε ε=0 = A -1 22 L(0)(A 11 -A 12 L(0)) = A -2 22 A 21 A 0 . Accordingly, (43) represents the first two terms of the MacLaurin series for L(ε). We remark that L(ε) as defined in ( 43) is unique in the sense that even though R(L, ) as given in (42) may have several real solutions, only one is approximated by ( 43). The characteristic equation of ( 41) is equivalent to that for the system from (36) owing to the similarity transform between the systems. The block-triangular form of (36) admits a characteristic equation given by ψ(s, ε) = 1 ε m ψ s (s, ε)ψ f (p, ε) = 0, where ψ s (s, ε) = det(sI -(A 11 -A 12 L(ε))) is the characteristic polynomial of the slow subsystem, and ψ f (p, ε) = det(pI -(A 22 + εA 12 L(ε))) is the characteristic polynomial of the fast subsystem in the timescale p = sε. Consequently, n of the eigenvalues of (36) denoted by {λ 1 , . . . , λ n } are the roots of the slow characteristic equation ψ s (s, ε) = 0 and the rest of the eigenvalues {λ n+1 , . . . , λ n+m } are denoted by λ i = ν j /ε for i = n + j and j ∈ {1, . . . , m} where {ν 1 , . . . , ν m } are the roots of the fast characteristic equation ψ f (p, ε) = 0. The roots of ψ s (s, ε) at ε = 0, given by the solution to ψ s (s, 0) = det(sI -(A 11 -A 12 L(0))) = 0, are the eigenvalues of the matrix A 0 defined in (44) since L(0) = A -1 22 A 21 as shown in Lemma I.1. The roots of the fast characteristic equation at ε = 0, given by the solution to ψ f (p, 0) = det(pI -A 22 ) = 0 (49) are the eigenvalues of the matrix A 22 . The roots of the systems correspond to the conditions for a differential Stackelberg equilibrium, which thus gives the result. We now proceed by characterizing how closely the eigenvalues of the system at ε = 0 approximate the eigenvalues of the system from (36) as ε → 0. If det(A 22 ) = 0, then as ε → 0, n eigenvalues of the system given in (36) tend toward the eigenvalues of the matrix A 0 while the remaining m eigenvalues of the system from (36) tend to infinity with the rate 1/ε along asymptotes defined by the eigenvalues of A 22 given as spec(A 22 )/ε as a result of the continuity of coefficients of the polynomials from ( 46) and ( 47) with respect to ε. Now, consider the special (but generic) case in which the eigenvalues of A 0 are distinct and the eigenvalues of A 22 are distinct, but A 0 and A 22 may have common eigenvalues. Then, taking the total derivative of ( 45) with respect to ε we have that ∂ψ s ∂s ds dε + ∂ψ s ∂ε = 0 Now, observe that ∂ψ s /∂s = 0 since the eigenvalues of A 0 = A 11 -A 12 A -1 22 A 21 are distinct.foot_10 For each i = 1, . . . , n, this gives us a well-defined derivative ds/dε (by the implicit mapping theorem) and hence, with s(0) = λ i (A 0 ), the O(ε) approximation of s(ε) follows directly. That is, λ i = λ i (A 0 ) + O(ε), i = 1, . . . , n 1 Similarly, taking the total derivative of ψ f (p, ε) = 0 and again applying the implicit function theorem, we have λ i+n1 = ε -1 (λ j (A 22 + O(ε)), i = 1, . . . , n 2 where we have used the fact that p = sε.

J FURTHER DETAILS ON RELATED WORK

In this section, we provide further details on the discussion from Section A regarding the results presented by Jin et al. (2020) on the local stability of gradient descent-ascent with a finite timescale separation. The purpose of this discussion is to make clear that Proposition 27 from the work of Jin et al. (2020) does not disagree with the results we provide in Theorem 1 and Theorem 2 and is instead complementary. In what follows, we recall Proposition 27 of Jin et al. (2020) in separate pieces in the terminology of this paper and delineate its meaning from our results on the stability of gradient descent-ascent with a finite timescale separation. To begin, we consider the component of Proposition 27 from Jin et al. (2020) which says that given any fixed and finite timescale separation τ > 0, a zero-sum game can be constructed with a differential Stackelberg equilibrium that is not stable with respect to the continuous time limiting system of τ -GDA given by the dynamics ẋ = -Λ τ g(x). Proposition J.1 (Rephrasing of Jin et al. 2020, Proposition 27(a) ). For any fixed τ > 0, there exists a zero-sum game G = (f, -f ) such that spec(J τ (x * )) ⊂ C • + for a differential Stackelberg equilibrium x * . We now explain the proof. Let us consider any > 0 and the game f (x, y) = -x 2 + 2 √ xy -( /2)y 2 . ( ) At the unique critical point (x * , y * ) = (0, 0), the Jacobian of the dynamics is given by J τ (x * , y * ) = -2 2 √ -2τ √ τ . Moreover, observe that (x * , y * ) is a differential Stackelberg equilibrium and not a differential Nash equilibrium since D 2 1 f (x * , y * ) = -2 ≯ 0, -D 2 2 f (x * , y * ) = > 0 and S 1 (J(x * , y * )) = 2 > 0. Finally, the spectrum of the Jacobian is spec(J τ (x * , y * )) = -2 + τ ± √ τ 2 2 -12τ + 4 2 . Let us now fix τ as any arbitrary positive value. Then, consider the game construction from (50) with = 1/τ . For the fixed choice of τ and subsequent game construction, we get that spec(J τ (x * , y * )) = { -1 ± i √ 7 /2} ⊂ C • + . This in turn means the differential Stackelberg equilibrium is not stable with respect to the dynamics ẋ = -Λ τ g(x) for the given choice of τ . Since the choice of τ was arbitrary, this is a valid procedure to generate a game with a differential Stackelberg equilibrium that is not stable with respect to ẋ = -Λ τ g(x) given a choice of τ beforehand. This result contrasts with that of Theorem 1 in the following fundamental way. In the proof of Proposition J.1, τ is fixed and then the game is constructed, whereas in Theorem 1 the game is fixed and then the conditions on τ given. To illustrate this point, consider the game construction from (50) with fixed to be an arbitrary positive value. It can be verified that spec(J τ (x * , y * )) ⊂ C • + for all τ > 2/ . This means that given the differential Stackelberg equilibria in this game construction, there is indeed a finite τ * such that the equilibrium is stable with respect to ẋ = -Λ τ g(x) for all τ ∈ (τ * , ∞). Put concisely, Proposition J.1 is showing that there is exists a continuum of games for which a differential Stackelberg equilibrium is unstable with an improper choice of finite learning rate ratio τ . On the other hand, Theorem 1 is proving that given a game with a differential Stackelberg equilibrium, there exists a range of suitable finite learning rate ratios such that the differential Stackelberg equilibrium is guaranteed to be stable. We now move on to examining the portion of Proposition 27 from Jin et al. (2020) which says that given any fixed and finite timescale separation τ > 0, a zero-sum game can be constructed with a critical point that is not a differential Stackelberg equilibrium which is stable with respect to the continuous time limiting system of τ -GDA given by ẋ = -Λ τ g(x). Proposition J.2 (Rephrasing of Jin et al. 2020, Proposition 27(b) ). For any fixed τ , there exists a zero-sum game G = (f, -f ) such that spec(J τ (x * )) ⊂ C • + for a critical point x * satisfying g(x * ) = 0 that is not a differential Stackelberg equilibrium. In a similar manner as following Proposition J.1, we now explain the proof of Proposition J.2 and then contrast the result with Theorem 2. Again, consider any > 0, along with the game construction f (x, y) = x 2 1 + 2 √ x 1 y 1 + ( /2)y 2 1 -x 2 2 /2 + 2 √ x 2 y 2 -y 2 2 . At the unique critical point (x * , y * ) = (0, 0), the Jacobian of the dynamics is given by J τ (x * , y * ) =    2 0 2 √ 0 0 -1 0 2 √ -2τ √ 0 -τ 0 0 -2τ √ 0 2τ    Observe that (x * , y * ) is neither a differential Nash equilibrium nor a differential Stackelberg equilibrium since D 2 1 f (x * , y * ) = diag(2, -1) and -D 2 2 f (x * , y * ) = diag( , 2 ) are both indefinite. The spectrum of the Jacobian is spec(J τ (x * , y * )) = 2 -τ ± √ τ 2 2 -12τ + 4 2 , -1 + 2τ ± √ 4τ 2 2 -12τ + 1 2 . Now, fix τ as any arbitrary positive value, then consider the game construction from (51) with = 1/τ . For the fixed choice of τ and resulting game construction given the choice of , we have that spec (J τ (x * , y * )) = {1 ± i √ 7, 1 ± i √ 7} ⊂ C • + . This indicates that the non-equilibrium critical point is stable with respect to the dynamics ż = -Λ τ g(z) where z = (x, y) for the given choice of τ . Similar to the proof of Proposition J.1, since the choice of τ was arbitrary, the procedure to generate a game with a non-equilibrium critical point that is stable with respect to ż = -Λ τ g(z) is valid given a choice of τ beforehand. The key distinction between Proposition J.2 and Theorem 2 is analogous to that between Proposition J.1 and Theorem 1. Indeed, the proof and result of Proposition J.2 rely on τ being fixed followed by the game being constructed. On the other hand, in Theorem 2 the game is fixed and then the conditions on τ given. To make this clear, consider the game construction from (51) with fixed to be an arbitrary positive value. It turns out that spec(J τ (x * , y * )) ⊂ C • + for all τ > 2/ since Re 2 -τ ± √ τ 2 2 -12τ + 4 2 < 0. As a result, given the unique critical point of the game there is a finite τ 0 such that the nonequilibrium critical point is not stable with respect to ẋ = -Λ τ g(x) for all τ ∈ (τ 0 , ∞). In summary, Proposition J.2 is showing that there is exists a continuum of games for which a nonequilibrium critical point is stable given an unsuitable choice of finite learning rate ratio τ . In contrast, Theorem 2 is showing that given a game with a non-equilibrium critical point, there exists a range of finite learning rate ratios such that it is not stable. To recap, the discussion in this section is meant to explicitly contrast Proposition 27 from the work of Jin et al. (2020) with Theorem 1 and Theorem 2 since they may potentially appear contradictory to each other without close inspection. The result of Jin et al. (2020) shows that (i) given a fixed finite learning ratio, there exists a game for with a differential Stackelberg equilibria that is not stable and (ii) given a fixed finite learning ratio, there exists a game with a non-equilibrium critical point that is stable. From a different perspective, we show that (i) given a fixed game and differential Stackelberg equilibrium, there exists a range of finite learning rate ratios for which the equilibrium is stable (Theorem 1) and (ii) given a fixed game and a non-equilibrium critical point, there exists a range of finite learning rate ratios for which the critical point is not stable (Theorem 2).

K EXPERIMENTS SUPPLEMENT

In this section we present several experiments not included in the body of the paper along with supplemental simulation results and details for the experiments presented in Section 5. We numerically investigate Example 1 in Section K.1 and a game similar to that from Example 2 in Section K.2. After that, we investigate a polynomial game with multiple equilibria in Section K.3. We study a torus game in Section K.4 and examine the connection between timescale separation and the region of attraction. Then, in Section K.5, we return to the Dirac-GAN game and consider the non-saturating objective function. In Section K.6, we explore a generative adversarial network formulation using the Wasserstein cost function with a linear generator and quadratic discriminator for the problem of learning a covariance matrix. We finish in Section K.7 by presenting further results and details on our experiments training generative adversarial networks on image datasets along with additional generative adversarial network experiments parameterized by neural networks with a mixture of Gaussians. Code for the experiments is included in the supplemental material. K.1 QUADRATIC GAME: TIMESCALE SEPARATION AND STACKELBERG STABILITY We now revisit the game from Example 1 that demonstrated there exists differential Stackelberg equilibrium that are unstable for choices of the timescale separation τ . To be clear, we repeat the game construction and some characteristics of the game. Let us consider the quadratic zero-sum game defined by the cost f (x 1 , x 2 ) = 1 2 x 1 x 2     -v 0 -v 0 0 1 2 v 0 1 2 v -v 0 -1 2 v 0 0 1 2 v 0 -v     x 1 x 2 where x 1 , x 2 ∈ R 2 and v > 0. The unique critical point of the game given by x * = (x * 1 , x * 2 ) = (0, 0) is a differential Stackelberg equilibrium. The spectrum of the Jacobian evaluated at the equilibrium is given by spec(J τ (x * )) = v(2τ + 1 ± √ 4τ 2 -8τ + 1) 4 , v(τ -2 ± √ τ 2 -12τ + 4) 4 . As mentioned in Example 1, it turns out that spec(J τ (x * ) ⊂ C • + only when τ ∈ (2, ∞). We remark that we computed τ * using the theoretical construction from Theorem 1 and found that it recovered the precise value of τ * = 2 such that the equilibrium is stable for all τ ∈ (τ * , ∞) with respect to the dynamics ẋ = -Λ τ g(x). In the experiments that follow, we consistently observe that the construction of τ * from the theory is tight. For this experiment, we select v = 4 and simulate τ -GDA from the initial condition (x 0 1 , x 0 2 ) = (5, 4, 3, 2) with γ 1 = 0.0005 and τ ∈ {2, 2.5, 3, 5, 10}. In Figures 4a and 4b , we show the trajectories of the players coordinate pairs (x 11 , x 21 ) and (x 21 , x 22 ), respectively. We observe that τ -GDA cycles around the equilibrium with τ = 2 since it is marginally stable with respect to the dynamics. For τ ∈ (2, ∞), the equilibrium is stable and τ -GDA ends up converging to it at a rate that depends on the choice of τ . We demonstrate how the convergence rate depends on the choice of τ in Figure 4c by showing the distance from the equilibrium along the learning path for each of the trajectories. The primary observation is that the cyclic behavior of τ -GDA dissipates as τ grows and as a result the dynamics then rapidly converge to the equilibrium. The behavior of the learning dynamics as a function of the timescale separation τ can be further explained by evaluating the eigenvalues of the game Jacobian at the equilibrium. We show the eigenvalues of the Jacobian at the equilibrium in several forms in Figures 4e, 4f , and 4g. Analyzing the spectrum, we are able to verify that for all τ ∈ (2, ∞) the equilibrium is indeed stable. Moreover, we see that the imaginary parts of the conjugate pairs of eigenvalues decay after τ = 1 and τ = 6, and then the eigenvalues of the conjugate pairs eventually become purely real at τ = 1.87 and τ = 11.66, respectively. After the eigenvalues of a conjugate pair become purely real, they split so that one of the eigenvalues asymptotically converges to an eigenvalue of S 1 (J(x * )) by moving back along the real line, while the other eigenvalue tends toward an eigenvalue of -τ D 2 2 f (x * ). This occurrence is exactly what was described in Section 3 as an immediate implication of Proposition I.1 when the eigenvalues of S 1 (J(x * )) and τ D 2 2 f (x * ) are distinct. The convergence rate is in fact limited by the eigenvalues splitting since as τ grows, the spectrum of the Jacobian is limited by the eigenvalues of the Schur complement which remain constant. A related open question centers on finding the worst case convergence rate as a function of the spectral properties of S 1 (J(x * )) and D 2 2 f (x * ). Finally, the evolution of the eigenvalues as a function of the timescale separation τ demonstrates that the rotational dynamics in τ -GDA vanish as the ratio between the magnitude of the real and imaginary parts of the eigenvalues grows.

K.2 POLYNOMIAL GAME: TIMESCALE SEPARATION AND NON-EQUILIBRIUM STABILITY

We now return to a game similar to that from Example 2 with a non-equilibrium critical point which is stable without timescale separation and becomes unstable for a range of finite learning ratios with multiple equilibria in the vicinity. Consider a zero-sum game defined by the cost f (x 1 , x 2 ) = 5 4 x 2 11 + 2x 11 x 21 + 1 2 x 2 21 -1 2 x 2 12 + 2x 12 x 22 -x 2 22 (x 11 -1) 2 + x 2 11 2 i=1 (x 1i -1) 2 -(x 2i -1) 2 . (53) This game has critical points at (0, 0, 0, 0), (1, 1, 1, 1), and (-4.73, 0.28, -92.47, 0.53) . Among the critical points, only (1, 1, 1, 1) and (-4.73, 0.28, -92.47, 0.53 ) are game-theoretically meaningful equilibrium. In fact, they are each differential Nash equilibrium and are locally stable for any choice of τ ∈ (0, ∞) as a result of Proposition F.1. On the other hand, the critical point x * = (0, 0, 0, 0) is neither a differential Nash equilibrium nor a differential Stackelberg equilibrium. However, x * is stable for τ ∈ (0, 2) and it is marginally stable for τ = 2. In general, convergence to the nonequilibrium critical point x * in the presence of multiple game-theoretically meaningful equilibrium would be viewed as undesirable. In fact, this is precisely the type of critical point that sophisticated schemes for converging to only differential Nash equilibria or only differential Stackelberg equilibria seek to avoid (Adolphs et al., 2019; Fiez et al., 2020; Mazumdar et al., 2019; Wang et al., 2020) . We show in this example that the simple inclusion of timescale separation in gradient descent-ascent is sufficient to avoid x * and instead converge to a differential Nash equilibrium. Indeed, for all τ ∈ (2, ∞) the non-equilibrium critical point x * is unstable with respect to ẋ = -Λ τ g(x). We simulate τ -GDA from the initial condition (x 0 1 , x 0 2 ) = (-1.5, 2.5, 2.5, 3) with γ 1 = 0.0005 and τ ∈ {0.75, 2, 5, 12}, where we use the superscript to denote the time index so as not to be confused with the multiple indexes for player choice variables. In Figures 5a and 5b , we show the trajectories of the players coordinate pairs (x 11 , x 21 ) and (x 21 , x 22 ), respectively. We observe that τ -GDA converges to the non-equilibrium critical point x * with τ = 0.75 as expected and the dynamics move near it and then cycle around it with τ = 2 since the critical point becomes marginally stable. However, for τ = 5 and τ = 12, τ -GDA avoids the non-equilibrium critical point since it becomes unstable and instead the dynamics converge to the nearby differential Nash equilibrium. We show the eigenvalues of the Jacobian at the non-equilibrium critical point x * = (0, 0, 0, 0) in several forms in Figures 5d-5f . Again, we observe that the eigenvalues quickly become purely real as τ grows and then they split, and asymptotically converge toward the eigenvalues of S 1 (J(x * )) and -τ D 2 2 f (x * ). Together, this example demonstrates that often there is a reasonable finite learning rate ratio such that non-meaningful critical points become unstable for τ -GDA.

K.3 POLYNOMIAL GAME: VECTOR FIELD WARPING AND REGION OF ATTRACTION

Consider a zero-sum game defined by the cost f (x 1 , x 2 ) = -e -(0.01x 2 1 +0.01x 2 2 ) (0.3x 1 + x 2 2 ) 2 + (0.3x 2 + x 2 1 ) 2 . ( ) The cost structure of this game is visualized in Figure 6a , where we present a three dimensional view of -f (x 1 , x 2 ) along with the cost contours and the locations of critical points. This game has eleven critical points including one differential Nash equilibrium and two differential Stackelberg equilibria that are not a differential Nash equilibrium. The critical points that are neither a Figure 7 : Experimental results for the polynomial game defined in (54) of Section K.3. In Figure 7a , we overlay the trajectories from Figure 6b produced by τ -GDA onto the vector field generated by the choice of timescale separation selection τ . The shading of the vector field is dictated by its magnitude so that lighter shading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude. Figure 7b demonstrates the effect of timescale separation on the region of attractions around critical points by coloring points in the strategy space according to the equilibrium τ -GDA converges. We remark that areas without coloring indicate where τ -GDA did not converge in the time horizon. differential Nash equilibrium nor a differential Stackelberg equilibrium are unstable for any choice of timescale separation τ . The differential Nash equilibrium is at (x 1 , x 2 ) = (10.57, -8.95) and it is stable for all τ ∈ (0, ∞) by Proposition F.1. The differential Stackelberg equilibria are at (x 1 , x 2 ) = (-1.625, -1.625) and (x * 1 , x * 2 ) = (-11.03, -11.03); each is stable for all τ ∈ (1, ∞). We computed τ * for the pair of differential Stackelberg equilibrium using the theoretical construction from Theorem 1 and observed that it properly recovered τ * = 1 for each equilibrium as the timescale separation such that the continuous time system is stable for all τ ∈ (τ * , ∞). Finally, we note that while the set of equilibrium follow a linear translation, this game is generic and the equilibria are in fact isolated. In Figure 6b , we show the trajectories of τ -GDA with γ 1 = 0.0001 and τ ∈ {1, 2, 5, 20} given the initialization (x 0 1 , x 0 2 ) = (-9, -9) near the differential Stackelberg equilibrium at (x * 1 , x * 2 ) = (-11.03, -11.03). Moreover, in Figure 7a , we overlay the trajectories on the vector field generated by the respective timescale separation parameters. As expected, the choice of τ = 1 results in a trajectory that cycles around the equilibrium in a closed curve since it is marginally stable and J τ (x * ) has purely imaginary eigenvalues. Notably, as τ grows, the cyclic behavior dissipates as the timescale separation reshapes the vector field until the trajectory moves near directly to the zero derivative line of the maximizing player and then follows a path along that line toward the equilibrium and converges rapidly. The eigenvalues of J τ (x * ) as a function of τ are presented in Figures 6c and 6d . As was the case for the previous experiments, we observe that after the eigenvalues become purely real as τ grows, they then split and asymptotically converge toward the eigenvalues of S 1 (J(x * )) and -τ D 2 2 f (x * ). It is worth noting that much of the rotational behavior in the dynamics and vector field disappears as a result of timescale separation well before the eigenvalues become purely real; this seems to occur after the timescale separation is such that the magnitude of the real part of the eigenvalues is greater than that of the imaginary part. Finally, in Figure 7b , we demonstrate how the choice of timescale separation τ not only warps the vector field but also shapes the regions of attraction around critical points. The vector field is again shown for each τ ∈ {1, 2, 5, 20}, but now zoomed out to include each of the equilibria. The colors overlayed on the vector field indicate the equilibria that the dynamics converge to given an initialization at that position. Positions in the strategy space without color did not converge to an equilibrium in the fixed horizon of 75000 iterations with γ 1 = 0.001. This is explained by the fact that the dynamics are not guaranteed to be globally convergent and may get stuck in limit cycles or may simply move slowly for a long time in flat regions of the optimization landscape. We produced this experiment by running τ -GDA for a dense set of initial conditions chosen uniformly over the space of interest. It is clear from the experiment that the choice of timescale separation determines not only the stability of equilibria, but also has a fundamental impact on the equilibria the dynamics converge to from a given initial condition as a result of the warping of the vector field. As a concrete example, given an initialization of (x 1 , x 2 ) = (-10, -2), the dynamics with τ = 1 converge to the differential Nash equilibria at (x 1 , x 2 ) = (10.57, -8.95). However, for any τ > 1, the dynamics instead converge to the differential Stackelberg equilibrium at (x 1 , x 2 ) = (-11.03, -11.03) that is significantly closer to the initial condition. This example motivates future work on methods for obtaining accurate estimates of the regions of attraction around critical points and techniques to design τ in order to explicitly shape the region of attraction around an equilibrium of interest. We refer to the end of Section G for further discussion on potentially relevant analysis methods in this direction.

K.4 LOCATION GAME ON THE TORUS

We use the example in this section to further study the role of timescale separation on the regions of attraction around critical points. Consider the zero-sum game defined by the cost f (x 1 , x 2 ) = -0.15 cos(x 1 ) + cos(x 1 -x 2 ) + 0.15 cos(x 2 ). (55) This game can be interpreted as a location game on the torus. Specifically, the first player seeks to be far from the second player but near zero, while the second player seeks to be near the first player. This is a non-convex game on a non-convex strategy space. The critical points are given by the setfoot_11 : {x : g(x) = 0} = {(0, 0), (π, π), (π, 0), (0, π), (-1.646, -1.496), (1.646, 1.496)}. The critical points (0, 0) and (π, π) are the only differential Stackelberg equilibrium and neither is a differential Nash equilibrium. The differential Stackelberg equilibrium at (0, 0) is stable for all τ ∈ (τ * , ∞) where τ * = 0.74 and the differential Stackelberg equilibrium (π, π) is stable for all τ ∈ (τ * , ∞) where τ = 1.35. The rest of the critical points are unstable for any choice of τ . We remark that we computed τ * for each differential Stackelberg equilibrium using the construction from Theorem 1 in Section 3 and it again gave the exact value of τ * such that the system is stable for all τ > τ * . In Figure 8a , we show the trajectories of τ -GDA with γ 1 = 0.001 and τ ∈ {1, 2, 5, 10} given the initializations (x 0 1 , x 0 2 ) = (2, -1) and (x 0 1 , x 0 2 ) = (1.9, -2.1) overlayed on the vector field generated by the respective timescale separation parameters. We observe that as the timescale separation τ grows, the rotational dynamics in the vector field dissipate and the directions of movement become sharp. As we mentioned in previous examples, τ -GDA moves directly to the zero line of -D 2 f (x 1 , x 2 ) and then along that line to an equilibrium given sufficient timescale separation. The warping of the vector field that occurs as a result of timescale separation impacts the equilibrium that the dynamics converge to from a fixed initial condition and the neighborhood on which τ -GDA converges to an equilibrium. In other words, the region of attraction around critical points depends heavily on the timescale separation τ . To illustrate this fact, in Figure 8b we show the regions of attraction for each choice of timescale separation. The vector fields are again shown for each τ ∈ {1, 2, 5, 10}, but now with colors overlayed indicating the equilibria that the dynamics converge to given an initialization at that position. This Figure 8 : Experimental results for the torus game defined in (55) of Appendix K.4. In Figure 8a , we overlay multiple trajectories produced by τ -GDA onto the vector field generated by the choice of timescale separation selection τ . The shading of the vector field is dictated by its magnitude so that lighter shading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude. Figure 8b demonstrates the effect of timescale separation on the regions of attraction around critical points by coloring points in the strategy space according to the equilibrium τ -GDA converges. We remark that areas without coloring indicate where τ -GDA did not converge in the time horizon. experiment was generated by running τ -GDA with a dense set of initial conditions chosen uniformly over the strategy space. Positions in the strategy space without color did not converge to an equilibrium in the fixed horizon of 20000 iterations with γ 1 = 0.04. This happens when τ -GDA is not initialized in the local neighborhood of attraction around a stable equilibrium. For the choice of τ = 1, (0, 0) is the only stable equilibrium. However, as demonstrated in Figure 8a , τ -GDA fails to converge to the equilibrium from the initial conditions (x 0 1 , x 0 2 ) = (2, -1) and (x 0 1 , x 0 2 ) = (1.9, -2.1). This behavior is further demonstrated over the strategy space in Figure 8b and highlights the local nature of the guarantees since convergence is only assured given an initialization in a suitable local neighborhood around a stable critical point. On the other hand, τ -GDA converges to an equilibrium from any initial condition for τ ∈ {2, 5, 10} as can be seen by Figure 8b . Notably, the equilibrium to which the learning dynamics converge depends on the timescale separation and initial condition. To give a concrete example, consider the initial conditions shown in Figure 8a of (x 0 1 , x 0 2 ) = (2, -1) and (x 0 1 , x 0 2 ) = (1.9, -2.1). For the initial condition (x 0 1 , x 0 2 ) = (2, -1), τ -GDA converges to the equilibrium at (0, 0) for each τ ∈ {2, 5, 10}. Yet, for the initial condition (x 0 1 , x 0 2 ) = (1.9, -2.1), τ -GDA converges to the equilibrium at {(0, 0), (π, π), (π, π)} for the respective choices of τ ∈ {2, 5, 10}. In other words, the region of attraction around the critical points changes so that from a fixed initial condition τ -GDA may converge to distinct equilibrium depending on the initial condition. From Figure 8b , we see that the region of attraction around (x 0 1 , x 0 2 ) = (1.9, -2.1) grows from τ = 1 to τ = 2 and τ = 4, but then shrinks at τ = 10. This example highlights that timescale separation has a fundamental impact on the region of attraction around critical points and as τ grows it is possible for the region of attraction around an equilibrium to shrink. Collectively, this motivates explicit methods for trying to shape the region of attraction around desirable equilibria. Figure 9 : Experimental results for the Dirac-GAN game defined in (56) of Appendix K.5. Figure 9a shows trajectories of τ -GDA for τ ∈ {1, 4, 8, 16} with regularization µ = 0.3 and τ = 1 with regularization µ = 1. Figure 9b shows the distance from the equilibrium along the learning paths. Figure 9d shows the trajectories of τ -GDA overlayed on the vector field generated by the respective timescale separation and regularization parameters. The shading of the vector field is dictated by its magnitude so that lighter shading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude.

K.5 DIRAC-GAN AND REGULARIZATION: NON-SATURATING FORMULATION

In Section 5, we presented experiments for the Dirac-GAN game studied by Mescheder et al. (2017) using the original generative adversarial network formulation of Goodfellow et al. (2014) . In this section, we revisit the Dirac-GAN game using the non-saturating generative adversarial network formulation also proposed by Goodfellow et al. (2014) . Recall that the zero-sum game which arises from the original objective with regularization µ > 0 is defined by the cost f (θ, ω) = (θω) + (0) - µ 2 ω 2 . As discussed in Section 5, the unique critical point of the game is (θ * , ω * ) = (0, 0) and it corresponds to the local Nash equilibrium of the unregularized game and a differential Stackelberg equilibrium of the regularized game. Moreover, the equilibrium is stable with respect to the continous time dynamics for all τ > 0 and µ > 0 so that the discrete time update τ -GDA converges with a suitable learning rate γ 1 . The non-saturating generative adversarial network formulation proposed by Goodfellow et al. (2014) in the context of the Dirac-GAN game corresponds to player 1 maximizing (-θω) instead of minimizing (θω). This results in the general-sum game defined by the costs (f 1 (θ, ω), f 2 (θ, ω)) = (-(-θω) + (0) - µ 2 ω 2 , -(θω) -(0) + µ 2 ω 2 ). As shown by Mescheder et al. (2018) , the unique critical point of the game remains at (θ * , ω * ) = (0, 0). Moreover, it can be observed that J τ (θ * , ω * ) in this formulation is identical to the game We begin by considering the simplest form of this problem, which is that d = 1. The critical points with this restriction are (V * , W * ) = (σ, 0) and (V * , W * ) = (-σ, 0) and the game Jacobian evaluated at them is J τ (V * , W * ) = 0 -2σ 2τ σ τ µ . Each critical point is a local Nash equilibrium of the unregularized game and a differential Stackelberg equilibrium of the regularized game since -D 2 2 f (V * , W * ) = µ > 0 and S 1 (J(V * , W * )) = 4σ 2 /µ > 0. Furthermore, spec(J τ (V * , W * )) = {(τ µ ± τ 2 µ 2 -16τ σ 2 )/2} so that each critical point is stable for all τ ∈ (0, ∞) and µ ∈ (0, ∞) since spec(J τ (θ * , ω * )) ⊂ C • + . Thus, given a suitably chosen learning rate γ 1 , the discrete time update τ -GDA locally converges to an equilibrium. For this reason, we focus on studying the rate of convergence for the problem as a function 58) of Section K.6. We overlay the trajectories produced by τ -GDA onto the vector field generated by the choices of τ and µ. The shading of the vector field is dictated by its magnitude so that lighter shading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude. of timescale separation and regularization. Figures 10a, 10b , and 10c show the distance from an equilibrium along the learning path of τ -GDA with τ ∈ {1, 5, 10, 25} given a fixed initial condition with learning rate γ 1 = 0.001 and regularization µ ∈ {0.5, 0.75, 1}, respectively. Moreover, Figures 10d, 10e, and 10f show the trajectories of the eigenvalues for J τ (V * , W * ) as a function of τ for the regularization parameters µ ∈ {0.5, 0.75, 1}. Finally, Figures 11a, 11b, and 11c show the trajectories of τ -GDA overlayed on the vector field generated by the respective timescale separation and regularization parameters. From the eigenvalue trajectories, we see that as µ grows, the eigenvalues become purely real at a smaller value of τ . Moreover, as µ increases, the magnitude of the real and imaginary parts of the eigenvalues decreases. We observe the effect of this on the convergence, where the dynamics do not cycles as much for larger µ. Again, we see the trade-off between timescale separation, regularization, and convergence. For example, despite the eigenvalues being purely real with µ = 1 and τ = 25 so that there is no rotational dynamics, the convergence is slower than for µ = 0.75 where there is some non-zero imaginary piece of the eigenvalues. Figures 10g, 10h , and 10i show the distance from a critical point along the learning path of τ -GDA with τ ∈ {1, 5, 10, 25} given a fixed initial condition with learning rate γ 1 = 0.001, regularization µ = 1, and the dimension of the problem d among the set {5, 10, 20}, respectively. The primary purpose of showing this set of results is simply to be clear that the behavior for d = 1, which is easier to explain and visualize, transfers over to higher dimensional formulations of this problem. This is to be expected since the problem dimension is not necessarily fundamental to the convergence rate, but rather it depends on the conditioning of Σ and each Σ was chosen so that the behavior was comparable for each choice of dimension.

K.7 GENERATIVE ADVERSARIAL NETWORKS PARAMETERIZED BY NEURAL NETWORKS

In this section, we provide a much more detailed discussion of our experiments training generative adversarial networks on image datasets than was included in Section 5 and also present more the results at greater depth. We also present experiments training a generative adversarial network with τ -GDA to learn a mixture of Gaussians. Background. The empirical benefits of training with a timescale separation have been documented previously. For example, Heusel et al. (2017) showed on a number of image datasets that a timescale separation between the generator and discriminator improves generation performance as measured by the Frechet Inception Distance (FID). Since then a significant number of papers have presented results training generative adversarial networks with timescale separation. Moreover, it is common in the literature for the discriminator to be updated multiple times between each update of the generator (Arjovsky et al., 2017) . Indeed, it has been widely demonstrated that this heuristic improves the stability and convergence of the training process and locally it has a similar effect as including a timescale separation between the generator and discriminator. The disadvantage of this approach is that the number of gradient calls per generator update increases and consequently the convergence is then slower in terms of wall clock time when a similar effect could potentially be achieved by a learning rate separation between the generator and discriminator. We remark that it appears to be reasonably common for practitioners to fix a shared learning rate for the generator and discriminator along with a pre-selected number of discriminator updates per generator update and not thoroughly investigate the impact timescale separation has on the training process. The goal of our generative adversarial network experiments is to reinforce the importance of the timescale separation between the generator and the discriminator as a hyperparameter in the training process, demonstrate how it changes the behavior along the learning path, and show that it is compatible with a number of common training heuristics. This is to say that our goal is not necessarily to show state-of-the art performance, but rather to perform experiments that allow us to make insights relevant to the theory in this paper. We remark that our empirical work on training generative adversarial networks is distinct from and complimentary to that of Heusel et al. (2017) in several ways. The theory given by Heusel et al. (2017) only applies to stochastic stepsizes, however in the experiments they implemented constant step sizes. We train with mini-batches and decaying stepsizes in our image dataset experiments, which does satisfy the theory we provide as detailed in Section H. Moreover, by and large, the experiments by Heusel et al. (2017) compare a fixed learning rate ratio between the generator and discriminator to multiple fixed shared learning rates for the generator and discriminator. In contrast, we fix a learning rate for the generator and explore the behavior of the training process as the timescale parameter τ is swept over a given range.

K.7.1 GENERATIVE ADVERSARIAL NETWORKS: MIXTURE OF GAUSSIANS

We now provide the results from training generative adversarial networks to learn a mixture of Gaussians. The underlying data distribution consists of Gaussian distributions configured in a circle arrangement with means given by µ = [sin(ω), cos(ω)] for ω ∈ {kπ/4} 7 k=0 , each with covariance σ 2 I where σ 2 = 0.05. Each sample of real data given to the discriminator is selected uniformly at random from the set of Gaussian distributions. We train the generator using latent vectors z ∈ R 16 sampled from a standard normal distribution in each training batch. The batch size for each player in the game is 512. The network for the generator and discriminator contain two and one hidden layers respectively, each which contain 32 neurons and ReLU activation functions. The training objective is the non-saturating objective and we run experiments without and with the R 1 gradient penalty proposed by Mescheder et al. (2018) using parameter µ = 0.1. The generator learning rate is fixed to be γ 1 = 0.005 and the discriminator learning rate is fixed as γ 2 = τ γ 1 where we experiment with τ ∈ {4, 8, 16, 32, 64, 100}. For each parameter choice (timescale separation τ and regularization µ), the experiment is repeated with 50 random seeds. The training does not rely on any adaptive gradient methods (Adam, RMSprop, etc.) and is the 'vanilla' stochastic τ -GDA dynamics. We evaluate the performance along the learning path by computing the KL-divergence between the generated data and the real data, where we sample 4096 data points from each. The results of this experiment are presented in Figure 12 . We show the mean of the KL-divergence and the standard error of the means across the runs along the learning path without (µ = 0) and with regularization (µ = 0.1) in Figures 12a and 12b , respectively. Moreover, Figures 12c and 12d iments and in Figure 18 we include the hyperparameters that were selected. The architectures are analogous to that reported in Mescheder et al. (2018) , but scaled down since we run experiments with 32 × 32 × 3 images. For evaluation, we computed the Frechet Inception Distance using 10k samples from the real and generated data. For both experiments and across the set of hyperparameters we performed the evaluation using a fixed random noise vector to make for an equal comparison and a fixed set of real images which were randomly selected. The evaluation was done using the training data. We used the FID score implementation in pytorch available at https://github.com/mseitzer/pytorch-fid. We train the generative adversarial networks with the non-saturating objective function and the R 1 gradient penalty proposed by Mescheder et al. (2018) with regularization parameters µ ∈ {1, 10}. We note that the non-saturating objective results in a game that is not zero-sum, however it is commonly used in practice and under the realizable assumptions it can be locally equivalent to the zero-sum objective as discussed in Section K.5. The theory we provide does not apply to using RMSprop, but it is ubiquitous in practice for training generative adversarial networks and we are interested in exploring the interplay of timescale separation with common heuristics to understand if similar conclusions hold when using them as from the previous experiments regarding timescale separation with the 'vanilla' τ -GDA dynamics. Moreover, we note that similarly Heusel et al. (2017) and Mescheder et al. (2018) also rely upon Adam or RMSprop in generative adversarial experiments. A final heuristic and hyperparameter that we explore in conjunction with the timescale separation τ is that of using an exponential moving average to produce the model that is evaluated. This means that at each update k, given that the parameters of the generator are given by x 1,k , the moving average xk = x 1,k β + x1,k-1 (1 -β) is kept where β ∈ (0, 1). Experimental studies have shown that this heuristic can yield a significant improvement in terms of both the inception score and the FID (Gidel et al., 2019a; Yazici et al., 2019) . The success of this method is thought to be a result of dampening both rotational dynamics and the noise from the randomness in the mini-batches of data. Experimental Results. We run the training algorithm with the learning rate ratio τ belonging to the set {1, 2, 4, 8} for CIFAR-10 and {1, 2, 4, 8, 16} for CelebA along with the regularization parameter µ belonging to the set {1, 10}. For each choice of τ and µ, we retain exponential moving averages of the generator parameters for β ∈ {0.99, 0.999, 0.9999}. The training process is repeated 2 times for each hyperparameter configuration in the CIFAR-10 experiments and the experiments with CelebA are simulated once. The performance is evaluated along the learning path at every 10,000 updates in terms of the FID score. We report the mean scores and the standard error of the mean over the repeated experiments for each dataset. The FID score is such that a lower score beats a higher score. The experiments are computationally intensive which limits the number of repeats of experiments that can be simulated, however, we observed that the scores were quite consistent between random seeds particularly with exponential averaging of the parameters. We run the experiments with µ = 1 for 150k mini-batch updates and the experiments with µ = 10 for 300k mini-batch updates. The results for each dataset across the hyperparameter configurations are presented in numeric form in Figure 15 . Figure 16 shows some generated samples selected at random for each dataset with the hyperparameter configuration that performed best in terms of the FID score at the end of the training process. We now describe the key observations from the experiments for each dataset. CIFAR-10. The FID scores along the learning path for CIFAR-10 with µ = 10 and µ = 1 are presented in Figures 13a and 13b , respectively. The corresponding scores in numeric form are given in Figures 15a, 15c , and 15e for µ = 10 at 150k iterations and µ = 1 at 150k and 300k iterations, respectively. To begin, we observe that the exponential moving average significantly improves performance, and of the parameters considered, β = 0.9999 performed best. This may be a result of removing noise as mentioned previously or potentially it could be from dampening oscillatory behavior in the dynamics. Moreover, we that timescale separation also has a significant impact on the FID score of the training process. Indeed, even selecting τ = 2 versus τ = 1 can yield an impressive performance gain. In this experiment for each regularization parameter, τ = 4 converges fastest and performs the best. We see that τ = 2 outperforms τ = 8 when µ = 10 and the relationship is flipped when viewing the evaluation at 150k updates with µ = 1 and then returns back when looking at the evaluation at 300k updates. The choice of τ = 1 performs the worst for each regularization parameter by a wide margin. Finally, observe that the performance with 22 A 12 < 0 by Assumption 1. Now, we can conclude for this entire class of games that τ * = 0 precisely because Q has no positive real roots. Indeed, to start recall from the proof that we need to find τ * = λ + max (-M 2 (A 11 ⊗ I n2 + M 1 Q )) where ). Hence, as long as M 2 ≥ 0 we know that τ * = 0. Here is where we can exploit the properties from Magnus (1988)  M 1 = -2H



Following Fiez et al. (2020), we refer to strict local Stackelberg as differential Stackelberg throughout. Note that differential Nash are a subset of differential Stackelberg(Fiez et al., 2020;Jin et al., 2020). See Lancaster & Tismenetsky (1985);Magnus (1988) for more detail on the definition and properties of these mathematical operators, and Appendix C for more detail directly related to their use in this paper. For example, (x) = -log(1 + exp(-x)) gives the original formulation ofGoodfellow et al. (2014). Indeed, this holds since the only scenarios in which det(A A) = 0 are such that the eigenvalues of A do not lie in S(C • -). Mescheder et al. (2018) imply that their results hold for a convex combination of the two gradient penalties, which would in turn imply our results will hold in this case. However, we have not included the details here. If a block matrix Q with block entries Qij for i, j ∈ {1, 2} is positive definite symmetric, then Qii > 0 for i = 1, 2. The norm that exists can easily be constructed as essentially a weighted induced 1-norm. Note that the norm construction is not unique. The proof inHorn & Johnson (1985) is by construction and the construction of this norm can be found there. To date it has not been shown that for a sufficient separation in timescale the only critical point attractors are local minmax. Here, the ≈ means, e.g.,D1f1(x, z) = D1f1(x * , z * ) + D 2 1 f1(x * , z * )(x -x * ) + D12f1(x * , z * )(zz * ) + O( x -x * 2 + z -z * 2), and similarly for D2f2(x, z). Recall that having distinct eigenvalues is a generic condition for a matrix an n1 × n1 matrix, though not explicitly required for the asymptotic results; its only a condition for the big-O approximation λi = λi(A0) + O(ε) for i = 1, . . . , n1 and λi = ε -1 (λj(A22) + O(ε)) where i = n1 + j for j = 1, . . . , n2. Note that because the joint strategy space is a torus, (±π, ±π) = (∓π, ±π), (π, 0) = (-π, 0), and (0, -π) = (0, π).



and let T θ * M G and T ω * M D denote their respective tangent spaces at θ * and ω * . As inMescheder et al. (2018), we make the following assumption. Assumption 1. Consider a zero-sum game of the form given in (5) where f ∈ C 2 (R n1 ×R n2 , R) and G(•; θ) and D(•; ω) are the generator and discriminator networks, respectively, and x = (θ, ω) ∈ R n1 × R n2 . Suppose that x * = (θ * , ω * ) is an equilibrium. Then, (a) at (θ * , ω * ), p θ * = p D and D(x; ω * ) = 0 in some neighborhood of supp(p D ), (b) the function ∈ C 2 (R) satisfies (0) = 0 and (0) < 0, (c) there are -balls B (θ * ) and B (ω * ) centered around θ * and ω * , respectively, so that

(a) Trajectories of τ -GDA (b) Distance to equilibrium (c) spec(Jτ ), µ = 0.3 (d) spec(Jτ ), µ = 1 (f) Trajectories of τ -GDA overlayed on vector fields generated by choices of τ and µ.

Figure 1: Experimental results for the Dirac-GAN game of Section 5.

Fiez et al. (2020);Jin et al. (2020) present sufficient conditions for a local Stackelberg equilibria andFiez et al. (2020) studied the genericity and structural stability.

Figure 4: Experimental results for the quadratic game defined in (52) of Section K.1 and presented in Example 1. Figures 4a and 4b show trajectories of the players coordinate pairs (x 11 , x 21 ) and (x 21 , x 22 ) for a range of learning rate ratios, respectively. Figures 4c shows the distance from the equilibrium along the learning paths. Figures 4e, 4f, and 4g show the trajectories of the eigenvalues, the real parts of the eigenvalues, and the imaginary parts of the eigenvalues for the J τ (x * ) as a function of the τ , respectively.

Figure 5: Experimental results for the polynomial game defined in (53) of Section K.2 and presented in Example 2. Figures5a and 5bshow trajectories of the players coordinate pairs (x 11 , x 21 ) and (x 21 , x 22 ) for a range of learning rate ratios, respectively. Figures5d, 5e, and 5f show the trajectories of the eigenvalues, the real parts of the eigenvalues, and the imaginary parts of the eigenvalues for J τ (x * ) as a function of the τ , respectively where x * is the non-equilibrium critical point.

Figure 6: Experimental results for the polynomial game defined in (54) of Section K.3. Figures 6a provides a 3d view of the cost function -f (x 1 , x 2 ) along with the cost contours and critical point locations. Figure 6b shows trajectories of τ -GDA for a range of learning rate ratios given an initialization around the differential Stackelberg equilibrium (x * 1 , x * 2 ) = (-11.03, -11.03). Figures 6c and 6d show the evolution of the eigenvalues from J τ (x * ) as a function of τ where x * is the differential Stackelberg equilibrium (x * 1 , x * 2 ) = (-11.03, -11.03).

d = 5, µ = 1 (h) d = 10, µ = 1 (i) d = 20, µ = 1 (j) Legend for Figures 10a, 10b, 10c, 10g, 10h, and 10i.

Figure 10: Experimental results for the generative adversarial network formulation for learning a covariance matrix defined by the cost from (58) of Section K.6. Figures 10a, 10b, and 10c show the distance from the equilibrium along the learning paths of τ -GDA with d = 1. Figures 10d, 10e, and 10f show the trajectories of the eigenvalues of J τ (x * ) as a function of the τ , respectively. Figures 10g, 10h, and 10i show the distance from the equilibrium along the learning paths of τ -GDA with d = 5, 10, 20.

Figure11: Experimental results for learning a covariance matrix defined by the cost from (58) of Section K.6. We overlay the trajectories produced by τ -GDA onto the vector field generated by the choices of τ and µ. The shading of the vector field is dictated by its magnitude so that lighter shading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude.

Figure 14: CelebA FID scores with regularizationµ = 10 in Figure 14a and µ = 1 in Figure 14b.

Figure 15: FID Scores on CIFAR-10 and CelebA.

I n1 ⊗ (µR + C) -1 -2(I n1 ⊗ (µR + C) -1 B )H n1 • (B(µR + C) -1 B B(µR + C) -1 B ) -1 H + n1 (I n1 ⊗ B(µR + C) -1) For later use, let us defineM 3 = -2(I n1 ⊗ (µR + C) -1 B )H n1 • (B(µR + C) -1 B B(µR + C) -1 B ) -1 H + n1 (I n1 ⊗ B(µR + C) -1 ) = -2(I n1 ⊗ (µR + C) -1 B )H n1 (H + n1 (B(µR + C) -1 B ⊗ I n1 + I n1 ⊗ B(µR + C) -1 B )H n1 ) -1 H + n1 (I n1 ⊗ B(µR + C) -1

n1 (B(µR + C) -1 B B(µR + C) -1 B ) -1 H + n1 = (H + n1 (I n1 ⊗ B(µR + C) -1 B )H n1 ) -1 Hence, M 3 = -(I n1 ⊗ (µR + C) -1 B )(I n1 ⊗ B(µR + C) -1 B ) -1 (I n1 ⊗ B(µR + C) -1 ) = -(I n1 ⊗ (µR + C) -1 B )(I n1 ⊗ (B(µR + C) -1 B ) -1 )(I n1 ⊗ B(µR + C) -1 ) = -(I n1 ⊗ (µR + C) -1 B (B(µR + C) -1 B ) -1 )(I n1 ⊗ B(µR + C) -1 ) = -(I n1 ⊗ B + )(I n1 ⊗ B(µR + C) -1 ) = -(I n1 ⊗ (µR + C) -1 )which clearly holds since B is full rank in general by Assumption 1, for which it is necessary that n 2 ≥ n 1 /2 by Proposition E.1.

The eigenvalues of the Jacobian with regularization µ = 0.3 presented in Figure

+ n2 (-A 12 ⊗ I n2 )(A 22 A 22 ) -1 (A 12 ⊗ I n2 )H n2 and M 2 = I n1 ⊗ A -1 22 -2(I n1 ⊗ A -1 22 A 12 )H n1 (S 1 S 1 ) -1 H + n1 (I n1 ⊗ A 12 A -1 22 )Plugging in the generative adversarial network parameters we see that we needτ * = λ + max (-M 2 M 1 ) where M 1 = 2H + n2 (B ⊗ I n2 )((µR + C) (µR + C)) -1 (B ⊗ I n2 )H n2 < 0 and

on the and ⊗ operators. Specifically, let us use the property thatH + n1 (G ⊗ I n1 )H n1 = H + n1 (I n1 ⊗ G)H n1 = 1 2 H + n1 (I n1 ⊗ G + G ⊗ I n1 )H n1 for a matrix G which in turn implies that (H + n1 (I n1 ⊗ G)H n1 ) -1 = 2H + n1 (I n ⊗ G + G ⊗ I n1) -1 H n1 to get the following:

annex

Published as a conference paper at ICLR 2021 Jacobian for the Dirac-GAN, which is given by J τ (θ * , ω * ) = 0 (0) -τ (0) τ µ , (57) so this game is locally equivalent to the zero-sum game that arises from the original objective proposed by Goodfellow et al. (2014) . This is despite the fact that the non-saturating objective was motivated by global concerns (vanishing gradients early in the training process) rather than local considerations. In Figure 9 we present experiments with τ -GDA for the regularized Dirac-GAN game with the non-saturating objective and (t) = -(1 + exp(-t)). We observe similar behavior as the experiments with the standard objective and refer back to Section 5 for the insights we draw from the simulation. This experiment is primarily included for completeness and to motivate our use of the non-saturating objective in the generative adversarial networks experiments we perform on image datasets in Section 5.

K.6 GENERATIVE ADVERSARIAL NETWORK: LEARNING A COVARIANCE MATRIX

We now consider a generative adversarial network formulation presented by Daskalakis et al. (2018) for learning a covariance matrix. This is a simple example with degeneracies much like the Dirac-GAN game, but it can be generalized to arbitrary dimensional strategy spaces and has served as a benchmark for comparing convergence rates in a number of recent papers on learning in games. Often, the example is used to show that gradient descent-ascent cycles and converges slowly. However, by and large, timescale separation is not considered. We show that gradient descent-ascent converges fast in this game with suitable timescale separation and further explore the interplay between timescale separation, regularization, and rate of convergence. We primarily follow the notation of Daskalakis et al. (2018) when describing the problem.The objective of this problem is to learn a covariance matrix using the Wasserstein GAN formulation. The real data x is drawn from a mean-zero multivariate normal distribution with an unknown covariance matrix Σ. The generator is restricted to be a linear function of the random input noise z ∼ N (0, I) and is of the form G V (z) = V z. The discriminator is restricted to the set of all quadratic functions, which we represent by D W (x) = x W x. The parameters of the generator and the discriminator are given by W ∈ R d×d and V ∈ R d×d , respectively. For the given generator and discriminator classes the Wasserstein GAN game is defined by the costAs shown by Daskalakis et al. (2018) , the cost function can be simplified to be expressed asWith this cost, the individual gradients for gradient descent-ascent are given byFrom the individual gradients, it is clear that the critical points of the game are given by (V, W ) such that V V = Σ and W + W = 0. Moreover, given the form of g(V, W ), the game Jacobian at any critical pointConsequently, the eigenvalues of the game Jacobian are purely imaginary and the critical points are not stable. To fix this problem, Daskalakis et al. (2018) regularized both the generator and discriminator. We only regularize the discriminator in this example. The cost function of the zerosum game with regularization µ > 0 is given byThe individual gradients for gradient descent-ascent in this regularized game are then show the medians of the KL-divergence across the runs without (µ = 0) and with regularization (µ = 0.1), respectively. From Figure 12a , we observe that the choices of τ = 4 and τ = 100 do not show on the plot since they perform poorly, which may be a result of equilibrium not being stable for τ = 4 and numerically conditioning for τ = 100. Furthermore, we see that timescale separation improves the results up to a reasonable timescale parameter and after which the performance degrades. Furthermore, Figure 12b reveals that the results are improved with regularization and we see that τ = 100 ends up performing well, potentially since the regularization can alleviate some of the problems of numerical stability. In general, we draw similar conclusions from the median scores as reported in Figures 12c and 12d .The primary purpose of this experiment is to train generative adversarial networks parameterized by neural networks using τ -GDA without heuristics such as adaptive gradient methods or parameter averaging as is employed on the image dataset experiments from Section 5 and expanded on further in Appendix K.7.2. Notably, we see that consistent themes emerge that timescale separation improves convergence until hitting a limiting value and regularization can improve the rate of convergence but there is an interplay with the timescale separation.

K.7.2 GENERATIVE ADVERSARIAL NETWORKS: IMAGE DATASETS

We now provide further details on the methods for the experiments training generative adversarial networks with image datasets presented in Section 5 along with more in-depth results.Methods. We again note that we built our experiments based on the methods and implementations of Mescheder et al. (2018) and used the publicly available code from the paper available at https://github.com/LMescheder/GAN_stability. We effectively only changed the learning rates, retained multiple exponential averages at once, and modified the updates to be simultaneous in the code. In Figure 17 we provide the network architectures from our exper- CelebA. The FID scores along the learning path for CIFAR-10 with µ = 10 and µ = 1 are presented in Figures 14a and 14b , respectively. The corresponding scores in numeric form are given in Figures 15b, 15d , and 15f for µ = 10 at 150k iterations and µ = 1 at 150k and 300k iterations, respectively. In this experiment we observe that while the exponential moving average helps performance, the gain is not as drastic as it was for CIFAR-10. It is not entirely clear if this is a consequence of the scores being lower or something fundamental to the optimization landscape and dynamics for the dataset. The timescale separation in combination with the regularization again has a major effect on the the FID score of the training process in this experiment. For regularization µ = 10, the timescale parameters of τ = 4 and τ = 8 outperform τ = 1, τ = 2, and τ = 16 by a wide margin, again highlighting that timescale separation can speed up convergence until a certain point where it can potentially slow it down owing to the effect on the conditioning of the problem locally. A similar trend can be observed with regularization µ = 1, but with τ = 16 performing closer to τ = 4 and τ = 8. For each regularization parameter and timescale parameter, we see that τ = 8 performs the best. We again observe in this experiment that for all timescale separation parameters, the performance is significantly improved with regularization µ = 1 as compared with µ = 10. This once again highlights the importance of considering how this the hyperparameters of regularization and timescale interact and dictate the local convergence rates.Summary. In summary, we took a well-performing method and implementation for training generative adversarial networks and demonstrated that timescale separation is an extremely important, and easy to implement, hyperparameter that is worth careful consideration since it can have a major impact on the convergence speed and final performance of the training process. Interestingly, the conclusions we draw are in line with the insights drawn from the simple Dirac-GAN experiment in Section 5 and from the mixture of Gaussian experiments from Appendix K.7.1. In particular, timescale separation only speeds up to convergence until hitting a limiting value and there is a key interplay between timescale separation, regularization, and convergence rate. In order to highlight the utility of Theorem 1 along with future directions of obtaining values of τ * for structured games, we revisit the proof of Theorem 3 and derive the result directly from the construction. The purpose of this section is to illustrate that the structure of equilibria considered in Theorem 3 can be exploited to obtain the value of τ * for the entire class of games using properties of the Kronecker product and sum.Consider the generative adversarial network example under Assumption 1. Then, we can show precise from the construction that for all games of this class that τ * = 0 as long as µ > 0. Fix a

