LOCAL CONVERGENCE ANALYSIS OF GRADIENT DE-SCENT ASCENT WITH FINITE TIMESCALE SEPARATION

Abstract

We study the role that a finite timescale separation parameter τ has on gradient descent-ascent in non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by γ 1 and the learning rate of player 2 is defined to be γ 2 = τ γ 1 . We provide a non-asymptotic construction of the finite timescale separation parameter τ * such that gradient descent-ascent locally converges to x * for all τ ∈ (τ * , ∞) if and only if it is a strict local minmax equilibrium. Moreover, we provide explicit local convergence rates given the finite timescale separation. The convergence results we present are complemented by a non-convergence result: given a critical point x * that is not a strict local minmax equilibrium, we present a non-asymptotic construction of a finite timescale separation τ 0 such that gradient descent-ascent with timescale separation τ ∈ (τ 0 , ∞) does not converge to x * . Finally, we extend the results to gradient penalty regularization methods for generative adversarial networks and empirically demonstrate on CIFAR-10 and CelebA the significant impact timescale separation has on training performance.

1. INTRODUCTION

In this paper we study learning in zero-sum games of the form min x1∈X1 max x2∈X2 f (x 1 , x 2 ) where the objective function of the game f is assumed to be sufficiently smooth and potentially nonconvex and non-concave in the strategy spaces X 1 and X 2 respectively with each X i a precompact subset of R ni . This general problem formulation has long been fundamental in game theory (Bas ¸ar & Olsder, 1998) and recently it has become central to machine learning with applications in generative adversarial networks (Goodfellow et al., 2014) , robust supervised learning (Madry et al., 2018; Sinha et al., 2018) , reinforcement and multi-agent reinforcement learning (Rajeswaran et al., 2020; Zhang et al., 2019) , imitation learning (Ho & Ermon, 2016) , constrained optimization (Cherukuri et al., 2017) , and hyperparameter optimization (Lorraine et al., 2020; MacKay et al., 2019) . The gradient descent-ascent learning dynamics are widely studied as a potential method for efficiently computing equilibria in game formulations. However, in zero-sum games, a number of past works highlight problems with this learning dynamic including both non-convergence to meaningful critical points as well as convergence to critical points devoid of game theoretic meaning, where common notions of 'meaningful' equilibria include the local Nash and local minmax (Stackelberg) concepts. For instance, in bilinear games, gradient descent-ascent avoids local Nash and Stackelberg equilibria due to the inherent instability of the update rule for this class. Fortunately, in this class of games, regularization or gradient-based learning dynamics that employ different numerical discretization schemes (as compared to forward Euler for gradient descent-ascent) are known to alleviate this issue (Daskalakis et al., 2018; Mertikopoulos et al., 2019; Zhang & Yu, 2020) . For the more general nonlinear nonconvex-nonconcave class of games, it has been shown gradient descentascent with a shared learning rate is prone to reaching critical points that are neither local Nash equilibria nor local Stackelberg equilibria (Daskalakis & Panageas, 2018; Jin et al., 2020; Mazumdar et al., 2020) . While an important negative result, it does not rule out the prospect that gradient descent-ascent may be able to guarantee equilibrium convergence as it fails to account for a key structural parameter of the dynamics, namely the ratio of learning rates between the players. Motivated by the observation that the order of play between players is fundamental to the definition of the game, the role of timescale separation in gradient descent-ascent has recently been explored theoretically (Chasnov et al., 2019; Heusel et al., 2017; Jin et al., 2020) . On the empirical side, it has been widely demonstrated that timescale separation in gradient descent-ascent is crucial to improving the solution quality when training generative adversarial networks (Arjovsky et al., 2017; Goodfellow et al., 2014; Heusel et al., 2017) . Denoting γ 1 as the learning rate of the player 1, the learning rate of player 2 can be redefined as γ 2 = τ γ 1 where τ = γ 2 /γ 1 > 0 is the learning rate ratio. Toward understanding the effect of timescale separation, Jin et al. (2020) show the locally stable critical points of gradient descent-ascent coincide with the set of strict local minmax/Stackelberg equilibrium across the spectrum of sufficiently smooth zero-sum games as τ → ∞. In other words, all 'bad critical points' (critical points lacking game-theoretic meaning) become unstable and all 'good critical points' (game-theoretically meaningful equilibria) remain or become locally exponentially stable (cf. Definition 3) as τ → ∞. While a promising theoretical development, gradient descent-ascent with a timescale separation approaching infinity does not lead to a practical learning rule and the analysis of it does not necessarily provide insights into the common usage of a reasonable finite timescale separation. An important observation is that choosing τ arbitrarily large with the goal of ensuring local equilibrium convergence can lead to numerically ill-conditioned problems. This highlights the significance of understanding the exact range of learning rate ratios that guarantee local stability. Moreover, our experiments in Section 5 (Dirac-GAN) and in Appendix K show that modest values of τ are typically sufficient to guarantee stability of only equilibria which allows for larger choices of γ 1 and results in faster convergence to an equilibrium. Contributions. We show gradient descent-ascent locally converges to a critical point for a range of finite learning rate ratios if and only if the critical point is a strict local Stackelberg equilibria (Theorem 1).foot_0 This result is constructive in the sense that we explicitly characterize the exact range of learning rate ratios for which the guarantee holds. Furthermore, we show all other critical points are unstable for a range of finite learning rate ratios that we explicitly construct (Theorem 2). To our knowledge, the aforementioned guarantees are the first of their kind in nonconvex-nonconcave zero-sum games for an implementable first-order method. Moreover, the technical results in this work rely on tools that have not appeared in the machine learning and optimization communities analyzing games. Finally, we extend these results to gradient penalty regularization methods in generative adversarial networks (Theorem 3), thereby providing theoretical guarantees for a common combination of heuristics used in practice, and empirically demonstrate the benefits and trade-offs of regularization and timescale separation on the Dirac-GAN along with image datasets.

2. PRELIMINARIES

A two-player zero-sum continuous game is defined by a collection of costs (f 1 , f 2 ) where f 1 ≡ f and f 2 ≡ -f with f ∈ C r (X, R) for some r ≥ 2 and where X = X 1 × X 2 with each X i a precompact subset of R ni for i ∈ {1, 2} and n = n 1 + n 2 . Each player i ∈ I seeks to minimize their cost f i (x i , x -i ) with respect to their choice variable x i where x -i is the vector of all other actions x j with j = i. We denote D i f i as the derivative of f i with respect to x i , D ij f i as the partial derivative of D i f i with respect to x j , and D 2 i f i as the partial derivative of D i f i with respect to x i . Mathematical Notation. Given a matrix A ∈ R n1×n2 , let vec(A) ∈ R n1n2 be its vectorization such that vec(A) takes rows a i of A, transposes them and stacks them vertically in order of their index. Let ⊗ and ⊕ denote the Kronecker product and sum respectively, where A ⊕ B = A ⊗ I + I ⊗ B. Moreover, is an operator that generates an 1 2 n(n+1)× 1 2 n(n+1) matrix from a matrix A ∈ R n×n such that A A = H + n (A ⊕ A)H n where H + n = (H n H n ) -1 H n is the (left) pseudo-inverse of H n , a full column rank duplication matrix. Let λ + max (•) be the largest positive real root of its argument if it exists and zero otherwise. See Lancaster & Tismenetsky (1985) and Appendix B for more detail. Equilibrium. There are natural equilibrium concepts depending on the order of play: the (local) Nash equilibrium concept in the case of simultaneous play and the (local) Stackelberg (equivalently minmax in zero-sum games) equilibrium concept in the case of hierarchical play (Bas ¸ar & Olsder,



Following Fiez et al. (2020), we refer to strict local Stackelberg as differential Stackelberg throughout.

