LOCAL CONVERGENCE ANALYSIS OF GRADIENT DE-SCENT ASCENT WITH FINITE TIMESCALE SEPARATION

Abstract

We study the role that a finite timescale separation parameter τ has on gradient descent-ascent in non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by γ 1 and the learning rate of player 2 is defined to be γ 2 = τ γ 1 . We provide a non-asymptotic construction of the finite timescale separation parameter τ * such that gradient descent-ascent locally converges to x * for all τ ∈ (τ * , ∞) if and only if it is a strict local minmax equilibrium. Moreover, we provide explicit local convergence rates given the finite timescale separation. The convergence results we present are complemented by a non-convergence result: given a critical point x * that is not a strict local minmax equilibrium, we present a non-asymptotic construction of a finite timescale separation τ 0 such that gradient descent-ascent with timescale separation τ ∈ (τ 0 , ∞) does not converge to x * . Finally, we extend the results to gradient penalty regularization methods for generative adversarial networks and empirically demonstrate on CIFAR-10 and CelebA the significant impact timescale separation has on training performance.

1. INTRODUCTION

In this paper we study learning in zero-sum games of the form min x1∈X1 max x2∈X2 f (x 1 , x 2 ) where the objective function of the game f is assumed to be sufficiently smooth and potentially nonconvex and non-concave in the strategy spaces X 1 and X 2 respectively with each X i a precompact subset of R ni . This general problem formulation has long been fundamental in game theory (Bas ¸ar & Olsder, 1998) and recently it has become central to machine learning with applications in generative adversarial networks (Goodfellow et al., 2014) , robust supervised learning (Madry et al., 2018; Sinha et al., 2018) , reinforcement and multi-agent reinforcement learning (Rajeswaran et al., 2020; Zhang et al., 2019) , imitation learning (Ho & Ermon, 2016), constrained optimization (Cherukuri et al., 2017) , and hyperparameter optimization (Lorraine et al., 2020; MacKay et al., 2019) . The gradient descent-ascent learning dynamics are widely studied as a potential method for efficiently computing equilibria in game formulations. However, in zero-sum games, a number of past works highlight problems with this learning dynamic including both non-convergence to meaningful critical points as well as convergence to critical points devoid of game theoretic meaning, where common notions of 'meaningful' equilibria include the local Nash and local minmax (Stackelberg) concepts. For instance, in bilinear games, gradient descent-ascent avoids local Nash and Stackelberg equilibria due to the inherent instability of the update rule for this class. Fortunately, in this class of games, regularization or gradient-based learning dynamics that employ different numerical discretization schemes (as compared to forward Euler for gradient descent-ascent) are known to alleviate this issue (Daskalakis et al., 2018; Mertikopoulos et al., 2019; Zhang & Yu, 2020) . For the more general nonlinear nonconvex-nonconcave class of games, it has been shown gradient descentascent with a shared learning rate is prone to reaching critical points that are neither local Nash equilibria nor local Stackelberg equilibria (Daskalakis & Panageas, 2018; Jin et al., 2020; Mazumdar et al., 2020) . While an important negative result, it does not rule out the prospect that gradient

