CHARACTERIZING LOOKAHEAD DYNAMICS OF SMOOTH GAMES

Abstract

As multi-agent systems proliferate in machine learning research, games have attracted much attention as a framework to understand optimization of multiple interacting objectives. However, a key challenge in game optimization is that, in general, there is no guarantee for usual gradient-based methods to converge to a local solution of the game. The latest work by Chavdarova et al. (2020) report that Lookahead optimizer (Zhang et al., 2019) significantly improves the performance of Generative Adversarial Networks (GANs) and reduces the rotational force of bilinear games. While promising, their observations were purely empirical, and Lookahead optimization of smooth games still lacks theoretical understanding. In this paper, we fill this gap by theoretically characterizing Lookahead dynamics of smooth games. We provide an intuitive geometric explanation on how and when Lookahead can improve game dynamics in terms of stability and convergence. Furthermore, we present sufficient conditions under which Lookahead optimization of bilinear games provably stabilizes or accelerates convergence to a Nash equilibrium of the game. Finally, we show that Lookahead optimizer preserves locally asymptotically stable equilibria of base dynamics, and can either stabilize or accelerate the local convergence to a given equilibrium with proper assumptions. We verify our theoretical predictions by conducting numerical experiments on two-player zero-sum (non-linear) games.

1. INTRODUCTION

Recently, a plethora of learning problems have been formulated as games between multiple interacting agents, including Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Brock et al., 2019; Karras et al., 2019) , adversarial training (Goodfellow et al., 2015; Madry et al., 2018) , self-play (Silver et al., 2018; Bansal et al., 2018) , inverse reinforcement learning (RL) (Fu et al., 2018) and multi-agent RL (Lanctot et al., 2017; Vinyals et al., 2019) . However, the optimization of interdependent objectives is a non-trivial problem, in terms of both computational complexity (Daskalakis et al., 2006; Chen et al., 2009) and convergence to an equilibrium (Goodfellow, 2017; Mertikopoulos et al., 2018; Mescheder et al., 2018; Hsieh et al., 2020) . In particular, gradient-based optimization methods often fail to converge and oscillate around a (local) Nash equilibrium of the game even in a very simple setting (Mescheder et al., 2018; Daskalakis et al., 2018; Mertikopoulos et al., 2019; Gidel et al., 2019b; a) . To tackle such non-convergent game dynamics, a huge effort has been devoted to developing efficient optimization methods with nice convergence guarantees in smooth games (Mescheder et al., 2017; 2018; Daskalakis et al., 2018; Balduzzi et al., 2018; Gidel et al., 2019b; a; Schäfer & Anandkumar, 2019; Yazici et al., 2019; Loizou et al., 2020) . Meanwhile, Chavdarova et al. (2020) have recently reported that the Lookahead optimizer (Zhang et al., 2019) significantly improves the empirical performance of GANs and reduces the rotational force of a bilinear game dynamics. Specifically, they demonstrate that class-unconditional GANs trained by a Lookahead optimizer can outperform class-conditional BigGAN (Brock et al., 2019) trained by Adam (Kingma & Ba, 2015) even with a model of 1/30 parameters and negligible computation overheads. They also show that Lookahead optimization of a stochastic bilinear game tends to be more robust against large gradient variances than other popular first-order methods, and converges to a Nash equilibrium of the game where other methods fail. Despite its great promise, the study of Chavdarova et al. (2020) relied on purely empirical observations, and the dynamics of Lookahead game optimization still lacks theoretical understanding. Specifically, many open questions, such as the convergence properties of Lookahead dynamics and the impact of its hyperparameters on the convergence, remain unexplained. In this work, we fill this gap by theoretically characterizing the Lookahead dynamics of smooth games. Our contributions are summarized as follows: • We provide an intuitive geometric explanation on how and when Lookahead can improve the game dynamics in terms of stability and convergence to an equilibrium. • We analyze the convergence of Lookahead dynamics in bilinear games and present sufficient conditions under which the base dynamics can be either stabilized or accelerated. • We characterize the limit points of Lookahead dynamics in terms of their stability and local convergence rates. Specifically, we show that Lookahead (i) preserves locally asymptotically stable equilibria of base dynamics and (ii) can either stabilize or accelerate the local convergence to a given equilibrium by carefully choosing its hyperparameters. • Each of our theoretical predictions is verified with numerical experiments on two-player zero-sum (non-linear) smooth games.

2. PRELIMINARIES

We briefly review the objective of smooth game optimization, first-order game dynamics, and Lookahead optimizer. Finally, we discuss previous work on game optimization. We summarize the notations throughout this paper in Table A .1.

2.1. SMOOTH GAMES

Following Balduzzi et al. (2018) , a smooth game between players i = 1, . . . , n can be defined as a set of smooth scalar functions {f i } n i=1 with f i : R d → R such that d = n i=1 d i . Each f i represents the cost of player i's strategy x i ∈ R di with respect to other players' strategies x -i . The goal of this game optimization is finding a (local) Nash equilibrium of the game (Nash, 1951) , which is a strategy profile where no player has an unilateral incentive to change its own strategy. Definition 1 (Nash equilibrium). Let {f i } n i=1 be a smooth game with strategy spaces {R di } n i=1 such that d = n i=1 d i . Then x * ∈ R d is a local Nash equilibrium of the game if, for each i = 1, . . . , n, there is a neighborhood U i of x * i such that f i (x i , x * -i ) ≥ f i (x * ) holds for any x i ∈ U i . Such x * is said to be a global Nash equilibrium of the game when U i = R di for each i = 1, . . . , n. A straightforward computational approach to find a (local) Nash equilibrium of a smooth game is to carefully design a gradient-based strategy update rule for each player. Such update rules that define iterative plays between players are referred to as a dynamics of the game. Definition 2 (Dynamics of a game). A dynamics of a smooth game {f i } n i=1 indicates a differentiable operator F : R d → R d that describes players' iterative strategy updates as x (t+1) = F (x (t) ). One might expect that a simple myopic game dynamics, such as gradient descent, would suffice to find a (local) Nash equilibrium of a game as in traditional minimization problems. However, in general, gradient descent optimization of smooth games often fail to converge and oscillate around an equilibrium of the game (Daskalakis et al., 2018; Gidel et al., 2019b; a; Letcher et al., 2019) . Such non-convergent behavior of game dynamics is mainly due to (non-cooperative) interaction between multiple cost functions, and is considered as a key challenge in the game optimization (Mescheder et al., 2017; 2018; Mazumdar et al., 2019; Hsieh et al., 2020) .

2.2. FIRST-ORDER METHODS FOR SMOOTH GAME OPTIMIZATION

We introduce well-known first-order methods for smooth game optimization. To ease the notation, we use ∇ x f(•) to denote the concatenated partial derivatives (∇ x1 f 1 (•), . . . , ∇ xn f n (•)) of a smooth game {f i } n i=1 , where ∇ xi f i (•) is a partial derivative of a player i's cost function with respective to its own strategy. Gradient Descent (GD) minimizes the cost function of each player using the gradient descent. Its simultaneous dynamics F GDSim of a smooth game {f i } n i=1 with a learning rate η > 0 is given by x (t+1) = F GDSim (x (t) ) def = x (t) -η∇ x f(x (t) ). (1) On the other hand, its alternating dynamics F GDAlt is described by x (t+1) = F GDAlt (x (t) ) def = F 1 • . . . • F n (x (t) ), where F i (x) def = (. . . , x i-1 , x i -η∇ xi f i (x), x i+1 , . . .). Proximal Point (PP) (Martinet, 1970) computes an update by solving a proximal problem at each iteration. Its simultaneous dynamics F PPSim of a smooth game {f i } n i=1 with a learning rate η > 0 is x (t+1) = F PPSim (x (t) ) def = x (t) -η∇ x f(x (t+1) ). (4) Note that this update rule is implicit in a sense that x (t+1) appears on both sides of the equation; hence it requires solving the proximal subproblem for x (t+1) per iteration. Extra Gradient (EG) (Korpelevich, 1976 ) computes an update by using an extrapolated gradient. Its simultaneous dynamics F EGSim of a smooth game {f i } n i=1 with a learning rate η > 0 is x (t+1) = F EGSim (x (t) ) def = x (t) -η∇ x f(x (t+ 1 2 ) ) , where (5) x (t+ 1 2 ) def = x (t) -η∇ x f(x (t) ).

2.3. LOOKAHEAD OPTIMIZER

Lookahead (Zhang et al., 2019) is a recently proposed optimizer that wraps around a base optimizer and takes a backward synchronization step for each k forward steps. Given a dynamics F A induced by a base optimization method A, the Lookahead dynamics G LA-A with a synchronization period k ∈ N and a rate α ∈ (0, 1) is x (t+1) = G LA-A (x (t) ) def = (1 -α)x (t) + αF k A (x (t) ).

2.4. RELATED WORK

The convergence analysis of first-order smooth game dynamics dates several decades back and have been established in the context of saddle-point problems (Rockafellar, 1976; Korpelevich, 1976; Tseng, 1995) , which is a special case of zero-sum games. For example, Rockafellar (1976) showed the linear convergence of PP in the bilinear and strongly-convex-strongly-concave (SCSC) saddlepoint problems. Tseng (1995) and Facchinei & Pang (2003) proved the linear convergence of EG in the same problem, and Nemirovski (2004) did in the convex-concave problem over compact sets. As many learning problems are formulated as games in recent years (Goodfellow et al., 2014; Madry et al., 2018; Silver et al., 2018; Fu et al., 2018; Vinyals et al., 2019) , game optimization has regained considerable attentions from the research community. Optimistic gradient descent (OGD) (Popov, 1980) , which can be seen as an efficient approximation of EG, was recently rediscovered in the context of GAN training (Daskalakis et al., 2018) . Recent work of Liang & Stokes (2019) and Gidel et al. (2019a) proved linear convergence of OGD in bilinear and SCSC games. Mokhtari et al. (2020) established an unifying theoretical framework for analyzing PP, EG and OGD dynamics. Zhang & Yu (2020) presented exact and optimal conditions for PP, EG and OGD dynamics to converge in bilinear games. While there has been a growing interest for incorporating second-order information into game dynamics (Mescheder et al., 2017; Balduzzi et al., 2018; Mazumdar et al., 2019; Schäfer & Anandkumar, 2019; Loizou et al., 2020) to remedy non-convergent behaviors, the first-order optimization still dominates in practice (Brock et al., 2019; Donahue & Simonyan, 2019) due to computational and memory cost of second-order methods. Lately, Chavdarova et al. (2020) reported that recently developed Lookahead optimizer (Zhang et al., 2019) significantly improves the empirical performance of GANs and reduces the rotational force of bilinear game dynamics. However, this study relied on purely empirical observation and lacked theoretical understanding for Lookahead optimization of smooth games. Although Wang et al. (2020) proved that Lookahead optimizer globally converges to a stationary point in minimization problems, its convergence in smooth games still remain as an open question. 8. λ ± are the eigenvalues of each base dynamics, and 1 -α + αλ k ± are the eigenvalues of the associated Lookahead dynamics. k forward steps of a Lookahead procedure first rotate the eigenvalues λ ± of the dynamics' Jacobian matrix. Then, a synchronization backward step pulls them into a circle with a radius smaller than their maximal modulus. This results in a reduced spectral radius of the Jacobian matrix, which improves stability and convergence to an equilibrium.

3. SPECTRAL CONTRACTION EFFECT OF LOOKAHEAD IN BILINEAR GAMES

In this section, we show that Lookahead can either stabilize or accelerate the convergence of its base dynamics by reducing the spectral radius of its underlying Jacobian matrix. We highlight such spectral contraction effect by analyzing the convergence of Lookahead dynamics in a simple bilinear game (Section 3.1), and extend the results to general bilinear games (Section 3.2).

3.1. LOOKAHEAD DYNAMICS OF A SIMPLE BILINEAR GAME

We begin with a simple exemplar bilinear game that has a unique Nash equilibrium (0, 0): min x1∈R max x2∈R x 1 • x 2 . ( ) This game has been extensively studied as a representative toy example of game optimization by Gidel et al. (2019a) due to its oscillating dynamics. The following proposition demonstrates stabilization effect of Lookahead on Equation 8. Proposition 1. Simultaneous GD dynamics F GDSim with a learning rate η > 0 diverges from the Nash equilibrium of Equation 8. However, its Lookahead dynamics G LA-GDSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) globally converges to the Nash equilibrium if ((1 + iη) k ) < 1 and α is small enough. Proposition 1 shows that Lookahead optimizer can stabilize divergent dynamics of Equation 8. However, such stabilization effect of Lookahead raises a natural question: would there be any advantage for using Lookahead when its base dynamics is already stable? Proposition 2 analyzes well-known convergent PP dynamics of Equation 8 and presents an affirmative answer. Specifically, it shows that Lookahead dynamics (i) preserves the convergence of its base dynamics, and (ii) can further accelerate the convergence with proper hyperparameter choices. Proposition 2. Simultaneous PP Lookahead dynamics G LA-PPSim with a learning rate η ∈ (0, 1), a synchronization period k ∈ N and a rate α ∈ (0, 1) globally converges to the Nash equilibrium of Equation 8. Furthermore, the rate of convergence is improved upon its base dynamics F PPSim if ((1 + iη) k ) < (1 + η 2 ) k and α is large enough. We provide geometric interpretation of the Lookahead procedure in Figure 1 . Intuitively, Lookahead optimizer either stabilizes or accelerates its base dynamics by pulling the eigenvalues of the dynamics' Jacobian matrix into a circle with a small radius. Specifically, k forward steps of a Lookahead procedure first rotate the eigenvalues, and a synchronization backward step pulls them into a circle with a radius smaller than their maximal modulus. This results in a reduction of the spectral radius of the dynamics' Jacobian matrix, which is known to be crucial for stability (Slotine & Li, 1991) . Such spectral contraction effect of Lookahead dynamics is captured by the following lemma. Lemma 3 (Spectral contraction effect of Lookahead). Let k ∈ N, α ∈ (0, 1) and define a function f : R m×m → R m×m by f (X) = (1 -α)I + αX k . Define θ(λ) def = Arg(λ k -1) and φ(λ) def = arcsin sin(θ(λ)) ρ(X) k . Then, the following statements hold: • For ρ(X) = 1, ρ(f (X)) < 1 if λ k = 1, ∀λ ∈ λ max (X). • For ρ(X) > 1, ρ(f (X)) < 1 if (λ k ) < 1, α < 2 cos(π-θ(λ)) |λ k -1| , ∀λ ∈ λ ≥1 (X). • For ρ(X) < 1, ρ(f (X)) < ρ(X) k if (λ k ) < ρ(X) 2k , α > 1 -2ρ(X) k cos(π-φ(λi)) |λ k i -1| , ∀λ ∈ λ max (X), ∀λ i ∈ λ(X). In short, Lemma 3 suggests that Lookahead can reduce the spectral radius of a matrix by choosing a proper α and a k such that the entire radius-supporting eigenvalues (e.g., λ ≥1 (X), λ max (X)) are rotated to left enough. However, such k may not exist, for example, especially when such eigenvalues are not tightly clustered together. To help understanding when Lookahead can actually reduce the spectral radius, we present Lemma 4 as a sufficient condition for a set of eigenvalues to admit the existence of k that rotates them to the left half-plane. Lemma 4 (Left-rotatable eigenvalues). Let X, J ∈ R m×m be such that X = I -ηJ for some η > 0 and let S ⊆ λ(X). Assume that each element of S has its conjugate pair in S. Then we have (λ k ) < 0 for each λ ∈ S if k ∈ π 2θ min (S) , 3π 2θmax(S) and every element of S has non-zero imaginary part. Existence of such k ∈ N is guaranteed for a small enough η when max(S) min (S) < 3. Note that the Jacobian matrix of most well-known gradient-based dynamics can be written in the form of I -ηJ, where η > 0 is a learning rate and J is the underlying Jacobian matrix of the game. Intuitively, Lemma 4 suggests that for a small enough learning rate, any subset of the eigenvalues of a dynamics with imaginary conditioning less than 3 admits the existence of k that rotates them left enough. For such k, Lookahead can reduce the spectral radius of the dynamics by choosing a proper α, as stated in Lemma 3. This joint usage of Lemma 3-4 plays a central role for the proofs of our main results in Section 3.2 and Section 4. To summarize, Lemma 3-4 together highlight when Lookahead can actually improve the game dynamics and show that the imaginary conditioning of the radius-supporting eigenvalues is crucial for determining whether the dynamics is improvable.

3.2. LOOKAHEAD DYNAMICS OF GENERAL BILINEAR GAMES

In this section, we extend the analysis of Lookahead dynamics to a general bilinear game min x1∈R m max x2∈R n x T 1 Ax 2 -b T 1 x 1 -b T 2 x 2 (9) for some A ∈ R m×n and b 1 ∈ R m , b 2 ∈ R n such that there exists x * 1 ∈ R m , x * 2 ∈ R n with A T x * 1 = b 2 and Ax * 2 = b 1 . The existence of x * 1 , x * 2 allows us to rewrite the game as min x1∈R m max x2∈R n (x 1 -x * 1 ) T U Σ r 0 0 0 V T (x 2 -x * 2 ), where U, Σ r , V is the SVD of A with r def = rank(A). Therefore, we can analyze the dynamics of Equation 9 by inspecting a rather simpler problem min x1∈R r max x2∈R r x T 1 Σ r x 2 , as they are equivalent up to some rotations and translations. This reduction is a well-known technique and has been used by Gidel et al. (2019b; a) and Zhang & Yu (2020) for simplifying the analysis of Equation 9. Now we present sufficient conditions for Lookahead hyperparameters under which convergence of each first-order base dynamics, namely GD Alt , GD Sim , PP Sim and EG Sim , is either stabilized or accelerated. The following first two theorems show that Lookahead can provably stabilize nonconvergent GD dynamics of general bilinear games. Theorem 5 (Convergence of G LA-GDAlt ). Lookahead dynamics G LA-GD Alt with a learning rate η ∈ 0, 2 σmax , a synchronization period k ∈ N and a rate α ∈ (0, 1) converges to a Nash equilibrium of Equation 9 if k arccos(1 - η 2 σ 2 i 2 ) mod 2π = 0 for any σ i ∈ σ(A). Theorem 6 (Convergence of G LA-GDSim ). Lookahead dynamics G LA-GDSim with a learning rate η > 0, a synchronization period k ∈ N and a rate α ∈ (0, 1) converges to a Nash equilibrium of Equation 9if k ∈ π 2 arctan ησ min , 3π 2 arctan ησmax and α is small enough. Roughly, Theorem 5 suggests that almost any configurations of Lookahead can make GD Alt convergent to a Nash equilibrium of the bilinear games. On the other hand, the existence of k that satisfies the condition of Theorem 6 is guaranteed for a small enough η if σmax σ min < 3 holds. This highlights a limitation of the convergence guarantee for GD Sim that it holds only for well-conditioned games. The next two theorems show that Lookahead preserves the convergence of PP Sim and EG Sim in the bilinear games, and can further accelerate their convergence under proper hyperparameter choices. Theorem 7 (Acceleration of G LA-PPSim ). Lookahead dynamics G LA-PPSim with a learning rate η > 0, a synchronization period k ∈ N and a rate α ∈ (0, 1) converges to a Nash equilibrium of Equation 9. Furthermore, the rate of convergence is accelerated upon its base dynamics Theorem 8 (Acceleration of G LA-EGSim ). Lookahead dynamics G LA-EGSim with a learning rate η ∈ 0, 1 σmax , a synchronization period k ∈ N and a rate α ∈ (0, 1) converges to a Nash equilibrium of Equation 9. Furthermore, the rate of convergence is accelerated upon its base dynamics F PPSim if k ∈ π 2 arctan ησ min F EGSim if η ∈ 0, 1 2σmax , k ∈ π 2 arctan ησ min 1-ησ min , 3π 2 arctan ησ min 1-ησ min and α is large enough. Note that the existence of k that satisfies the acceleration conditions of Theorem 7-8 is always guaranteed for a small enough η. This contrasts Theorem 7-8 with Theorem 6, which only applies to well-conditioned games, and suggests that they can be applied for a wide range of bilinear games, including the ill-conditioned ones.

4. THE LIMIT POINTS OF LOOKAHEAD DYNAMICS

In this section, we characterize the limit points of Lookahead dynamics and reveal the connections between their stability and the hyperparameters of Lookahead. We start by defining a few stability concepts which are standard in the dynamical system theory (Slotine & Li, 1991) . Definition 3 (Lyapunov stability). Let F be a smooth vector field on R n . Then x ∈ R n is Lyapunov stable in F if for any > 0, there exists δ > 0 such that for any y ∈ R n , xy < δ implies F t (x) -F t (y) < for all t ∈ N. Definition 4 (Asymptotic stability). A Lyapunov stable equilibrium x * ∈ R n of a smooth vector field F is said to be asymptotically stable if there exists δ > 0 such that xx * < δ implies lim t→∞ F t (x) -x * = 0. Such x * is said to be locally asymptotically stable if δ < ∞. We show that any Lyapunov stable equilibrium (SE) of a dynamics is a locally asymptotically stable equilibrium (LASE) of a Lookahead dynamics. Furthermore, we show that Lookahead can either stabilize or accelerate the local convergence to an equilibrium when the radius-supporting eigenvalues of the equilibrium satisfy certain assumptions on their imaginary parts. Theorem 9 (SE A ⊆ LASE LA-A ). Let x * ∈ R n be a Lyapunov stable equilibrium of a dynamics F . Then, x * is a LASE of its Lookahead dynamics G with a synchronization period k ∈ N and a rate α ∈ (0, 1) if λ k i = 1 for each λ i ∈ λ(∇ x F (x * )). Theorem 10 (One-point local stabilization). Let x * ∈ R n be an equilibrium of a dynamics F with ρ(∇ x F (x * )) > 1. Assume that every element of λ ≥1 (∇ x F (x * )) has non-zero imaginary part. Then, x * is a LASE of its Lookahead dynamics G with a synchronization period k ∈ N and a rate α ∈ (0, 1) if k ∈ π 2θ min (λ ≥1 (∇xF (x * ))) , 3π 2θmax(λ ≥1 (∇xF (x * )) ) and α is small enough. Theorem 11 (One-point local acceleration). Let x * ∈ R n be an equilibrium of a dynamics F with ρ(∇ x F (x * )) < 1. Assume that every element of λ max (∇ x F (x * )) has non-zero imaginary part. Then, the local convergence rate to x * in a Lookahead dynamics G with a synchronization period k ∈ N and a rate α ∈ (0, 1) is accelerated upon F if k ∈ π 2θ min (λmax(∇xF (x * ))) , 3π 2θmax(λmax(∇xF (x * ))) and α is large enough. Intuitively, Theorem 9 shows that Lookahead preserves stability of its base dynamics, and Theorem 10-11 suggest that Lookahead can either stabilize or accelerate the local convergence to an equilibrium. Note that the stabilization and acceleration can be guaranteed when λ ≥1 (∇ x F (x * )) and λ max (∇ x F (x * )) contain no real eigenvalues and have imaginary conditioning less than 3; otherwise, k that satisfies the conditions of Theorem 10-11 may not exist (see Appendix E.10-E.11). An additional, but important consequence of Theorem 10 is that the inclusion relationship implied by Theorem 9 is strict in general. In the context of Nash equilibrium (NE) computation, such stabilization effect of Lookahead can be helpful when unstable NE are stabilized (e.g., bilinear games). However, the stabilization effect also carries a possibility for introducing non-Nash LASE, which is bad for the NE computation (Mazumdar et al., 2019) . Hence, the overall impact of Theorem 10 on the computation of NE depends on the global structure of the game and base dynamics. Note that Theorem 10-11 require radius-supporting eigenvalues to have non-zero imaginary parts and therefore does not apply to fully-cooperative (FC) games (i.e., minimization problems), which exhibit real eigenvalues only. To give an understanding of Lookahead dynamics in FC games, we present Proposition 12-13, together which imply that the iterates of Lookahead dynamics almost surely avoids unstable equilibria of its base dynamics in FC games (e.g., avoids local maxima). Proposition 12 (Avoids unstable points). Let F be a L-Lipschitz smooth dynamics for some L > 0 and let G be its Lookahead dynamics with a synchronization period k ∈ N and a rate α ∈ 0, 1 1+L k . Then the random-initialized iterates of G almost surely avoids its equilibrium x * with ρ(∇ x G(x * )) > 1 if ρ(∇ x G(x 0 )) = 1 holds for any equilibrium x 0 of G. Proposition 13 (Preserves unstable points in FC games). Let x * ∈ R n be an equilibrium of a dynamics F with ρ(∇ x F (x * )) > 1, and assume that ∇ x F (x * ) is a symmetric matrix with positive eigenvalues. Then, ρ(∇ x G(x * )) > 1 holds for a Lookahead dynamics G with a synchronization period k ∈ N and a rate α ∈ (0, 1). Bilinear game. We test our theoretical predictions in Section 3.2 (Theorem 5-8) on a bilinear game

5. EXPERIMENTS

min x1∈R n max x2∈R n x T 1 Ax 2 with A def = I n + • E n , where each element of E n ∈ M n×n is sampled from N (0, 1). We report our results using n = 10 and = 0.05, which gives a sample of A with σ max = 1.195 and σ min = 0.852, hence σmax σ min = 1.401 < 3. For a fixed η = 0.1, we use Theorem 5-8 to derive a range of k and an approximate scale of α that guarantee stabilization and acceleration of convergence to a Nash equilibrium (NE) of Equation 12. We provide the derivations of theoretically recommended values and actual configurations used for the experiment in Appendix D. Figure 2 (a) shows that the hyperparameters predicted by our theorems, denoted by LA-GD Alt/Sim + and LA-EG Sim + , actually stabilize and accelerates the convergence to a NE. We also test the hyperparameters that are chosen against our theorems and denote as LA-GD Alt/Sim -and LA-EG Sim -. Specifically, we choose a k smaller than the lower bound predicted by our theorems and use large α for unstable base dynamics and small α for stable base dynamics. The result in Figure 2 (a) suggests that Lookahead can fail to stabilize, or even worse, slow down the convergence when hyperparameters are configured badly. Nonlinear game. We verify our theoretical predictions in Section 4 (Theorem 10 and 11) on the non-linear game proposed by Hsieh et al. (2020) : min x1∈R max x2∈R x 1 • x 2 + φ(x 2 ), where φ(x) def = 1 2 x 2 -1 4 x 4 with > 0. This game has an unstable critical point (0, 0) surrounded by an attractive internally chain-transitive (ICT) set, which may contain arbitrarily long trajectories. Hsieh et al. (2020) demonstrate that most first-order methods fail to converge in this game due to the instability of the equilibrium and the existence of the ICT set. For a fixed = 0.01 and η = 0.05, we use Theorem 10 and 11 to derive a range of k and an approximate scale of α that guarantee local stabilization and acceleration to the equilibrium of Equation 13. We provide the detailed derivations of the theoretically recommended values and the configurations in Appendix D. Figure 2 

6. CONCLUSION

In this work, we derived the theoretic results for convergence guarantee and acceleration of Lookahead dynamics in smooth games for the first time. Specifically, we derived sufficient conditions for hyperparameters of Lookahead optimizer under which the convergence of bilinear games is either stabilized or accelerated. Furthermore, we proved that the Lookahead optimizer preserves locally asymptotically stable equilibria of smooth games. Finally, we showed that Lookahead can either stabilize or accelerate the local convergence to a given equilibrium under proper assumptions. Our results point to several future research directions. Lemma 4 suggests that the imaginary conditioning of the radius-supporting eigenvalues is crucial for the performance gain in Lookahead. Therefore, developing an optimizer that exhibits a small imaginary conditioning could improve the convergence of its Lookahead dynamics. Another interesting application of our theoretic results would be designing an adaptive mechanism for the Lookahead hyperparameters by applying our theorems on local bilinear approximation (Schäfer & Anandkumar, 2019) of the game for each step. A NOTATION i-th element of a vector x = (x 1 , . . . , x n ) x i i-th vector of a concatenated vectors x = (x 1 , . . . , x n ) x -i (x 1 , . . . , x i-1 , x i+1 , . . . , x n ) ∇ x f (x ) Derivative of a function f evaluated at x S r The zero-centered circle of radius r > 0 in C (z) Real part of z ∈ C (z) Imaginary part of z ∈ C Arg(z) The angle between z ∈ C and real axis of the complex plane σ(A) The set of singular values of A ∈ R m×n ρ(A) The spectral radius of A ∈ R m×m λ(A) The set of eigenvalues of A ∈ R m×m λ ≥a (A) The set of eigenvalues of A with modulus larger than or equal to a ∈ R λ max (A) The set of eigenvalues of A with the largest modulus (S) { (c)|c ∈ S} for S ⊆ C min (S) min (S) max (S) max (S) (S) { (c)|c ∈ S} for S ⊆ C ≥0 (S) { i ∈ (S)| i ≥ 0} min (S) min ≥0 (S) max (S) max ≥0 (S) θ(S) {Arg(c)|c ∈ S)} for S ⊆ C θ ≥0 (S) {θ i |θ i ∈ θ(S), θ i ≥ 0} θ min (S) min θ ≥0 (S) θ max (S) max θ ≥0 (S) B USEFUL FACTS B.1 STANDARD RESULTS ON CONVERGENCE Lemma 14 (Bertsekas (1999) ). Let F : R m → R m be continuously differentiable, and let x * ∈ R m be such that F (x * ) = x * . Assume that ρ(∇ x F (x * )) < 1. Then, there is an open neighborhood U x * of x * such that for any x ∈ U x * , F t (x) -x * 2 ∈ O(ρ(∇ x F (x * )) t ) for t → ∞. Lemma 15 (Gidel et al. ( 2019b)). Let M ∈ R m×m and u (t) be a sequence of iterates such that, u (t+1) = Mu (t) , then we have three cases of interest for the spectral radius ρ(M): • If ρ(M) < 1 and M is diagonalizable 1 , then u (t) 2 ∈ O(ρ(M) t u (0) 2 ). • If ρ(M) > 1, then there exists u (0) such that u (t) 2 ∈ Ω(ρ(M) t u (0) 2 ). • If |λ i | = 1, ∀λ i ∈ λ(M), and M is diagonalizable, then u (t) 2 ∈ Θ( u (0) 2 ).

B.2 CHARACTERISTIC EQUATIONS OF FIRST-ORDER DYNAMICS IN BILINEAR GAMES

Latest work of Zhang & Yu (2020) provides the exact and optimal conditions for popular first-order methods to converge in zero-sum bilinear games, if possible. Besides from the exact conditions and the choice of optimal hyperparameters, they also derive the characteristic equation of each firstorder dynamics in the zero-sum bilinear games. Since our proofs of theorems in Section 3.2 heavily rely on these characteristic equations, we restate somewhat simplified version of the equations for Equation 9 using our notations. GD Alt : (λ i -1) 2 + η 2 σ 2 i λ i = 0. (14) GD Sim : (λ i -1) 2 + η 2 σ 2 i = 0. (15) PP Sim : (1/λ i -1) 2 + η 2 σ 2 i = 0. ( ) EG Alt : (λ i -1) 2 + (η 2 + 2η)σ 2 i (λ i -1) + (η 2 σ 2 i + η 2 σ 4 i ) = 0. ( ) EG Sim : (λ i -1) 2 + 2ησ 2 i (λ i -1) + η 2 σ 2 i + η 2 σ 4 i = 0. (18) We denote the singular values of matrix A in Equation 9 by σ i . The eigenvalues of each dynamics' Jacobian matrix are denoted by λ i . Note that Zhang & Yu (2020) also derive characteristic equations of memory-augmented first-order methods, such as OGD (Popov, 1980 ) and the momentum method, which we do not cover in this paper.

C OMITTED RESULTS

Proposition 16. Alternating GD dynamics F GD Alt with a learning rate η ∈ (0, 2) fails to converge and oscillates around the Nash equilibrium of the game in Equation 8. However, its Lookahead dynamics G LA-GD Alt with a synchronization period k ∈ N and a rate α ∈ (0, 1) globally converges to the Nash equilibrium if 1 -η 2 2 + i √ 4-η 2 2 k = 1. Proof. One can easily check from Equation 2 that the dynamics F GDAlt can be written as F GDAlt (x (t) 1 , x (t) 2 ) = 1 -η η 1 -η 2 x (t) 1 x (t) 2 . ( ) Defining M def = 1 -η η 1 -η 2 , the Lookahead dynamics G LA-GDAlt can be written as G LA-GDAlt (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) It follows that the eigenvalues of ∇ x G LA-GDAlt can be written as 1 -α + αλ k ± with λ ± def = 1 -η 2 2 ± i √ 4-η 2 2 ∈ λ(M) for any η ∈ (0, 2). However, 1 -α + αλ k ± is an interpolation between two distinct points on S 1 since |λ ± | = 1 and λ k ± = 1 , implying |1 -α + αλ k ± | < 1. Therefore, we conclude from Lemma 14 that the iterates of G LA-GDAlt converge to the Nash equilibrium (0, 0) of the game with convergence rate O(|1 -α + αλ k ± | t/k ), assuming the amortization of its computation over k forward steps. The proof for oscillation of F GDAlt follows from Lemma 15 and can be found in Gidel et al. (2019a) . Proposition 17. Simultaneous EG Lookahead dynamics G LA-EGSim with a learning rate η ∈ (0, 1), a synchronization period k ∈ N and a rate α ∈ (0, 1) globally converges to the Nash equilibrium of Equation 8. Furthermore, the rate of convergence is improved upon its base dynamics F EGSim if ((1 -η 2 + iη) k ) < (1 -η 2 + η 4 ) k and α is large enough. Proof. Using simple algebra on Equation 5, the dynamics F EGSim can be written as F EGSim (x (t) 1 , x (t) 2 ) = 1 -η 2 -η η 1 -η 2 x (t) 1 x (t) 2 . ( ) Defining M def = 1 1+η 1 -η 2 -η η 1 -η 2 , its Lookahead dynamics G LA-EGSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written as G LA-EGSim (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) It follows that the eigenvalues of ∇ x G LA-EGSim are 1 -α + αλ k ± with λ ± def = 1 -η 2 ± iη ∈ λ(M). However, 1 -α + αλ k ± is an interpolation between two distinct points on/inside S 1 since |λ ± | k < 1 for any η ∈ (0, 1). It follows that |1 -α + αλ k ± | < 1, from which we conclude from Lemma 14 that the iterates of G LA-EGSim converge to the Nash equilibrium (0, 0) of the game with convergence rate O(|1 -α + αλ k ± | t/k ), assuming the amortization of its computation over k forward steps. Now we show that the convergence is accelerated upon its base dynamics F EGSim if ((1 -η 2 + iη) k ) < (1 -η 2 + η 4 ) 2k and α is large enough. Figure 1 (c ) intuitively shows that the line segment between (1, 0) and λ k ± contains a line segment inside S |λ±| k when k is such that (λ k ± ) < |λ 2k ± |. Therefore, for a large enough α, the interpolation 1 -α + αλ k ± lies inside S |λ±| k . This implies that the convergence rate O(|1 -α + αλ k ± | t/k ) of G LA-EGSim is accelerated upon the rate O(|λ ± | t ) of its base dynamics. Proposition 18 (Equilibrium of Lookahead dynamics). Let F be a dynamics and G be its associated Lookahead dynamics with a synchronization period k ∈ N. Then any equilibrium of F is an equilibrium of G and any equilibrium of G is a periodic point of F . Proof. Let k ∈ N and α ∈ (0, 1) be the synchronization period and synchronization rate of G, respectively. It is trivial to see that G(x * ) = ((1 -α)id + αF k )(x * ) = (1 -α)x * + αx * = x * if F (x * ) = x * . Conversely, one can easily check that G(x * ) = (1 -α)x * + αF k (x * ) = x * implies F k (x * ) = x * .

D EXPERIMENTAL DETAILS

We report the actual hyperparameters used for the experiments of Section 5 in TableD.2 and D.3. Furthermore, we also provide the detailed derivations of the theoretically recommended range of synchronization period k ∈ N. ) mod π = 0, ∀σ i }, • LA-GD Sim : 13LA-GD Alt From Equation 2, the Jacobian of dynamics F LA-GDAlt of Equation 13 can be derived as ∇ x F GDAlt (x 1 , x 2 ) = 1 -η η 1 -η 2 + η (1 -3x 2 2 ) x 1 x 2 , ( ) and it is trivial to see that it has an equilibrium at (0, 0). By plugging in = 0.01 and η = 0.05, we obtain ∇ x F GDAlt (0, 0) = 1 -0.05 0.05 0.998 with eigenvalues λ ± def = 0.99 ± 0.05i. Note that |λ ± | = 1.0003 > 1 and ∇ x F GDAlt (0, 0) has the imaginary conditioning of 1, which implies that the origin is an unstable equilibrium of GD Alt that can be locally stabilized by a Lookahead dynamics. By plugging in the eigenvalues and θ min (∇ x F GDAlt (0, 0)) = θ max (∇ x F GDAlt (0, 0)) = arctan 0.05 0.99 = 0.0504 to Theorem 10, we obtain the theoretically recommended range of k as (31.16, 93.49).

LA-GD Sim

From Equation 1, the Jacobian of dynamics F LA-GDSim of Equation 13 can be derived as ∇ x F GDSim (x 1 , x 2 ) = 1 -η η 1 + η (1 -3x 2 2 ) x 1 x 2 , ( ) and it is trivial to see that it has an equilibrium at (0, 0). By plugging in = 0.01 and η = 0.05, we obtain ∇ x F GDSim (0, 0) = 1 -0.05 0.05 1.005 with eigenvalues λ ± def = 1.0025 ± 0.0499i. Note that |λ ± | = 1.0037 > 1 and ∇ x F GDSim (0, 0) has the imaginary conditioning of 1, which implies that the origin is an unstable equilibrium of GD Alt that can be locally stabilized by a Lookahead dynamics. By plugging in the eigenvalues and θ min (∇ x F GDSim (0, 0)) = θ max (∇ x F GDSim (0, 0)) = arctan 0.0499 1.0025 = 0.0497 to Theorem 10, we obtain the theoretically recommended range of k as (31.6, 94.81). LA-EG Sim From Equation 5, the dynamics F LA-EGSim of Equation 13 can be derived as x 1 x 2 = F EGSim (x 1 , x 2 ) = x 1 -ηx 2 x 2 + η(x 1 + (x 2 -x3 2 )) , where x1 x2 = x 1 -ηx 2 x 2 + η(x 1 + (x 2 -x 3 2 )) . ( ) By computing the derivatives with x 1 = 0, x 2 = 0 and = 0.01, η = 0.05, we obtain ∇ x F EGSim (0, 0) = 0.9975 -0.05 0.05 0.9005 with eigenvalues λ ± = 0.949 ± 0.0122i. Note that |λ ± | = 0.949 < 1 and ∇ x F EGSim (0, 0) has the imaginary conditioning of 1, which implies that the origin is an stable equilibrium of EG Sim whose local convergence can be accelerated by a Lookahead dynamics. By plugging in the eigenvalues and θ min (∇ x F EGSim (0, 0)) = θ max (∇ x F EGSim (0, 0)) = arctan 0.0122 0.949 = 0.0129 to Theorem 11, we obtain the theoretically recommended range of k as (121.76, 365.30 ).

E PROOFS E.1 PROOF OF PROPOSITION 1

Proof. One can easily check from Equation 1 that the dynamics F GDSim can be written as F GDSim (x (t) 1 , x (t) 2 ) = 1 -η η 1 x (t) 1 x (t) 2 . ( ) Defining M def = 1 -η η 1 , its Lookahead dynamics G LA-GDSim can be written as G LA-GDSim (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) It follows that the eigenvalues of ∇ x G LA-GDSim can be written as 1 -α + αλ k ± with λ ± def = 1 ± iη ∈ λ(M). Assuming ((1 + iη) k ) < 1, the line segment between (1, 0) and λ k ± contains a line segment inside S 1 as in Figure 1 (b) . Therefore, for a small enough α, the interpolation1 -α + αλ k ± lies inside S 1 , implying |1 -α + αλ k ± | < 1. We thus conclude from Lemma 15 that the iterates of G LA-GDSim converge to the Nash equilibrium (0, 0) of the game. The proof for divergence of F GDSim follows from Lemma 15 and can be found in Gidel et al. (2019a) .

E.2 PROOF OF PROPOSITION 2

Proof. Using simple algebra on Equation 4, the dynamics F PPSim can be written as F PPSim (x (t) 1 , x (t) 2 ) = 1 1 + η 1 -η η 1 x (t) 1 x (t) 2 . ( ) Defining M def = 1 1+η 1 -η η 1 , its Lookahead dynamics G LA-PPSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written as G LA-PPSim (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) It follows that the eigenvalues of ∇ x G LA-PPSim are 1 -α + αλ k ± with λ ± def = 1±iη 1+η 2 ∈ λ(M). We know that 1 -α + αλ k ± is an interpolation between two distinct points on/inside S 1 since |λ ± | k < 1 for any η ∈ (0, 1). It follows that |1 -α + αλ k ± | < 1 , from which we conclude from Lemma 14 that the iterates of G LA-PPSim converge to the Nash equilibrium (0, 0) of the game with convergence rate O(|1 -α + αλ k ± | t/k ), assuming the amortization of its computation over k forward steps. Now we show that the convergence is accelerated upon the base dynamics F PPSim if ((1 + iη) k ) < (1 + η 2 ) k and α is large enough. Figure 1 (c ) intuitively shows that the line segment between (1, 0) and λ k ± contains a line segment inside S |λ±| k when k is such that (λ k ± ) < |λ 2k ± |. Therefore, the interpolation 1 -α + αλ k ± lies inside S |λ±| k for a large enough α. This implies that the convergence rate O(|1-α+αλ k ± | t/k ) of G LA-PPSim is accelerated upon the rate O(|λ ± | t ) of its base dynamics.

E.3 PROOF OF LEMMA 3

Proof. We prove each of the cases in their order. Case ρ(X) = 1. Assume that λ max k = 1 for any λ max ∈ λ max (X). Then we can immediately conclude ρ(f (X)) < 1 since 1 -α + αλ k i ∈ λ(f (X)) is an interpolation between two distinct points (1, 0) and λ k i on/inside S 1 for any λ i ∈ λ(X).  AC = α|λ k -1| < 2 cos(π -θ(λ)) = AD is sufficient to place 1 -α + αλ k inside S 1 . Furthermore, for any λ ∈ λ(X) such that |λ| < 1, 1-α+αλ k lies inside S 1 since 1-α+αλ k is an interpolation between two distinct points on/inside S 1 . Therefore we conclude ρ(f (X)) < 1. Case ρ(X) < 1. Assume that (λ k ) < ρ(X) 2k for any λ ∈ λ max (X). Then for any λ i ∈ λ(X), λ k i can be visualized as point B in Figure 4 (b) since the existence of point D is guaranteed by (λ k ) < ρ(X) 2k and sin(φ(λ i )) = sin(θ(λ i ))/ρ(X) k follows from the law of sines. Therefore we can intuitively see from the figure that is guaranteed for a small enough η > 0 when max (S) min (S) < 3. Using simple algebra, we can see that θ max < f (θ min ) for f : R → R defined by f BC = (1 -α)|λ k i -1| < 2ρ(X) k cos(π -φ(λ i )) = BD (35) is sufficient to place 1 -α + αλ k i inside S ρ(X) k , (x) = 3πx π+2x is equivalent to π 2θ min -3π 2θmax > 1, implying nonempty N ∩ π 2θ min , 3π 2θmax . Therefore it suffices to show that θ max < f (θ min ) holds for a small enough η > 0 when max (S) min (S) < 3. Let us define a function H : R → R given by H(η) def = 1 + 2θ max + π 1 + η max (S) 1 + η min (S) 1 + 2 sec θ min - 1 + 2 sec θ max + + b , where θ min . We show that the inequality max (S) min (S) < 3 H(η) implies θ max < f (θ min ) and conclude the proof by showing that there exists a small enough η > 0 such that satisfies Equation 37 when max(S) min (S) < 3. Note that the inequalities θ min -≤ θ min and θ max ≤ θ max + directly follow from the definitions of θ min -and θ max + . Furthermore, using the Shafer-type double inequalities (Mortici & Srivastava, 2014) for arctan(•), we obtain θ min -≥ 3 tan θ min - 1 + 2 1 + tan 2 θ min - = 3η min (1 + η max )(1 + 2 sec θ min -) , θ max + ≤ 3 tan θ max + 1 + 2 1 + tan 2 θ max + + 1 180 tan 5 θ max + (39) = 3η max (1 + η min )(1 + 2 sec θ max + ) + η max tan 4 θ max + 180(1 + η min ) , from which follows that θ max θ min ≤ θ max + θ min -= max (S) min (S) 1 + η max (S) 1 + η min (S) 1 + 2 sec θ min - 1 + 2 sec θ max + + b . However, assuming inequality 37, we can derive max (S) min (S) 1 + η max (S) 1 + η min (S) 1 + 2 sec θ min - 1 + 2 sec θ max + + b < 3π π + 2θ max + = f (θ max + ) θ max + . Furthermore, since f (x) = 3π 2 (π+2x) 2 , we know that f is both concave and monotonically increasing. Hence it follows that f (θ max + ) θ max + < f (θ min ) θ min , from which we obtain θ max < f (θ min ) by combining Equation 41-43. Finally, we prove that Equation 37 holds for a small enough η > 0 when max(S) min (S) < 3. Assume max(S) min (S) < 3 and let def = 3 -max(S) min (S) > 0. By the continuity of 3 H(•) at η = 0 and the fact that H(0) = 1, there exists δ > 0 such that |3 -3 H(η) | < holds for any η ∈ (0, δ). Therefore we have max(S) min (S) = 3 -< 3 H(η) for any η ∈ (0, δ), concluding the proof.

E.5 PROOF OF THEOREM 5

Proof. From Equation 2, the dynamics F GDAlt of Equation 11 can be derived as F GDAlt (x (t) 1 , x (t) 2 ) = I r -ηΣ r ηΣ r I r -η 2 Σ 2 r x (t) 1 x (t) 2 . ( ) Defining M def = I r -ηΣ r ηΣ r I r -η 2 Σ 2 r , its Lookahead dynamics G LA-GDAlt with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written as G LA-GDAlt (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) Together with Equation 14, we can see that eigenvalues of ∇ x G LA-GDAlt can be written as 1 -α+αλ k ±i with λ ±i def = 1 - η 2 σ 2 i 2 ± iησ i 1 - η 2 σ 2 i 4 ∈ λ(M) for any η ∈ 0, 2 σmax . In the meanwhile, simple calculation gives us |λ ±i | = 1, which implies ρ(M) = 1. Now assume k ∈ N is such that k arccos(1 - η 2 σ 2 i 2 ) mod 2π = 0 for any σ i . Then it follows that λ k ±i = 1 for any λ ±i ∈ λ(M), from which we obtain ρ(∇ x G LA-GDAlt ) < 1 from Lemma 3. It follows from Lemma 15 that the iterates converge to the origin, and we conclude the proof by observing that the transformations x 1 → U [x 1 ; 0 m-r ] + x * 1 and x 2 → V [x 2 ; 0 n-r ] + x * 2 of (0, 0) ∈ R r × R r gives (x * 1 , x * 2 ) ∈ R m × R n , which is a Nash equilibrium of Equation 9.

E.6 PROOF OF THEOREM 6

Proof. From Equation 1, the dynamics F GDSim of Equation 11 can be derived as F GDSim (x (t) 1 , x (t) 2 ) = I r -ηΣ r ηΣ r I r x (t) 1 x (t) 2 . ( ) Let us define J def = 0 r Σ r -Σ r 0 r and M def = I -ηJ. Then its Lookahead dynamics G LA-GDSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written as G LA-GDSim (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . Together with Equation 15, we can see that the eigenvalues of ∇ x G LA-GDSim can be written as 1-α+ αλ k ±i with λ ±i def = 1 ± iησ i ∈ λ(M). In the meanwhile, one can easily see that < 3. Then it follows from Lemma 3 that ρ(∇ x G LA-GDSim ) < 1 holds for a small enough α. Therefore, by Lemma 15, the iterates converge to the origin, and we conclude the proof by observing that the transformations |λ ±i | > 1, implying ρ(M) > 1. Now assume that k ∈ π 2 arctan ησ min x 1 → U [x 1 ; 0 m-r ] + x * 1 and x 2 → V [x 2 ; 0 n-r ] + x * 2 of (0, 0) ∈ R r × R r gives (x * 1 , x * 2 ) ∈ R m × R n , which is a Nash equilibrium of Equation 9.

E.7 PROOF OF THEOREM 7

Proof. From Equation 4, the dynamics F PPSim of Equation 11 can be derived as F PPSim (x (t) 1 , x (t) 2 ) = I r ηΣ r -ηΣ r I r -1 x (t) 1 x (t) 2 . ( ) Let us define J def = 0 r Σ r -Σ r 0 r and M def = (I + ηJ) -1 . Then its Lookahead dynamics G LA-GDSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written as G LA-PPSim (x (t) 1 , x (t) 2 ) = ((1 -α)I + αM k ) x (t) 1 x (t) 2 . ( ) Together with Equation 16, we can see that the eigenvalues of ∇ x G LA-EGSim can be written as 1 - α + αλ k ±i with λ ±i def = 1+iησi 1+η 2 σ 2 i ∈ λ(M). In the meanwhile, we can easily see that |λ ±i | < 1 holds for any η > 0. Therefore, 1 -α + αλ k ±i is an interpolation between two distinct points (1, 0) and λ k ±i on/inside S 1 , implying ρ(∇ x G LA-PPSim ) < 1. Hence it follows from Lemma 15 that the iterates converges to the origin. However, the transformations . Note that M -1 shares the same eigenvalues with the Jacobian of F GDSim . Therefore we have tan θ min (λ max (M -1 )) = tan θ max (λ max (M -1 )) = ησ min , which implies k ∈ π 2θ min (λmax(M -1 )) , 3π 2θmax(λmax(M -1 )) . Then it follows from Lemma 4 that (λ -k ±i ) < 0 for any λ -1 ±i ∈ λ max (M -1 ), and the existence of k ∈ N is guaranteed for a small enough η. Then we have (λ k ±i ) < 0 for any λ ±i ∈ λ(M) since the reciprocal of a complex number preserves the sign of the real part. Hence it follows from Lemma 3 that ρ(∇ x G LA-PPSim ) < ρ(M) k holds for a large enough α. We conclude the proof by noting that the convergence rate O(ρ(∇ x G LA-PPSim ) t k ) of G LA-PPSim provided by Lemma 15 is faster than the rate O(ρ(M) t ) of F PPSim , assuming amortization of computations over k forward steps. x 1 → U [x 1 ; 0 m-r ] + x * 1 and x 2 → V [x 2 ; 0 n-r ] + x * 2 of (0, 0) ∈ R r × R r gives (x * 1 , x * 2 ) ∈ R m × R n ,

E.8 PROOF OF THEOREM 8

Proof. From Equation 5, the dynamics F EGSim of Equation 11 can be derived as . (51) Together with Equation 18, we can see that the eigenvalues of ∇ x G LA-EGSim can be written as 1 -α + αλ k ±i with λ ±i def = 1-ησ i ±iησ i ∈ λ(M). In the meanwhile, we can easily see that |λ ±i | < 1 for any η ∈ 0, 1 σmax , implying ρ(M) < 1. Therefore, 1-α+αλ k ±i is an interpolation between two distinct points (1, 0) and λ k ±i on/inside S 1 , implying ρ(∇ x G LA-EGSim ) < 1. Hence it follows from Lemma 15 that the iterates converges to the origin. However, the transformations x 1 → U [x 1 ; 0 m-r ] + x * 1 and x 2 → V [x 2 ; 0 n-r ] + x * 2 on (0, 0) ∈ R r × R r gives (x * 1 , x * 2 ) ∈ R m × R n , which is a Nash equilibrium of Equation 9. Now we show that G LA-EGSim can accelerate the convergence upon its base dynamics F EGSim . Assume k ∈ π 2 arctan ησ min 1-ησ min , 3π 2 arctan ησ min 1-ησ min and η ∈ 0, 1 2σmax . Note that |λ i | 2 = 2η 2 (σ i -1 2η ) 2 + 1 2 holds for each λ ±i ∈ λ(M). This implies λ max (M) = {1 -ησ min ± iησ min } for any η ∈ 0, 1 2σmax , hence k ∈ π 2θ min (λmax(M)) , 3π 2θmax(λmax(M)) . It follows from Lemma 4 that (λ k ±i ) < 0 holds for any λ ±i ∈ λ(M), and the existence of k is guaranteed for a small enough η. Then by Lemma 3 we have ρ(∇ x G LA-EGSim ) < ρ(M) k for a large enough α. We conclude the proof by noting that the convergence rate O(ρ(∇ x G LA-EGSim ) t k ) of G LA-EGSim provided by Lemma 15 is faster than the rate O(ρ(M) t ) of F EGSim , assuming amortization of computations over k forward steps. E.9 PROOF OF THEOREM 9 Proof. From Equation 7, the Jacobian of G evaluated at x * can written as We fix η = 0.05 throughout the experiments, and use Theorem 8 to derive theoretically recommended (+) hyperparameters k = 300, α = 0.9 for LA-EG Sim dynamics. We use k = 50, α = 0.1 to represent hyperparameters of LA-EG Sim chosen against (-) the theorem. For LA-GD Alt , we use k = 300, α = 0.1. We use the momentum factor β = -0.1 for negative (NM) and β = 0.1 for positive (PM) momentum methods. ∇ x G(x * ) = ∇ x (1 -α)id + αF k (x * ) = (1 -α)I + α∇ x F k (x * ) (52) = (1 -α)I + α k i=1 ∇ x F (F i-1 (x * )) = (1 -α)I + α(∇ x F (x * )) k , Figure 6 (a) shows that Theorem 5 and 8 indeed hold even for an ill conditioned game. Furthermore, Figure 6 (b) suggests that Lookahead can significantly accelerate the convergence of momentum methods that provably perform well on bilinear games, including the gradient descent with negative momentum (Gidel et al., 2019b) and extragradient with momentum (Azizian et al., 2020) .



Actually, M does not have to be diagonalizable; see Theorem 5.4 and Theorem 5.D4 inChen (1995).



Figure 1: The spectral contraction effect of Lookahead dynamics in Equation 8. λ ± are the eigenvalues of each base dynamics, and 1 -α + αλ k± are the eigenvalues of the associated Lookahead dynamics. k forward steps of a Lookahead procedure first rotate the eigenvalues λ ± of the dynamics' Jacobian matrix. Then, a synchronization backward step pulls them into a circle with a radius smaller than their maximal modulus. This results in a reduced spectral radius of the Jacobian matrix, which improves stability and convergence to an equilibrium.

ησ min and α is large enough.

(a) Convergence to the NE of the bilinear game. (b) Convergence to (0, 0) in the nonlinear game.

Figure 2: Optimization progress of multiple first-order methods with hyperparameters chosen by (+) and against our theorems (-) in the bilinear and nonlinear games.

(a) Trajectories of each Lookahead dynamics with k and α predicted by Theorem 10 and 11. (b) Trajectories of each Lookahead dynamics with k and α chosen against Theorem 10 and 11.

Figure 3: Visualized trajectories of each dynamics an equilibrium (0, 0) of the nonlinear game.

(b)  and shows that the hyperparameters predicted by our theorems, denoted by LA-GD Alt/Sim + and LA-EG Sim + , actually stabilize and accelerates the convergence to the equilibrium. In contrast, hyperparameters chosen against our theorems, denoted by LA-GD Alt/Sim -and LA-EG Sim -in Figure 2 (b) and Figure 3 (b), neither success to stabilize nor accelerate the convergence to the equilibrium.

Figure 4: Visualized eigenvalues of (1 -α)I + αX k .

θ min -) tan 4 θmax + 540

ησmax . Then since tan θ min (λ(M)) = ησ min and tan θ max (λ(M)) = ησ max , we have k ∈ π 2θ min (λ(M)) , 3π 2θmax(λ(M)) . It follows from Lemma 4 that (λ k ±i ) < 0 for any λ ±i ∈ λ(M), and the existence of k is guaranteed for a small enough η when max(λ(M)) min (λ(M)) = σmax σ min

ηJ. Then its Lookahead dynamics G LA-EGSim with a synchronization period k ∈ N and a rate α ∈ (0, 1) can be written asG LA-EGSim (x

Figure 5: Visualized top 20 eigenvalues of ∇ x F GDSim before (blue) and after (orange) training GANs with two different loss functions on MNIST.

Figure 6: Optimization progress of first-order methods in an ill-conditioned bilinear game. (a) Comparison between convergence of Lookahead dynamics chosen by (+) and (-) against Theorem 5 and 8. (b) Comparison between convergence of Lookahead dynamics with positive (PM) and negative (NM) momentums.

1: List of mathematical notations used in the paper.

2: Hyperparameters used for the experiment on the bilinear game

DERIVATION OF THEORETICALLY RECOMMENDED RANGE OF k IN EQUATION

< 0 for any λ i ∈ S such that (λ i ) > 0. Since every element of S has its conjugate pair in S by the assumption, we conclude (λ k i ) < 0 for any λ i ∈ S.

which is a Nash equilibrium of Equation9. Now we show that G LA-PPSim can accelerate the convergence upon its base dynamics F PPSim . Assume k ∈

annex

where the chain rule is used in third equality with a slight abuse of notation F 0 def = id. We use the fact that x * is an equilibrium of dynamics F for the last equality. It is easy to see from Equation 53that eigenvalues of ∇ x G(x * ) can be written as 1 -α + αλ k i for each λ i ∈ λ(∇ x F (x * )). However, λ k i is either on/inside S 1 since |λ i | ≤ 1 for each i due to the Lyapunov stability of x * in F . Therefore, 1 -α + αλ k i is an interpolation between two points (1, 0) ∈ S 1 and λ k i either on/inside S 1 ; hence |1 -α + αλ k i | ≤ 1. By assumption that λ k i = (1, 0) for each λ i ∈ λ(∇ x F (x * )), the inequality is strict, i.e. |1 -α + αλ k i | < 1, implying the local asymptotic stability of x * in G by Proposition 14.

E.10 PROOF OF THEOREM 10

Proof. From Equation 7, the Jacobian of G evaluated at x * can written aswhere the chain rule is used in third equality with a slight abuse of notation F 0 def = id. We use the fact that x * is an equilibrium of dynamics F for the last equality. It is easy to see from Equation 55that the eigenvalues of ∇ x G(x * ) can be written as 1 -α + αλ k i for each λ i ∈ λ(∇ x F (x * )). Now assume that every element of λ ≥1 (∇ x F (x * )) has non-zero imaginary part, and let k ∈Then by Lemma 4, (λ k i ) < 0 holds for any λ i ∈ λ ≥1 (∇ x F (x * )), and the existence of such k ∈ N is guaranteed for a small enough η when max(λ≥1(∇xF (x * )))min (λ ≥1 (∇xF (x * ))) < 3. Then it follows from the second case of Theorem 3 that ρ(∇ x G(x * )) < 1 holds for a small enough α. By Proposition 14, this implies local asymptotic stability of x * in G, concluding the proof.

E.11 PROOF OF THEOREM 11

Proof. From Equation 7, the Jacobian of G evaluated at x * can written aswhere the chain rule is used in third equality with a slight abuse of notation F 0 def = id. We use the fact that x * is an equilibrium of dynamics F for the last equality. It is easy to see from Equation 57that the eigenvalues of ∇ x G(x * ) can be written asThen by Lemma 4, (λ k i ) < 0 holds for any λ i ∈ λ max (∇ x F (x * )), and the existence of such k ∈ N is guaranteed for a small enough η when max(λmax(∇xF (x * )))min (λmax(∇xF (x * ))) < 3. Then it follows from the third case of Theorem 3 that ρ(∇ x G(x * )) < ρ(∇ x F (x * )) k holds for a small enough α. We conclude the proof by noting that this implies the upper bound O(ρ(∇ x G(x * ) t k ) on the rate of local convergence provided by Proposition 14 is faster than O(∇ x F (x * ) t ).

E.12 PROOF OF PROPOSITION 12

Proof. We directly follow the proofs of Lemma 2.1 and Lemma 3.1 in Daskalakis & Panageas (2018) and show that α ∈ (0, 1 1+L k ) guarantees locally diffeomorphic Lookahead dynamics, i.e., it is locally invertible at any given points.Note from Equation 7 that the Jacobian of G evaluated at x can written aswhere we have used the chain rule in the last equality with a slight abuse of notation F 0 def = id. Now assume that α ∈ 0, 1 1+L k and consider the following inequalities ρ(where the first and second inequalities hold for any operator norms and the last inequality is due to L-Lipschitzness of F . Then it follows from the assumption that ρ(Therefore, we conclude that G locally diffeomorphic, since ρ(

Now let us define the set of unstable equilibria of G as

Then it directly follows from the locally diffeomorphic G and the arguments of Lee et al. (2019) ; Daskalakis & Panageas (2018) that the set {x (0) : lim t→∞ G t (x (0) ) ∈ U } is of measure zero, which concludes the proof. We refer the readers to Appendix A of Daskalakis et al. (2018) for the detailed derivation of measure-zero arguments.

E.13 PROOF OF PROPOSITION 13

Proof. From Equation 7, the Jacobian of G evaluated at x * can written aswhere the chain rule is used in third equality with a slight abuse of notation F 0 def = id. We use the fact that x * is an equilibrium of dynamics F for the last equality. It is easy to see from Equation 63 that the eigenvalues of ∇ x G(x * ) can be written as 1 -α + αλ k i for each λ i ∈ λ(∇ x F (x * )). However, by the assumption, there exists a λ ∈ λ(∇ x F (x * )) such that |λ| > 1. Since λ is a positive real number, we have |1 -α + αλ k | > 1, concluding the proof.

F ADDITIONAL EXPERIMENTS F.1 EIGENVALUES OF GAN DYNAMICS

Theorem 10-11 assumes the radius-supporting eigenvalues, namely λ ≥1 (∇ x F ) and λ max (∇ x F )), to have non-zero imaginary parts and imaginary conditioning less than 3; otherwise, the existence of k that satisfies the sufficient conditions of Theorem 10-11 may not exist. We verify whether such assumptions are realistic in practical settings. Specifically, we train GANs on MNIST dataset with two different loss functions, non-saturating (Goodfellow et al., 2014) and WGAN-GP (Gulrajani et al., 2017) , and visualize the top 20 eigenvalues of ∇ x F GDSim for each loss function in Figure 5 .Figure 5 suggests most of the radius-supporting eigenvalues of ∇ x F GDSim at well-performing point (Inception Score (IS) (Salimans et al., 2016) 9) are distributed along the imaginary axis, and have non-zero imaginary part with imaginary conditioning less than 3. This suggests that our assumptions on the eigenvalues is not unrealistic and Theorem 10-11 can be applied for a practical non-linear game like GANs.

