CAN WE FIND NASH EQUILIBRIA AT A LINEAR RATE IN MARKOV GAMES?

Abstract

We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta algorithm, dubbed as Homotopy-PO, which provably finds a Nash equilibrium at a global linear rate. In particular, Homotopy-PO interweaves two base algorithms Local-Fast and Global-Slow via homotopy continuation. Local-Fast is an algorithm that enjoys local linear convergence while Global-Slow is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, Global-Slow essentially serves as a "guide" which identifies a benign neighborhood where Local-Fast enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by Local-Fast. Furthermore, we prove that Local-Fast and Global-Slow can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest. Our goal is to design efficient policy optimization methods to find Nash equilibria in zero-sum Markov games. This task is usually formulated as a nonconvex-nonconcave minimax optimization problem. There have been works showing that Nash equilibria in matrix games, which are a special kind of zero-sum Markov games with convex-concave structures, can be found at a linear rate (Gilpin et al., 2012; Wei et al., 2020) . However, due to the nonconvexity-nonconcavity, theoretical understanding of zero-sum Markov games is sparser. Existing methods have either sublinear rates for finding Nash equilibria, or linear rates for finding regularized Nash equiliria such as quantal response equilibria which are approximations for Nash equilibria (Alacaoglu et al.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) which studies how a group of agents interact with each other and make decisions in a shared environment (Zhang et al., 2021a) has received much attention in recent years due to its wide applications in games (Lanctot et al., 2019; Silver et al., 2017; Vinyals et al., 2019) , robust reinforcement learning (Pinto et al., 2017; Tessler et al., 2019; Zhang et al., 2021b) , robotics (Shalev-Shwartz et al., 2016; Matignon et al., 2012) , among many others. Problems in MARL are frequently formulated as Markov Games (Littman, 1994; Shapley, 1953) . In this paper, we focus on one important class of Markov games: two-player zero-sum Markov games. In such a game, the two players compete against each other in an environment where state transition and reward depend on both players' actions. Furthermore, in Markov games, it is desirable to design decentralized algorithms. That is, when a player updates its policy, it does not need to know the policy of other agents, as such information is usually unavailable especially when the game is competitive in nature. Meanwhile, other desiderata in MARL include symmetric updates and rationality. Here symmetry means that the algorithm employed by each player is the same/symmetric, and their updates differ only through using the different local information possessed by each player. Rationality means that if other players adopt stationary policy, the algorithm will converge to the best-response policy (Sayin et al., 2021; Wei et al., 2021) . In other words, the algorithm finds the optimal policy of the player. In decentralized learning, each player observes dynamic local information due to the changes in other players' policy, which makes it more challenging to design efficient algorithms (Daskalakis et al., 2020; Hernandez-Leal et al., 2017; Sayin et al., 2021) . Symmetric update also poses challenges for the convergence. Condon (1990) shows multiple variants of value iteration with symmetric updates can cycle and fail to find NEs. Gradient descent/ascent (GDA) with symmetric update can cycle even in matrix games (Daskalakis et al., 2018; Mertikopoulos et al., 2018) . Thus, an even more challenging question to pose is: Q2: Can we further answer Q1 with a decentralized algorithm that is symmetric and rational? In this paper, we give the first affirmative answers to Q1 and Q2. In specific, we propose a meta algorithm Homotopy-PO which provably converges to a Nash equilibrium (NE) with two base algorithms Local-Fast and Global-Slow. Homotopy-PO is a homotopy continuation style algorithm that switches between Local-Fast and Global-Slow, where Global-Slow behaves as a "guide" which identifies a benign neighborhood for Local-Fast to enjoy linear convergence. A novel switching scheme is designed to achieve global linear convergence without knowing the size of such a neighborhood. Next, we propose the averaging independent optimistic gradient descent/ascent (Averaging OGDA) method and the independent optimistic policy gradient descent/ascent (OGDA) method. Then, we instantiate Homotopy-PO by proving that Averaging OGDA and OGDA satisfy the conditions of Global-Slow and Local-Fast, respectively. This yields the first algorithm which provably finds Nash equilibria in zero-sum Markov games at a global linear rate. In addition, Homotopy-PO is decentralized, symmetric, rational and last-iterate convergent.

Our contribution.

Our contribution is two-fold. First, we propose a meta algorithm Homotopy-PO which is shown to converge to Nash equilibria of two-player zero-sum Markov games with global linear convergence, when the two base algorithms satisfy certain benign properties. Moreover, Homotopy-PO is a decentralized algorithm and enjoys additional desiderata in MARL including symmetric update, rationality and last-iterate convergence. Second, we instantiate Homotopy-PO by designing two base algorithms based on variants of GDA methods, which are proved to satisfy the conditions required by Homotopy-PO. In particular, we prove that the example base algorithm OGDA enjoys local linear convergence to Nash equilibria, which might be of independent interest.

1.1. RELATED WORK

A more comprehensive literature review is moved to Appendix A due to the space limitation. Of particular relevance are two decentralized algorithms Daskalakis et al. (2020) and Wei et al. (2021) . Daskalakis et al. (2020) consider an independent policy gradient descent/ascent algorithm which is a natural extension of single-agent policy gradient descent to two-player zero-sum Markov games. They utilize the two-sided gradient dominance to prove a sub-linear convergence rate of the gradientdescent-ascent (GDA) method. This is the first non-asymptotic convergence result of GDA for finding Nash equilibria in Markov games. However, their method is asymmetric, where one-player takes much smaller steps than its opponent. And their convergence results are base on average policies with no explicit guarantee for last-iterate convergence. Wei et al. (2021) propose an actor-critic optimistic policy gradient descent/ascent algorithm that is simultaneous decentralized, symmetric, rational and has O(1/ √ t) last-iterate convergence rate to the Nash equilibrium set. They use a critic which averages the approximate value functions from past iterations to tame nonstationarity in approximate Q-functions and get better approximations for policy gradients. A classical averaging stepsize from Jin et al. (2018) is utilized by the critic so that the errors accumulate slowly and last-iterate convergence is obtained. However, the critic using such choice of stepsizes throughout the game also destroys linear convergence since relatively large errors from past iterations cannot decay fast enough and will harm the accuracy of more recent steps.

2. NOTATIONS AND PRELIMINARIES

For integers n ≤ n ′ , we denote [n : n ′ ] = {n, n + 1, • • • , n ′ } and [n] = {1, • • • , n}. We use ∥•∥ to denote the Euclidean norm, and ∥ • ∥ p denotes the ℓ p -norm. For any vector x ∈ R d and closed convex set C ⊆ R d , let P C (x ) denote the unique projection point of x onto C. In addition, the distance between x and C is denoted by dist (x , C) = ∥x -P C (x )∥. Markov game. A two-player zero-sum discounted Markov game is denoted by a tuple MG = (S, A, B, P, R, γ), where S = [S] is the state space; A = [A] and B = [B] are the action spaces of the min-player and the max-player respectively; P : S × A × B → ∆ S is the transition kernel, R = {R s } s∈S ⊆ [0, 1] A×B is the reward function, and γ is the discount factor. Specifically, at state s, when the min-player takes action a and the max-player takes action b at state s, P(s ′ |s, a, b) is probability that the next state becomes s ′ , R s (a, b) is the reward received by the max-player, and the min-player receives an loss -R s (a, b). We assume that the rewards are bounded in [0, 1] without loss of generality. Let x = {x s } s∈S and y = {y s } s∈S denote the policies of the min-player and the max-player, where x s ∈ ∆ A and y s ∈ ∆ B . The policy spaces of the min-player and the max-player are denoted by X = (∆ A ) S , Y = (∆ B ) S . Let Z = X × Y denote the product policy space. The policy x ∈ X (y ∈ Y) is treated as an AS-dimensional (BS-dimensional) vector, and the policy pair z = (x , y ) is treated as an (A + B)S-dimensional vector where z s = (x s , y s ) represents an (A + B)-dimensional vector by concatenating x s and y s . The value function under the policy pair (x , y ) is defined as an S-dimensional vector with its entries representing the expected cumulative rewards: V x ,y (s) = E x ,y [ +∞ t=0 γ t R s t (a t , b t ) |s 0 = s]. Define V x , † (V †,y ) as the value functions of x (y ) with its best response, i.e., V x , † (s) = max y ′ ∈Y V x ,y ′ (s), V †,y (s) = min x ′ ∈X V x ′ ,y (s). For state s ∈ S, define the Bellman target operator Q s : R S → R A×B such for vector v ∈ R S , Q s [v](a, b) = R s (a, b) + γ s ′ ∈S P (s ′ |s, a, b) v(s ′ ). The Q-function Q x ,y = {Q x ,y s } s∈S is defined as a collection of A-by-B matrices with Q x ,y s = Q s [V x ,y ]. The (state) visitation distribution is defined as d x ,y s (s ′ ) = +∞ t=0 γ t Pr x ,y [s t = s ′ |s 0 = s]. For any distribution ρ ρ ρ ∈ ∆ S , we abbreviate V x ,y (ρ ρ ρ) = s∈S ρ ρ ρ(s)V x ,y (s), d x ,y ρ ρ ρ (s) = s ′ ∈S ρ ρ ρ(s ′ )d x ,y s ′ (s). From Gilpin et al. (2012) , there is a problem-dependent constant c + > 0 such that for any policy pair z = (x , y ) ∈ Z and s ∈ S, max y ′ s ∈∆ B x ⊤ s Q * s y ′ s -min x ′ s ∈∆ A x ′ s ⊤ Q * s y s ≥ c + • dist(z s , Z * s ). Nash equilibrium. The minimax game value of state s is defined as v * (s) = min x ∈X max y ∈Y V x ,y (s) = max y ∈Y min x ∈X V x ,y (s). A policy pair (x , y ) is called a Nash equilibirum (NE) if and only if: for any s ∈ S, V x , † (s) = V †,y (s) = v * (s). Define the minimax Q-functions as Q * s = Q s [v * ]. Define the sets X * s and Y * s as X * s = arg min x ′ s ∈∆ A max y ′ s ∈∆ B ⟨x ′ s , Q * s y ′ s ⟩ , Y * s = arg max y ′ s ∈∆ B min x ′ s ∈∆ A ⟨x ′ s , Q * s y ′ s ⟩ . Then X * s and Y * s are non-empty, closed and convex. Denote Z * s = X * s × Y * s . Let X * = s∈S X * s , Y * = s∈S Y * s , Z * = s∈S Z * s . A policy pair (x * , y * ) attains Nash equilibrium if and only if (x * , y * ) ∈ Z * , i.e., (x * s , y * s ) ∈ Z * s for any s ∈ S (Bas ¸ar & Olsder, 1998; Filar & Vrieze, 2012) . We denote the closure of the NE set's neighborhood as B(Z * , c) = {z ∈ Z : dist(z , Z * ) ≤ c}. Interaction protocol. In each iteration, each player plays a policy and observes the marginal reward function and the marginal transition kernel, i.e., in iteration t, the min-player plays x t ∈ X , while the max-player plays y t ∈ Y. The min-player receives the marginal reward function r t x : S×A → [0, 1] with r t x (s, a) = b∈B y t s (b)R s (a, b) and marginal transition kernel P t x : S × A → ∆ S with P t x (s ′ |s, a) = b∈B y t s (b)P(s ′ |s, a, b), while the max-player receives r t y and P t y which are defined analogously. Each player is oblivious to its opponent's policy.

3. A HOMOTOPY CONTINUATION ALGORITHM WITH GLOBAL LINEAR CONVERGENCE

We propose a decentralized algorithm with global linear convergence by (1) proposing a meta algorithm which can achieve global linear convergence with two base algorithms, ( 2) providing examples for the base algorithms. The analysis for the example base algorithms are in Section 4 and Section 5.

3.1. A HOMOTOPY CONTINUTATION META ALGORITHM

We present a homotopy continuation meta algorithm. It can achieve global linear convergence by switching between two base algorithms: Global-Slow base algorithm (Global-Slow) and Local-Fast base algorithm (Local-Fast). Global-Slow is globally convergent, but only attains a O( 1 T ) rate. Local-Fast is not necessarily globally convergent but attains a linear convergence rate in a neighborhood of the Nash equilibrium set. Global-Slow base algorithm: by calling Global-Slow([T 1 : T 2 ], z , η ′ ) during time interval [T 1 : T 2 ] where z = (x , ỹ ) is the initial policy pair, the players play policy pair z t = (x t , y t ) for each iteration t ∈ [T 1 : T 2 ], and compute an average policy pair z [T1:T2] = ( x [T1:T2] , y [T1:T2] ) at the end of iteration T 2 such that z t , z [T1:T2] satisfy the following two properties: Z * is unknown, it is impossible tell whether the algorithm has reached the benign neighborhood for Local-Fast to enjoy the linear rate . Thus, we cannot decide when to switch from Global-Slow to Local-Fast. Algorithm 1: Homotopy-PO: a meta-algorithm with global linear convergence Input: iterations: [0 : T ], initial policy pair: z 0 ∈ Z, stepsizes: η, η ′ > 0 set k = 1, I 0 lf = -1, z -1 = z 0 while I k-1 lf < T do I k gs = I k-1 lf + 1, I k gs = min{I k gs + 2 k -1, T }, I k lf = I k gs + 1, I k lf = min{I k lf + 4 k -1, T } during time interval [I k gs : I k gs ], run Global-Slow([I k gs : I k gs ], z I k-1 lf , η ′ ) and compute an average policy z [I k gs : I k gs ] during time interval [I k lf : I k lf ], run Local-Fast([I k lf : I k lf ], z [I k gs : I k gs ] , η) k ← k + 1 end To overcome this problem, we propose a homotopy continuation method Homotopy-PO which smartly switches between Global-Slow and Local-Fast. The pseudocode is in Algorithm 1. In Homotopy-PO, we split [0 : T ] into the segments:  [0 : T ] = [I 1 gs : I 1 gs ] ∪ [I 1 lf : I 1 lf ] ∪ • • • ∪ [I k gs : I k gs ] ∪ [I k lf : I k lf ] ∪ • • • where [I k gs : I k gs ] is the time interval of the k-th call to Global-Slow and [I k gs : I k gs ] = 2 k ; [I k lf : I k lf ] * = max{k * 1 , k * 2 }, where 2 k * 1 = O(1/( √ δ 0 η 2 )) and 2 k * 2 = O( 1 c0η 2 log(D 0 Γ 0 )) = O(1/(c 0 η 2 )). The value of k * is unknown to the players. Hidden Phase I. In the beginning, Global-Slow behaves like a "guide" in the sense that its average policy z [I k gs : I k gs ] is getting closer to the NE set as k goes. For small k, dist(z t , Z * ) could possibly increase when running Local-Fast. However, since the average policy z [I k gs : I k gs ] is the initial policy of the k-th call to Local-Fast, by the global convergence as in (2), for k ≥ k * 1 , z [I k gs : I k gs ] will reach B(Z * , δ 0 η 4 ). Thus, after k ≥ k * 1 , each time when we switch to Local-Fast, it will exhibit linear convergence during time interval [I k lf : I k lf ]. Hidden Phase II. After k ≥ k * 1 , Local-Fast enjoys fast linear convergence and becomes the main contributor to the convergence (see segments AB, CD in Figure 1 ). Thanks to the fast convergence of Local-Fast, in this phase, dist(z t , Z * ) can be much smaller than C ′ /t. Note that we use z I k-1 lf as the initial policy of the k-th call to Global-Slow. Thus, Global-Slow could possibly cause dist(z t , Z * ) to increase. However, instead of bounding dist(z t , Z * ) by ( 2), now (3) can provide a tighter bound for dist(z t , Z * ) when calling Global-Slow during Hidden Phase II. (3) implies that dist(z t , Z * ) increases at most geometrically when running Global-Slow (see segments BC, DE in Figure 1 ). After 2 k ≥ O( 1 c0η 2 log(D 0 Γ 0 )) (k ≥ k * 2 ) , the possible increase of dist(z t , Z * ) caused by Global-Slow is much less than the decrease caused by  Q G M Y w E i x g B Y 6 W B W x l x G Q L H e R 4 U V m D o w K 1 6 N W 8 O v E r 8 g l R R g e b A / e o P J U l j K g z h o H X P 9 x I T Z K A M I 5 z O y v 1 U 0 w T I B E a 0 Z 6 m A m O o g m 9 8 + w 2 d W G e J I K v u E w X P 1 d y K D W O t p H N r J G M x Y L 3 u 5 + J / X S 0 1 0 H W R M J K m h g i w W R S n H R u K 8 C D x k i h L D p 5 Y A U c z e i s k Y F B B j 6 y r b E v z l L 6 + S d r 3 m X 9 Y u 7 u v V x k 1 R R w m d o F N 0 j n x 0 h R r o D j V R C x H 0 h J 7 R K 3 p z Z s 6 L 8 + 5 8 L E b X n C J z j P 7 A + f w B u Y m U P g = = < / l a t e x i t > global linear rate < l a t e x i t s h a 1 _ b a s e 6 4 = " R k W s f F C O p O d i Y G X o t Q L O E C Z 5 t s 4 = " > A A A B / H i c b V B N S 8 N A E N 3 4 W e t X t E c v i 0 X w Y k m K q M e i I B 4 8 V L A f 0 I a y 2 W 7 a p Z t N 2 J 2 I I d S / 4 s W D I l 7 9 I d 7 8 N 2 7 b H L T 1 w c D j v R l m 5 v m x 4 B o c 5 9 t a W l 5 Z X V s v b B Q 3 t 7 Z 3 d u 2 9 / a a O E k V Z g 0 Y i U m 2 f a C a 4 Z A 3 g I F g 7 V o y E v m A t f 3 Q 1 8 V s P T G k e y X t I Y + a F Z C B 5 w C k B I / X s U h f Y I w B k t x E l 4 u S a a B j 3 7 L J T c a b A i 8 T N S R n l q P f s r 2 4 / o k n I J F B B t O 6 4 T g x e R h R w K t i 4 2 E 0 0 i w k d k Q H r G C p J y L S X T Y 8 f 4 y O j 9 H E Q K V M S 8 F T 9 P Z G R U O s 0 9 E 1 n S G C o 5 7 2 J + J / X S S C 4 8 D I u 4 w S Y p L N F Q S I w R H i S B O 5 z x S i I 1 B B C F T e 3 Y j o k i l A w e R V N C O 7 8 y 4 u k W a 2 4 Z 5 X T u 2 q 5 d p n H U U A H 6 B A d I x e d o x q 6 Q X X U Q B S l 6 B m 9 o j f r y X q x 3 q 2 P W e u S l c + U 0 B 9 Y n z 8 k J J U X < / l a t e x i t >

Local-Fast

< l a t e x i t s h a 1 _ b a s e 6 4 = " 9 g P E T y a + c O t x i F i j 4 s R t g V 6 u Y x U = " > A A A B / X i c b V D L S s N A F J 3 4 r P U V H z s 3 g 0 V w Y 0 m K q M u i C 1 1 W t A 9 o Q 5 l M J + 3 Q S S b M 3 K g 1 F H / F j Q t F 3 P o f 7 v w b p 2 0 W 2 n r g w u G c e 7 n 3 H j 8 W X I P j f F t z 8 w u L S 8 u 5 l f z q 2 v r G p r 2 1 X d M y U Z R V q R R S N X y i m e A R q w I H w R q x Y i T 0 B a v 7 / Y u R X 7 9 j S n M Z 3 c I g Z l 5 I u h E P O C V g p L a 9 2 w L 2 A A D p p Z A + E U c 3 Q t 4 P 2 3 b B K T p j 4 F n i Z q S A M l T a 9 l e r I 2 k S s g i o I F o 3 X S c G L y U K O B V s m G 8 l m s W E 9 k m X N Q 2 N S M i 0 l 4 6 v H + I D o 3 R w I J W p C P B Y / T 2 R k l D r Q e i b z p B A T 0 9 7 I / E / r 5 l A c O a l P I o T Y B G d L A o S g U H i U R S 4 w x W j I A a G E K q 4 u R X T H l G E g g k s b 0 J w p 1 + e J b V S 0 T 0 p H l + X C u X z L I 4 c 2 k P 7 6 B C 5 6 B S V 0 R W q o C q i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F p n b O y m R 3 0 B 9 b n D w 7 n l Z 4 = < / l a t e x i t >

Global-Slow

< l a t e x i t s h a 1 _ b a s e 6 4 = " 3 R / 6 j I q j S G p 9 s r c 1 s y G 4 B 5 b 8 n z Theorem 1 Let {z t = (x t , y t )} t∈[0:T ] be the policy pairs played when running Homotopy-PO (Algorithm 1). Then, there exists a problem-dependent constant I = " > A A A C E 3 i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 6 3 Q F i l J E X V Z d O O y g n 1 g 0 5 b J d F K H T h 7 M T I Q a 8 g 9 u / B U 3 L h R x 6 8 a d f + M k z U J b D 1 w 4 n H M v 9 9 5 j B 4 w K a R j f W m 5 p e W V F G H Y i S V N N S r J Y v 5 Y 1 i 2 X C T v u B u N 1 M q 4 / D C Q x z C V M G L R b T y o V i q l o V 4 0 a k Y K u E j M j B R B h u Z Q / 7 J G P g 5 d 4 k n M k B A 9 0 w h k P 0 J c U s x I X L B C Q Q K E J 2 h M e o p 6 y C W i H 6 U / x f B I K S P o + F y V J 2 G q / p 6 I k C v E 1 L V V Z 3 K n m P c S 8 T + v F 0 r n v B 9 R L w g l 8 f B s k R M y K H 2 Y B A R H l B M s 2 V Q R h D l V t 0 J 8 h z j C U s V Y U C G Y 8 y 8 v k n a 9 Z p 7 W T q 7 r x c Z F F k c e H I B D U A Y m O A M N c A W a o A U w e A T P 4 B W 8 a U / a i / a u f c x a c 1 o 2 s w / + Q P v 8 A W Q a n T c = < / l a t e x i t > log(dist(z t , Z ⇤ )) D ≤ O(poly(C ′ , 1/c , 1/δ 0 , 1/η, 1/η ′ )) such that for any t ∈ [0 : T ], we have dist 2 (z t , Z * ) ≤ 2S max{Γ 0 , 1} • 1 -c0η 2 48 t-D , where the value of C ′ , c 0 , δ , Γ 0 can be found in the definitions of Global-Slow and Local-Fast. As D is independent of t, Theorem 1 guarantees the global linear convergence of Homotopy-PO.

3.2. EXAMPLES OF BASE ALGORITHMS

We introduce the averaging independent optimistic gradient descent/ascent (Averaging OGDA) method and the independent optimistic policy gradient descent/ascent (OGDA) method which will serve as examples for Global-Slow and Local-Fast respectively. Both Averaging OGDA and OGDA are symmetric, rational and decentralized algorithms. The pseudocodes are in Appendix G. Example of Global-Slow base algorithm (Averaging OGDA). By running Averaging-OGDA([T 1 : T 2 ], z , η ′ ) with initial policy z = (x , ỹ ), the min-player initializes x T1 = x T1 = x and V T1 (s) = V †,ỹ (s), the max-player initializes y T1 = y T1 = ỹ and V T1 (s) = V x , † (s), and they update for t ∈ [T 1 + 1 : T 2 ] as follows: V t (s) = min a∈A t-1 j=T1 α j-T1+1 t-T1 q j s (a), V t (s) = max b∈B t-1 j=T1 α j-T1+1 t-T1 q j s (b), x t s = P ∆ A x t-1 s -η ′ q t-1 s , y t s = P ∆ B y t-1 s + η ′ q t-1 s , x t s = P ∆ A x t-1 s -η ′ q t s , y t s = P ∆ B y t-1 s + η ′ q t s , where q j s = Q s [V j ]y j s , q j s = Q s [V j ] ⊤ x j s , and Q s [•] is the Bellman target operator defined in the introduction part. The min-player and the max-player compute the average policies x [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 x t , y [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 y t . We use the classical averaging stepsizes {α j t } from Jin et al. (2018) : Example of Local-Fast base algorithm (OGDA). By running OGDA([T 1 : T 2 ], z , η) with initial policy z = ( x , y ), the min-player initializes x T1 = x T1 = x , the max-player initializes y T1 = y T1 = y , and they update for t ∈ [T 1 + 1 : T 2 ] as follows: α t = H+1 H+t , α j t = α j t k=j+1 (1 -α k ) (1 ≤ j ≤ t -1), α t t = α t , with H = 1+γ 1-γ x t s = P ∆ A x t-1 s -ηQ t-1 s y t-1 s , y t s = P ∆ B x t-1 s + η Q t-1 s ⊤ x t-1 s , x t s = P ∆ A x t-1 s -ηQ t s y t s , y t s = P ∆ B y t-1 s + η Q t s ⊤ x t s , where we abbreviate Q t s = Q x t ,y t s for t ∈ [T 1 : T 2 ]. In OGDA, x t , y t are the policies played at iteration t ∈ [T 1 : T 2 ], while x t , y t are local auxiliary variables. OGDA can be considered as a natural extension of the classical optimistic gradient descent/ascent to Markov games because when there is only one state (S = 1), it reduces to the classical OGDA method for matrix games. The proof for local linear convergence of OGDA is of independent interest and shown in Section 5.

3.3. GLOBAL LINEAR CONVERGENCE

We can instantiate the meta algorithm Homotopy-PO by using OGDA (8) as Local-Fast and Averaging OGDA (6) as Global-Slow. This gives the provable global linear convergence for zero-sum discounted Markov games as in Theorem 2. In practice, Homotopy-PO can also exhibit linear convergence. Our numerical experiments are in Appendix I due to the space limitation. Theorem 2 (Global Linear Convergence) Let {z t = (x t , y t )} t∈[0:T ] be the policy pairs played when running Homotopy-PO (Algorithm 1), where Local-Fast uses OGDA with η ≤ (1-γ)  : T ], dist 2 (z t , Z * ) ≤ 16S 2 1 -γ • 1 -cη 2 t-M log 2 (SAB/(c + ηη ′ )) η 4 η ′2 , where c = Ω(c 2 + /poly(S, A, B, 1/(1γ))) and M = poly(S, A, B, 1/(1γ), 1/c + ). Decentralized implementation. Since both OGDA and Averaging OGDA are symmetric, rational and decentralized, our instantiation of Homotopy-PO is naturally a symmetric, rational and decentralized algorithm. Pseudocodes and more details can be found in Appendix G. Linear rate comparison with matrix games. For the convex-concave matrix games, Gilpin et al. (2012) and Wei et al. (2020) propose centralized/decentralized methods with global linear rates of (1 -O(φ(G))) t and (1 -O(φ(G) 2 )) t respectively, where φ(G) is a certain condition measure of matrix G. Details of φ(G) are in Lemma 22. The constant c + in (1) can be naturally defined as c + = min s∈S φ(Q * s ) (see Corollary 2). Thus, the global linear convergence rate for zero-sum Markov games in Theorem 2 is comparable to solving matrix games up to polynomials in S, A, B, 1/(1γ).

4. GLOBAL CONVERGENCE AND GEOMETRIC BOUNDEDNESS OF AVERAGING OGDA

We show that the Averaging OGDA (6) method has O(log T /T ) global convergence rate and geometric boundedness. Thus, Averaging OGDA can be serve as Global-Slow in Homotopy-PO. Global convergence. The proof for global convergence of Averaging OGDA adapts several standard techniques from Markov games (Zhang et al., 2022; Wei et al., 2021) . We attach its proof in Appendix D.1 for completeness. Theorem 3 (Global Convergence) Let z [T1:T2] = ( x [T1:T2] , y [T1:T2] ) be the average policy (7) generated by running Averaging-OGDA([T 1 : T 2 ], z , η ′ ) with η ′ ≤ 1-γ 16 max{A,B} . There is a problem- dependent constant C ′ = O( √ S(A+B) c+(1-γ) 6 ) such that z [T1:T2] satisfies dist z [T1:T2] , Z * ≤ C ′ • log (T 2 -T 1 + 1) η ′ (T 2 -T 1 + 1) . ( ) This gives the O(1/T ) global convergence rate of Global-Slow. This property guarantees that Global-Slow can serve as a "guide" in Hidden Phase I as described in Section 3.1. Geometric boundedness. The proof of geometric boundedness mainly relies on the stability of projected gradient descent with respect to the NE set (Appendix B). We will prove that the increase of dist(z t , Z * ) is at most geometric by providing mutual bounds among {dist (z t , Z * )}, {dist( z t , Z * )}, {∥V t -V t ∥ ∞ }, {max b q t s (b) -min a q t s (a)} inductively. The formal proof is in Appendix D.2. Theorem 4 (Geometric Boundedness) Let {z t } t∈[T1:T2] , z [T1:T2] be the policy pairs played and the average policy pair generated by running Averaging-OGDA([T 1 : T 2 ], z , η ′ ) with η ′ ≤ 1, then there is a problem-dependent constant D 0 = O( S(A+B) 2 (1-γ) 4 ) (possibly D 0 > 1) such that for any t ∈ [T 1 : T 2 ], dist 2 (z t , Z * ) ≤ D t-T1 0 • dist 2 (z , Z * ). ( ) dist 2 ( z [T1:T2] , Z * ) ≤ D T2-T1 0 • dist 2 (z , Z * ). ( ) This property is important in our proof for the main theorem (Theorem 2). It means that when running Global-Slow in Hidden Phase II, though dist(z t , Z * ) can possibly increase due to D 0 > 1, dist(z t , Z * ) can only increase geometrically (see segments BC, DE in Figure 1 ).

5. LOCAL LINEAR CONVERGENCE OF OGDA

We show that OGDA (8) has local linear convergence. Thus, OGDA can be used as the base algorithm Local-Fast in Homotopy-PO. To prove the local linear convergence, we provide a novel analysis for OGDA which starts from the following two observations. Observation I (Lemma 7) saddle-point metric subregularity (SP-MS) can be generalized to Markov games, i.e., for any policy pair z ∈ Z and s ∈ S, V x , † (s) -V †,y (s) ≥ c + • dist(z s , Z * ). Observation I guarantees the progress of projected gradient descent/ascent (PGDA) is substantial. Observation II (Appendix B, Lemma 10) when running OGDA (8), the change in policy pair becomes smaller when z t , z t are approaching the NE set, i.e., ∥z t+1 -z t ∥ 2 + ∥ z t -z (t-1) ∥ 2 ≤ O(dist 2 ( z t-1 , Z * ) + ∥ z t-1 -z t-1 ∥ 2 ). ( ) Observation II implies the stability of state visitation distribution. Intuitively, it can help us relate Markov games to matrix games in a neighborhood of the NE set. Theorem 5 (Local Linear Convergence) Let {z t } t∈[T1:T2] be the policy pairs played when running OGDA([T 1 : T 2 ], z , η) with stepsize η ≤ (1-γ) 5 2 32 √ S(A+B) . Then, there are problem-dependent constants c ∈ (0, 1), δ 0 > 0 such that if dist 2 ( z , Z * ) ≤ δ 0 η 4 , then for any t ≥ T 1 , dist 2 (z t , Z * ) ≤ 8S 1 -γ 1 - c 0 η 2 48 t-T1 dist 2 ( z , Z * ), where c 0 = Ω(c 2 + /poly(S, A, B, 1/(1γ))) and δ 0 = Ω(c 4 + /poly(S, A, B, 1/(1γ))). We provide a proof sketch below. The formal proof is in Appendix C. Proof sketch of Theorem 5. We denote the projections by x t * = P X * ( x t ), y t * = P Y * ( y t ), z t * = P Z * ( z t ). Let ρ ρ ρ 0 be the uniform distribution on S. Our proof for the local linear convergence of OGDA has the following steps. Step I: One-step analysis (Appendix C.1). The main obstacle in adopting standard analysis for normal form games to Markov games lies in the fact that as Markov games are nonconvex-nonconcave, it may happen that x t+1 Step III: stability of d t x , d t y - C Λ 2 ∥ z t -z t ∥ 2 -C Λ (∥ z t+1 -z t+1 ∥ 2 + ∥ z t -z t ∥ 2 ) Step II: progress of PGDA , where C Λ = 1-γ 4S , Θ t = s∈S d t x (s)dist 2 ( x t s , X * s ) + d t y (s)dist 2 ( y t s , Y * s ), Θ t = s∈S d t+1 x (s)dist 2 ( x t s , X * s )+d t+1 y (s)dist 2 ( y t s , Y * s ), and Λ t = Θ t +C Λ z t -z t 2 . As d t x (s) ≥ d x t * ,y t+1 s (s) ≥ 1-γ S , to show the local linear convergence of OGDA, it suffices to show that for Λ t . Step II: Progress of projected gradient descent (Appendix C.2). We combine (13) from Observation I and standard analysis of projected gradient descent (Lemma 5) to essentially show that there exists a problem-dependent constant c ′ + = O(c 2 + η 2 /poly(S, A, B, 1/(1 -γ))) such that C Λ (∥z t+1 -z t ∥ 2 + ∥ z t -z t ∥ 2 ) ≥ c ′ + • Θ t . ( ) Step III: Stability of visitation distribution near the NE set (Appendix C.3). Using (14) from Observation II and the non-expansive property of projections onto convex sets, we will show ∥z t+1 - 16) are continuous in z t and z t , we can find a problem-dependent constant δ = O(c 4 + η 4 /poly(S, A, B, 1/(1γ))) such that if Λ t-1 ≤ δ, then Θ t can be bounded by z t ∥ 2 + ∥ z t * -z (t-1) * ∥ 2 ≤ O Λ t-1 . Then, as d t x (s), d t y (s) in ( Θ t ≤ (1 + c ′ + 2 )Θ t . ( ) Step IV: Induction (Appendix C.4). By (17), ( 18), (19) from Steps I, II, III above, intuitively, we can deduce that when Λ t-1 ≤ δ, the "one-step linear convergence" is achieved Λ t+1 ≤Λ t + c ′ + 2 Θ t - C Λ 2 ∥ z t -z t ∥ 2 -c ′ + Θ t = Λ t - c ′ + 2 Θ t - C Λ 2 ∥ z t -z t ∥ 2 ≤Λ t -min{ c ′ + 2 , 1 2 } Θ t + C Λ ∥ z t -z t ∥ 2 = 1 - c ′ + 2 Λ t .

By a coupled induction with

Step III, given the initial policy z in the neighborhood B(Z * , √ δ) of the NE set, the policy pair z t will always stay in B(Z * , √ δ). Then, Λ t converges linearly. This yields the local linear convergence of OGDA as in Theorem 5.

ACKNOWLEDGEMENT

We would like to thank anonymous reviewers for their helpful comments. 

A FURTHER RELATED WORK

Sampling-based two-player zero-sum Markov games. Finding Nash equilibria of zero-sum Markov games in sampling-based/online setting is receiving extensive studies in recent years (Zhang et al., 2020; Liu et al., 2021; Bai et al., 2020; Bai & Jin, 2020; Brafman & Tennenholtz, 2002; Sidford et al., 2020; Tian et al., 2021; Wei et al., 2017; Xie et al., 2020; Chen et al., 2022; Li et al., 2022) . In this paper, we are more concerned with known model or perfect recall settings. Specifically, our focus is on how to design efficient policy optimization methods to solve the minimax optimization problem formulated by zero-sum Markov games. Therefore, these works are not directly relevant to us. Minimax optimization. Zero-sum Markov games are usually studied as minimax optimization problems. Finding Nash equilibria/saddle points in convex-concave and nonconvex-concave problems have been extensively studied (Lin et al., 2020b; Tseng, 1995; Mokhtari et al., 2020a; b; Thekumparampil et al., 2019; Lu et al., 2020; Nouiehed et al., 2019; Kong & Monteiro, 2021; Lin et al., 2020a) . Due to the nonconcexity-nonconcavity of zero-sum Markov games, existing tools in convex-concave and nonconvex-concave optimization are hard to be adapted here. (Gilpin et al., 2012; Wei et al., 2020) . The linear convergence of their algorithms relies on the following fact: the duality gap of one policy pair can be lower bounded by its distance to the NE set multiplied by a matrix condition measure (see Lemma 22 for more details). This property is called saddle-point metric subregularity (SP-MS) in Wei et al. (2020) . Similar techniques have been extended to extensive form games and get linear convergence (Lee et al., 2021; Piliouras et al., 2022) . Averaging techniques. Averaging techniques are usually used to tame nonstationarity in approximate Q functions, where the players utilize information from past iterations to obtain better approximations for value functions and policy gradients. Wei et al. (2021) propose an actor-critic OGDA method which uses averaged value functions to get more accurate policy gradients, and then achieve a O(1/ √ t) last-iterate convergence rate to the NE set. Zhang et al. (2022) propose a modified OFTRL method, where the min-player and the max-players employ a lower and upper bound for value functions separately. The lower and upper bounds are computed from approximate Qfunctions in past iterations. Their method has O(1/t) convergence rate for the average policy. Yang & Ma (2022) show that the average policy of an OFTRL method whose approximate Q-functions are also averaged from past estimates can find Nash equilibria at the rate of O(1/t) with no logarithmic factors. Regularized Markov games. Adding regularizer can greatly refine the structures of matrix games and Markov games and is considered a powerful tool to tackle nonconvexity-nonconcavity of zero-sum Markov games. Cen et al. (2021) study entropy-regularized matrix games and achieve dimension-free last-iterate linear convergence to the quantal response equilibrium which is an approximation for the Nash equilibrium. They further connect value iteration with matrix games and use the contraction property of the Bellman operator to prove the linear convergence to the quantal response equilibrium of the Markov games. By choosing small regularization weights, their method can find an ϵ-Nash equilibrium in O(1/ϵ) iterations. Zeng et al. (2022) also consider adding entropy regularization to help find Nash equilibria in zero-sum Markov games. They prove the O(t -1/3 ) convergence rate of a variant of GDA by driving regularization weights dynamically to zero. How-ever, to obtain Nash equilibria, the regularization weights have to be reduced to zero in the learning process. The time complexities of existing regularized methods are usually inversely proportional to the regularization weights. Reducing such weights to zero could possibly lead to sub-linear rates.

B STABILITY OF PROJECTED GRADIENT DESCENT/ASCENT WITH RESPECT

TO THE NASH EQUILIBRIUM SET In this section, we show the stability of the distance to the Nash equilibrium set after one step of projected gradient descent/ascent. The results in this section are important in our proofs for the local linear convergence of OGDA and the geometric boundedness of Averaging OGDA. The following lemma shows that projected gradient descent/ascent is very "stable" on the NE set. More specifically, if the players have attained a Nash equilibirum, then, their policies will remain invariant by doing projected gradient descent/ascent. Lemma 1 For any Nash equilibrium z = (x , y ) ∈ Z * , let x + , y + be the variables after one step of projected policy gradient descent/ascent with stepsize η > 0, i.e., for s ∈ S x + s = P ∆ A (x s -ηQ * s y s ) , y + s = P ∆ B y s + η (Q * s ) ⊤ x s . Let z + = (x + , y + ), then, z + = z . Proof of Lemma 1. Let u * s = Q * s y s . By Lemma 19, x s ∈ arg min x ′ s ∈∆ A ⟨x ′ s , Q * s y s ⟩ . Equiv- alently, supp (x s ) ⊆ arg min a u * s (a) , where supp(x s ) is the index set of the nonzero entries in x s . Next, we will show x + s = x s . Since x + s is the projection onto ∆ A and Slater's condition is naturally satisfied in the simplex constraint, by the KKT conditions, x + s (a) -x s (a) + ηu * s (a) -λ 0 + λ a = 0, λ a x + s (a) = 0, ∀a ∈ [A], λ a ≥ 0, ∀a ∈ [A], x + s (a) ≥ 0, ∀a ∈ [A], a∈[A] x + s (a) = 1. Then, for a ∈ [A], λ a > 0 only if x + s (a) = 0; otherwise, x + s (a) = x s (a) -ηu * s (a) + λ 0 . Thus, x + s (a) = max {x s (a) -ηu * s (a) + λ 0 , 0} . If λ 0 = η • min a∈[A] u * s (a), then by combining with supp (x s ) ⊆ arg min a u * s (a), we have max {x s (a) -ηu * s (a) + λ 0 , 0} = x s (a), i.e., a max {x s (a) -ηu * s (a) + λ 0 , 0} = 1. Thus, for λ 0 > η • min a∈[A] u * s (a) or λ 0 < η • min a∈[A] u * s (a), we will have a max {x s (a) -ηu * s (a) + λ 0 , 0} > 1 or a max {x s (a) -ηu * s (a) + λ 0 , 0} < 1, respectively. To meet the condition a∈[A] x + s (a) = 1, we have to let λ 0 = η • min a∈[A] u * s (a). Now, x + s (a) = max {x s (a) -ηu * s (a) + λ 0 , 0} = x s (a), ∀a ∈ A. Analogously, y + s = y s . □ The following lemma is a perturbed version of Lemma 1. Lemma 2 For any z = (x , y ) ∈ Z, z = ( x , y ) ∈ Z and matrices Q s , Q s s∈S ⊆ R A×B , let x + , y + be the position after one step of projected policy gradient descent/ascent with stepsize η > 0, i.e., for s ∈ S x + s = P ∆ A ( x s -ηQ s y s ) , y + s = P ∆ B y s + η Q s ⊤ x s . Let z + = (x + , y + ), then, z + -z 2 ≤8dist 2 ( z , Z * ) + 4η 2 s∈S B max (a,b)∈A×B |Q s (a, b) -Q * s (a, b)| 2 + 4η 2 s∈S A max (a,b)∈A×B Q s (a, b) -Q * s (a, b) 2 + 4η 2 max {A, B} 2 (1 -γ) 2 dist 2 (z , Z * ) . Proof of Lemma 2. Denote x * = P X * (x ), y * = P Y * (y ), z * = (x * , y * ); x * = P X * ( x ), y * = P Y * ( y ), z * = x * , y * . Let u s = Q s y s , u * s = Q * s y * s , then ∥u * s -u s ∥ ≤ √ B ∥u * s -u s ∥ ∞ ≤ √ B max (a,b)∈A×B |Q s (a, b) -Q * s (a, b)| ∥y s ∥ 1 + max (a,b)∈A×B |Q * s (a, b)| ∥y s -y * s ∥ 1 ≤ √ B max (a,b)∈A×B |Q s (a, b) -Q * s (a, b)| + B 1 -γ dist (y s , Y * s ) , i.e., ∥u * s -u s ∥ 2 ≤ 2 B max (a,b)∈A×B |Q s (a, b) -Q * s (a, b)| 2 + B 2 (1 -γ) 2 dist 2 (y s , Y * s ) . By Lemma 19, ( x * , y * ) is also a Nash equilibrium. Denote x * + s = P ∆ A x * s -ηQ * s y * s . Then, by Lemma 1, x * + s = x * s . By triangle inequality, we have x + s -x s ≤ x + s -x * + s + x * + s -x * s + x * s -x s = P ∆ A ( x s -ηu s ) -P ∆ A x * s -ηu * s + 0 + dist ( x s , X * s ) ≤ x s -x * s + η ∥u s -u * s ∥ + dist ( x s , X * s ) =2dist ( x s , X * s ) + η ∥u s -u * s ∥ , where the first equality is by ( 21) and the second inequality comes from the fact that for any a a a, b ∈ R A , ∥P ∆ A (a a a) -P ∆ A (b)∥ ≤ ∥a a a -b∥. Taking square and summing over s ∈ S and combining with (20) yield that x + -x 2 ≤8dist 2 ( x , X * ) + 4η 2 B s∈S max (a,b)∈A×B |Q s (a, b) -Q * s (a, b)| 2 + B 2 (1 -γ) 2 dist 2 (y , Y * ) . Analogously, y + -y 2 ≤8dist 2 ( y , Y * ) + 4η 2 A s∈S max (a,b)∈A×B Q s (a, b) -Q * s (a, b) 2 + A 2 (1 -γ) 2 dist 2 (x , X * ) . Then, the result follows by summing up the bound for ∥x +x ∥ 2 and ∥y +y ∥ 2 . □

C PROOF FOR LOCAL LINEAR CONVERGENCE OF OGDA

In this section, we prove the local linear convergence of OGDA (Theorem 5). For notational simplicity, we assume T 1 = 0 in the analysis below. Recall the OGDA algorithm (T 1 = 0): the min-player and max-player initialize x 0 = x 0 = x , y 0 = y 0 = y . ( ) and the min-player updates for t ≥ 1 as follows x t s = P ∆ A x t-1 s -ηQ t-1 s y t-1 s , x t s = P ∆ A x t-1 s -ηQ t s y t s , while the max-player updates for t ≥ 1 as follows y t s = P ∆ A y t-1 s + η Q t-1 s ⊤ x t-1 s , ( ) y t s = P ∆ A y t-1 s + η Q t s ⊤ x t s . Here, we denote Q t s = Q x t ,y t s , ∀t ≥ 0. The policy x t and y t are played by the min-player and the max-player at iteration t. And x t , y t are local auxiliary variables to help generate the policies x t and y t . Since we initialize x 0 = x , y 0 = y , we drop the notation of x , y below and directly use x 0 , y 0 to denote the initial policies. The OGDA method is a decentralized algorithm, its decentralized implementation is in Algorithm 2 (for the min-player) and Algorithm 3 (for the max-player) of Appendix G. To prove the local linear convergence of OGDA, we first introduce some notations and auxiliary variables. Additional notations and auxiliary variables. We use 1, 0 to denote the all-ones and all-zeros vectors or matrices, whose dimensions are determined from the context. 1 i is the i-th standard basis of the Euclidean space, i.e., the i-th entry of 1 i equals one, and the others entries equal zero. The operators >, ≥, <, ≤ are overloaded for vectors and matrices in entry-wise sense. We denote the policy pairs z t = (x t , y t ) , z t = x t , y t and denote the projections onto the Nash equilibrium sets as x t * s = P X * s (x t s ) , y t * s = P Y * s (y t s ) , z t * s = P Z * s (z t s ) . Since x t , y t , z t are treated as concatenated vectors, we have from the elementary property of the ℓ 2 -norm that z t * s = ( x t * s , y t * s ), x t * = P X * x t = x t * s s∈S , y t * = P Y * y t = y t * s s∈S , z t * = P Z * z t = z t * s s∈S , and z t * = ( x t * , y t * ). Let ρ ρ ρ 0 be the uniform distribution on S. Then, we denote the state visitation distribution under the policy pairs ( x (t-1) * , y t ) and (x t , y (t-1) * ) as d t x (s) = d x (t-1) * ,y t ρ ρ ρ0 (s), d t y (s) = d x t , y (t-1) * ρ ρ ρ0 (s). It follows by definition that for any s ∈ S, 1 -γ S ≤ d t x (s) ≤ 1, 1 -γ S ≤ d t y (s) ≤ 1. Define weighted sums of distances Θ t = s∈S d t x (s)dist 2 x t s , X * s + d t y (s)dist 2 y t s , Y * s , Θ t = s∈S d t+1 x (s)dist 2 x t s , X * s + d t+1 y (s)dist 2 y t s , Y * s , and potential functions Λ 0 = dist 2 z 0 , Z * = dist 2 ( z , Z * ), Λ t = Θ t + 1 -γ 4S z t -z t 2 , t ≥ 1. ( ) We will show the linear convergence of Λ t given dist 2 z 0 , Z * ≤ δ 0 η 4 for some problemdependent constant δ 0 > 0.

C.1 ONE-STEP ANALYSIS

Our proof for local linear convergence starts from the following elementary lemma, which is derived by combining a standard analysis of optimistic gradient descent/ascent with the smoothness of Q x ,y s with respect to the policy pair (x , y ). Lemma 3 Let x t , x t , y t , y t be generated from OGDA (23), ( 24). Then, for any t ≥ 0, we have η x t+1 s -x t * s , Q t+1 s y t+1 s ≤ 1 2 x t s -x t * s 2 -x t+1 s -x t * s 2 - 1 4 x t+1 s -x t+1 s 2 - 1 2 x t+1 s -x t s 2 + 16A (A + B) η 2 (1 -γ) 4 z t+1 -z t 2 and η y t * s -y t+1 s , Q t+1 s ⊤ x t+1 s ≤ 1 2 y t s -y t * s 2 -y t+1 s -y t * s 2 - 1 4 y t+1 s -y t+1 s 2 - 1 2 y t+1 s -y t s 2 + 16B (A + B) η 2 (1 -γ) 4 z t+1 -z t 2 . ( ) Proof of Lemma 3. We abbreviate x t * = x * , x t * s = x * s in this proof. By (23b), since x t+1 s is the projection onto ∆ A , we have x * s -x t+1 s , x t+1 s -x t s + ηQ t+1 s y t+1 s ≥ 0, ∀t ≥ 0. Equivalently, η x t+1 s -x * s , Q t+1 s y t+1 s ≤ 1 2 x t s -x * s 2 -x t+1 s -x * s 2 -x t+1 s -x t s 2 . Similarly, from (23a), x t+1 s -x t+1 s , x t+1 s -x t s + ηQ t s y t s ≥ 0, ∀t ≥ 0 i.e., η x t+1 s -x t+1 s , Q t s y t s ≤ 1 2 x t+1 s -x t s 2 -x t+1 s -x t+1 s 2 -x t+1 s -x t s 2 . Then, we have η x t+1 s -x * s , Q t+1 s y t+1 s =η x t+1 s -x * s , Q t+1 s y t+1 s + η x t+1 s -x t+1 s , Q t s y t s + η x t+1 s -x t+1 s , Q t+1 s y t+1 s -Q t s y t s ≤ 1 2 x t s -x * s 2 -x t+1 s -x * s 2 -x t+1 s -x t+1 s 2 -x t+1 s -x t s 2 + η x t+1 s -x t+1 s , Q t+1 s y t+1 s -Q t s y t s ≤ 1 2 x t s -x * s 2 -x t+1 s -x * s 2 - 1 4 x t+1 s -x t+1 s 2 - 1 2 x t+1 s -x t s 2 + 4η 2 A Q t+1 s y t+1 s -Q t s y t s 2 ∞ . By Lemma 21, we have Q t+1 s y t+1 s -Q t s y t s ∞ ≤ max (a,b)∈A×B Q t s (a, b) -Q t+1 s (a, b) y t+1 s 1 + max (a,b)∈A×B Q t s (a, b) y t+1 s -y t s 1 ≤ √ A + B z t+1 -z t (1 -γ) 2 + √ B y t+1 s -y t s 1 -γ ≤ 2 √ A + B (1 -γ) 2 z t+1 -z t . Then, (28) follows by combining ( 30) with (31). And (29) follows by similar arguments. □ We consider weighted sum of ( 28) and (29) using the state visitation distribution d t x (s), d t y (s) defined in (25) as the weighting coefficients. Lemma 4 (One-Step Analysis) Let x t , x t , y t , y t be generated from OGDA with η ≤ (1-γ) 5 2 32 √ S(A+B) . Then, for any t ≥ 0, Θ t+1 + 1 -γ 4S z t+1 -z t+1 2 ≤ Θ t + 1 -γ 8S z t -z t 2 - 1 -γ 4S z t+1 -z t+1 2 + z t+1 -z t 2 . ( ) Proof of Lemma 4. Recall that ρ ρ ρ 0 be the uniform distribution on S. By Lemma 20, V x t+1 , y t * (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) =V x t+1 , y t * (ρ ρ ρ 0 ) -V x t+1 ,y t+1 (ρ ρ ρ 0 ) + V x t+1 ,y t+1 (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) = 1 1 -γ s∈S d t+1 x (s) x t+1 s -x t * s , Q t+1 s y t+1 s -d t+1 y (s) y t+1 s -y t * s , Q t+1 s ⊤ x t+1 s . As x t * ∈ X * , y t * ∈ Y * , by Lemma 19, x t * , y t * also attains Nash equilibrium. Thus, we have V x t+1 , y t * (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) =V x t+1 , y t * (ρ ρ ρ 0 ) -V x t * , y t * (ρ ρ ρ 0 ) + V x t * , y t * (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) ≥ 0. Substituting ( 28), ( 29) into (33) yields that η(1 -γ) V x t+1 , y t * (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) ≤ 1 2 s∈S d t+1 x (s) x t s -x t * s 2 + d t+1 y (s) y t s -y t * s 2 - 1 2 s∈S d t+1 x (s) x t+1 s -x t * s 2 + d t+1 y (s) y t+1 s -y t * s 2 - 1 4 s∈S d t+1 x (s) x t+1 s -x t+1 s 2 + d t+1 y (s) y t+1 s -y t+1 s 2 - 1 2 s∈S d t+1 x (s) x t+1 s -x t s 2 + d t+1 y (s) y t+1 s -y t s 2 + 16 (A + B) η 2 (1 -γ) 4 s∈S d t+1 x (s) + d t+1 y (s) z t+1 -z t 2 By combining with the facts that x t+1 s -x t * s ≥ dist x t+1 s , X * s , d t x (s) ≥ 1-γ S , s∈S d t+1 x (s) = 1 and their counterparts for the max-player, we have η(1 -γ) V x t+1 , y t * (ρ ρ ρ 0 ) -V x t * ,y t+1 (ρ ρ ρ 0 ) ≤ 1 2 Θ t - 1 2 Θ t+1 - 1 -γ 4S z t+1 -z t+1 2 - 1 -γ 2S z t+1 -z t 2 + 64 (A + B) η 2 (1 -γ) 4 z t+1 -z t 2 + z t -z t 2 ≤ 1 2 Θ t - 1 2 Θ t+1 - 1 -γ 4S z t+1 -z t+1 2 - 1 -γ 8S z t+1 -z t 2 + 1 -γ 16S z t -z t 2 , where the last inequality is by our condition on η. By combining (34) with ( 35) and rearranging, we have Θ t+1 + 1 -γ 4S z t+1 -z t+1 2 ≤ Θ t + 1 -γ 8S z t -z t 2 - 1 -γ 4S z t+1 -z t+1 2 + z t+1 -z t 2 . □ C.2 PROGRESS OF PROJECTED GRADIENT DESCENT The following lemma is a standard step in the analysis of projected gradient descent. Lemma 5 If η ≤ 1-γ max{ √ A, √ B} , for any t ≥ 0, let ρ ρ ρ 0 be the uniform distribution on S, then η 2 s∈S V x t , † (s) -V †, y t (s) 2 ≤ 36S (1 -γ) 2 z t+1 -z t+1 2 + z t+1 -z t 2 . Proof of Lemma 5. Since x t+1 s is a projection onto ∆ A , for any x ′ s ∈ ∆ A , x t+1 s -x t s + ηQ t+1 s y t+1 s , x ′ s -x t+1 s ≥ 0, i.e., η x t+1 s -x ′ s , Q t+1 s y t+1 s ≤ x t+1 s -x t s , x ′ s -x t+1 s . Then, by combining with the condition on η, η x t+1 s -x ′ s , Q t+1 s y t+1 s ≤ η x t+1 s -x ′ s , Q t+1 s y t+1 s + η x t+1 s -x t+1 s Q t+1 s y t+1 s ≤ x t+1 s -x t s , x ′ s -x t+1 s + η √ A 1 -γ x t+1 s -x t+1 s ≤2 x t+1 s -x t s + x t+1 s -x t+1 s ≤2 x t+1 s -x t s + 3 x t+1 s -x t+1 s . For any s 0 ∈ S and x ′ ∈ X , by Lemma 20 and the fact that s∈S d x ′ ,y t+1 s0 (s) = 1, η V x t+1 ,y t+1 (s 0 ) -V x ′ ,y t+1 (s 0 ) = η 1 -γ s∈S d x ′ ,y t+1 s0 (s) x t+1 s -x ′ s , Q t+1 s y t+1 s ≤ η 1 -γ s∈S d x ′ ,y t+1 s0 (s) sup x ′′ s ∈∆ A x t+1 s -x ′′ s , Q t+1 s y t+1 s ≤ 1 1 -γ max s∈S 2 x t+1 s -x t s + 3 x t+1 s -x t+1 s ≤ 1 1 -γ 2 x t+1 -x t + 3 x t+1 -x t+1 , i.e., η V x t+1 ,y t+1 (s 0 ) -V x ′ ,y t+1 (s 0 ) ≤ 1 1 -γ 2 x t+1 -x t + 3 x t+1 -x t+1 . Similarly, η V x t+1 , † (s 0 ) -V x t+1 ,y t+1 (s 0 ) ≤ 1 1 -γ 2 y t+1 -y t + 3 y t+1 -y t+1 . ( ) By ( 93) and ( 94), we have V x t+1 , † (s 0 ) -V x t , † (s 0 ) ≤ √ A (1 -γ) 2 x t+1 -x t , V †,y t+1 (s 0 ) -V †, y t (s 0 ) ≤ √ B (1 -γ) 2 y t+1 -y t . (38) Then, by combining (36), ( 37), ( 38) and the condition on η, we have η 2 V x t , † (s 0 ) -V †, y t (s 0 ) 2 ≤ 36 (1 -γ) 2 z t+1 -z t+1 2 + z t+1 -z t 2 . The result follows by taking sum over s 0 ∈ S. □ Next, we extend Lemma 4 of Gilpin et al. (2012) and Theorem 5 of Wei et al. (2020) from matrix games to Markov games. Firstly, we prove the following auxiliary lemma, which is used in the proof of Lemma 7. This lemma is straightforward from the contraction and monotonicity of the Bellman operator, we attach its proof for completeness. Lemma 6 For policies x ∈ X and y ∈ Y, if there is a vector v ∈ R S such that for any s ∈ S ⟨x s , Q s [v]y s ⟩ ≥ v(s), then, we have that for any s ∈ S, V x ,y (s) ≥ v(s). Proof of Lemma 6. For any vector u ∈ R S , define the mapping Φ : R S → R S with Φ[u](s) = ⟨x s , Q s [u]y s ⟩ . Then, for any u 1 , u 2 ∈ R S , by definition, |Φ[u 1 ](s) -Φ[u 2 ](s)| ≤γ s ′ ∈S (a,b)∈A×B P(s ′ |s, a, b)x s (a)y s (b) |u 1 (s ′ ) -u 2 (s ′ )| ≤γ ∥u 1 -u 2 ∥ ∞ . Thus, we have ∥Φ[u 1 ] -Φ[u 2 ]∥ ∞ ≤ γ ∥u 1 -u 2 ∥ ∞ , i.e., Φ is a contraction mapping. Define v 1 = Φ[v] and v k+1 = Φ[v k ], . . . Then, by (39), we have ∥v k+1 -v k ∥ ∞ ≤ γ ∥v k -v k-1 ∥ ∞ ≤ γ k ∥v 1 -v∥ ∞ . Then, the limit of v k exists and we denote the limit v * = lim k→∞ v k . Obviously, v * is a fixed point of Φ because v * = lim k→∞ v k = lim k→∞ Φ[v k-1 ] = Φ[ lim k→∞ v k-1 ] = Φ[v * ]. As V x ,y (s) = ⟨x s , Q s [V x ,y ]y s ⟩, we have Φ[V x ,y ] = V x ,y , i.e., V x ,y is a fixed point of Φ. By the contraction property of Φ as in (39), its fixed point is unique. Thus, V x ,y = v * . By definition, for any u 1 , u 2 ∈ R S , if u 1 ≥ u 2 in entry-wise sense, then Φ[u 1 ] ≥ Φ[u 2 ] in entry- wise sense. Since the condition ⟨x s , Q s [v]y s ⟩ ≥ v(s) for any s ∈ S is equivalent to v 1 ≥ v in entry-wise sense. By induction, we have v k (s) is non-decreasing in k. Combining with the fact that v * = lim k→∞ v k , we have that for any s ∈ S, V x ,y (s) = v * (s) ≥ v(s).

□

The following lemma is an extension of Lemma 4 of Gilpin et al. (2012) and Theorem 5 of Wei et al. (2020) for matrix games to Markov games, it plays an important role in lower bounding the progress of gradient descent/ascent. Lemma 7 There exists a problem-dependent constant c + > 0 such that for any z = (x , y ) ∈ Z and s ∈ S, V x , † (s) -V †,y (s) ≥ c + • dist(z s , Z * s ). Proof of Lemma 7. Recall that v * (s) is the minimax game value at state s and Q * s = Q s [v * ]. For any s ∈ S, choose y s ∈ arg max y ′ s ∈∆ B ⟨x s , Q * s y ′ s ⟩ , x s ∈ arg min x ′ s ∈∆ A ⟨x ′ s , Q * s y s ⟩ . Then, by Shapley's theorem (Lemma 19), X * s × Y * s is the NE set for the matrix game min x ′ max y ′ x ′ ⊤ Q * s y ′ . ⟨x s , Q * s y s ⟩ ≥ v * (s), ⟨ x s , Q * s y s ⟩ ≤ v * (s). Then, by Lemma 22, for any s ∈ S, there exists a constant φ(Q * s ) > 0 depending only on the matrix Q * s such that ⟨x s , Q * s y s ⟩ -⟨ x s , Q * s y s ⟩ ≥ φ(Q * s ) • dist(z s , Z * s ). Define the policies x = { x s } s∈S and y = { y s } s∈S . Combining (40) with Lemma 6 yields that for any s ∈ S, V x , y (s) ≥ v * (s), V x ,y (s) ≤ v * (s). Then, by definition, in entry-wise sense, Q x , y s = Q s [V x , y ] ≥ Q s [v * ]. By combining the above equations, we have for any s ∈ S, V x , † (s) -V †,y (s) ≥ V x , y (s) -V x ,y (s) = x s , Q x , y s y s -x s , Q x ,y s y s = x s , Q s [V x , y ] y s -x s , Q s [V x ,y ]y s ≥ ⟨x s , Q s [v * ] y s ⟩ -⟨ x s , Q s [v * ]y s ⟩ = ⟨x s , Q * s y s ⟩ -⟨ x s , Q * s y s ⟩ ≥φ(Q * s ) • dist(z s , Z * s ) , where the second last inequality is by (40), the last inequality is by (41). Then, the proof is completed by (1). □ By combining Lemma 5 and Lemma 7, we provide lower bound for the progress of projected gradient descent (PGD). Lemma 8 (Progress of PGD) Let z t , z t t≥0 be generated from OGDA with η ≤ 1γ, then for any t ≥ 0, we have z t+1 -z t+1 2 + z t+1 -z t 2 ≥ (1 -γ) 2 η 2 c 2 + 36S Θ t . Proof of Lemma 8. By Lemma 5 and Lemma 7, we have z t+1 -z t+1 2 + z t+1 -z t 2 ≥η 2 s∈S V x t , † (s) -V †, y t (s) 2 ≥ (1 -γ) 2 η 2 c 2 + 36S dist 2 ( z t , Z * ) ≥ (1 -γ) 2 η 2 c 2 + 36S Θ t , where the last inequality above comes from the fact that d t x (s) ≤ 1, d t y (s) ≤ 1 for any s ∈ S. □

C.3 STABILITY OF STATE VISITATION DISTRIBUTION NEAR THE NASH EQUILIBRIUM SET

The main motivation behind the proofs in this section is Lemma 1, which shows that projected gradient descent is very "stable" on the NE set. The following lemma is a perturbed version of Lemma 1. It is extensively used in the proof of Lemma 10. Its proof follows by Lemma 2 and Lemma 21 with a simplification of coefficients. Lemma 9 For any z = (x , y ) ∈ Z and z = ( x , y ) ∈ Z, let x + , y + be the policy after one step of projected policy gradient descent/ascent with sepsize η > 0, i.e., for s ∈ S x + s = P ∆ A ( x s -ηQ x ,y s y s ) , y + s = P ∆ B y s + η (Q x ,y s ) ⊤ x s . Let z + = (x + , y + ), then, z + -z 2 ≤ 8dist 2 ( z , Z * ) + 8S (A + B) 2 η 2 (1 -γ) 4 dist 2 (z , Z * ) . Proof of Lemma 9. Denote x * = P X * (x ), y * = P Y * (y ) and z * = (x * , y * ). By Lemma 19, (x * , y * ) attains Nash equilibrium and Q x * ,y * s = Q * s . By (91), we have max (a,b)∈A×B |Q x ,y s (a, b) -Q * s (a, b)| ≤ √ A + B ∥z -z * ∥ (1 -γ) 2 . Then, by combining with Lemma 2, we have z + -z 2 ≤ 8dist 2 ( z , Z * ) + 8S (A + B) 2 η 2 (1 -γ) 4 dist 2 (z , Z * ) .

□

The following lemma uses Lemma 9 to show that when Λ t is close to 0, z t+1z t , z t+1z t will be small, which implies the difference between Θ t and Θ t will also be small. Lemma 10 Consider the sequence z t , z t generated from OGDA with stepsize η ≤ (1-γ) 2 2 √ 2S(A+B) . There is a problem-dependent constant δ 1 = O (1-γ) 5 S 3 (A+B) > 0 such that for any τ > 0 and t ≥ 1, if Λ t-1 ≤ τ 2 δ 1 , we have Θ t -Θ t ≤ τ Θ t . Proof of Lemma 10. By the condition on η, we have 8S(A+B) 2 η 2 (1-γ) 4 ≤ 1. Denote c ′ = S 1-γ , c ′′ = 8S 1-γ and define the problem-dependent constant δ 1 = (1 -γ) 4 S 2 (A + B)(1704c ′ + 226c ′′ ) = O (1 -γ) 5 S 3 (A + B) . ( ) We also denote δ = τ 2 δ 1 below. The positive constants c 1 , c 2 , • • • , c 7 below are all polynomials in S, A, B, 1/(1γ), the definition for each of them follows from the line it first occurs. Since d t x (s), d t y (s) ≥ 1-γ S = c ′ , the condition Λ t-1 ≤ τ 2 δ 1 = δ implies that dist 2 z t-1 , Z * ≤ c ′ δ, z t-1 -z t-1 2 ≤ c ′′ δ. Then, dist 2 z t-1 , Z * ≤ 2dist 2 z t-1 , Z * + 2 z t-1 -z t-1 2 ≤ 2(c ′ + c ′′ )δ. By applying Lemma 9 with z := z t-1 , z := z t-1 , we have z t -z t-1 2 ≤ (8c ′ + 2(c ′ + c ′′ )) δ def = c 1 δ. ( ) Thus, dist 2 z t , Z * ≤ 2dist 2 z t-1 , Z * + 2 z t -z t-1 2 ≤ (2c ′ + 2c 1 ) δ def = c 2 δ. By setting z := z t-1 , z := z t in Lemma 9, we have z t -z t-1 2 ≤ (8c ′ + c 2 ) δ def = c 3 δ. ( ) Therefore, dist 2 z t , Z * ≤ 2dist 2 z t-1 , Z * + 2 z t -z t-1 2 ≤ (2c ′ + 2c 3 ) δ def = c 4 δ. Again, utilize Lemma 9 with z := z t , z := z t , we have z t+1 -z t 2 ≤ (8c 4 + c 2 ) δ def = c 5 δ. Thus, z t+1 -z t 2 ≤3 z t+1 -z t 2 + z t -z t-1 2 + z t-1 -z t 2 ≤3 (c 5 + c 3 + c 1 ) δ def = c 6 δ. Now we can bound z t -z t-1 2 + z t+1 -z t 2 ≤ (c 3 + c 6 ) δ def = c 7 δ. ( ) Since X * s is a convex set, the projection onto it is non-expansive, i.e., x t * sx (t-1) * s = P X * s x t s -P X * s x t-1 s ≤ x t s -x t-1 s . Then, x t * , y t+1 -x (t-1) * , y t 2 ≤ x t * -x (t-1) * 2 + y t+1 -y t 2 ≤ x t -x t-1 2 + y t+1 -y t 2 ≤ z t -z t-1 2 + z t+1 -z t 2 ≤ c 7 δ. Analogously, x t+1 , y t * -x t , y (t-1) * 2 ≤ z t -z t-1 2 + z t+1 -z t 2 ≤ c 7 δ. By Lemma 21 and ( 25), for any s ∈ S, d t+1 x (s) -d t x (s) = d x t * ,y t+1 ρ ρ ρ0 (s) -d x (t-1) * ,y t ρ ρ ρ0 (s) ≤ √ A + B x t * , y t+1 -x (t-1) * , y t 1 -γ ≤ (A + B) c 7 δ 1 -γ . ( ) Similarly, we also have for any s ∈ S, d t+1 y (s) -d t y (s) ≤ (A + B) c 7 δ 1 -γ . ( ) What remains is to bound the term √ (A+B)c7δ 1-γ on the RHS of ( 50) and ( 51). Using ( 43)-( 49), we have • (by (43)) c 1 = 10c ′ + 2c ′′ • (by (44)) c 2 = 22c ′ + 4c ′′ • (by (45)) c 3 = 30c ′ + 4c ′′ • (by (46)) c 4 = 62c ′ + 8c ′′ • (by (47)) c 5 = 518c ′ + 68c ′′ • (by (48)) c 6 = 1674c ′ + 222c ′′ • (by (49)) c 7 = 1704c ′ + 226c ′′ By the definition of δ 1 in (42) and our notation δ = τ 2 δ 1 , we have (A + B) c 7 δ 1 -γ = τ (1 -γ) S . Then, by combining with (50), we have d t x (s) -d t+1 x (s) ≤ τ (1-γ)

S

. By combining with the fact that d t x (s) ≥ 1-γ S , we have d t x (s) -d t+1 x (s) ≤ τ d t x (s). Analogously, for any s ∈ S, d t y (s) -d t+1 y (s) ≤ τ d t y (s). Then, the result follows by the definition of Θ t and Θ t in (26). □

C.4 PROOF OF THEOREM 5

By Lemma 8, we have 1 -γ 4S z t+1 -z t+1 2 + z t+1 -z t 2 ≥ c 8 η 2 Θ t , where c 8 = c 2 + (1 -γ) 3 144S 2 . ( ) Then, combining (52) with Lemma 4 and the definitions of Λ t , Θ t , Θ t in ( 27), ( 26) yield that for any t ≥ 1, Λ t+1 ≤Λ t + Θ t -Θ t - 1 -γ 8S z t -z t 2 - 1 -γ 4S z t+1 -z t 2 + z t -z t 2 ≤Λ t + Θ t -Θ t - 1 -γ 8S z t -z t 2 -c 8 η 2 Θ t . Define c 0 = min (1 -γ)c 8 S , c 8 2 , 1 2 > 0. ( ) Recall the problem-dependent constant δ 1 > 0 defined in Lemma 10, we define δ 0 = c 2 0 δ 1 > 0. Now, we prove Λ t+1 ≤ (1c 0 ) t Λ t by induction. For the case t = 0, firstly, by the definitions of Λ 0 , Θ 0 , Θ 0 in ( 27), ( 26) and the fact that 1-γ S ≤ d t x (s) ≤ 1, 1-γ S ≤ d t y (s) ≤ 1, we have Λ 0 ≥ Θ 0 , Θ 0 ≥ 1 -γ S Λ 0 . Then, by combining with Lemma 4, ( 52) and the fact that z 0 = z 0 , we have Λ 1 ≤ Λ 0 -c 8 η 2 Θ 0 ≤ 1 - (1 -γ)c 8 η 2 S Λ 0 ≤ 1 -c 0 η 2 Λ 0 . If we have shown Λ j+1 ≤ (1c 0 ) j Λ j for j = 0, • • • , t -1, we next prove it for t. By induction hypothesis, Λ t-1 ≤ Λ 0 ≤ δ 0 η 4 = (c 0 η 2 ) 2 δ 1 . By Lemma 10, Θ t ≤ 1 + c 0 η 2 Θ t . Then, by combining with (54) and the fact that c 0 ≤ c 8 /2, c 0 ≤ 1/2 from the definition of c 0 in (55), we have Λ t+1 ≤Λ t + c 0 η 2 Θ t - 1 -γ 16S z t -z t 2 -2c 0 η 2 Θ t ≤Λ t -min c 0 η 2 , 1 2 Θ t + 1 -γ 8S z t -z t 2 = 1 -c 0 η 2 Λ t . By induction, we have for any t ≥ 0, Λ t ≤ 1 -c 0 η 2 t Λ 0 . Using the fact that 1-γ S ≤ d t x (s) ≤ 1, 1-γ S ≤ d t y (s) ≤ 1 and the definition of Λ t in (27), we have dist 2 z t , Z * ≤ 2 dist 2 z t , Z * + z t -z t 2 ≤ 8S 1 -γ Λ t ≤ 8S 1 -γ 1 -c 0 η 2 t Λ 0 ≤ 8S 1 -γ 1 -c 0 η 2 t dist 2 z 0 , Z * = 8S 1 -γ 1 -c 0 η 2 t dist 2 ( z , Z * ) , where z = ( x , y ) is the initial policy pair ( 22). This completes the proof for local linear convergence of OGDA. As for the order of c 0 and δ 0 , by ( 53) and ( 55), c 0 = O (1 -γ) 4 c 2 + S 3 . ( ) By Lemma 10, δ 1 = O (1-γ) 5 S 3 (A+B) . Then, by ( 57) and ( 56), δ 0 = O (1 -γ) 13 c 4 + S 9 (A + B) . ( ) Since we need η ≤ O( (1-γ) √ S(A+B) ), we have the linear convergence rate 1 -c 0 η 2 = 1 -O (1 -γ) 9 c 2 + S 4 (A + B) 2 and to have linear convergence, dist(z T1 , Z * ) needs to satisfy dist(z T1 , Z * ) ≤ δ 0 η 4 = O (1 -γ) 23 2 c 2 + S 11 2 (A + B) .

D PROOFS FOR GLOBAL CONVERGENCE AND GEOMETRIC BOUNDEDNESS OF AVERAGING OGDA

In this section, we prove that the Averaging OGDA method introduced in (6) of Section 3.2 can serve as Global-Slow in the meta algorithm Homotopy-PO. The proof of global convergence (Theorem 3) is in Appendix D.1. The proof of geometric boundedness (Theorem 4) is in Appendix D.2. To begin with, let us recall the Averaging OGDA method: The min-player initializes x T1 = x T1 = x , V T1 (s) = V †,ỹ (s) = V †,y T 1 (s) while the max-player initializes y T1 = y T1 = ỹ , V T1 (s) = V x , † (s) = V x T 1 , † (s). The min-player updates for t > T 1 as follows: V t (s) = min a∈A t-1 j=T1 α j-T1+1 t-T1 q j s (a), x t s = P ∆ A x t-1 s -ηq t-1 s , ( ) x t s = P ∆ A x t-1 s -ηq t s , where q t s = Q s [V t ]y t s , and Q s [•] is the Bellman target operator defined in the introduction. Meanwhile, the max-player updates for t > T 1 as follows: V t (s) = max b∈B t-1 j=T1 α j-T1+1 t-T1 q j s (b), y t s = P ∆ B y t-1 s + ηq t-1 s , y t s = P ∆ B y t-1 s + ηq t s , where q t s = Q s [V t ] ⊤ x t s . At the end iteration T 2 , the min-player and the max-player compute the following average policies respectively x [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 x t , y [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 y t . The min-player plays policy x t and the max-player plays policy y t at iteration t. The variables x t , V t and y t , V t are all local auxiliary variables to help generate the policies x t and y t . Averaging OGDA is a decentralized algorithm whose decentralized implementation is in Algorithm 4 (for the min-player) and Algorithm 5 (for the max-player) of Appendix G.

D.1 GLOBAL CONVERGENCE RATE OF AVERAGING OGDA

Our task in this section is to prove the global convergence of Averaging OGDA (Theorem 3). To this end, we need to bound dist 2 z [T1:T2] , Z * by O(log(T 2 -T 1 )/(T 2 -T 1 )). Our roadmap can be depicted as follows: dist 2 z [T1:T2] , Z * Lemma 11 ← V t -V t ∞ Lemma 12 ← Reg T1:t Lemma 14 ≤ O(1/(T 2 -T 1 )) The regrets above are defined as Reg T1:t x (s) = min x ′ s ∈∆ A t j=T1 α j-T1+1 t-T1+1 x ′ s -x j s , Q s [V j ]y j s , Reg T1:t y (s) = max y ′ s ∈∆ B t j=T1 α j-T1+1 t-T1+1 x j s , Q s [V j ] y ′ s -y j s , Reg T1:t = max s∈S Reg T1:t y (s) -Reg T1:t x (s) . More specifically, we bound the distance dist( z [T1:T2] , Z * ) in the following steps: 1. (Lemma 11) bounding dist( z [T1:T2] , Z * ) by O( V T2+1 -V T2+1 ∞ ): dist z [T1:T2] , Z * ≤ O V T2+1 -V T2+1 ∞ 2. (Lemma 12) bounding O( V T2+1 -V T2+1 ∞ ) by regrets: V T2+1 -V T2+1 ∞ ≤ Reg T1:T2 + O( 1 T 2 -T 1 ) • ( T2 t=T1 Reg T1:t + ∥V T1 -V T1 ∥ ∞ ) 3. (Lemma 14) bounding the regrets: Reg T1:t ≤ O 1 η(t -T 1 ) The following fact about α j t can be found in Section 4 of Jin et al. (2018) . It will be used extensively in our proofs below. Fact 1 The stepsize α j t satisfy: (i) ∞ t=j α j t = 1 + 1 H , ∀ t ≥ 1. (ii) t j=1 α j t = 1, ∀ t ≥ 1. (iii) α j t ≤ α t and α j t+1 ≤ α j t , ∀ t ≥ 1, 1 ≤ j ≤ t. Firstly, we show that the local auxiliary variables V t (s), V t (s) are lower and upper bounds for v * (s). Then, to bound V t -v * ∞ and V t -v * ∞ , it suffices to bound V t -V t ∞ . Fact 2 For any t ∈ [T 1 : T 2 ] and s ∈ S, 0 ≤ V t (s) ≤ v * (s) ≤ V t (s) ≤ 1 1 -γ , 0 ≤ min a∈A q t s (a) ≤ v * (s) ≤ max b∈B q t s (b) ≤ 1 1 -γ . Proof of Fact 2. By (61a), we have V T1 (s) = V †,y T 1 (s) ≤ v * (s). By the definition of q t s in (62), min a∈A q t s (a) = min x ′ s ∈∆ A x ′ s , Q s [V t ]y t s . Recall that by Lemma 19, v * (s) = min x s max y s ⟨x s , Q * s y s ⟩ and Q * s = Q s [v * ]. Suppose V j (s) ≤ v * (s) for any s ∈ S and j ∈ [T 1 : t], then we have min a∈A q j s (a) = min x ′ s ∈∆ A x ′ s , Q s [V j ]y j s ≤ min x ′ s ∈∆ A x ′ s , Q s [v * ]y t s ≤ min x ′ s max y ′ s ⟨x ′ s , Q * s y ′ s ⟩ = v * (s), which leads to V t+1 (s) ≤ v * (s) for any s ∈ S. Then, it follows by induction that V t (s) ≤ v * (s), min a∈A q t s (a) ≤ v * (s) for any t ∈ [T 1 : T 2 ] and s ∈ S. Analogously, V t (s) ≥ v * (s), max b∈B q t s (a) ≥ v * (s) for any t ∈ [T 1 : T 2 ] and s ∈ S. It also follows by induction directly that the value of V t (s), V t (s), min a∈A q t s (a), max b∈B q t s (a) stays in [0, 1 1-γ ] . □ The following lemma shows that to bound dist 2 z [T1:T2] , Z * , it suffices to bound V t -V t ∞ . Lemma 11 There is a problem-dependent constant C = √ S c+ > 0 such that the average policy z [T1:T2] = x [T1:T2] , y [T1:T2] satisfies dist z [T1:T2] , Z * ≤ C • V T2+1 -V T2+1 ∞ . Proof of Lemma 11. Recall that Q * s = Q s [v * ] . By (62) and Fact 2, min a∈A T2 t=T1 α t-T1+1 T2-T1 q t s (a) = min x ′ s ∈∆ A T2 t=T1 α t-T1+1 T2-T1 x ′ s , Q s [V t ]y t s ≤ min x ′ s ∈∆ A T2 t=T1 α t-T1+1 T2-T1 x ′ s , Q * s y t s . Analogously, max b∈B T2 t=T1 α t-T1+1 T2-T1 q t s (b) = max y ′ s ∈∆ B T2 t=T1 α t-T1+1 T2-T1 x t s , Q s [V t ]y ′ s ≥ max y ′ s ∈∆ B T2 t=T1 α t-T1+1 T2-T1 x t s , Q * s y ′ s . Thus, V T2+1 (s) -V T2+1 (s) = max b∈B T2 t=T1 α t-T1+1 T2-T1 q t s (b) -min a∈A T2 t=T1 α t-T1+1 T2-T1 q t s (a) ≥ max y ′ s ∈∆ B T2 t=T1 α t-T1+1 T2-T1 x t s , Q * s y ′ s -min x ′ s ∈∆ A T2 t=T1 α t-T1+1 T2-T1 x ′ s , Q * s y t s = max y ′ s ∈∆ B x [T1:T2] s , Q * s y ′ s -min x ′ s ∈∆ A x ′ s , Q * s y [T1:T2] s . By (1), max y ′ s ∈∆ B x [T1:T2] s , Q * s y ′ s -min x ′ s ∈∆ A x ′ s , Q * s y [T1:T2] s ≥ c + • dist z [T1:T2] s , Z * s . The positiveness of c + is guaranteed by Lemma 22. Let C = √ S c+ , then, V T2+1 -V T2+1 ∞ ≥ max s∈S max y ′ s ∈∆ B x [T1:T2] , Q * s y ′ s -min x ′ s ∈∆ A x ′ s , Q * s y [T1:T2] ≥ max s∈S c + • dist z [T1:T2] s , Z * s ≥ 1 C • dist z [T1:T2] , Z * .

□

The following lemma mainly uses Fact 1 (i) and induction to show that V T2+1 -V T2+1 ∞ can be bounded by weighted sum of the regrets.

Lemma 12

The value functions V T2+1 , V T2+1 satisfies V T2+1 -V T2+1 ∞ ≤ Reg T1:T2 + 2γ (H + 1) (1 -γ) (T 2 -T 1 + 1) ( T2 t=T1 Reg T1:t + ∥V T1 -V T1 ∥ ∞ ). Proof of Lemma 12. By Fact 2 and the definition of the operator Q s [•], we have max (a,b)∈A×B Q s [V t ](a, b) -Q s [V t ](a, b) ≤ γ V t -V t ∞ . ( ) The following relation follows by definitions of V t in (61a) and q j s in (62), V t (s) = min a∈A t j=T1 α j-T1+1 t-T1+1 q j s (a) = min x ′ s ∈∆ A t j=T1 α j-T1+1 t-T1+1 x ′ s , Q s [V j ]y j s . Analogously, V t (s) = max b∈B t j=T1 α j-T1+1 t-T1+1 q j s (b) = max y ′ s ∈∆ B t j=T1 α j-T1+1 t-T1+1 x j s , Q s [V j ]y ′ s . Summing up the above two equations yields that V t+1 (s) -V t+1 (s) = max y ′ s ∈∆ B t j=T1 α j-T1+1 t-T1+1 x j s , Q s [V j ]y ′ s -min x ′ s ∈∆ A t j=T1 α j-T1+1 t-T1+1 x ′ s , Q s [V j ]y j s ≤ max y ′ s ∈∆ B t j=T1 α j-T1+1 t-T1+1 x j s , Q s [V j ] y ′ s -y j s -min x ′ s ∈∆ A t j=T1 α j-T1+1 t-T1+1 x ′ s -x j s , Q s [V j ]y j s + t j=T1 α j-T1+1 t-T1+1 x j s , Q s [V j ] -Q s [V j ] y j s ≤Reg T1:t y (s) -Reg T1:t x (s) + γ t j=T1 α j-T1+1 t-T1+1 V j -V j ∞ , where the last inequality is by (64). Thus, V t+1 -V t+1 ∞ ≤ Reg T1:t + γ t j=T1 α j-T1+1 t-T1+1 V j -V j ∞ . Taking sum on both sides of the above equation and combining with Fact 1 (i) yield that T2 t=T1 V t+1 -V t+1 ∞ ≤ T2 t=T1 Reg T1:t + γ T2 t=T1 t j=T1 α j-T1+1 t-T1+1 V j -V j ∞ ≤ T2 t=T1 Reg T1:t + γ T2 j=T1 T2 t=j α j-T1+1 t-T1+1 V j -V j ∞ ≤ T2 t=T1 Reg T1:t + γ T2 j=T1 1 + 1 H V j -V j ∞ ≤ T2 t=T1 Reg T1:t + γ 1 + 1 H T2 j=T1 V j -V j ∞ ≤ T2 t=T1 Reg T1:t + 2γ 1 + γ T2 j=T1 V j -V j ∞ , where the last inequality is from the fact that H = 1+γ 1-γ . After rearranging, we have T2 t=T1 V t+1 -V t+1 ∞ ≤ 1 + γ 1 -γ T2 t=T1 Reg T1:t + 2γ 1 + γ V T1 -V T1 ∞ . ( ) Since α j-T1+1 T2-T1+1 ≤ α T2-T1+1 ≤ H+1 T2-T1+1 for any j ∈ [T 1 : t], by setting t := T 2 in (65) and substituting (66), we have V T2+1 -V T2+1 ∞ ≤ Reg T1:T2 + γ H + 1 T 2 -T 1 + 1 T2 j=T1 V j -V j ∞ ≤Reg T1:T2 + γ H + 1 T 2 -T 1 + 1 • 1 + γ 1 -γ T2 t=T1 Reg T1:t + 2γ 1 + γ V T1 -V T1 ∞ + V T1 -V T1 ∞ ≤Reg T1:T2 + 2γ (H + 1) (1 -γ) (T 2 -T 1 + 1) T2 t=T1 Reg T1:t + V T1 -V T1 ∞ .

□

The next lemma is used to derive Lemma 14. Lemma 13 For any t ∈ [T 1 : T 2 -1] and s ∈ S, q t s -q t+1 s 2 ≤ 8Bγ 2 (α t-T1+1 ) 2 (1 -γ) 2 + 2B 2 (1 -γ) 2 y t s -y t+1 s 2 q t s -q t+1 s 2 ≤ 8Aγ 2 (α t-T1+1 ) 2 (1 -γ) 2 + 2A 2 (1 -γ) 2 x t s -x t+1 s 2 . Proof of Lemma 13. By (62) and Fact 2, we have q t s -q t+1 s 2 ≤2B max (a,b)∈A×B Q s [V t ](a, b) -Q s [V t+1 ](a, b) 2 y t s 2 1 + 2B 2 max (a,b)∈A×B Q s [V t+1 ] 2 y t s -y t+1 s 2 ≤2Bγ 2 V t -V t+1 2 ∞ + 2B 2 (1 -γ) 2 y t s -y t+1 s 2 . ( ) By Fact 2, q t s ∞ ≤ 1 1-γ . Then, by the definition of V t in (61a), for any s ∈ S, V t+1 (s) -V t (s) ≤ t j=T1 α j-T1+1 t-T1+1 q j s - t-1 j=T1 α j-T1+1 t-T1 q j s ∞ ≤α t-T1+1 t-T1+1 q t+1 s ∞ + t-1 j=T1 α j-T1+1 t-T1 -α j-T1+1 t-T1+1 q j s ∞ ≤ 1 1 -γ (α t-T1+1 + 1 -(1 -α t-T1+1 )) ≤ 2α t-T1+1 1 -γ , where the third inequality uses the facts that t j ′ =1 α j ′ t = 1 and α j t+1 ≤ α j t , α j t ≤ α t for any 1 ≤ j ≤ t. Thus, V t+1 -V t ∞ ≤ 2α t-T1+1 1 -γ . ( ) By substituting ( 68) into (67), we have q t s -q t+1 s 2 ≤ 8Bγ 2 (α t-T1+1 ) 2 (1 -γ) 2 + 2B 2 (1 -γ) 2 y t s -y t+1 s 2 . The bound for q t sq t+1 s 2 follows analogously. □ We bound the regrets in the following lemma. Its proof is mainly from combining standard analysis in RVU property (see for instance Rakhlin & Sridharan (2013) ; Syrgkanis et al. (2015) ) and with Lemma 13. Lemma 14 For any t ∈ [T 1 : T 2 ], if η ≤ 1-γ 8 √ 2 max{A,B} , we have Reg T1:t ≤ 136 (A + B) H η (1 -γ) 2 α t-T1+1 . Proof of Lemma 14. Choose an arbitrary point x * s from ∆ A . Since x t+1 ≤ 1 2 x t s -x * s 2 -x t+1 s -x * s 2 -x t+1 s -x t+1 s 2 -x t+1 s -x t s 2 + η x t+1 s -x t+1 s , q t+1 s -q t s ≤ 1 2 x t s -x * s 2 -x t+1 s -x * s 2 + ∆ t+1 x , where ∆ t+1 x = - 1 4 x t+1 s -x t+1 s 2 - 1 2 x t+1 s -x t s 2 + 4η 2 q t+1 s -q t s 2 . By taking sum on both sides of the above equation, we have η T2 t=T1 α t-T1+1 T2-T1+1 x t s -x * s , q t s ≤ α 1 T2-T1+1 2 x T1 s -x * s 1 q T1 s ∞ + α 2 T2-T1+1 2 x T1 -x * s 2 + T2-1 t=T1+1 α t-T1+2 T2-T1+1 -α t-T1+1 T2-T1+1 2 x t -x * s 2 + T2-1 t=T1 α t-T1+2 T2-T1+1 ∆ t+1 x ≤ α 1 T2-T1+1 1 -γ + α 2 T2-T1+1 + T2-1 t=T1+1 α t-T1+2 T2-T1+1 -α t-T1+1 T2-T1+1 + T2-1 t=T1 α t-T1+2 T2-T1+1 ∆ t+1 x ≤ α T2-T1+1 1 -γ + 2α T2-T1+1 + T2-1 t=T1 α t-T1+2 T2-T1+1 ∆ t+1 x . Analogously, for any y * s ∈ ∆ B , η T2 t=T1 α t-T1+1 T2-T1+1 y t s -y * s , q t s ≤ α T2-T1+1 1 -γ + 2α T2-T1+1 + T2-1 t=T1 α t-T1+2 T2-T1+1 ∆ t+1 y , where ∆ t+1 y = - 1 4 y t+1 s -y t+1 s 2 - 1 2 y t+1 s -y t s 2 + 4η 2 q t+1 s -q t s 2 . Since H ≥ 1, we have α t-T1+2 T2-T1+1 /α t-T1+1 T2-T1+1 ≤ 2. Then, by combining with the condition on η and the fact that x t+1x t 2 ≤ 2 x t+1x t 2 + 2 x tx t

2

, we have - α t-T1+2 T2-T1+1 2 x t+1 -x t 2 - α t-T1+1 T2-T1+1 4 x t -x t 2 + 8α t-T1+2 T2-T1+1 max A 2 , B 2 η 2 (1 -γ) 2 x t+1 -x t 2 ≤ - α t-T1+2 T2-T1+1 16 -2 x t+1 -x t 2 -2 x t -x t 2 + x t+1 -x t 2 ≤ 0. Then, by combining the definitions of ∆ t+1 x and ∆ t+1 y with Lemma 13, we have T2-1 t=T1 α t-T1+2 T2-T1+1 ∆ t+1 x + ∆ t+1 y ≤ 8α T2-T1+1 max A 2 , B 2 η 2 (1 -γ) 2 x T1+1 -x T1 2 + y T1+1 -y T1 2 + T2-1 t=T1 α t-T1+2 T2-T1+1 32 (A + B) γ 2 (α t-T1+1 ) 2 (1 -γ) 2 ≤2α T2-T1+1 + 32 (A + B) γ 2 (1 -γ) 2 T2-1 t=T1 α T2-T1+1 H + 1 H + t -T 1 + 1 2 ≤ 2 + 32 (A + B) γ 2 (1 -γ) 2 • (H + 1) 2 H α T2-T1+1 , where the first inequality also uses (71) and the max-player's counterpart of (71), the second inequality is by the condition on η and Fact 1. By combining (69), ( 70), (72), Reg T1:T2 ≤ 1 η 2 1 -γ + 6 + 32 (A + B) γ 2 (1 -γ) 2 • (H + 1) 2 H α T2-T1+1 ≤ 136 (A + B) H η (1 -γ) 2 α T2-T1+1 . The bound of Reg T1:t for t ∈ [T 1 : T 2 ] follows by similar arguments. □ Now, we can prove the global convergence of Averaging OGDA (Theorem 3) by combining Lemma 11, Lemma 12 and Lemma 14. Proof of Theorem 3. By Lemma 12, Lemma 14, we have V T2+1 -V T2+1 ∞ ≤ 136 (A + B) H η (1 -γ) 2 α T2-T1+1 + 2γ (H + 1) (1 -γ) (T 2 -T 1 + 1) T2 t=T1 136 (A + B) H η (1 -γ) 2 α t-T1+1 + V T1 -V T1 ∞ . Since T2 t=T1 α t-T1+1 ≤ (H+1) log(T2-T1+1) T2-T1+1 , we have V T2+1 -V T2+1 ∞ ≤ 408(H + 1) 3 (A + B) log (T 2 -T 1 + 1) η (1 -γ) 3 (T 2 -T 1 + 1) + 2γ (H + 1) (1 -γ) 2 (T 2 -T 1 + 1) . By Lemma 11, we have dist z [T1:T2] , Z * ≤ C • 408(H + 1) 3 (A + B) log(T 2 -T 1 + 1) η(1 -γ) 3 (T 2 -T 1 + 1) + 2γ (H + 1) (1 -γ) 2 (T 2 -T 1 + 1) ≤ C ′ log(T 2 -T 1 + 1) η(T 2 -T 1 + 1) , where C ′ = 3280 C(A + B) (1 -γ) 6 = 3280 √ S(A + B) c + (1 -γ) 6 . □ Remark 1 The initialization V T1 = V †,y T 1 and V T1 = V x T 1 , † is only used to show the geometric boundedness in Theorem 4. When Averaging OGDA is used independently rather than called in Homotopy-PO (Algorithm 1), we can simply choose V T1 (s) = 0 and V T1 = 1 1-γ for any s ∈ S. The global convergence rate in Theorem 3 still holds.

D.2 GEOMETRIC BOUNDEDNESS OF AVERAGING OGDA

In this section, we prove the geometric boundedness of Averaging OGDA (Theorem 4). The geometric boundedness of averaging OGDA essentially relies on the stability of projected gradient descent/ascent characterized in Lemma 2. Intuitively, when z j j∈[T1:t] are close to the Nash equilibrium set, V j (s), V j (s) j∈[T1:t] will be close to v * (s). Thus, min a q t s (a), max b q t s (b) will also be close to v * (s). Then, by Lemma 2, z t+1 will not be far away from the Nash equilibrium set. Our proofs in this section can be summarized as: providing mutual bounds among {dist (z t , Z * )}, dist z t , Z * , V t -V t ∞ , max b q t s (b)min a q t s (a) by induction. The following fact shows that V T1 -V T1 ∞ can be bounded by dist z T1 , Z * . Lemma 15 The approximate value functions V T1 , V T1 satisfy V T1 -V T1 ∞ ≤ max √ 2A, √ 2B (1 -γ) 2 dist z T1 , Z * . Proof of Lemma 15. By Fact 2, V T1 (s) ≤ v * (s) ≤ V T1 (s). By Lemma 19, V †,y t * (s) = v * (s). Since the min-player initializes V T1 (s) = V †,y T 1 (s), by combining with Lemma 21, we have v * (s) -V T1 (s) = V †,y t * (s) -V †,y T 1 (s) ≤ √ B y T1 -y T1 * (1 -γ) 2 ≤ √ Bdist y T1 , Z * (1 -γ) 2 . Analogously, V T1 (s) -v * (s) ≤ √ Adist x T1 , X * (1 -γ) 2 . The result follows by summing the above two equations and combining with the fact that dist(z T1 , Z * ) ≤ √ 2dist(x T1 , X * ) + √ 2dist(y T1 , Y * ). □ The following lemma follows directly by the definition of V t , V t in (61a), (63a) and the fact that t j=1 α j t = 1. Lemma 16 For any t ∈ [T 1 : T 2 -1] and s ∈ S V t+1 (s) -V t+1 (s) ≤ max j∈[T1:t] max b∈B q j s (b) -min a∈A q j s (a) . The following lemma bound the expansion of dist (z t , Z * ). Its proof mainly uses Lemma 2. Lemma 17 For any t ∈ [T 1 + 1 : T 2 -1], we have dist 2 z t , Z * ≤18dist 2 z t-1 , Z * + 8η 2 S max {A, B} V t -V t 2 ∞ + 8η 2 max {A, B} 2 (1 -γ) 2 dist 2 z t , Z * , dist 2 z t+1 , Z * ≤324dist 2 z t-1 , Z * + 152η 2 S max {A, B} V t -V t 2 ∞ + 152η 2 max {A, B} 2 (1 -γ) 2 dist 2 z t , Z * . In addition, dist 2 z T1+1 , Z * ≤ 8 + 8η 2 S max A 2 , B 2 (1 -γ) 4 + 4η 2 max {A, B} 2 (1 -γ) 2 dist 2 z T1 , Z * . Proof of Lemma 17. By Fact 2, we have V t -v * 2 ∞ + V t -v * 2 ∞ ≤ V t -V t 2 ∞ . Then, B max (a,b)∈A×B Q s [V t ](a, b) -Q * s (a, b) 2 + A max (a,b)∈A×B Q s [V t ](a, b) -Q * s (a, b) 2 ≤γ 2 max {A, B} V t -v * 2 ∞ + V t -v * 2 ∞ ≤ max {A, B} V t -V t 2 ∞ . Then, by Lemma 2, we have z t -z t-1 2 ≤8dist 2 z t-1 , Z * + 4η 2 S max {A, B} V t -V t 2 ∞ + 4η 2 max {A, B} 2 (1 -γ) 2 dist 2 z t , Z * . z t+1 -z t 2 ≤8dist 2 z t , Z * + 4η 2 S max {A, B} V t -V t 2 ∞ + 4η 2 max {A, B} 2 (1 -γ) 2 dist 2 z t , Z * . z T1+1 -z T1 2 ≤8dist 2 z T1 , Z * + 4η 2 S max {A, B} V T1 -V T1 2 ∞ + 4η 2 max {A, B} 2 (1 -γ) 2 dist 2 z T1 , Z * . The bound of dist 2 ( z t , Z * ) follows by the fact that dist 2 z t , Z * ≤ 2dist 2 z t-1 , Z * + 2 z t -z t-1 2 . The bound of dist 2 (z t+1 , Z * ) follows by the fact that dist 2 z t+1 , Z * ≤ 2dist 2 z t , Z * + 2 z t+1 -z t 2 . The bound of dist 2 (z T1+1 , Z * ) follows by combining with Lemma 15. □ The following lemma is straightforward from the definitions of q t s and q t s . Lemma 18 For any t ∈ [T 1 : T 2 ] and s ∈ S, max b∈B q t s (b) -min a∈A q t s (a) ≤ V t -V t ∞ + max √ 2A, √ 2B 1 -γ dist z t s , Z * s . Proof of Lemma 18. For any s ∈ S, we have v * (s) -min a∈A q t s (a) = min a∈A Q s [v * ]y t * s (a) -min a∈A (Q s [V t ]y t s )(a) ≤ Q s [v * ]y t * s -Q s [V t ]y t s ∞ ≤ max (a,b)∈A×B Q s [v * ](a, b) -Q s [V t ](a, b) y t * s 1 + max (a,b)∈A×B Q s [V t ] y t s -y t * s 1 ≤ v * -V t ∞ + √ B 1 -γ dist y t s , Y * s . Analogously, max b∈B q t s (b) -v * (s) ≤ V t -v * ∞ + √ A 1 -γ dist x t s , X * s . Then, the proof is completed by combining the above two equations with the facts that dist(z s , Z * s ) ≤ √ 2dist(x s , X * ) + √ 2dist(y s , Y * ). □ Now, we can prove the geometric boundedness of Averaging OGDA (Theorem 4) by combining Lemma 16, Lemma 17, Lemma 18 inductively. Proof of Theorem 4. By Lemma 15, V T1 -V T1 ∞ ≤ C 1 dist z T1 , Z * , ( ) where C 1 = max √ 2A, √ (1γ)

2

. By Lemma 18, max s∈S max b∈B q t s (b) -min a∈A q t s (a) ≤ V t -V t ∞ + C 2 dist z T1 s , Z * s , where C 2 = max √ 2A, √ 1γ . By Lemma 17 and the fact that √ A 1 + A 2 + A 3 ≤ √ A 1 + √ A 2 + √ A 3 , we have dist z T1+1 , Z * ≤ D 1 dist z T1 , Z * , ( ) dist z t+1 , Z * ≤ D 2 dist z t , Z * + C 3 V t+1 -V t+1 ∞ + C 4 dist z t+1 , Z * , ( ) dist z t+2 , Z * ≤ D 3 dist z t , Z * + C 5 V t+1 -V t+1 ∞ + C 6 dist z t+1 , Z * 2 , ( ) where D 1 = 8 + 8η 2 S max {A 2 , B 2 } (1 -γ) 4 + 4η 2 max {A, B} 2 (1 -γ) 2 , D 2 = √ 18, C 3 = η 8S max {A, B}, C 4 = √ 8η max {A, B} 1 -γ , D 3 = √ 324, C 5 = η 152S max {A, B}, C 6 = √ 152η max {A, B} 1 -γ . Define D = max {D 1 , C 1 + C 2 , 1 + C 2 , D 2 + C 3 + C 4 , D 3 + C 5 + C 6 } . Next, we prove (78) by induction max dist z j+1 s , Z * s , dist z j s , Z * s , V j -V j ∞ , max s∈S max b∈B q j s (b) -min a∈A q j s (a) ≤ D j-T1+1 • dist z T1 , Z * . ( ) The case of j = T 1 follows by ( 73), ( 74), ( 75). Now, suppose that we have shown (78) for j ∈ [T 1 : t]. Then, by Lemma 16 and the induction hypothesis ( 78), V t+1 -V t+1 ∞ ≤ D t-T1+1 • dist z T1 , Z * . By combining the above equation with ( 74) and the induction hypothesis (78), max s∈S max b∈B q t+1 s (b) -min a∈A q t+1 s (a) ≤ V t -V t + C 2 dist z t s , Z * s ∞ ≤ (1 + C 2 ) D t-T1+1 dist z T1 , Z * . By combining the above two equations with ( 76), ( 77) and the induction hypothesis (78), dist z t+1 , Z * ≤ (D 2 + C 3 + C 4 ) D t-T1+1 dist z T1 , Z * , dist z t+2 , Z * ≤ (D 3 + C 5 + C 6 ) D t-T1+1 dist z T1 , Z * . By the definition of D, we have proved (78) for t + 1. By induction, (78) holds for any t ∈ [T 1 : T 2 ]. The following relation is implied by (78) directly dist z t , Z * ≤ D t-T1 • dist z T1 , Z * = D t-T1 • dist (z , Z * ) , where z = (x , ỹ ) is the initial policy pair ( 59), (60). Then, (11) follows by setting D 0 = D 2 . By definition, we have D 0 = O(S(A + B) 2 /(1 -γ) 4 ) under the condition η ≤ 1. By Shapley's theorem (Lemma 19), Z * s = X * s × Y * s is the set of Nash equilibria of a matrix game. Thus, Z * s is convex, then, Z * is also convex. Thus, we have dist T2 t=T1 α t-T1+1 T2-T1+1 z t , Z * ≤ T2 t=T1 α t-T1+1 T2-T1+1 dist z t , Z * . As D 0 ≥ 1 in our definition, we have dist z [T1:T2] , Z * ≤dist T2 t=T1 α t-T1+1 T2-T1+1 z t , Z * ≤ T2 t=T1 α t-T1+1 T2-T1+1 dist z t , Z * ≤ D 0 T2-T1 dist z T1 , Z * = D 0 T2-T1 dist (z , Z * ) . This gives (12) . □

E PROOFS FOR GLOBAL LINEAR CONVERGENCE

Proof of Theorem 1. Recall the constants c 0 , δ 0 defined in the local linear convergence of Local-Fast, D 0 defined in the geometric boundedness of Global-Slow, C ′ defined in the global convergence of Global-Slow in Section 3.1.

Define

M * 1 = min t ≥ 1 : C ′ log(t) η ′ t ≤ δ 0 η 4 , M * 2 = max 3 c 0 η 2 ⌈log Γ 0 ⌉ , 0 + 1, M * 3 = 6 c 0 η 2 (⌈log max {D 0 , 1}⌉ + 1) . Let M * = max (M * 1 ) 2 , M * 2 , (M * 3 ) 2 . Then, the order of M * M * ≤ O C ′ 2 log 2 (C ′ /(δ 0 ηη ′ )) δ 0 η 4 η ′ 2 + log 2 (D 0 + 1) c 2 0 η 4 . ( ) For simplicity we denote z k = z [I k gs : I k gs ] . Note that z k = z [I k gs : I k gs ] is the initial policy pair of the k-th call to Local-Fast. Define k * as k * = min k ∈ Z + : 2 k ≥ M * 1 , 4 k ≥ M * 2 , 2 k ≥ M * 3 . Then, 2 k * -1 ≤ M * 1 , 4 k * -1 ≤ M * 2 , 2 k * -1 ≤ M * 3 , i.e., 4 k * ≤ 4 max (M * 1 ) 2 , M * 2 , (M * 3 ) 2 = 4M * . ( ) Firstly, we provide bounds for z k after k ≥ k * . For any k ≥ k * , since 2) and the definition of M * 1 , the policy pair z k satisfies I k gs -I k gs + 1 = 2 k ≥ 2 k * ≥ M * 1 , by ( dist 2 z k , Z * ≤ C ′ log(2 k * ) η ′ • 2 k * 2 ≤ δ 0 η 4 . Since z k is the initial policy pair of Local-Fast in time interval [I k lf : I k lf ], by (5), for t ∈ [I k lf : I k lf ], dist 2 z t , Z * ≤Γ 0 • 1 -c 0 η 2 t-I k lf dist 2 z k , Z * . Since 4 k ≥ 4 k * ≥ M * 2 , Γ 0 • 1 - c 0 η 2 3 4 k -1 ≤ 1. Since 2 k ≥ 2 k * ≥ M * 3 , we have 1 - c 0 η 2 3 4 k -1 ≤ 1 - c 0 η 2 3 2 k+1 •(2 k-1 -1) ≤ 1 max {D 0 , 1} 2 k+1 . Then, by combining the above three equations, dist 2 z I k lf , Z * ≤ Γ 0 • 1 -c 0 η 2 I k lf -I k lf dist 2 z k , Z * =Γ 0 • 1 -c 0 η 2 4 k -1 dist 2 z k , Z * ≤ Γ 0 • 1 - c 0 η 2 3 3•(4 k -1) dist 2 z k , Z * ≤ 1 max {D 0 , 1} 2 k+1 1 - c 0 η 2 3 4 k -1 dist 2 z k , Z * . ( ) By ( 4) and the fact that z I k lf is the initial policy pair of the (k + 1)-th call to Global-Slow, dist 2 z k+1 , Z * ≤ D I k+1 gs -I k+1 gs 0 dist 2 z I k+1 gs , Z * = D 2 k+1 -1 0 dist 2 z I k lf , Z * . ( ) Then, by combining ( 82) and ( 83), we have dist 2 z k+1 , Z * ≤D 2 k+1 -1 0 • 1 max {D 0 , 1} 2 k+1 1 - c 0 η 2 3 4 k -1 dist 2 z k , Z * ≤ 1 - c 0 η 2 3 4 k -1 dist 2 z k , Z * . (84) Next, we give a rough bound of dist 2 (z t , Z * ) for t ∈ [I k lf : I k+1 gs ]. For t ∈ [I k lf : I k lf ], by (5), dist 2 z t , Z * ≤Γ 0 • 1 -c 0 η 2 t-I k lf dist 2 z k , Z * ≤ Γ 0 dist 2 z k , Z * . For t ∈ [I k+1 gs : I k+1 gs ], since z I k lf is the initial policy pair of the (k + 1)-th call to Global-Slow, it follows by (3) that dist 2 z t , Z * ≤D t-I k+1 gs 0 dist 2 z I k lf ,Z * ≤ max {D 0 , 1} 2 k+1 dist 2 z I k lf , Z * ≤ 1 - c 0 η 2 3 4 k -1 dist 2 z k , Z * ≤ dist 2 z k , Z * , where the first inequality is from (3); the second inequality is from the fact that [I k+1 gs : I k+1 gs ] = 2 k+1 ; the third inequality is by (82). Thus, for any t ∈ [I k lf : I k+1 gs ], dist 2 z t , Z * ≤ Γ 0 dist 2 z k , Z * . ( ) Now, we are ready to bound dist(z t , Z * ) for each t ∈ [0 : T ]. Firstly, we fix a k ′ ≥ k * + 1 and a t ′ ∈ [I k ′ lf : I k ′ +1 gs ]. Then, the time interval [0 : t ′ ] can be divided into: [0 : t ′ ] = [0 : I k * gs ] ∪ [I k * lf : I k * +1 gs ] ∪ • • • ∪ [I k ′ -1 lf : I k ′ gs ] ∪ [I k ′ lf : t ′ ]. By (84), we have dist 2 z k ′ , Z * ≤ 1 - c 0 η 2 3 k ′ -1 k=k * (4 k -1) dist 2 z k * , Z * ≤ 2S 1 - c 0 η 2 3 k ′ -1 k=k * (4 k -1) By combining with (85), we have 1) . dist 2 z t ′ , Z * ≤ (2SΓ 0 ) • 1 - c 0 η 2 3 k ′ -1 k=k * (4 k - By (81), I k * gs ≤ 2 k * + k * -1 k=1 2 k + 4 k ≤ 2 k * k=0 4 k ≤ 8 3 • 4 k * ≤ 32M * 3 . (86) Thus, k ′ -1 k=k * 4 k -1 ≥ 1 2 k ′ -1 k=k * 4 k ≥ 1 4 k ′ -1 k=k * 4 k + 2 k+1 ≥ 1 16 k ′ k=k * 4 k + 2 k+1 = 1 16 k ′ k=k * I k+1 gs -I k gs = I k ′ +1 gs -I k * gs 16 ≥ t ′ -I k * gs 16 ≥ t ′ -32M * /3 16 . Then, for the time t ′ we have fixed, dist 2 z t ′ , Z * ≤(2SΓ 0 ) • 1 - c 0 η 2 3 k ′ -1 k=k * (4 k -1) ≤ (2SΓ 0 ) • 1 - c 0 η 2 3 t-32M * /3 16 . ( ) Since the above arguments can be applied to any k ′ ≥ k * + 1 and t ∈ [I k ′ lf : I k ′ +1 gs ], we have that ( 87) holds for any t ≥ I k * +1 lf . By similar arguments to (86), we have I k * +1 gs ≤ 128M * /3. Then, for any t ∈ [0 : I k * +1 gs ], dist(z t , Z * ) ≤2S ≤ 2S max {Γ 0 , 1} • 1 - c 0 η 2 3 t-I k * +1 gs 16 ≤2S max {Γ 0 , 1} • 1 - c 0 η 2 3 t-128M * /3 16 . Then, by combining with (87), for any t ∈ [0 : T ], More specifically, by Theorem 1 and (57), the constant c in (9) satisfies c > 0 and it is of order dist 2 z t , Z * ≤2S max {Γ 0 , 1} • 1 - c 0 η 2 3 t-128M * /3 16 ≤2S max {Γ 0 , 1} • 1 - c 0 η 2 48 t-128M * /3 . ( c = c 0 48 = O (1 -γ) 4 c 2 + S 3 . By combining (80) with Theorem 3, Theorem 4, ( 57), ( 58), the constant M in ( 9) is of order √ S(A+B) ), then the linear convergence rate is M = O S 10 (A + B) 3 log 2 (SAB/(c + (1 -γ))) 1 -cη 2 = 1 -O (1 -γ) 9 c 2 + S 4 (A + B) 2 . If we set η = O( (1-γ) 5 2 √ S(A+B) ) for OGDA and η ′ = O( 1-γ A+B ) for Averaging OGDA, then the length of Hidden Phase I is of order M log 2 (SAB/(c + ηη ′ )) η 4 η ′ 2 = O S 12 (A + B) 9 log 2 (SAB/(c + (1 -γ))) (1 -γ) 37 c 6 + .

F AUXILIARY LEMMAS

The following lemma gives a characterization of Nash equilibrium. Its proof can be found in Section 3.9 of Filar & Vrieze (2012) . Lemma 19 Consider Markov game G = (S, A, B, r, P, γ). Given the minimax game value v * (s) = min x ∈X max y ∈Y V x ,y (s). A policy pair (x * , y * ) ∈ X × Y is a Nash equilibrium if and only if it holds for any s ∈ S that (x * s , y * s ) is a Nash equilibrium of the matrix game min x s∈∆A max y s ∈∆ B x ⊤ s Q * s y s , where Q * s is an A-by-B matrix with Q * s (a, b) = R s (a, b)+γ s ′ ∈S v * (s ′ )P(s ′ |s, a, b). In addition, the minimax game value and the Nash equilibrium set of the matrix game (89) are v * (s) and Z * s = X * s × Y * s , respectively. Then, the Nash equilibrum set of Markov game G is Z * = s∈S Z * s . The following lemma is known as "performance difference lemma" (Kakade & Langford, 2002) . It is used extensively throughout this paper. Lemma 20 (Performance Difference Lemma) For any policies x , x ′ ∈ X , y ∈ Y and state s 0 ∈ S, we have V x ′ ,y (s 0 ) -V x ,y (s 0 ) = 1 1 -γ s∈S d x ′ ,y s0 (s) ⟨x ′ s -x s , Q x ,y s y s ⟩ . The following lemma is standard. We provide its proof for completeness. Lemma 21 For any policies x , x ′ ∈ X , y , y ′ ∈ Y and state s ∈ S, state distribution ρ ∈ ∆ S , action pair (a, b) ∈ A × B. Let z = (x , y ) and z ′ = (x ′ , y ′ ), then V x ,y (s) -V x ′ ,y ′ (s) ≤ √ A + B ∥z -z ′ ∥ (1 -γ) 2 , Q x ,y s (a, b) -Q x ′ ,y ′ s (a, b) ≤ γ √ A + B ∥z -z ′ ∥ (1 -γ) 2 , ( ) d x ,y ρ ρ ρ (s) -d x ′ ,y ′ ρ ρ ρ (s) ≤ √ A + B ∥z -z ′ ∥ 1 -γ , V x , † (s) -V x ′ , † (s) ≤ √ A ∥x -x ′ ∥ (1 -γ) 2 , ( ) V †,y (s) -V †,y ′ (s) ≤ √ B ∥y -y ′ ∥ (1 -γ) 2 . ( ) Proof of Lemma 21. By performance difference lemma (Lemma 20), V x ,y (s) -V x ′ ,y (s) ≤ 1 1 -γ s ′ ∈S d x ′ ,y s (s ′ ) ∥x s ′ -x ′ s ′ ∥ 1 ∥Q x ,y s ′ y s ∥ ∞ ≤ 1 (1 -γ) 2 s ′ ∈S d x ′ ,y s (s ′ ) ∥x s ′ -x ′ s ′ ∥ 1 Similarly, V x ′ ,y (s) -V x ′ ,y ′ (s) ≤ 1 1 -γ s ′ ∈S d x ′ ,y s (s ′ ) ∥y s ′ -y ′ s ′ ∥ 1 Q x ′ ,y ′ s ′ ⊤ x ′ s ∞ ≤ 1 (1 -γ) 2 s ′ ∈S d x ′ ,y s (s ′ ) ∥y s ′ -y ′ s ′ ∥ 1 . Then, by triangle inequality and the fact that s ′ ∈S d x ′ ,y s (s ′ ) = 1, we have V x ,y (s) -V x ′ ,y ′ (s) ≤ 1 (1 -γ) 2 s ′ ∈S d x ′ ,y s (s ′ ) ∥z s ′ -z ′ s ′ ∥ 1 ≤ √ A + B max s ′ ∈S ∥z ′ s -z ′ s ′ ∥ (1 -γ) 2 ≤ √ A + B ∥z -z ′ ∥ (1 -γ) 2 . Then, (91) follows by combining (90) with the definition Q x ,y s = Q s [V x ,y ]. To bound the difference of state visitation distribution, we fix s, s ′ ∈ S. Let P ∈ R S×S be the transition matrix of policy pair (x , y ), i.e., P(s, s 1 ) = a∈A b∈B x s (a)y s (b)P (s 1 |s, a, b) . Similarly, define P ′ as the transition matrix of (x ′ , y ′ ). Then, d x ,y s (s 1 ) is the (s, s 1 )-th entry of (1γ) (I -P) -1 ; d x ′ ,y ′ s (s 1 ) is the (s, s 1 )-th entry of I -P ′ -1 . By definition, for any s, s 1 ∈ S,  ≤ ∥z s -z ′ s ∥ 1 . Thus, we have P -P ′ ∞ ≤ max s∈S ∥z s -z ′ s∥ 1 . By combining with the fact that (I -P) -1 ∞ ≤ ∞ i=0 γ i P i ∞ ≤ 1 1-γ , we have d x ,y s (s 1 ) -d x ′ ,y ′ s (s 1 ) = (1 -γ) 1 s , (I -P) -1 P -P ′ I -P ′ -1 1 s1 ≤ (1 -γ) (I -P) -1 ∞ P -P ′ ∞ I -P ′ -1 ∞ ≤ √ A + B max s ′ ∈S ∥z ′ s -z ′ s ′ ∥ 1 -γ ≤ √ A + B ∥z -z ′ ∥ 1 -γ . Then, d x ,y ρ ρ ρ (s) -d x ′ ,y ′ ρ ρ ρ (s) ≤ s0∈S ρ(s 0 ) d x ,y s0 (s) -d x ′ ,y ′ s0 (s) ≤ √ A + B ∥z -z ′ ∥ 1 -γ . To show (93), we choose y ∈ arg max y V x ,y (s), then, by performance difference lemma (Lemma 20), V x , y (s) -V x ′ , y (s) ≤ 1 1 -γ s ′ ∈S d x , y s (s ′ ) ∥x s ′ -x ′ s ′ ∥ 1 Q x ′ , y s ′ y s ∞ ≤ max s∈S ∥x s -x ′ s∥ 1 (1 -γ) 2 ≤ √ A ∥x -x ′ ∥ (1 -γ) 2 . Analogously, V x ′ , † (s) -V x , † (s) ≤ √ A∥x -x ′ ∥ (1-γ) 2 . Thus, V x , † (s) -V x ′ , † (s) ≤ √ A∥x -x ′ ∥ (1-γ) 2 . The inequality (94) follows similarly. □ As a direct corollary of ( 93), (94), we can bound the Nash gap max s∈S V x , † (s) -V †,y (s) by dist(z , Z * ). Corollary 1 For any z = (x , y ) ∈ Z, max s∈S V x , † (s) -V †,y (s) ≤ max{ √ 2A, √ 2B} (1 -γ) 2 • dist(z , Z * ). Proof of Corollary 1. Denote P X * (x ) = x * , P Y * (y ) = y * , then z * = (x * , y * ) = P Z * (z ) . By the definition of Nash equilibria, V x * ,y * (s) = V x * , † (s) = V †,y * (s). Then, by combining with ( 93), ( 94), for any s ∈ S, max s∈S V x , † (s) -V †,y (s) = max s∈S V x , † (s) -V x * , † (s) + V †,y * (s) -V †,y (s) ≤ max s∈S √ A ∥x -x * ∥ (1 -γ) 2 + √ B ∥y -y * ∥ (1 -γ) 2 ≤ max{ √ 2A, √ 2B}dist(z , Z * ) (1 -γ) 2 . This completes the proof. □ The following lemma is paraphrased from Lemma 4 of Gilpin et al. (2012) and is also known as saddle-point metric subregularity of matrix games as in Theorem 5 of Wei et al. (2020) . It essentially shows that in matrix game min x max y x ⊤ Gy , the suboptimality of any policy pair can be lower bounded by a certain condition measure φ(G) of the matrix G multiplied by the policy pair's distance to the Nash equilibrium set of the matrix game. Lemma 22 (Lemma 4 of Gilpin et al. (2012) , Theorem 5 of Wei et al. (2020) ) For any matrix G ∈ R A×B , let X * (G) = arg min x ′ ∈∆ A (max y ′ ∈∆ B x ′ ⊤ Gy ′ ) and Y * (G) = arg max y ′ ∈∆ B (min x ′ ∈∆ A x ′ ⊤ Gy ′ ). Then, it holds that for any x ∈ ∆ A and y ∈ ∆ B , max y ′ ∈∆ B x ⊤ Qy ′ -min x ′ ∈∆ A x ′ ⊤ Qy ≥ φ(Q) • dist 2 (x , X * (G)) + dist 2 (y , Y * (G)), where φ(Q) > 0 is a certain condition measure of the matrix Q. As a direct corollary of Lemma 22, we can instantiate the value of c + in (1). Corollary 2 Let c + = min s∈S φ(Q * s ), then, for any policy pair z = (x , y ) ∈ Z and s ∈ S, max y ′ s ∈∆ B x ⊤ s Q * s y ′ s -min x ′ s ∈∆ A x ′ s ⊤ Q * s y s ≥ c + • dist(z s , Z * s ).

G DECENTRALIZED IMPLEMENTATION OF THE ALGORITHMS

Recall that in our interaction protocol, the min-player only has access to its marginal reward function r t x and marginal transition kernel P t x , while the max-player only has access to its marginal reward function r t y and marginal transition kernel P t y . Equivalently, in each iteration, the min-player receives full information of the Markov Decision Process (MDP) M t x = (S, A, P t x , r t x , γ), the max-player receives M t y = S, B, P t y , r t y , γ . The marginal rewards and transition kernels are defined as r t x (s, a) = b∈B y t s (b)R s (a, b), P t x (s ′ |s, a) = b∈B y t s (b)P(s ′ |s, a, b), r t y (s, b) = a∈A x t s (a)R s (a, b), P t y (s ′ |s, a) = a∈A x t s (a)P(s ′ |s, a, b). The value function of the policy x in the MDP M t x is defined as an S-dimensional vector containing the expected cumulative rewards of each state, i.e., V x ,M t x (s) = E x ,y t   +∞ j=0 γ j r t x s j , a j |s 0 = s   . The q-function q x ,M t x = {q x ,M t x s } s∈S is defined as a collection of A-dimensional vector with q x ,M t x s (a) = r t x (s, a) + γ s ′ ∈S P t x (s ′ |s, a) V x ,M t x (s ′ ). The counterparts V M t y ,y (s), q M t y ,y s for the max-player are defined similarly. In the pseudocodes below, for any set C, I C denotes its indicator. The decentralized implementation of OGDA ( 8) is in Algorithm 2 (min-player's perspective) and Algorithm 3 (max-player's perspective). The decentralized implementation of Averaging OGDA ( 6) is in Algorithm 4 (min-player's perspective) and Algorithm 5 (max-player's perspective). Then, our instantiation of the meta algorithm Homotopy-PO which uses Averaging OGDA as Global-Slow and OGDA as Local-Fast is naturally a decentralized algorithm. The pseudocodes are presented in Algorithm 6 (min-player's perspective) and Algorithm 7 (max-player's perspective). • Equivalence between OGDA (8) and Algorithm 2, 3 To prove the equivalence between OGDA (8) and Algorithm 2, 3, it suffices to show that q x t ,M t x s = Q t s y t s . Actually, both q x t ,M t x s and Q t s y t s equals the marginal q-function of the local MDP M t x = {S, A, P t x , r t x , γ} observed by the min-player at iteration t. By definition, we have for any s ∈ S, V x t ,y t (s) = V x t ,M t x (s) = V M t y ,y t (s). Then, we have q x t ,M T x s (a) = b∈B R s (a, b)y t s (b) + b∈B s ′ ∈S P t x (s ′ |s, a, b)V x t ,M t x (s ′ )y t s (b) = b∈B R s (a, b)y t s (b) + b∈B s ′ ∈S P t x (s ′ |s, a, b)V x t ,y t (s ′ )y t s (b) = 1 a , Q x t ,y t s y t s = 1 a , Q t s y t s . Thus, q x t ,M T x s = Q t s y t s . Analogously, q y t ,M T y s = Q t s ⊤ x t s . This gives the equivalence between OGDA (8) and Algorithm 2, 3. Algorithm 2: x-OGDA Input: time interval: [T 1 : T 2 ], initial policy: x ∈ X , stepsize: η Initialize x T1 = x for t = T 1 , • • • , T 2 do play policy x t receive r t x and P t x compute the q-function q x t ,M t x s s∈S in the MDP M t x = (S, A, P t x , r t x , γ) optimistic gradient descent x t s = I {t=T1} • x T1 s + I {t>T1} • P ∆ A x t-1 s -ηq x t ,M t x s x t+1 s = P ∆ A x t s -ηq x t ,M t x s end Algorithm 3: y-OGDA Input: time interval: [T 1 : T 2 ], initial policy: y ∈ Y, stepsize: η, Initialize y T1 = y for t = T 1 , • • • , T 2 do play policy y t receive r t y and P t y compute the q-function q M t y ,y t s s∈S in the MDP M t y = S, B, P t y , r t y , γ optimistic gradient ascent y t s = I {t=T1} • y T1 s + I {t>T1} • P ∆ B y t-1 s + ηq M t y ,y t s y t+1 s = P ∆ B y t s + ηq M t y ,y t s end Algorithm 4: x-Averaging-OGDA Input: time interval: [T 1 : T 2 ], initial policy x ∈ X , stepsize: η Initialize x T1 = x for t = T 1 , • • • , T 2 do play policy x t receive r t x and P t x if t == T 1 then solve the MDP M T1 x = S, A, P T1 x , r T1 x , γ to compute V T1 (s) = min x ′ ∈X V x ′ ,M T 1 x (s) for any s ∈ S end compute for (s, a) ∈ S × A, q t s (a) = r t x (s, a) + γ s ′ ∈S P t x (s ′ |s, a) V t (s ′ ) optimistic gradient descent x t s = I {t=T1} • x T1 s + I {t>T1} • P ∆ A x t-1 s -ηq t s x t+1 s = P ∆ A x t s -ηq t s update value function V t+1 (s) = min a∈A t j=T1 α j-T1+1 t-T1+1 q j s (a) end Compute the average policy x [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 x t Algorithm 5: y-Averaging-OGDA Input: time interval: [T 1 : T 2 ], initial policy ỹ ∈ Y, stepsize: η Initialize y T1 = ỹ for t = T 1 , • • • , T 2 do play policy y t receive r t y and P t y if t == T 1 then solve the MDP M T1 y = S, B, P T1 y , r T1 y , γ to compute V T1 (s) = max y ′ ∈Y V M T 1 y ,y ′ (s) for any s ∈ S end compute for (s, b) ∈ S × B, q t s (b) = r t y (s, b) + γ s ′ ∈S P t y (s ′ |s, b) V t (s ′ ) optimistic gradient ascent y t s = I {t=T1} • y T1 s + I {t>T1} • P ∆ B y t-1 s + ηq t s y t+1 s = P ∆ B y t s + ηq t s update value function V t+1 (s) = max b∈B t j=T1 α j-T1+1 t-T1+1 q j s (b) end Compute the average policy y [T1:T2] = T2 t=T1 α t-T1+1 T2-T1+1 y t • Equivalence between Averaging OGDA (6) and Algorithm 4, 5 Firstly, it follows by definition that V †,y t (s) = min x ′′ ∈X V x ′′ ,M t x (s), V x t , † (s) = max y ′′ ∈Y V M t y ,y ′′ (s). Thus, the initiation steps in Averaging OGDA (6) and Algorithm 4, 5 are equivalent. Thus, V T1 in Averaging OGDA (6) equals that in Algorithm 4. Consider the variable q t s (a) defined in Algorithm 4, q t s (a) = r t x (s, a) + γ s ′ ∈S P t x (s ′ |s, a) V t (s ′ ) . By substituting (95) into (97) and combining the definition of the Bellman target operator in the introduction, we have q t s (a) = b∈B R s (a, b)y t s (b) + b∈B V t (s ′ )P(s ′ |s, a, b)y t s (b) = 1 a , Q s [V t ]y t s , The RHS of ( 98) is exactly our definition for q t s in Averaging OGDA (6) in Section 3.2. Analogously, the definition for q t s equals in (6) and Algorithm 5. Then, by induction, q t s , q t s , V t (s), V t (s) t∈[T1:T2],s∈S has the same value in Averaging OGDA (6) and Algorithm 4, 5. This gives the equivalence of Averaging OGDA (6) and Algorithm 4, 5. •Decentralized implementation of Homotopy-PO Recall that we have shown in Section 4 and Section 5 that Averaging OGDA (6) and OGDA (8) can serve as the base algorithm Global-Slow and Local-Fast in the meta algorithm Homotopy-PO, respectively. Thus, we can interpolate Averaging OGDA as Global-Slow and OGDA as Local-Fast in the meta algorithm Homotopy-PO to obtain a globally linearly convergence algorithm. We have shown that Averaging OGDA can be implemented in a decentralized manner (Algorithm 4, 5), and OGDA is also a decentralized algorithm (Algorithm 2, 3). In addition, the inputs of Algorithm 2, 3 and Algorithm 4, 5 only need local information (the min-player only needs x T1 , the max-player only needs y T1 ) with no requirement for knowledge of its opponent's policies. Thus, the algorithm constructed by interpolating Averaging OGDA and OGDA into Homotopy-PO is naturally a decentralized algorithm. The pseudocodes in the min-player's and the max-player's perspectives are illustrated in Algorithm 6 and Algorithm 7, respectively. We make final remarks that our instantiation for Homotopy-PO is symmetric and rational. Since the min-player and the max-player use equal stepsize η for OGDA and equal stepsize η ′ for Averaging OGDA, the players have symmetric roles in our algorithms. Rationality means one player can converge to the best response set when its opponent chooses a stationary policy. This property is naturally possessed by decentralized and symmetric algorithms. Similar arguments for rationality can also be found in some existing decentralized algorithms, see for instance Sayin et al. (2021) ; Wei et al. (2021) . We attach the proof for rationality here for completeness. In addition, since our instantiation of Homotopy-PO has linear convergence, it is not only rational but also able to guarantee the linear convergence to the best response set. Theorem 6 (Rationality) If the max-player chooses a stationary policy y = { y s } s∈S ∈ Y and the min-player runs the instantiation of Homotopy-PO (Algorithm 6), then x t will converge to the best response set x ∈ X : V x , y (s) = V †, y (s), ∀s ∈ S at a linear rate. Analogously, if the min-player chooses a stationary policy x = { x s } s∈S ∈ X and the max-player runs the instantiation of Homotopy-PO (Algorithm 7), then y t will converge to the best response set y ∈ Y : V x ,y (s) = V x , † (s), ∀s ∈ S at a linear rate. Proof of Theorem 6. Since the min-player and the max-player are symmetric, without loss of generality, we let the max-player chooses a stationary policy y = { y s } s∈S ∈ Y. Then, we define a new Markov game MG ′ = (S, A, B, P, R, γ), where S, A, γ have the same meaning as in the original Markov game. Now, the action set of the max-player only has one action B = {1}. P(s ′ |s, a, 1) = b∈B P(s ′ |s, a, b) y s (b) represents the transition probability to state s ′ when the min-player takes action a and the max-player plays the stationary policy y . Similarly, define R s (a, 1) = b∈B R s (a, b) y s (b) as the marginal reward function that the min-player will receive when its opponent chooses the stationary policy y . Denote the one-sided NE set of the min-player in the new Markov game MG ′ by X * (MG ′ ). By definition, the minimax game values v * of MG ′ are v * (s) = V †, y (s). Then, for any x * ∈ X * (MG ′ ), V x * , y (s) = V †, y (s) for any s ∈ S. Equivalently, X * (MG ′ ) is the best response set of y . By applying Theorem 2 to the new Makov game MG ′ , we have that the policy x t played by the minplayer will converge at a global linear rate to X * (MG ′ ) that is the best response set of y . Similar arguments also hold for the max-player. This gives the rationality. □ Algorithm 6: Instantiation of Homotopy-PO with Averaging OGDA and OGDA (min-player's perspective) Input: iterations: [0 : T ], initial policy:  x 0 ∈ X , stepsizes: η, η ′ > 0 set k = 1, I 0 lf = -1, x -1 = x 0 while I k-1 lf < T do I k gs = I k-1 lf + 1, I k gs = min{I k gs + 2 k -1, T }, I k lf = I k gs + 1, I k lf = min{I k lf + 4 k -1, : I k lf ], x [I k gs : I k gs ] , η) (Algorithm 2) k ← k + 1 end T ], we have dist 2 (z t , Z * ) ≤ 2S max {Γ 0 , 1} • 1 - c 0 η 2 48 t-128 M * /3 , ( ) where the value of C ′ , c 0 , δ 0 , Γ 0 can be found in the definitions of Gen-Global-Slow and Local-Fast and M * = O        C ′ 2 log 2p3 ( C ′ /(δ 0 ηη ′ )) δ 0 η 4 η ′ 2p2    1 p 1 + log 2 ( D 0 + 1) c 2 0 η 4     . ( ) Proof of Theorem 7. Let c 0 , δ 0 be defined in the local linear convergence of Local-Fast, D 0 defined in the geometric boundedness of Gen-Global-Slow, C ′ defined in the global convergence of Gen-Global-Slow.

Define

M * 1 = min t ≥ 1 : C ′ log p3 (t) η ′ p2 t p1 ≤ δ 0 η 4 , M * 2 and M * 3 are defined the same as in the proof of Theorem 1 in Appendix E. Analogous to the proof of Theorem 1, we also let M * = max (M * 1 ) 2 , M * 2 , (M * 3 ) 2 . This gives the order of M * in (101). Notice that the global linear rate only depends on the local linear rate of Local-Fast and the geometric boundedness of Global-Slow. The global convergence rate of Gen-Global-Slow is only relevant to the length of Hidden Phase I, i.e., M * 1 will only affect the length of Hidden Phase I. Then the rest of this proof follows from Theorem 1 directly. Analogously to (88), we also have dist 2 (z t , Z * ) ≤ 2S max {Γ 0 , 1} • 1 - c 0 η 2 48 t-128 M * /3 . This gives the convergence result of Homotopy-PO when equipped with Gen-Global-Slow and Local-Fast. □

H.2 ANOTHER EXAMPLE OF GLOBAL-SLOW BASE ALGORITHM

Next, we show that the algorithm in Wei et al. (2021) with a slightly modified initialization can serve as an example of Gen-Global-Slow. It is shown in Theorem 2 of Wei et al. (2021) that Algorithm 1 therein has a sub-linear last-iterate global convergence rate which satisfies the RHS of (99) with p 1 = 1 2 , p 2 = 2, p 3 = 0. To instantiate that Algorithm 1 in Wei et al. (2021) can be an example of Gen-Global-Slow, it suffices to prove its geometric boundedness. We remark that geometric boundedness may not hold for the original Algorithm 1 in Wei et al. (2021) since its initialization V 0 (s) = 0 may cause the policy gradients in the first step to deviate largely. However, this problem can be fixed simply by changing the initialization to V 0 (s) = V x 1 ,y 1 (s). When running Algorithm 1 of Wei et al. (2021) in the full-information setting (with the different initialization discussed above) during the time interval [T 1 : T 2 ], the min-player and the max-player initialize x T1 = x T1 = x , y T1 = y T1 = ỹ and V T1-1 (s) = V x T 1 ,y T 1 (s) and update for t ≥ T 1 and any s ∈ S x t+1 s = P X x t s -η Q t s y t s , x t+1 s = P X x t+1 s -η Q t s y t s , y t+1 s = P Y y t s + η Q t s ⊤ x t s , y t+1 s = P Y y t+1 s + η Q t s ⊤ x t s , V t (s) = (1 -β t-T1+1 ) V t-1 (s) + β t-T1+1 x t s , Q t s y t s , where Q t s = Q s [V t-1 ] and β t = H0+1 H0+t with H 0 = 2 1-γ . Recall that Q s [•] is the Bellman target operator defined in the introduction. When using the algorithm (104) with initialization (103), the output policy can be set as x [T1:T2] = x T2 , y [T1:T2] = y T2 . We also denote z t = (x t , y t ), z [T1:T2] = ( x [T1:T2] , y [T1:T2] ), x t * = P X * (x t ), y t * = P Y * (y t ), z t * = P Z * (z ) = (x t * , y t * ) in the analysis below. Next, we proceed to show the geometric boundedness of the algorithm of Wei et al. (2021) with the slightly modified initialization in a similar way with Appendix D.2. We first provide mutual bounds among {dist(z t , Z * )} and {∥V t (s)v * (s)∥ ∞ } in Lemma 23 and Lemma 24 below. Lemma 23 Let {z t , V t } be generated from (104) with initialization (103). For any t ≥ T 1 , V t -v * ∞ ≤ max j∈[T1:t] √ A + Bdist(z j , Z * ) 1 -γ + max j∈[T1-1:t-1] V j -v * ∞ . Proof of Lemma 23. Firstly, define β j t = β j Π t k=j+1 (1β k ) for 0 ≤ j ≤ t -1 and β t t = β t . Since β 0 t = 0, by (104e), for any t ≥ T 1 V t (s) = t j=T1 β j-T1+1 t-T1+1 x j s , Q j s y j s . By the definition of Q j s , we have max s,a,b Q j s (a, b) -Q * s (a, b) = max s,a,b Q j s (a, b) -Q s [v * ](a, b) ≤ V j-1 -v * ∞ . By Lemma 19, v * (s) = x j * s , Q s [v * ]y j * s . Thus, for any t ≥ T 1 and s ∈ S, by combining the above equations, we have V t (s) -v * (s) ≤ t j=T1 β j-T1+1 t-T1+1 x j s , Q j s y j s -x j * s , Q s [v * ]y j * s ≤ t j=T1 β j-T1+1 t-T1+1 x j * s 1 • max (a,b)∈A×B |Q s [v * ](a, b)| • y j s -y j * s 1 + t j=1 β j-T1+1 t-T1+1 x j * s 1 • max (a,b)∈A×B Q j s (a, b) -Q s [v * ](a, b) • y j s 1 + t j=T1 β j-T1+1 t-T1+1 x j s -x j * s 1 Q j s y j s ∞ ≤ t j=T1 β j-T1+1 t-T1+1 z j s -z j * s 1 1 -γ + V j-1 -v * ∞ . 0 • dist 2 (z , Z * ) ≤ D t+1-T1 0 • dist 2 (z , Z * ). Thus, we have shown (112) for j = t + 1. By induction, (112) holds for any j ∈ [T 1 : T 2 ], which implies (106) directly. This completes the proof for the geometric boundedness of the algorithm (104) with the initialization (103). □ Remark 3 When the meta algorithm Homotopy-PO switches between Algorithm 1 of Wei et al. (2021) (with the slightly modified initialization) and OGDA (8), then by Theorem 7 and (57), c 0 48 = O (1 -γ) 4 c 2 + S 3 . Then, if η = O( (1-γ) 5 2 √ S(A+B) ) for OGDA, the linear rate is 1 - c 0 η 2 48 = 1 -O (1 -γ) 9 c 2 + S 4 (A + B) 2 . As in Algorithm 1 of Wei et al. (2021) , the stepsize η therein has to satisfy η ≤ (1-γ) for each s ∈ S, a ∈ A. Then, we normalize x 0 s = u s / ∥u s ∥ 1 for each s ∈ S. The initial policy y 0 s s∈S of the max-player is generated independently in the same way. Algorithm implementation. In all the experiments below, we set the stepsizes η = 0.1 in OGDA and also η ′ = 0.1 in Averaging OGDA. We find our algorithm has linear convergence in all the experiments with these stepsizes. Performance metric. We measure the closeness of z t to the Nash equilibria set by the Nash gap max s∈S V x t , † (s) -V †,y t (s). By combining Lemma 7 and Corollary 1 with the fact that dist(z , Z * ) ≤ √ S max s∈S dist(z s , Z * s ), we have the following relation between the Nash gap max s∈S V x , † (s) -V †,y (s) and the distance to the NE set dist(z , Z * ): for any z = (x , y ) ∈ Z, c + √ S • dist(z , Z * ) ≤ max s∈S V x , † (s) -V †,y (s) ≤ max{ √ 2A, √ 2B} (1 -γ) 2 • dist(z , Z * ). Thus, the linear convergence of dist(z t , Z * ) is equivalent to the linear convergence of the Nash gap max s∈S V x t , † (s) -V †,y t (s) up to problem-dependent constants. In the figures below, y-axis represents the logarithmic of the Nash gap log max s∈S V x t , † (s) -V †,y t (s) , x-axis represents the iteration number. Remark 4 As we can see, there are discontinuities when switching from Averaging OGDA to OGDA in the figures below. This is because Averaging OGDA is an averaging style method. Recall that the y-axis represents log max s∈S V x t , † (s) -V †,y t (s) . However, the initial policy pair of the k-th call of OGDA is the average policy z [I k gs : I k gs ] = I k gs t=I k gs α t-I k gs +1 2 k z t . Since it is quite possible that z [I k gs : I k gs ] ̸ = z I k gs , there can be some discontinuities in the figures below when switching from Averaging OGDA to OGDA. On the other hand, our theoretical bound in Figure 1  is continuous because by setting t = T 2 in (3), theoretically dist 2 (z T2 , Z * ) ≤ D T2-T1 0 • dist 2 (z , Z * ) whose bound equals the bound for dist 2 ( z [T1:T2] , Z * ) on the RHS of (4). We remark that in practice, it is predictable that z I k gs ̸ = z [I k gs : I k gs ] in most cases.

Numerical performance.

We validate the linear convergence of our instantiation of Homotopy-PO, where Global-Slow and Local-Fast are instantiated by Averaging OGDA and OGDA respectively. Figure 2 shows the performance when the min-player and max-player run Algorithm 6 and Algorithm 7, respectively. We do 10 random and independent trials and the algorithm exhibits linear convergence in every trial. The plot shows the average trajectory and standard deviation of the 10 random and independent trials. The vertical dotted line is at the end of 7-th call to OGDA (iteration t = 22098). As we can see, on the RHS of the dotted line (after t > 22098), the algorithm converges linearly and the Nash gap is less than 10 -5 after 2 × 10 5 iterations. The standard deviation of the 10 random trials is illustrated by the shadow area. Since the switching pattern is 2 k iterations of Averaging OGDA followed by 4 k iterations of OGDA, Averaging OGDA is only run for 1022 iterations in the total 2 × 10 5 iterations. Thus, Averaging OGDA is hardly seen in Figure 2 . We magnify the trajectory of the 9-th call to Averaging OGDA as a subfigure in Figure 2 . We can find that Averaging OGDA increases in its 9-th call. This has been predicted in our theoretical bounds (see segment BC in Figure 1 ). The 8-th call to OGDA has 4 8 iterations, while the 9-th call to Averaging OGDA only has 2 9 iterations. We have 4 8 /2 9 = 128, i.e., the iterations of OGDA are hundreds of times more than those in the successive call to Averaging OGDA. Then the increase caused by Averaging OGDA can be naturally "omitted" compared with the decrease from OGDA. This aligns with our theoretical bounds in Figure 1 (see the relation between the segments AB and BC in Figure 1 ). The trajectory is the average of 10 random and independent trials. The x-axis represents the iteration number, while the y-axis represents the logarithm of the Nash gap. The shadow area shows the standard deviations of these trials. The vertical dotted line is drawn at the end of the 7-th call to OGDA (iteration t = 22098). On the RHS of the dotted line (equivalently, after t > 22098), the algorithm exhibits fast linear convergence. In our switching pattern, 2 k ≪ 4 k when k is large. Thus, Averaging OGDA is almost "invisible". We magnify the 9-th call to Averaging OGDA as a subfigure. Though Averaging OGDA can increase, its increase is negligible by the decrease from hundreds of times more steps of OGDA. This aligns with our theoretical guarantees (see the relation between segments AB and BC in Figure 1 ). To avoid the problem that the iterations of Averaging OGDA is too few to be "visible", we do another group of trials by generalizing the switching pattern slightly. The numerical performance of Homotopy-PO with a slightly generalized switching scheme. In the new switching scheme, the k-th call to Averaging OGDA has 2 k steps and the k-th call to OGDA has 2.1 k steps. In this way, there are more iterations of Averaging OGDA so that the switches between them can be seen more clearly. The trajectory is the average of 10 random and independent trials with this switching pattern. The shadow area shows the standard deviation of these trials. The x-axis represents the iteration number, while the y-axis represents the logarithm of the Nash gap. We show the trajectories of the first 15 calls of Averaging OGDA and OGDA (iterations 1 ≤ t ≤ 195592) in this figure. The discontinuity in the trajectory is because Averaging OGDA is an averaging style method where z [I k gs : I k gs ] may not equal z I k gs (see Remark 4). A vertical dotted line is drawn at the end of the 12-th call to OGDA (iteration t = 22237). It can be observed that on the RHS of the dotted line (iteration t > 22237), the algorithm exhibits linear convergence. This aligns with our theoretical bounds illustrated in Figure 1 , where Averaging OGDA can increase but its increase can be "omitted" compared with the decrease from the more steps of OGDA so that the algorithm still has linear convergence. Figure 3 , where the iterations of Averaging OGDA are drawn in red while those of OGDA are drawn in blue. We show the trajectories of the first 15 calls of Averaging OGDA and OGDA (iterations 1 ≤ t ≤ 195592) in Figure 3 . The discontinuity of the trajectory is because Averaging OGDA is an averaging style method and OGDA uses the average policy z [I k gs : I k gs ] rather than z I k gs as the initial policy (see Remark 4). We draw a vertical dotted line at the end of the 12-th call to OGDA (iteration t = 22237). It can be observed that on the RHS of the dotted line (after t > 22237), the algorithm exhibits linear convergence. On the RHS of the dotted line, the performance of Averaging OGDA is generally inferior to OGDA. Averaging OGDA can even increase in some iterations. This coincides with our theoretical bounds (see the segment BC in Figure 1 ). Thanks to the fast and efficient linear convergence of OGDA together with the fact that the iterations of Averaging OGDA take up less and less proportion in the total iterations, the algorithm can exhibit linear convergence on the RHS of the vertical dotted line. This also aligns with our theoretical bounds illustrated in Figure 1 . To see the switches between Averaging OGDA and OGDA clearly in each trial, in Figure 4 and Figure 5 below, we present the 10 random trials of the changed switching pattern (2 k iterations of Averaging OGDA followed by 2.1 k iterations of OGDA). We illustrate the trajectories of the first 15 calls of Averaging OGDA and OGDA (iterations 1 ≤ t ≤ 195592) in Figure 4 and Figure 5 . In each subplots, we draw a vertical dotted line at the end of the 12-th call to OGDA (iteration t = 22237). It can be observed that on the RHS of the dotted line (after t > 22237), the algorithm has linear convergence in each trial. In some of the trials, Averaging OGDA can increase in some iterations. This is predicted (see segment BC in Figure 1 ). Since OGDA converges linearly and Averaging OGDA takes less and less proportion in the total iterations, the algorithm can still exhibit linear convergence on the RHS of the dotted line (t > 22237). This aligns with our theoretical bounds (see the relation between segments AB and BC in Figure 1 ). Even in the worst case (the 8-th trial), the Nash gap is less than 10 -3 after 2 × 10 5 iterations. And in some fast cases such as the 3-rd, 4-th, 5-th, 9-th, 10-th trials, the Nash gap can be less than 10 -6 or even 10 -8 in about 2 × 10 5 iterations. Figure 4 : The first 4 trajectories of 10 random and independent trials with the switching pattern described for Figure 3 . The rest 6 trajectories are illustrated in Figure 5 below. In these trials, the k-th call to Averaging OGDA and OGDA have 2 k and 2.1 k iterations respectively so that the switches between them can be seen more clearly. The x-axis represents the iteration number, while the y-axis represents the logarithm of the Nash gap. We show the trajectories of the first 15 calls of Averaging OGDA and OGDA (iterations 1 ≤ t ≤ 195592) in these subfigures. The vertical dotted line is drawn at the end of the 12-th call to OGDA (iteration t = 22237). As we can see, on the RHS of the vertical dotted line (t > 22237), all trajectories have linear convergence. The discontinuity is because Averaging OGDA is an averaging style method (see Remark 4). The trajectories coincides with our theoretical bounds in Figure 1 where although Averaging OGDA can cause increase, its increase can be "omitted" by the more steps of decrease from OGDA. We also compare our algorithm with Alg. 1 in Wei et al. (2021) . We choose the stepsizes of both our Homotopy-PO and Alg. 1 in Wei et al. (2021) to be 0.1. We choose the discount factor γ = 0.5, and the rest settings are the same with those in the experiments above. The switching scheme is chosen to be the same with that in Figure 3 above. The comparison between Homotopy-PO and Alg. 1 in Wei et al. (2021) is illustrated in Figure 6 , where the curves are drawn by taking the average over Figure 6 : Comparison between Homotopy-PO with a slightly generalized switching scheme with Alg. 1 in Wei et al. (2021) . In the new switching scheme, the k-th call to Averaging OGDA has 2 k steps and the k-th call to OGDA has 2.1 k steps. The curves are computed from the average of 5 random and independent trials. The curves are the segments connecting the points at the time points when Homotopy-PO switches between Averaging OGDA and OGDA.

J DISCUSSIONS J.1 POSSIBLE TRANSLATION TO SAMPLE-BASED ALGORITHMS

We remark that it is possible to translate our algorithm into sample-based algorithms. Here, we tentatively discuss the analogues of local linear convergence of OGDA under the following two cases and give an intuitive analysis for each case. The analogues of global convergence and geometric boundedness of Averaging OGDA can be discussed similarly. • Case 1: Assuming access to a simulator (generative model). If there is a simulator (generative model) and the players can draw lots of samples in one iteration, then it is possible to get linear convergence against the iteration number. More specifically, at iteration t, for each s, N t samples are drawn from the distributions P(•|s, a t,j , b t,j ), where 1 ≤ j ≤ N t and a t,j ∼ x t s , b t,j ∼ y t s . We define a very small variable δ t = O(c 0 η 2 (1c 0 η 2 ) t ). Define the truncated simplex ∆ t A = {x ∈ R A : x(a) ≥ δ t , x(a) = 1}. ∆ t B is defined analogously. At iteration t, we replace the projection operator P ∆ A (•) and P ∆ B (•) with P ∆ t A (•) and P ∆ t B (•). This guarantees that each action is taken with probability at least δ t . Then by Hoeffding's inequality, each action a is taken by the min-player for at least O(N t δ t ) times with high probability (w.h.p). Then, the empirical marginal reward r t x and marginal transition kernel P t x observed by the min-player satisfy the following relation w.h.p., ∥ r t x -r t x ∥ ∞ ≤ O 1 N t δ t , ∥ P t x (•|s, a) -P t x (•|s, a)∥ 1 ≤ O 1 N t δ t . In this remark, O(•) suppresses logarithmic terms and problem parameters such as S, A, B, 1/(1-γ) for simplicity. Thus, we have ∥ V x t ,y t - A (•) will add an error term whose ℓ ∞ -norm is at most δ t . Thus in each iteration t, new error terms of order O( 1 Ntδt +δ t ) are added. At iteration t, let xt+1 be the ideal variable computed from {x j } j≤t with exact value of the marginal information r t V x t ,y t ∥ ∞ ≤ O( 1 Ntδt ), ∥ V †,y t -V †,y t ∥ ∞ ≤ O( 1 Ntδt ), ∥ V x t , † -V x t , † ∥ ∞ ≤ O( 1 Ntδt ), | Q x t x and P t x . Let x t+1 be the real variable computed in the learning process. ȳt+1 and y t+1 are defined similarly. Then, by Cauchy-Schwartz inequality, dist 2 (z t+1 , Z * ) ≤ (1 + c 0 η 2 /2)dist 2 (z t+1 , Z * ) + (1 + 1/(2c 0 η 2 )) O( 1Ntδt + δ 2 t ). After adding these error terms to the proof of Theorem 5, the bounds for the potential functions Λ t defined in (27) will be Λ t+1 ≤ (1 -c 0 η 2 )(1 + c 0 η 2 /2)Λ t + O 1 c 0 η 2 1 N t δ t + δ 2 t . Then by setting N t = O(c 3 0 η 6 (1c 0 η 2 ) 2t ), we can show by induction that Λ t ≤ O((1c 0 η 2 /3) t ). This gives the local linear convergence of OGDA when the players can draw lots of samples in one iteration. • Case 2: Using an ergodic assumption. When no simulator is available, we consider translating our algorithm into a sample-based algorithm under an ergodic assumption. The assumption is that there exists a constant L 0 > 0 such that for any policy pair z = (x, y), if the min-player and the max-player play policy x and y respectively in L 0 successive iterations t ∈ [T 0 : T 0 + L 0 -1], then, for any initial state s T0 and state s ∈ S, there exists a t ′ ∈ [T 0 : T 0 + L 0 -1] such that s t ′ = s. Briefly, this assumption requires that when the players choose a stationary policy pair for successive L 0 iterations, then every state must be visited at least once in these D 0 iterations. Under this assumption, our strategy is to regard successive L 0 × N k iterations as a virtual iteration k. In this way, we divide [1 : T ] into [1 : T ] = [T 1 : T 2 ] ∪ [T 3 : T 4 ] ∪ • • • ∪ [T 2k-1 : T 2k ] ∪ • • • where T 2k -T 2k-1 = L 0 × N k . Then, in the time interval [T 2k-1 : T 2k ], each state s has been visited for at least N k times. This is similar to the case when we have a simulator and N k samples are drawn for each state s in iteration k. In this way, by applying our algorithm and analysis for the simulator case (Case 1 above), we can show the local linear convergence with respect to the virtual iteration number k.

J.2 DISCUSSIONS ABOUT THE INTUITIONS BEHIND AVERAGING OGDA AND OGDA

• In Markov games, the main challenge of finding an NE is to estimate the minimax game values {v * (s)}. If {v * (s)} are already known, the players can use Q s [v * ]y t s and Q s [v * ] ⊤ x t s as policy gradients to do optimistic gradient descent/ascent. Then finding an NE is reduced to solving S matrix games min x s ∈∆ A max y s ∈∆ B x ⊤ s Q s [v * ]y s separately. Approximating {v * (s)} s∈S is difficult because (1) zero-sum Markov games are nonconvex-nonconcave problems, and then, solving {v * (s)} is almost as hard as our goal which is finding a NE; (2) the players only know marginal reward and marginal transition kernel, so in each player's perspective, the marginal MDP observed is dynamic. More specifically, the errors in approximations for {v * (s)} will cause inaccuracy in approximate Q-functions and policy gradients. The inaccuracy in policy gradients will make z t far from the NE set and then induce errors in approximations for {v * (s)}. This will easily generate a vicious circle and make the errors blow up. • Averaging OGDA and OGDA employ different ways to approximate {v * (s)} s∈S in each iteration. OGDA directly uses V x t ,y t (s) to approximate v * (s). Thus, the min-player uses Q t s y t s = Q s [V x t ,y t ]y t s as approximate Q-functions to do optimistic gradient descent, while the max-player uses (Q t s ) ⊤ x t s = Q s [V x t ,y t ] ⊤ x t s as approximate Q-functions to do optimistic gradient ascent. On the other hand, in Averaging OGDA, the min-player and max-player use V t (s) and V t (s) respectively to approximate {v * (s)}. Thus, the min-player uses q t s = Q s [V t (s)]y t s as approximate Q-functions to do optimistic gradient descent, while the max-player uses q t s = Q s [V t (s)] ⊤ x t s as approximate Q-functions to do optimistic gradient ascent. V t (s) and V t (s) are computed through the averaging technique. Take the min-player as an example, where V t (s) = min a∈A t-1 j=T1 α j-T1+1 t-T1 q j s (a) is a step of value iteration on the average of past Q-function approximations. We will elaborate on the intuition of V t (s) below, but before that, we first show why the averaging step is needed to achieve global convergence (equivalently, the disadvantage of using V x t ,y t (s) to approximate v * (s)). • Technical challenge in the analysis of OGDA. The main difficulty in the analysis of OGDA is the nonconvex-nonconcave essence of zero-sum Markov games. As discussed in Section 5.1 To address the above questions, we mainly use the two geometric observations in Section 5. Observation I guarantees the progress of projected gradient descent/ascent is substantial. This means that the difference between Q t s and Q * s will not be troublesome in deriving the local linear convergence. Observation II implies the stability of state visitation distribution. Thus, the time-varying coefficients d t x (s), d t y (s) will be "stable" when z t , z t are approaching the NE set. In other words, we can find a problem-dependent neighborhood where the time-varying coefficients d t x (s), d t y (s) will possess some "stability". We remark that since d t x (s), d t y (s) can vary rapidly when z t is far from the NE set, our analysis for OGDA only hold in a small neighborhood of the NE set. Whether OGDA has global convergence is still an open problem.



is the time interval of the k-th call to Local-Fast and |[I k lf : I k lf ]| = 4 k . The switching scheme of Homotopy-PO method can be summarized as below: starting from k = 1, • (Step 1) during time interval [I k gs : I k gs ], run Global-Slow for [I k gs : I k gs ] = 2 k iterations with the initial policy z I k-1 lf (for k ≥ 1, it is the last-iterate policy of the last call to Local-Fast) • (Step 2) during time interval [I k lf : I k lf ], run Local-Fast for [I k lf : I k lf ] = 4 k iterations with the initial policy z [I k gs : I k gs ] that is the average policy of the last call to Global-Slow • (Step 3) k ← k + 1, goto Step 1. Now, we elaborate on how Homotopy-PO achieves global linear convergence given a Global-Slow base algorithm and a Local-Fast base algorithm. Specifically, there are two hidden phases which are oblivious to the players and only used for analysis. The two phases are split by k

Local-Fast, and thus, can be "omitted". More specifically, in AB, dist 2 (z t , Z * ) converges at rate of 1c η 2 for |[I k lf :I k lf ]| = 4 k iterations, while in BC, dist 2 (z t , Z * ) diverges at rate of D 0 for |[I k+1 gs : I k+1 gs ]| = 2 k+1 iterations. Then, since 4 k /2 k+1 = 2 k-1 , if one step increase of Global-Slow is much smaller than 2 k-1 steps of decrease of Local-Fast, i.e., D 0 (1c 0 η 2 /2) 2 k-1 ≪ 1,then, we obtain the global linear convergence (see the line AC in Figure 1). t e x i t s h a 1 _ b a s e 6 4 = " K 4 8 w U Y X 6 N N t e / + + Z K e G 4 m I x o Z n 8 = " > A A A B + 3 i c b V D L S g M x F M 3 4 r P U 1 1 q W b Y B F c l Z k i 6 r L o x m U F + 4 B 2 K H f S T B u a S Y Y k I 5 a h v + L G h S J u / R F 3 / o 2 Z d h b a e i B w O O c e 7 s 0 J E 8 6 0 8 b x v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z h 0 j y p t L V N F a I t I L l U 3 B E 0 5 E 7 R l m O G 0 m y g K c c h p J 5 z c 5 n 7 n k S r N p H g w 0 4

1 L b 9 e 2 N j c 2 t 7 R d / f a w g 8 5 J i 3 s M 5 9 3 b S Q I o x 5 p S S o Z 6 Q a c I N d m p G N P L h O / c 0 + 4 o L 5 3 I 6 c B 6 b t o 7

Figure 1: An illustration of upper bound for log(dist(z t , Z * )) in Hidden Phase II.

) , and Global-Slow uses Averaging OGDA with η ′ ≤ 1-γ 16 max{A,B} . Then, there exist problem-dependent constants c ∈ (0, 1) and M > 0 such that for any t ∈ [0

) ) in Theorem 5, by setting η = O( (1-γ) 5 2

Proof of Theorem 2. By Theorem 3 and Theorem 4, Averaging OGDA can serve as the base algorithm Global-Slow in the meta algorithm Homotopy-PO. By Theorem 5, OGDA can serve as the base algorithm Local-Fast in the meta algorithm Homotopy-PO.Then, by Theorem 1, we have the global linear convergence of the instantiation of Homotopy-PO with OGDA and Averaging OGDA.

1γ) 25 c 6 + This completes the proof for global linear convergence of our instantiation for Homotopy-PO. □ Remark 2 Theorem 2 requires η ≤ O( (1-γ) ) ) for OGDA and η ′ ≤ O( 1-γ A+B ) for Averaging OGDA. If we set η = O( (1-γ) 5 2

(s, s 1 ) -P ′ (s, s 1 ) ≤ s1∈S a∈A b∈B |x s (a)x ′ s (a)| y s (b)P (s 1 |s, a, b) + s1∈S a∈A b∈B x ′ s (a) |y s (b)y ′ s (b)| P (s 1 |s, a, b)

T } during time interval [I k gs : I k gs ], run x-Averaging-OGDA([I k gs : I k gs ], x I k-1 lf , η ′ ) and compute an average policy x [I k gs : I k gs ] (Algorithm 4) during time interval [I k lf : I k lf ], run x-OGDA([I k lf

combining Theorem 1 ofWei et al. (2021) with (100), (101), (57), (58), if η = O((1-γ)

) ) for OGDA and η ′ = O((1-γ)   5 2 √ S ) for Algorithm 1 inWei et al. (2021), then the length of Hidden Phase I is of order 128 M * /3 = O S 32 (A+B) 10 c 16 + (1-γ)74   .I NUMERICAL EXPERIMENTSIn this section, we evaluate the numerical performance of Homotopy-PO where Local-Fast and Global-Slow are instantiated with OGDA and Averaging OGDA respectively.Markov game model. We generate a sequence of zero-sum Markov games randomly and independently in the way described below and test the performance of Homotopy-PO on each of the games. In each Markov game generated below, the number of states is S = 10, the min-player and max-player have A = B = 10 actions respectively, and the discount factor γ = 0.99. The reward functions {R s (a, b)} s∈S,a∈A,b∈B are generated from uniform distribution on [0, 1] independently. To generate the transition kernel, for each (s, a, b), we first choose an integer i s,a,b uniformly at random from [S]. Then, we choose a random subset M s,a,b ⊆ S with |M s,a,b | = i s,a,b . Then for each s ′ ∈ M s,a,b , we set P(s ′ |s, a, b) from uniform distribution on [0, 1] independently, and for s ′ ∈ S\M s,a,b , we set P(s ′ |s, a, b) = 0. Finally, we normalize P(s ′ |s, a, b) = P(s ′ |s, a, b)/ s ′′ ∈S P(s ′′ |s, a, b) for each (s, a, b) to get the transition kernel. For the initial policies, we first generate {u s } s∈S with u s (a) chosen from uniform distribution on [0, 1]

Figure2: The numerical performance of Homotopy-PO when Global-Slow and Local-Fast are instantiated by Averaging OGDA and OGDA. The trajectory is the average of 10 random and independent trials. The x-axis represents the iteration number, while the y-axis represents the logarithm of the Nash gap. The shadow area shows the standard deviations of these trials. The vertical dotted line is drawn at the end of the 7-th call to OGDA (iteration t = 22098). On the RHS of the dotted line (equivalently, after t > 22098), the algorithm exhibits fast linear convergence. In our switching pattern, 2 k ≪ 4 k when k is large. Thus, Averaging OGDA is almost "invisible". We magnify the 9-th call to Averaging OGDA as a subfigure. Though Averaging OGDA can increase, its increase is negligible by the decrease from hundreds of times more steps of OGDA. This aligns with our theoretical guarantees (see the relation between segments AB and BC in Figure1).

Figure 5: As complement to Figure 4, this figure shows the rest 6 trajectories of the 10 random and independent trials with the switching pattern described for Figure 3. The caption of this figure has been integrated into that of Figure 4.

,y t s (a, b) -Q x t ,y t s (a, b)| ≤ O( 1 Ntδt ) for any (s, a, b). Here, we use • overhead to indicate the empirical quantities. And the replacement of P ∆ A (•) by P ∆ t

. In Averaging OGDA, x t , y t are the policies played at iteration t ∈ [T 1 : T 2 ], and x t , V t , y t , V t are local auxiliary variables help to generate such sequences of x t , y t . The global convergence and geometric boundedness of Averaging OGDA are shown in Section 4. The RHS of (2) in the definition of Global-Slow can be directly extended to different convergence rates and more algorithms such as Wei et al. (2021) with a different initialization can serve as the generalized Global-Slow. More details are in Appendix H.

Recall that in Algorithm 1, the k-th call to Global-Slow has 2 k iterations while the k-th call to Local-Fast has 4 k iterations. It is worth noting that the choices of 2 k and 4 k in Algorithm 1 is only for simplicity. The proofs for linear convergence of Homotopy-PO can be directly generalized to the case when the k-th call to Global-Slow and Local-Fast has u k and v

annex

In this section, we mainly (1) show the convergence results of Homotopy-PO when Global-Slow base algorithm has different rates on the RHS of (2); (2) provide another example of Global-Slow base algorithm with generalized global convergence rates by proving the geometric boundedness of Algorithm 1 in Wei et al. (2021) with a slightly modified initialization.

H.1 CONVERGENCE RESULT OF HOMOTOPY-PO WHEN GLOBAL-SLOW HAS DIFFERENT CONVERGENCE RATES

To avoid abuse of notations, we call the Global-Slow algorithm with more general global convergence rates by Generalized Global-Slow base algorithm.Generalized Global-Slow base algorithm: by calling Gen-Global-Slow([T 1 : T 2 ], z , η ′ ) during time interval [T 1 : T 2 ] where z = (x , ỹ ) is the initial policy pair, the players play policy pair z t = (x t , y t ) for each iteration t ∈ [T 1 : T 2 ], and compute a policy pair z [T1:T2] = ( x [T1:T2] , y [T1:T2] ) at the end of iteration T 2 such that z t , z [T1:T2] satisfy the following two properties:• global convergence: there is a problem-dependent constant C ′ > 0 and real numbers p 1 > 0 and p 2 , p 3 ≥ 0 such that• geometric boundedness: there exists a problem-dependent constantThe main difference between Gen-Global-Slow and Global-Slow is that (1) the RHS of (99) in the definition of Gen-Global-Slow add more flexibility in the power numbers then the condition (2) in the definition of Global-Slow;(2) z [T1:T2] need not to be an average policy. In the example (104) below, we can simply set z [T1:T2] = z T2 .By similar arguments with Theorem 1, we have the following convergence rates for Homotopy-PO with generalized Global-Slow.Theorem 7 Let {z t = (x t , y t )} t∈[0:T ] be the policy pairs played when running Homotopy-PO (Algorithm 1) where Global-Slow is replaced by Gen-Global-Slow. Then, for any t ∈ [0 :Published as a conference paper at ICLR 2023 Then, the proof is completed by the fact that max s∈S z j sz j * s 1 ≤ √ A + Bdist(z j , Z * ) and t j=T1 β j-T1+1 t-T1+1 = 1. □Lemma 24 Let {z t , V t } be generated from (104) with initialization (103). Then, for any t ≥ T 1 ,Proof of Lemma 24. By applying Lemma 2 to (104) and substituting (105), we have. The bound of dist 2 (z t+1 , Z * ) follows by the fact that dist 2 z t+1 , Z * ≤ 2dist 2 z t+1 , Z * + 2 z t+1z t+1 2 and substituting the bound for dist 2 ( z t+1 , Z * ). □Next, we show the geometric boundedness of Algorithm 1 in Wei et al. (2021) with the initialization (103).Theorem 8 (Geometric Boundedness of Algorithm 1 in Wei et al. ( 2021)) Let {z t } t∈[T1:T2] be the policy pairs played by running the algorithm (104) with initialization (103).Proof of Theorem 8. We will show (106) by proving (112) inductively.Firstly, we define some constants which are used in the definition of D 0 . By Lemma 23, for t ≥ T 1 ,whereBy Lemma 24 and the condition η ≤ 1, for t ≥ T 1 ,whereFor the initialization (103), by ( 91),whereDefineBy definition, D 0 ≤ O S(A+B) 2(1-γ) 4. Now, we proceed to prove (112) by induction.The case of j = T 1 follows by ( 110) and the initialization z T1 = z T1 = z .Suppose we have shown (112) for j ∈ [T 1 : t]. Then, by ( 107), the fact D 0 ≥ 1 and induction hypothesis,By (108) and induction hypothesisAnalogously, by (109) and induction hypothesisLet ρ ρ ρ 0 denote the uniform distribution on S. As ( x t * , y t * ) attains a Nash equilibrium,,y t+1 (ρ ρ ρ 0 ). Then, by applying performance difference lemma (Lemma 20), we have a variant of the MVI property with time-varying coefficients which is as in ( 16), which can be regarded as a variant of the MVI property with time-varying coefficients.In order to utilize (16) to get local linear convergence, we still need to tackle the following two problems:• whether we can find a neighborhood of the NE set such that the time-varying coefficients d t x (s), d t y (s) in ( 16) are "stable"? • if the time-varying coefficients d t x (s), d t y (s) in ( 16) can be "stable" in a small neighborhood of the NE set, will the difference between Q t s and Q * s prevent the local linear convergence?

