POLICY OPTIMIZATION IN ZERO-SUM MARKOV GAMES: FICTITIOUS SELF-PLAY PROVABLY ATTAINS NASH EQUILIBRIA

Abstract

Fictitious Self-Play (FSP) has achieved significant empirical success in solving extensive-form games. However, from a theoretical perspective, it remains unknown whether FSP is guaranteed to converge to Nash equilibria in Markov games. As an initial attempt, we propose an FSP algorithm for two-player zero-sum Markov games, dubbed as smooth FSP, where both agents adopt an entropy-regularized policy optimization method against each other. Smooth FSP builds upon a connection between smooth fictitious play and the policy optimization framework. Specifically, in each iteration, each player infers the policy of the opponent implicitly via policy evaluation and improves its current policy by taking the smoothed best-response via a proximal policy optimization (PPO) step. Moreover, to tame the non-stationarity caused by the opponent, we propose to incorporate entropy regularization in PPO for algorithmic stability. When both players adopt smooth FSP simultaneously, i.e., with self-play, in a class of games with Lipschitz continuous transition and reward, we prove that the sequence of joint policies converges to a neighborhood of a Nash equilibrium at a sublinear O(1/T ) rate, where T is the number of iterations. To our best knowledge, we establish the first finite-time convergence guarantee for FSP-type algorithms in zero-sum Markov games. Let [π 1 * , π 2 * ] be a Nash equilibrium of the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), which exists (Shapley, 1953) and satisfies ) for all policy pairs [π 1 ; π 2 ]. Here we define the performance function as

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Bu et al., 2008; Sutton & Barto, 2018) has achieved great empirical success, e.g., in playing the game of Go (Silver et al., 2016; 2017) , Dota 2 (Berner et al., 2019) , and StarCraft 2 (Vinyals et al., 2019) , which are all driven by policy optimization algorithms which iteratively update the policies that are parameterized using deep neural networks. Empirically, the popularity of policy optimization algorithms for MARL is attributed to the observations that they usually converges faster than value-based methods that iteratively update the value functions (Mnih et al., 2016; O'Donoghue et al., 2016) . Compared with their empirical success, the theoretical aspect of policy optimization algorithms in MARL setting (Littman, 1994; Hu & Wellman, 2003; Conitzer & Sandholm, 2007; Pérolat et al., 2016; Zhang et al., 2018) remains less understood. Although convergence guarantees for various policy optimization algorithms have been established under the single-agent RL setting (Sutton et al., 2000; Konda & Tsitsiklis, 2000; Kakade, 2002; Agarwal et al., 2019; Wang et al., 2019) , extending those theoretical guarantees to arguably one of the simplest settings of MARL, two-player zero-sum Markov game, suffers from challenges in the following two aspects. First, in such a Markov game, each agent interact with the opponent as well as the environment. Seen from the perspective of each agent, it belongs to an environment that is altered by the actions of the opponent. As a result, due to the existence of an opponent, the policy optimization problem of each agent has a time-varying objective function, which is in stark contrast with the value-based methods such as value-iteration Shapley (1953) ; Littman (1994) , where there is a central controller which specifies the policies of both players. When the joint policy of both players are considered, the problem of solving the optimal value function corresponds to finding the fixed point of the Bellman operator, which is defined independently of the policy of the players. Second, when viewing the policy optimization in zero-sum Markov game as an optimization problem for both players together, although we have a fixed objective function, the problem is minimax optimization with a non-convex non-concave objective. Even for classical optimization, such a kind of optimization problem remains less less understood (Cherukuri et al., 2017; Rafique et al., 2018; Daskalakis & Panageas, 2018; Mertikopoulos et al., 2018) . It is observed that first-order methods such as gradient descent might fail to converge (Balduzzi et al., 2018; Mazumdar & Ratliff, 2018) . As an initial step to study policy optimization for MARL, we propose a novel policy optimization algorithm for any player of a multi-player Markov game, which is dubbed as smooth fictitious selfplay (FSP). Specifically, when a player adopts smooth FSP, in each iteration, it first solves a policy evaluation problem that estimates the value function associate with the current joint policy of all players. Then it update its own policy via an entropy-regularized proximal policy optimization (PPO) Schulman et al. (2017) step, where the update direction is obtained from the estimated value function. This algorithm can be viewed as an extension of the fictitious play (FP) algorithm that is designed for normal-form games (Von Neumann & Morgenstern, 2007; Shapley, 1953) and extensive-form games (Heinrich et al., 2015; Perolat et al., 2018) to Markov-games. FP is a general algorithmic framework for solving games where an agent first infer the policy of the opponents and then adopt a policy that best respond to the inferred opponents. When viewing our algorithm as a FP method, instead of estimating the policies of the opponents directly, the agent infers the opponent implicitly by estimating the value function. Besides, policy update corresponds to a smoothed best-response policy Swenson & Poor (2019) based on the inferred value function. To examine the theoretical merits of the proposed algorithm, we focus on two-player zero-sum Markov games and let both players follow smooth FSP, i.e., with self-play. Moreover, we restrict to a class of Lipschitz games (Radanovic et al., 2019) where the impact of each player's policy change on the environment is Lipschitz continuous with respect to the magnitude of policy change. For such a Markov game, we tackle the challenge of non-stationarity by imposing entropy regularization which brings algorithmic stability. In addition, to establish convergence to Nash equilibrium, we explicitly characterize the geometry of the policy optimization problem from a functional perspective. Specifically, we prove that the objective function, as a bivariate function of the two players' policies, despite being non-convex and non-concave, satisfies a one-point strong monotonicity condition (Facchinei & Pang, 2007) at a Nash equilibrium. Thanks to such benign geometry, we prove that smooth FSP converges to a neighborhood of a Nash equilibrium at a sublinear O(1/T ) rate, where T is the number of policy iterations and O hides logarithmic factors. Moreover, as a byproduct of our analysis, if any of the two players deviates from the proposed algorithm, it is shown the other player that follows smooth FSP exploits such deviation by finding the best-response policy at a same sublinear rate. Such a Hannan consistency property exhibited in our algorithm is related to Hennes et al. (2020) , which focus on normal-form games. Thus, our results also serve as a first step towards connecting regret between minimization in normal-form/extensive-form games and Markov games. Contribution. Our contribution is two-fold. First, we propose a novel policy optimization algorithm for Markov games, which can be viewed as a generalization of FP. Second, when applied to a class of two-player zero-sum Markov games satisfying a Lipschitz regularity condition, our algorithm provably enjoys global convergence to a neighborhood of a Nash equilibrium at a sublinear rate. To the best of our knowledge, we propose the first provable FSP-type algorithm with finite time convergence guarantee for zero-sum Markov games. Related Work. There is a large body of literature on the value-based methods to zero-sum Markov games (Lagoudakis & Parr, 2012; Pérolat et al., 2016; Zhang et al., 2018; Zou et al., 2019) . More recently, Perolat et al. (2018) prove that actor-critic fictitious play asymptotically converges to the Nash equilibrium, while our work provides finite time convergence guarantee to a neighborhood of a Nash equilibrium. In addition, Zhang et al. (2020) study the sample comlexity of planning algorithm in the model-based MARL settting as opposed to the model-free setting with function approximation in this paper. Closely related to smooth FSP proposed in this paper, there is a line of work in best-response algorithms (Heinrich et al., 2015; Heinrich & Silver, 2016) , which have also shown great empirical performances (Dudziak, 2006; Xiao et al., 2013; Kawamura et al., 2017) . However, they are only applicable to extensive-form games and not directly applicable to stochastic games. Also, our smooth FSP is related to Swenson & Poor (2019) , which focus on the potential games. It does not enforce entropy-regularization and only provides asymptotic convergence guarantee to a neighborhood of the Nash equilibrium for smooth fictitious play in multi-player two-action potential games. Moreover, our work also falls into the realm of regularizing and smoothing techniques in reinforcement learning (Dai et al., 2017; Geist et al., 2019; Shani et al., 2019; Cen et al., 2020) , which focus on the single-agent setting.

2. BACKGROUND

In this section, we briefly introduce the general setting of reinforcement learning for two-player zero-sum Markov games.

Zero-Sum Markov Games.

We consider the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), where S ⊂ R d is a compact state space, A 1 and A 2 are finite action spaces of Player 1 and Player 2, respectively, P : S × S × A 1 × A 2 → [0, 1] is the Markov transition kernel, r : S × A 1 × A 2 → [-1, 1] is the reward function of Player 1, which implies that the reward function of Player 2 is -r, and γ ∈ (0, 1) is the discount factor. Let r 1 = r and r 2 = -r be the reward functions of Player 1 and Player 2, respectively. For notational simplicity, throughout this paper, we write Player -i as Player i's opponent, where i ∈ {1, 2}. In the rest of this paper, we omit i ∈ {1, 2} where it is clear from the context. Also, we denote by E π i ,π -i [ • ] the expectation over the trajectory induced by the policy pair [π i ; π -i ]. Given a policy π -i : A -i ×S → [0, 1] of Player -i, the performance of a policy π i : A i ×S → [0, 1] of Player i is evaluated by its state-value function (V i -function) V π i ,π -i i : S → R, which is defined as V π i ,π -i i (s) = E π i ,π -i ∞ t=0 γ t • r i (s t , a i t , a -i t ) s 0 = s . (2.1) Correspondingly, the performance of a policy π i : A i × S → [0, 1] of Player i is evaluated by its action-value function (Q i -function) Q π i ,π -i i : S × A i × A -i → R , which is defined by the following Bellman equation, Q π i ,π -i i (s, a i , a -i ) = r i (s, a i , a -i ) + γ • E s ∼P(• | s,a i ,a -i ) V π i ,π -i i (s ) . We denote by ν π i ,π -i (s) and σ π i ,π -i (s, a i , a -i ) = π i (a i | s) • π -i (a -i | s) • ν π i ,π -i (s) the stationary state distribution and the stationary state-action distribution associated with the policy pair [π i ; π -i ], respectively. Correspondingly, we denote by E σ π i ,π -i [ • ] and E ν π i ,π -i [ • ] the expectations E (s,a i ,a -i )∼σ π i ,π -i [ • ] and E s∼ν π i ,π -i [ • ], respectively. Throughout this paper, we denote by •, • the inner product between vectors.

1. (s) ,

(2.2) where ν * is the stationary distribution σ π 1 * ,π 2 * . Regularized Markov Games. Based on the definition of the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), we define its entropy-regularized counterpart (S, A 1 , A 2 , P, r, γ, λ 1 , λ 2 ), where λ 1 , λ 2 ≥ 0 are the regularization parameters. Specifically, (S, A 1 , A 2 , P, r, γ, λ 1 , λ 2 ) is defined as the two-player general-sum Markov game with the reward function of Player i replaced by its entropy-regularized counterpart r π i ,π -i (i) : S × A i × A -i → R, which is defined as r π i ,π -i (i) (s, a i , a -i ) = r i (s, a i , a -i ) -λ i • log π i (a i | s). (2.3) With a slight abuse of notation, we write r π i ,π -i i (s) = E π i ,π -i r i (s, a i , a -i ) , r π i ,π -i (i) (s) = E π i ,π -i r π i ,π -i (i) (s, a i , a -i ) = r π i ,π -i i (s) + λ i • H π i (• | s) as the state-reward function and the entropy-regularized state-reward function, respectively. Here H(π i (• | s)) = -a i ∈A i π i (a i | s) • log π i (a i | s) is the Shannon entropy. For Player i, the entropy- regularized state-value function (V (i) -function) V π i ,π -i (i) : S → R and the entropy-regularized action-value function ( Q (i) -function) Q π i ,π -i (i) : S × A i × A -i → R are defined as V π i ,π -i (i) (s) = E π i ,π -i ∞ t=0 γ t • r π i ,π -i (i) (s t , a i t , a -i t ) s 0 = s , (2.4) Q π i ,π -i (i) (s, a i , a -i ) = r i (s, a i , a -i ) + γ • E s ∼P(• | s,a i ,a -i ) V π i ,π -i (i) (s ) , (2.5) respectively. By the definition of r π i ,π -i (i) in (2. 3), we have that, for all policy pairs [π i ; π -i ] and s ∈ S, E π i ,π -i r π i ,π -i (i) (s, a i , a -i ) ≤ 1 + λ i • log |A i |, which, by (2.4) and (2.5) implies that, for all policy pairs [π i ; π -i ] and (s, a i , a -i ) ∈ S × A i × A -i , V π i ,π -i (i) (s) ≤ V max (i) = 1 + λ i • log |A i | 1 -γ , (2.6) Q π i ,π -i (i) (s, a i , a -i ) ≤ Q max (i) = 1 + γ • (1 + λ i • log |A i |) 1 -γ . (2.7)

3. FICTITIOUS SELF-PLAY FOR ZERO-SUM MARKOV GAMES

In this section, we introduce smooth fictitious self-play (FSP) for two-player zero-sum Markov games.

3.1. FSP: FROM MATRIX GAMES TO MARKOV GAMES

FSP is an algorithmic framework for finding the Nash equilibria of games. It consists of two building blocks: (I) inferring the opponent's policy by playing against each other, namely fictitious play, and (II) improving the two players' policies with symmetric updating rules, namely self-play. Specifically, Player i best responds to a mixed policy of Player -i, which is a weighted average of Player -i's historical policies. Here playing a mixed policy π -i = α • π -i + (1 -α) • π -i means that, at the beginning of the game, the player chooses to play the policy π -i with probability α and play the policy π -i with probability 1 -α. FSP is originally developed for normal-form games (Von Neumann & Morgenstern, 2007; Shapley, 1953) and extensive-form games (Heinrich et al., 2015; Heinrich & Silver, 2016) . In (entropyregularized) two-player zero-sum matrix games, which are the special cases of (entropy-regularized) two-player zero-sum Markov games with |S| = 1 and no state transition, mixing two policies π -i and π -i with probabilities α and 1 -α, respectively, is equivalent to averaging the corresponding Q i -functions, i.e., Q π i ,α•π -i +(1-α)•π -i i = α • Q π i ,π -i i + (1 -α) • Q π i ,π -i i . In other words, in a two-player zero-sum matrix game, Player i is equivalently best responding to a weighted average of the historical Q i -functions by taking the corresponding greedy action. To generalize FSP to the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), we propose to let Player i best respond to the following weighted average of the historical marginalized Q (i) -functions at the t-th iteration, Q t+1,(i) (s, a i ) = (1 -α t,(i) ) • Q t,(i) (s, a i ) + α t,(i) • Q π i t ,π -i t (i) (s, a i ), where α t,(i) ∈ [0, 1] is the mixing rate. Here the marginalized Q (i) -function Q π i ,π -i (i) (s, a i ) is defined as Q π i ,π -i (i) (s, a i ) = E π -i Q π i ,π -i (i) (s, a i , a -i ) . (3.2) Recursively applying the symmetric updating rule in (3.1), we obtain Q t+1,(i) (s, a i ) = t τ =0 α τ,(i) • t k=τ +1 (1 -α k,(i) ) • Q π i τ ,π -i τ (i) (s, a i ) , which is the weighted average of the historical marginalized Q (i) -functions. Here we use the convention that t k=t+1 (1 -α k,(i) ) = 1. Correspondingly, (3.1) induces the following symmetric policy updating rule, π i,best t+1 (a i | s) = 1 a i = argmax a i ∈A i Q t+1,(i) (s, a i ) , where the obtained policy π i,best t+1 best responds to Q t+1,(i) defined in (3.1) by taking the corresponding greedy action.

3.2. MARKOV GAMES: FROM FSP TO SMOOTH FSP

FSP is only known to converge asymptotically even in two-player zero-sum matrix games (Robinson, 1951) . Instead, we consider smooth FSP, which uses the following smoothed best-response, π i t+1 (a i | s) ∝ exp E t+1,(i) (s, a i ) . (3.4) Here the ideal energy function E t+1,(i) (s, a i ) = κ t+1,(i) • Q t+1,(i) (s, a i ) is proportional to the weighted average of the historical marginalized Q (i) -functions defined in (3.1) with the normalization parameter κ t+1,(i) > 0. In the sequel, we simplify the symmetric updating rules in (3.1) and (3.4). Let the stepsizes be α t,(i) = κ t+1,(i) • α t,(i) , α t,(i) = κ t+1,(i) /κ t,(i) • (1 -α t,(i) ). (3.5) Recall that Q π i t ,π -i t (i) , which is the marginalized Q (i) -function, is defined in (3.2). Corresponding to (3.1), we have the following symmetric updating rule for the energy functions, E t+1,(i) (s, a i ) = α t,(i) • E t,(i) (s, a i ) + α t,(i) • Q π i t ,π -i t (i) (s, a i ), which gives the following symmetric policy updating rule equivalent to (3.4), π i t+1 (a i | s) ∝ π i t (a i | s) α t,(i) • exp α t,(i) • Q π i t ,π -i t (i) (s, a i ) . We call E t+1,(i) the ideal energy function, since it is directly obtained from the symmetric updating rule in (3.3), which operates in the functional space given the marginalized Q (i) -functions.

3.3. IMPLEMENTING SMOOTH FSP

In practice, it remains to approximate the ideal energy function E t+1,(i) within a parameterized function class, which is further used to parameterize the policy π i t+1 . For notational simplicity, we concatenate the parameters of the policies π i t+1 and π -i t+1 into a single parameter θ t+1 ∈ Θ, which gives the parameterized policy pair [π i θt ; π -i θt ]. Meanwhile, we need to estimate the marginalized Q (i) - function Q π i θ t ,π -i θ t (i) (s, a i ) defined in (3.2). In practice, the parameterization of the energy function and the marginalized Q (i) -function are set to be neural networks, which means that Θ = R N with N being the size of the neural network. To implement smooth FSP, given θ t ∈ Θ, we find the best parameter θ t+1 ∈ Θ that minimizes the mean squared error (MSE), E σt i∈{1,2} E θt+1,(i) (s, a i ) -E t+1,(i) (s, a i ) 2 , (3.7) where E t+1,(i) (s, a i ) = α t,(i) • E θt,(i) (s, a i ) + α t,(i) • Q π i θ t ,π -i θ t (i) (s, a i ) (3.8) is the estimated ideal energy function. Here Q π i θ t ,π -i θ t (i) (s, a i ) is the estimator of the marginalized Q (i) - function Q π i θ t ,π -i θ t (i) (s, a i ). Such an estimator is obtained based on the data generated by smooth FSP via policy evaluation (Sutton et al., 2000) . For notational simplicity, in (3.7) and the rest of the paper, we write the stationary state-action distribution σ π i θ t ,π -i θ t and the stationary state distributionν π i θ t ,π -i θ t associated with the policy pair [π i θt ; π -i θt ] as σ t and ν t , respectively. We define the bounded function class F R with the radius R > 0 as F R = {f : f ∞ ≤ R}. Algorithm 1 gives the implementation of smooth FSP for two-player zero-sum Markov games. Algorithm 1 Smooth FSP for Two-Player Zero-Sum Markov Games 1: Require Two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), number of iterations T , regularization parameters {λ i } i∈{1,2} , truncation parameters {Q max (i) , E max (i) } i∈{1,2} , and stepsizes {α t,(i) , α t,(i) } 0≤t≤T -1,i∈{1,2} 2: Initialize the energy function E θ0,(i) (s, a i ) ← 0 (i ∈ {1, 2}) 3: For t = 0, . . . , T -1 and i ∈ {1, 2} do 4: Set the policy π i θt (• | s) ∝ exp{E θt,(i) (s, • )} 5: Generate the marginalized Q (i) -function estimator Q π i θ t ,π -i θ t (i) (s, a i ) ∈ F Q max (i) using the data generated by fictitious play with the policy pair [π i θt ; π -i θt ] 6: Update the estimated ideal energy function E t+1,(i) (s, a i ) ← α t,(i) • E θt,(i) (s, a i ) + α t,(i) • Q π i θ t ,π -i θ t (i) (s, a i ) 7: Minimize (3.7) to obtain the energy function E θt+1,(i) (s, a i ) ∈ F E max (i) 8: End 9: Output: {π i θt } 0≤t≤T -1,i∈{1,2}

4. MAIN RESULTS

In this section, we establish the convergence of smooth FSP for two-player zero-sum Markov games by casting it as regularized proximal policy optimization (PPO).

4.1. SMOOTH FSP AS REGULARIZED PPO

In the sequel, we connect the energy function update in (3.8) with regularized PPO. Corresponding to the estimated ideal energy function updates E t+1,(i) in (3.8), we define the estimated ideal policy update as π i t+1 (• | s) ∝ exp{ E t+1,(i) (s, • )}. (4.1) We have the following proposition states the equivalence between smooth FSP and regularized PPO. Proposition 4.1. For all 0 ≤ t ≤ T -1, let the stepsizes α t,(i) and α t,(i) of Algorithm 1 satisfy λ i = (1 -α t,(i) )/α t,(i) > 0. At the t-th iteration of Algorithm 1, the policy update in (4.1) is equivalent to solving the regularized PPO subproblem, π i t+1 = argmax π i E νt α t,(i) • Q π i θ t ,π -i θ t (i) (s, • ) -λ i • log π i θt (• | s), π i (• | s) -π i θt (• | s) (4.2) -KL π i (• | s) π i θt (• | s) . Here Q π i θ t ,π -i θ t (i) (s, a i ) is the estimator of the marginalized Q (i) -function Q π i θ t ,π -i θ t (i) (s, a i ). Proof. See Appendix A for a detailed proof. Proposition 4.1 implies that smooth FSP proximally improves the policy π i based on the regularized performance function, J (i) (π i , π -i ) = E ν * V π i ,π -i (i) (s) . (4.3) Proposition C.1 implies that, the smaller the regularization parameter λ i is, the closer the regularized performance function J (i) is to the performance function J . In the rest of the paper, we show that, with a proper choice of λ i , smooth FSP converges to a neighborhood of a Nash equilibrium [π 1 * ; π 2 * ] at a sublinear rate of O(1/T ).

4.2. CONVERGENCE TO NASH EQUILIBRIUM

Let P(s t = s | π i , π -i , s 0 ∼ ν) be the probability that the trajectory, which is generated by the policy pair [π i ; π -i ] with the initial state distribution s 0 ∼ ν, reaches the state s at the timestep t. Correspondingly, let ρ π i ,π -i ν (s) = (1 -γ) • ∞ t=0 γ t • P(s t = s | π i , π -i , s 0 ∼ ν) (4.4) be the visitation measure of [π i ; π -i ] with the initial state distribution s 0 ∼ ν. Also, for notational simplicity, we define ρ π i ,π -i ν,π i ,π -i (s) = (1 -γ) • ∞ t=0 γ t • P s t+1 = s π i , π -i , (s 0 , a i 0 , a -i 0 ) ∼ νπ i π -i (4.5) as the visitation measure of the policy pair [π i ; π -i ] with the initial state-action distribution νπ i π -i . We lay out the following assumption on the concentrability coefficient. With a slight abuse of notation, we write ν and π i in the subscripts as s and a i , respectively, when they are point masses. Assumption 4.2 (Concentrability Coefficient). We assume that for the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), there exists ζ > 0 such that E ν * dρ π i ,π -i * s,a i ,π -i * /dν * 2 1/2 ≤ ζ for all s ∈ S, a i ∈ A i , and π i = π i θt generated by the policy update in Line 4 of Algorithm 1. Here dρ π i ,π -i * s,a i ,π -i * /dν * is the Radon-Nikodym derivative, where ρ π i ,π -i * s,a i ,π -i * is defined in (4.5). The notion of concentrability coefficient in Assumption 4.2 is commonly used in the literature (Munos & Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Tosatto et al., 2017; Yang et al., 2019) . For all policy pairs [π i ; π -i ], we define the Markov state transition kernel as P π i ,π -i (• | s) = E π i ,π -i P(• | s, a i , a -i ) . (4.6) With a slight abuse of notation, we write P π i ,π -i as the Markov state transition operator induced by the Markov state transition kernel defined in (4.6), such that [P π i ,π -i • h](s) = s ∈S h(s )P π i ,π -i (ds | s), where h : S → R is an L 1 -integrable function and the Lebesgue measure over S ⊂ R d is used. Correspondingly, we define the operator norm of an operator O as O op = sup h O • h L1(S) h L1(S) = sup h L 1 (S) ≤1 O • h L1(S) , where • L1(S) is the L 1 -norm over the state space S. The following assumption characterizes the Lipschitz continuity of P π i ,π -i and r π i ,π -i with respect to π -i . Assumption 4.3 (Lipschitz Game). We assume that for the two-player zero-sum Markov game (S, A 1 , A 2 , P, r, γ), there exists ι i > 0 such that for all s ∈ S and [π i ; π -i ], P π i ,π -i * -P π i ,π -i op ≤ ι i • E ν * KL π -i * (• | s) π -i (• | s) 1/2 , (4.8) r π i ,π -i * (s) -r π i ,π -i (s) ≤ ι i • KL π -i * (• | s) π -i (• | s) 1/2 . (4.9) The Lipschitz coefficient ι i in (4.8) of Assumption 4.3 quantifies to the influence of Player -i on the nonstationary environment that Payer i faces. Such a notion of influence is proposed by Radanovic et al. (2019) in the tabular setting. In particular, the expected KL-divergence between the policies is used in place of the distance max s∈S π Radanovic et al. (2019) . Such an assumption is also related to the linear-quadratic game (LQG) literature (see, e.g., Zhang et al. (2019) ), where the Lipschitz continuity is established based on the special structure in the LQG model. In Lemma C.2, we show that such a Lipschitz coefficient ι i quantifies the Lipschitz continuity of the marginalized Q (i) -function of the entropy-regularized two-player Markov game (S, A 1 , A 2 , P, r, γ, λ 1 , λ 2 ). -i (• | s) -π -i (• | s) 1 in Recall that π i t+1 ∝ exp{ E t+1,(i) } is defined in (4.1), where E t+1,(i) is defined in (3.8). Also, recall that π i θt+1 ∝ exp{E θt+1,(i) } is defined in Line 4 of Algorithm 1, where E θt+1,(i) is obtained by minimizing (3.7) in Line 7 of Algorithm 1. Meanwhile, we define the ideal policy update as π i t+1 (• | s) ∝ exp E t+1,(i) (s, • ) , where E t+1,(i) (s, a i ) = α t,(i) • E θt,(i) (s, a i ) + α t,(i) • Q π i θ t ,π -i θ t (i) (s, a i ) (4.10) is the corresponding ideal energy function update. We lay out the following assumption on the errors that arise from the estimation of the marginalized Q (i) -function Q π i θ t ,π -i θ t (i) and the minimization of the MSE in (3.7). Assumption 4.4 (Estimation Error). We assume that there exist t , t > 0 such that for all 0 ≤ t ≤ T -1, E ν * E θt+1,(i) (s, •) -E t+1,(i) (s, •) 2 ∞ ≤ t , E ν * E θt+1,(i) (s, •) -E t+1,(i) (s, •), π i * (• | s) -π i θt (• | s) ≤ t . (4.12) Assumption 4.4 characterizes the estimation error through the policy updates in Line 7 of Algorithm 1. In particular, (4.11) upper bounds the errors arising from the minimization of the MSE in (3.7), which is zero as long as the representation power of the parameterized class of the energy functions is sufficiently strong. Meanwhile, by (3.7) and (4.10), the gap between E θt+1,(i) and E t+1,(i) involves (I) the gap between E t+1,(i) and E t+1,(i) , which arises from the gap between Q π i θ t ,π -i θ t (i) and Q π i θ t ,π -i θ t (i) , and (II) the gap between E θt+1,(i) and E t+1,(i) , which arises the minimization of the MSE in (3.7). Hence, t in (4.12) is zero as long as the estimator Q π i θ t ,π -i θ t (i) of Q π i θ t ,π -i θ t (i) is accurate and t is zero. We summarize t and t into the following total error σ, We are now ready to present the following theorem on the convergence of the policy sequence σ = T -1 t=0 (t + 1) • ( t + t ). ( {[π 1 θt ; π 2 θt ]} 0≤t≤T -1 to a neighborhood of a Nash equilibrium [π 1 * ; π 2 * ]. Recall that Q max (i) and V max (i) are defined in (2.6) and (2.7), respectively. Also, recall that ζ is the concentrability coefficient in Assumption 4.2, ι i is the Lipschitz coefficient in Assumption 4.3, and σ is defined in (4.13). Theorem 4.5 (Convergence of Smooth FSP to Nash Equilibrium). Suppose that Assumptions 4.2-4.4 hold. We set the regularization parameter λ i ≥ 2M i , where M i = 2 + i∈{1,2} (V max (i) + Q max (i) • ζ)/(1 -γ) • ι i . (4.14) In Algorithm 1, we set E max (i) = Q max (i) /(λ i -M i ) and α t,(i) = 1 (t + 1) • min i∈{1,2} {λ i -M i } , α t,(i) = 1 - λ i (t + 1) • min i∈{1,2} {λ i -M i } . (4.15) For the policy sequence {[π 1 θt ; π 2 θt ]} 0≤t≤T -1 generated by the policy update in Line 7 of Algorithm 1, we have 1 T • T -1 t=0 J (π 1 * , π 2 θt ) -J (π 1 θt , π 2 * ) ≤ i∈{1,2} 2 + 2λ 2 i /(λ i -M i ) 2 • (Q max (i) ) 2 (1 -γ) • min i∈{1,2} {λ i -M i } • log T T (4.16) + 2σ • min i∈{1,2} {λ i -M i } (1 -γ) • T + i∈{1,2} λ i • log |A i |. Proof. See Appendix C for a detailed proof. The key to our proof is the convergence of infinitedimensional mirror descent with the primal and dual errors. In particular, the errors are characterized in Appendix B. Recall that the Lipschitz coefficient ι i is defined in Assumption 4.3. In Lemma C.2, we interpret ι i as the Lipschitz coefficient of the marginalized Q (i) -function. Meanwhile, recall that Theorem 4.5 requires λ i ≥ 2M i , where M i scales linearly with ι i . Hence, the smaller the Lipschitz coefficient ι i is, the smaller the regularization parameter λ i can be, which in turn leads to a smaller regularization bias characterized in Proposition C.1. Thus, the policy sequence {[π 1 θt ; π 2 θt ]} 0≤t≤T -1 generated by Algorithm 1 converges to a smaller neighborhood of a Nash equilibrium [π 1 * ; π 2 * ]. We give the following two sufficient conditions for the Lipschitz coefficients. (I) The two players have similar influence to the game, i.e., ι 1 /ι 2 = O(1): a sufficient requirement on both of the Lipschitz coefficients is ι i ≤ (1 -γ) 2 8(1 + γ) • log |A i | , i ∈ {1, 2}. (II) One of the two players (without loss of generality, we assume it is Player 2) has dominant influence to the game compared to the other: let ι 1 /ι 2 = z > 0, in which case we set M i in (4.14) as As z moves towards zero, the convergence guarantee approaches those for single-controller case. Please see Appendix I for a more detailed illustration on case(II). M i = √ 2z • 2 + i∈{1,2} (V max (i) + Q max (i) • ζ)/(1 -γ) • ι 2 , We remark in the following that, with stronger assumptions, we can strengthen Theorem 4.5 to satisfy Hannan consistency. Remark 4.6 (Hannan Consistency). When Assumptions 4.2-4.4 hold for any policy [π i ; π -i ] instead of only a Nash equilibrium [π i * ; π -i * ], we can prove that, when one of the player does not update the policy as described in Algorithm 1, the opposing player can exploit the strategies it plays. Specifically, for example, when Player 2 plays the policy sequence { π 2 t } 0≤t≤T -1 while Player 1 updates its policy according to Algorithm 1, we have sup π 1 1 T • T -1 t=0 J (π 1 , π 2 t ) -J (π 1 θt , π 2 t ) (4.17) ≤ σ • (λ 1 -M 1 ) (1 -γ) • T + 2 + 2λ 2 1 /(λ 1 -M 1 ) 2 • (Q max (1) ) 2 (1 -γ) • (λ 1 -M 1 ) • log T T + λ 1 • log |A 1 |, which implies that the policy sequence {π 1 θt } 0≤t≤T -1 converges to the best policy in hindsight with respect to { π 2 t } 0≤t≤T -1 . As a consequence, we can also replace the left-hand side of (4.16) by the following duality gap, sup π 1 1 T • T -1 t=0 J (π 1 , π 2 θt ) -inf π 2 1 T • T -1 t=0 J (π 1 θt , π 2 ) . (4.18) See Appendix J for a more detailed illustration on Remark 4.6.



.13) As discussed in Lemmas 4.7 and 4.8 ofLiu et al. (2019), under Assumption 4.2, when we use sufficiently deep and wide neural networks equipped with the rectified linear unit (ReLU) activation function to parameterize the marginalized Q (i) -functions and the energy functions, Assumption 4.4 can be satisfied with σ = O(1). See Appendix B for a detailed discussion.

{1, 2}. Then one sufficient requirement on the ratio z isz ≤ (1 -γ) 4 16(1 + γ)ι 2 • log |A 1 | • |A 2 | 2 .

