FASTER LAST-ITERATE CONVERGENCE OF POLICY OPTIMIZATION IN ZERO-SUM MARKOV GAMES

Abstract

Multi-Agent Reinforcement Learning (MARL)-where multiple agents learn to interact in a shared dynamic environment-permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

1. INTRODUCTION

Policy optimization methods (Williams, 1992; Sutton et al., 2000; Kakade, 2002; Peters and Schaal, 2008; Konda and Tsitsiklis, 2000) , which cast sequential decision making as value maximization problems with regards to (parameterized) policies, have been instrumental in enabling recent successes of reinforcement learning (RL) . See e.g., Schulman et al. (2015; 2017) ; Silver et al. (2016) . Despite its empirical popularity, the theoretical underpinnings of policy optimization methods remain elusive until very recently. For single-agent RL problems, a flurry of recent works has made substantial progress on understanding the global convergence of policy optimization methods under the framework of Markov Decision Processes (MDP) (Agarwal et al., 2020; Bhandari and Russo, 2019; Mei et al., 2020; Cen et al., 2021a; Lan, 2022; Bhandari and Russo, 2020; Zhan et al., 2021; Khodadadian et al., 2021; Xiao, 2022) . Despite the nonconcave nature of value maximization, (natural) policy gradient methods are shown to achieve global convergence at a sublinear rate (Agarwal et al., 2020; Mei et al., 2020) or even a linear rate in the presence of regularization (Mei et al., 2020; Cen et al., 2021a; Lan, 2022; Zhan et al., 2021) when the learning rate is constant. Moving beyond single-agent RL, Multi-Agent Reinforcement Learning (MARL) is the next frontier-where multiple agents learn to interact in a shared dynamic environment-permeating across critical applications such as multi-agent networked systems, autonomous vehicles, robotics, and so on. Designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges and new desiderata, which unfortunately, remain highly inadequately addressed by existing theory.

1.1. POLICY OPTIMIZATION FOR COMPETITIVE RL

In this work, we focus on one of the most basic settings of competitive multi-agent RL, namely two-player zero-sum Markov games (Shapley, 1953) , and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. In particular, our designs gravitate around algorithms that are single-loop, symmetric, with finite-time last-iterate convergence to the Nash Equilibrium (NE) or Quantal Response Equilibrium (QRE) under bounded rationality, two prevalent solution concepts in game theory. These design principles naturally come up as a result of pursuing simple yet efficient algorithms: single-loop updates preclude sophisticated interleaving of rounds between agents; symmetric updates ensure no agent will compromise its rewards in the learning process, which can be otherwise exploited by a faster-updating opponent; in addition, asymmetric updates typically lead to one-sided convergence, i.e., only one of the agents is guaranteed to converge to the minimax equilibrium in a non-asymptotic manner, which is less desirable; moreover, last-iterate convergence guarantee absolves the need for agents to switch between learning and deployment; last but not least, it is desirable to converge as fast as possible, where the iteration complexities are non-asymptotic with clear dependence on salient problem parameters. Substantial algorithmic developments have been made for finding equilibria in two-player zero-sum Markov games, where Dynamical Programming (DP) techniques have long been used as a fundamental building block, leading to prototypical iterative schemes such as Value Iteration (VI) (Shapley, 1953) and Policy Iteration (PI) (Van Der Wal, 1978; Patek and Bertsekas, 1999) . Different from their single-agent counterparts, these methods require solving a two-player zero-sum matrix game for every state per iteration. A considerable number of recent works (Zhao et al., 2022; Alacaoglu et al., 2022; Cen et al., 2021b; Chen et al., 2021a) are based on these DP iterations, by plugging in various (gradient-based) solvers of two-player zero-sum matrix games. However, these methods are inherently nested-loop, which are less convenient to implement. In addition, PI-based methods are asymmetric and come with only one-sided convergence guarantees (Patek and Bertsekas, 1999; Zhao et al., 2022; Alacaoglu et al., 2022) . Going beyond nested-loop algorithms, single-loop policy gradient methods have been proposed recently for solving two-player zero-sum Markov games. Here, we are interested in finding an ϵoptimal NE or QRE in terms of the duality gap, i.e. the difference in the value functions when either of the agents deviates from the solution policy. • For the infinite-horizon discounted setting, Daskalakis et al. (2020) demonstrated that the independent policy gradient method, with direct parameterization and asymmetric learning rates, finds an ϵ-optimal NE within a polynomial number of iterations. Zeng et al. (2022) improved over this rate using an entropy-regularized policy gradient method with softmax parameterization and asymmetric learning rates. On the other end, Wei et al. (2021b) proposed an optimistic gradient descent ascent (OGDA) method (Rakhlin and Sridharan, 2013) with direct parameterization and symmetric learning rates, 1 which achieves a last-iterate convergence at a rather pessimistic iteration complexity. • For the finite-horizon episodic setting, Zhang et al. (2022) ; Yang and Ma (2022) showed that the weighted average-iterate of the optimistic Follow-The-Regularized-Leader (FTRL) method, when combined with slow critic updates, finds an ϵ-optimal NE in a polynomial number of iterations. A more complete summary of prior results can be found in Table 1 and Table 2 . In brief, while there have been encouraging progresses in developing computationally efficient policy gradient methods for solving zero-sum Markov games, achieving fast finite-time last-iterate convergence with singleloop and symmetric update rules remains a challenging goal.

1.2. OUR CONTRIBUTIONS

Motivated by the positive role of entropy regularization in enabling faster convergence of policy optimization in single-agent RL (Cen et al., 2021a; Lan, 2022) and two-player zero-sum games (Cen et al., 2021b) , we propose a single-loop policy optimization algorithm for two-player zero-sum Markov games in both the infinite-horizon and finite-horizon settings. The proposed algorithm follows the style of actor-critic (Konda and Tsitsiklis, 2000) , with the actor updating the policy via the entropy-regularized optimistic multiplicative weights update (OMWU) method (Cen et al., 2021b) and the critic updating the value function on a slower timescale. Both agents execute multiplicative and symmetric policy updates, where the learning rates are carefully selected to ensure a fast lastiterate convergence. In both the infinite-horizon and finite-horizon settings, we prove that the last iterate of the proposed method learns the optimal value function and converges at a linear rate to the unique QRE of the entropy-regularized Markov game, which can be further translated into finding the NE by setting the regularization sufficiently small. • For the infinite-horizon discounted setting, the last iterate of our method takes at most O |S| (1 -γ) 4 τ log 1 ϵ iterations for finding an ϵ-optimal QRE under entropy regularization, where O(•) hides logarithmic dependencies. Here, |S| is the size of the state space, γ is the discount factor, and τ is the regularization parameter. Moreover, this implies the last-iterate convergence with an iteration complexity of O |S| (1 -γ) 5 ϵ for finding an ϵ-optimal NE. • For the finite-horizon episodic setting, the last iterate of our method takes at most O H 2 τ log 1 ϵ iterations for finding an ϵ-optimal QRE under entropy regularization, where H is the horizon length. Similarly, this implies the last-iterate convergence with an iteration complexity of O H 3 ϵ for finding an ϵ-optimal NE. Detailed comparisons between the proposed method and prior arts are provided in Table 1 and Table 2. To the best of our knowledge, this work presents the first method that is simultaneously single-loop, symmetric, and achieves fast finite-time last-iterate convergence in terms of the duality gap in both infinite-horizon and finite-horizon settings. From a technical perspective, the infinitehorizon discounted setting is in particular challenging, where ours is the first single-loop algorithm that guarantees an iteration complexity of O(1/ϵ) for last-iterate convergence in terms of the duality gap, with clear and improved dependencies on other problem parameters in the meantime. In contrast, several existing works introduce additional problem-dependent constants (Daskalakis et al., 2020; Wei et al., 2021b; Zeng et al., 2022) in the iteration complexity, which can scale rather pessimistically-sometimes even exponentially-with problem dimensions (Li et al., 2021) . Our technical developments require novel ingredients that deviate from prior tools such as error propagation analysis for Bellman operators (Perolat et al., 2015; Patek and Bertsekas, 1999) from a dynamic programming perspective, as well as the gradient dominance condition (Daskalakis et al., 2020; Zeng et al., 2022) from a policy optimization perspective. Importantly, at the core of our analysis lies a carefully-designed one-step error contraction bound for policy learning, together with a set of recursive error bounds for value learning, all of which tailored to the non-Euclidean OMWU update rules that have not been well studied in the setting of Markov games.  O ∥1/ρ∥∞ (1-γ) 3 ϵ * ✗ ✗ ✓ VI-based Methods Cen et al. (2021b) Chen et al. (2021a) O 1 (1-γ) 3 ϵ ✗ ✓ ✓ Daskalakis et al. (2020) Polynomial * ✓ ✗ ✗ Zeng et al. (2022) O |S| 2 ∥1/ρ∥ 5 ∞ (1-γ) 14 c 4 ϵ 3 * ✓ ✗ ✓ Wei et al. (2021b) O |S| 3 (1-γ) 9 ϵ 2 ✓ ✓ ✗ O |S| 5 (|A|+|B|) 1/2 (1-γ) 16 c 4 ϵ 2 ✓ ✓ ✓ This Work O |S| (1-γ) 5 ϵ ✓ ✓ ✓ ϵ-QRE VI-based Methods Cen et al. (2021b) O 1 (1-γ) 3 log 2 1 ϵ ✗ ✓ ✓ Zeng et al. (2022) O |S| 2 ∥1/ρ∥ 5 ∞ (1-γ) 11 c 4 τ 3 log 1 ϵ * ✓ ✗ ✓ This Work O |S| (1-γ) 4 τ log 1 ϵ ✓ ✓ ✓ ) OFTRL O H 5 ϵ ✓ ✓ ✗ This Work O H 3 ϵ ✓ ✓ ✓ ϵ-QRE This Work O H 2 τ log 1 ϵ ✓ ✓ ✓

1.3. RELATED WORKS

Learning in two-player zero-sum matrix games. Freund and Schapire (1999) showed that the average iterate of Multiplicative Weight Update (MWU) method converges to NE at a rate of O(1/ √ T ), which in principle holds for many other no-regret algorithms as well. Daskalakis et al. (2011) deployed the excessive gap technique of Nesterov and improved the convergence rate to O(1/T ), which is achieved later by (Rakhlin and Sridharan, 2013) with a simple modification of MWU method, named Optimistic Mirror Descent (OMD) or more commonly, OMWU. Moving beyond average-iterate convergence, Bailey and Piliouras (2018) demonstrated that MWU updates, despite converging in an ergodic manner, diverge from the equilibrium. Daskalakis and Panageas (2018) ; Wei et al. (2021a) explored the last-iterate convergence guarantee of OMWU, as-suming uniqueness of NE. Cen et al. (2021b) established linear last-iterate convergence of entropyregularized OMWU without uniqueness assumption. Sokota et al. (2022) showed that optimistic update is not necessary for achieving linear last-iterate convergence in the presence of regularization, albeit with a more strict restriction on the step size. Learning in two-player zero-sum Markov games. In addition to the aforementioned works on policy optimization methods (policy-based methods) for two-player zero-sum Markov games (cf. Table 1 and Table 2 ), a growing body of works have developed model-based methods (Liu et al., 2021; Zhang et al., 2020; Li et al., 2022) and value-based methods (Bai and Jin, 2020; Bai et al., 2020; Chen et al., 2021b; Jin et al., 2021; Sayin et al., 2021; Xie et al., 2020) , with a primary focus on learning NE in a sample-efficient manner. Our work, together with prior literatures on policy optimization, focuses instead on learning NE in a computation-efficient manner assuming full-information. Entropy regularization in RL and games. Entropy regularization is a popular algorithmic idea in RL (Williams and Peng, 1991) that promotes exploration of the policy. A recent line of works (Mei et al., 2020; Cen et al., 2021a; Lan, 2022; Zhan et al., 2021) demonstrated that incorporating entropy regularization provably accelerates policy optimization in single-agent MDPs by enabling fast linear convergence. While the positive role of entropy regularization is also verified in various gametheoretic settings, e.g., two-player zero-sum matrix games (Cen et al., 2021b) , zero-sum polymatrix games (Leonardos et al., 2021) , and potential games (Cen et al., 2022) , it remains highly unexplored the interplay between entropy regularization and policy optimization in Markov games with only a few exceptions (Zeng et al., 2022) .

1.4. NOTATIONS

We denote the probability simplex over a set A by ∆(A). We use bracket with subscript to index the entries of a vector or matrix, e.g., [x] a for a-th element of a vector x, or simply x(a) when it is clear from the context. Given two distributions x, y ∈ ∆(A), the Kullback-Leibler (KL) divergence from y to x is denoted by KL x ∥ y = a∈A x(a)(log x(a) -log y(a)). Finally, we denote by ∥A∥ ∞ the maximum entrywise absolute value of a matrix A, i.e., ∥A∥ ∞ = max i,j |A i,j |.

2.1. PROBLEM FORMULATION

Two-player zero-sum discounted Markov game. A two-player zero-sum discounted Markov game is defined by a tuple M = (S, A, B, P, r, γ), with finite state space S, finite action spaces of the two players A and B, reward function r : S × A × B → [0, 1], transition probability kernel P : S × A × B → ∆(S) and discount factor 0 ≤ γ < 1. The action selection rule of the max player (resp. the min player) is represented by µ : S → ∆(A) (resp. ν : S → ∆(B)), where the probability of selecting action a ∈ A (resp. b ∈ B) in state s ∈ S is specified by µ(a|s) (resp. ν(b|s)). The probability of transitioning from state s to a new state s ′ upon selecting the action pair (a, b) ∈ A, B is given by P (s ′ |s, a, b). Value function and Q-function. For a given policy pair µ, ν, the state value of s ∈ S is evaluated by the expected discounted sum of rewards with initial state s 0 = s: ∀s ∈ S : V µ,ν (s) = E ∞ t=0 γ t r(s t , a t , b t ) s 0 = s , the quantity the max player seeks to maximize while the min player seeks to minimize. Here, the trajectory (s 0 , a 0 , b 0 , s 1 , •  ∀(s, a, b) ∈ S × A × B : Q µ,ν (s, a, b) = E ∞ t=0 γ t r(s t , a t , b t ) s 0 = s, a 0 = a, b 0 = b . (2) For notation simplicity, we denote by Shapley (1953) proved the existence of a policy pair (µ ⋆ , ν ⋆ ) that solves the min-max problem max µ min ν V µ,ν (s) for all s ∈ S simultaneously, and that the mini-max value is unique. A set of such optimal policy pair (µ ⋆ , ν ⋆ ) is called the Nash equilibrium (NE) to the Markov game. Q µ,ν (s) ∈ R |A|×|B| the matrix [Q µ,ν (s, a, b)] (a,b)∈A×B , so that ∀s ∈ S : V µ,ν (s) = µ(s) ⊤ Q µ,ν (s)ν(s). Entropy regularized two-player zero-sum Markov game. Entropy regularization is shown to provably accelerate convergence in single-agent RL (Geist et al., 2019; Mei et al., 2020; Cen et al., 2021a) and facilitate the analysis in two-player zero-sum matrix games (Cen et al., 2021b) as well as Markov games (Cen et al., 2021b; Zeng et al., 2022) . The entropy-regularized value function V µ,ν τ (s) is defined as ∀s ∈ S : V µ,ν τ (s) = E ∞ t=0 γ t r(s t , a t , b t ) -τ log µ(a t |s t ) + τ log ν(b t |s t ) s 0 = s , (3) where τ ≥ 0 is the regularization parameter. Similarly, the regularized Q-function Q µ,ν τ is given by ∀(s, a, b) ∈ S × A × B : Q µ,ν τ (s) = r(s, a, b) + γE s ′ ∼P (•|s,a,b) [V µ,ν τ (s ′ )] . It is known that (Cen et al., 2021b) there exists a unique pair of policy (µ ⋆ τ , ν ⋆ τ ) that solves the min-max entropy-regularized problem max µ min ν V µ,ν τ (s), or equivalently max µ min ν µ(s) ⊤ Q µ,ν τ (s)ν(s) + τ H µ(s) -τ H ν(s) for all s ∈ S, and we call (µ ⋆ τ , ν ⋆ τ ) the quantal response equilibrium (QRE) (McKelvey and Palfrey, 1995) to the entropy-regularized Markov game. We denote the associated regularized value function and Q-function by V ⋆ τ (s) = V µ ⋆ τ ,ν ⋆ τ τ (s) and Q ⋆ τ (s, a, b) = Q µ ⋆ τ ,ν ⋆ τ τ (s, a, b). Goal. We seek to find an ϵ-optimal QRE or ϵ-QRE (resp. ϵ-optimal NE or ϵ -NE) ζ = (µ, ν) which satisfies max s∈S,µ ′ ,ν ′ V µ ′ ,ν τ (s) -V µ,ν ′ τ (s) ≤ ϵ (6) (resp. max s∈S,µ ′ ,ν ′ V µ ′ ,ν (s) -V µ,ν ′ (s) ≤ ϵ) in a computationally efficient manner. In truth, the solution concept of ϵ-QRE provides an approximation of ϵ-NE with appropriate choice of the regularization parameter τ . Basic calculations tell us that V µ ′ ,ν (s) -V µ,ν ′ (s) = V µ ′ ,ν τ (s) -V µ,ν ′ τ (s) + V µ ′ ,ν (s) -V µ ′ ,ν τ (s) -V µ,ν ′ (s) -V µ,ν ′ τ (s) ≤ V µ ′ ,ν τ (s) -V µ,ν ′ τ (s) + τ (log |A| + log |B|) 1 -γ , which guarantees that an ϵ/2-QRE is an ϵ-NE as long as τ ≤ (1-γ)ϵ 2(log |A|+log |B|) . For technical convenience, we assume τ ≤ 1 max{1,log |A|+log |B|} throughout the paper. Additional notation. For notation convenience, we denote by ζ the concatenation of a policy pair µ and ν, i.e., ζ = (µ, ν). The QRE to the regularized problem is denoted by ζ ⋆ τ = (µ ⋆ τ , ν ⋆ τ ). We use shorthand notation µ(s) and ν(s) to denote µ(•|s) and ν(•|s). In addition, we write KL µ(s) ∥ µ ′ (s) and KL ν(s) ∥ ν ′ (s) as KL s µ ∥ µ ′ and KL s ν ∥ ν ′ , and let KL s ζ ∥ ζ ′ = KL s µ ∥ µ ′ + KL s ν ∥ ν ′ . By a slight abuse of notation, KL ρ ζ ∥ ζ ′ denotes E s∼ρ KL s ζ ∥ ζ ′ for ρ ∈ ∆(S).

2.2. SINGLE-LOOP ALGORITHM DESIGN

In this section, we propose a single-loop policy optimization algorithm for finding the QRE of the entropy-regularized Markov game, which is generalized from the entropy-regularized OMWU method (Cen et al., 2021b) for solving entropy-regularized matrix games, with a careful orchestrating of the policy update and the value update. Algorithm 1: Entropy-regularized OMWU for Discounted Two-player Zero-sum Markov Game Input: Regularization parameter τ > 0, learning rate for policy update η > 0, learning rate for value update {α t } ∞ t=1 . Initialization: Set µ (0) , μ(0) , ν (0) and ν(0) as uniform policies; and set Q (0) = 0, V (0) = τ (log |A| -log |B|). for t = 0, 1, • • • do for all s ∈ S do in parallel When t ≥ 1, update policy pair ζ (t) (s) as: µ (t) (a|s) ∝ µ (t-1) (a|s) 1-ητ exp(η[Q (t) (s)ν (t) (s)] a ) ν (t) (b|s) ∝ ν (t-1) (b|s) 1-ητ exp(-η[Q (t) (s) ⊤ μ(t) (s)] b ) . ( ) Update policy pair ζ(t+1) (s) as: μ(t+1) (a|s) ∝ µ (t) (a|s) 1-ητ exp(η[Q (t) (s)ν (t) (s)] a ) ν(t+1) (b|s) ∝ ν (t) (b|s) 1-ητ exp(-η[Q (t) (s) ⊤ μ(t) (s)] b ) . ( ) Update Q (t+1) (s) and V (t+1) (s) as    Q (t+1) (s, a, b) = r(s, a, b) + γE s ′ ∼P (•|s,a,b) V (t) (s ′ ) V (t+1) (s) = (1 -α t+1 )V (t) (s) +α t+1 μ(t+1) (s) ⊤ Q (t+1) (s)ν (t+1) (s) + τ H μ(t+1) (s) -τ H ν(t+1) (s) . Review: entropy-regularized OMWU for two-player zero-sum matrix games. We briefly review the algorithm design of entropy-regularized OMWU method for two-player zero-sum matrix game (Cen et al., 2021b) . The problem of interest can be described as max µ∈∆(A) min ν∈∆(B) µ ⊤ Aν + τ H(µ) -τ H(ν), where A ∈ R |A|×|B| is the payoff matrix of the game. The update rule of entropy-regularized OMWU with learning rate η > 0 is defined as follows: ∀a ∈ A, b ∈ B, µ (t) (a) ∝ µ (t-1) (a) 1-ητ exp(η[Aν (t) ] a ) ν (t) (b) ∝ ν (t-1) (b) 1-ητ exp(-η[A ⊤ μ(t) ] b ) , μ(t+1) (a) ∝ µ (t) (a) 1-ητ exp(η[Aν (t) ] a ) ν(t+1) (b) ∝ ν (t) (b) 1-ητ exp(-η[A ⊤ μ(t) ] b ) . ( ) We remark that the update rule can be alternatively motivated from the perspective of natural policy gradient (Kakade, 2002; Cen et al., 2021a) or mirror descent (Lan, 2022; Zhan et al., 2021) with optimistic updates. In particular, the midpoint (μ (t+1) , ν(t+1) ) serves as a prediction of (µ (t+1) , ν (t+1) ) by running one step of mirror descent. Cen et al. (2021b) established that the last iterate of entropyregularized OMWU converges to the QRE of the matrix game (7) at a linear rate (1 -ητ ) t , as long as the step size η is no larger than min 1 2∥A∥∞+2τ , 1 4∥A∥∞ . Single-loop algorithm for two-player zero-sum Markov games. In view of the similarity in the problem formulations of ( 5) and ( 7), it is tempting to apply the aforementioned method to the Markov game in a state-wise manner, where the Q-function assumes the role of the payoff matrix. It is worth noting, however, that Q-function depends on the policy pair ζ = (µ, ν) and is hence changing concurrently with the update of the policy pair. We take inspiration from Bai et al. (2020) ; Wei et al. (2021b) and equip the entropy-regularized OMWU method with the following update rule that iteratively approximates the value function in an actor-critic fashion: Q (t+1) (s, a, b) = r(s, a, b) + γE s ′ ∼P (•|s,a,b) V (t) (s ′ ) , where V (t+1) is updated as a convex combination of the previous V (t) and the regularized game value induced by Q (t+1) as well as the policy pair ζ(t+1) = (μ (t+1) , ν(t+1) ): V (t+1) (s) = (1 -α t+1 )V (t) (s) + α t+1 μ(t+1) (s) ⊤ Q (t+1) (s)ν (t+1) (s) + τ H μ(t+1) (s) -τ H ν(t+1) (s) . The update of V becomes more conservative with a smaller learning rate α t , hence stabilizing the update of policies. However, setting α t too small slows down the convergence of V to V ⋆ τ . A key novelty-suggested by our analysis-is the choice of the constant learning rates α := α t = ητ which updates at a slower timescale than the policy due to τ < 1. This is in sharp contrast to the vanishing sequence α t = 2/(1-γ)+1 2/(1-γ)+t adopted in Wei et al. (2021b) , which is essential in their analysis but inevitably leads to a much slower convergence. We summarize the detailed procedure in Algorithm 1. Last but not least, it is worth noting that the proposed method access the reward via "first-order information", i.e., either agent can only update its policy with the marginalized value function Q(s)ν(s) or Q(s) ⊤ µ(s). Update rules of this kind are instrumental in breaking the curse of multi-agents in the sample complexity when adopting sample-based estimates in (10), as we only need to estimate the marginalized Q-function rather than its full form (Li et al., 2022; Chen et al., 2021a) .

2.3. THEORETICAL GUARANTEES

Below we present our main results concerning the last-iterate convergence of Algorithm 1 for solving entropy-regularized two-player zero-sum Markov games in the infinite-horizon discounted setting. The proof is postponed to Appendix A. Theorem 1. Setting 0 < η ≤ (1-γ) 3 32000|S| and α t = ητ , it holds for all t ≥ 0 that max 1 |S| s∈S KL s ζ ⋆ τ ∥ ζ (t) , 1 2|S| s∈S KL s ζ ⋆ τ ∥ ζ(t) , 3η |S| s∈S Q (t) (s) -Q ⋆ τ (s) ∞ ≤ 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t ; and max s∈S,µ,ν V µ,ν (t) τ (s) -V μ(t) ,ν τ (s) ≤ 6000|S| (1 -γ) 3 τ max 8 (1 -γ) 2 τ , 1 η 1 - (1 -γ)ητ 4 t . Theorem 1 demonstrates that as long as the learning rate η is small enough, the last iterate of Algorithm 1 converges at a linear rate for the entropy-regularized Markov game. Compared with prior literatures investigating on policy optimization, our analysis focuses on the last-iterate convergence of non-Euclidean updates in the presence of entropy regularization, which appears to be the first of its kind. Several remarks are in order, with detailed comparisons in Table 1 . • Linear convergence to the QRE. Theorem 1 demonstrates that the last iterate of Algorithm 1 takes at most O 1 (1-γ)ητ log 1 ϵ iterations to yield an ϵ-optimal policy in terms of the KL di- vergence to the QRE max s∈S KL s ζ ⋆ τ ∥ ζ(t) ≤ ϵ, the entrywise error of the regularized Q-function Q (t) -Q ⋆ τ ∞ ≤ ϵ, as well as the duality gap max s∈S,µ,ν V µ,ν (t) τ (s) -V μ(t) ,ν τ (s) ≤ ϵ at once. Minimizing the bound over the learning rate η, the proposed method is guaranteed to find an ϵ-QRE within O

|S|

(1-γ) 4 τ log 1 ϵ iterations, which significantly improves upon the one-side convergence rate of Zeng et al. (2022) . • Last-iterate convergence to ϵ-optimal NE. By setting τ = (1-γ)ϵ 2(log |A|+log |B|) , this immediately leads to provable last-iterate convergence to an ϵ-NE, with an iteration complexity of O (1-γ) 5 ϵ , which again outperforms the convergence rate in Wei et al. (2021b) . Remark 1. The learning rate η is constrained to be inverse proportional to |S|, which is for the worst case and can be potentially loosened for problems with a small concentration coefficient. We refer interested readers to Appendix A for details.

3. ALGORITHM AND THEORY: THE EPISODIC SETTING

Episodic two-player zero-sum Markov game. An episodic two-player zero-sum Markov game is defined by a tuple {S, A, B, H, {P h } H h=1 , {r h } H h=1 }, with S being a finite state space, A and B denoting finite action spaces of the two players, and H > 0 the horizon length. Every step h ∈ [H] admits a transition probability kernel P h : S × A → ∆(S) and reward function r h : S × A × B → [0, 1]. Furthermore, µ = {µ h } H h=1 and {ν h } H h=1 denote the policies of the two players, where the probability of the max player choosing a ∈ A (resp. the min player choosing b ∈ B) at time h is specified by µ h (a|s) (resp. ν h (a|s)). Entropy regularized value functions. The value function and Q-function characterize the expected cumulative reward starting from step h by following the policy pair µ, ν. For conciseness, we only present the definition of entropy-regularized value functions below and remark that the their un-regularized counterparts V µ,ν h and Q µ,ν h can be obtained by setting τ = 0. We have V µ,ν h,τ (s) = E H h ′ =h [r h ′ (s h ′ , a h ′ , b h ′ ) -τ log µ h ′ (a h ′ |s h ′ ) + τ log ν h ′ (b h ′ |s h ′ )] s h = s ; Q µ,ν h,τ (s, a, b) = r h (s, a, b) + E s ′ ∼P h (•|s,a,b) V µ,ν h+1,τ (s ′ ) . The solution concept of NE and QRE are defined in a similar manner by focusing on the episodic versions of value functions. We again denote the unique QRE by ζ ⋆ τ = (µ ⋆ τ , ν ⋆ τ ). Proposed method and convergence guarantee It is straightforward to adapt Algorithm 1 to the episodic setting with minimal modifications, with detailed procedure showcased in Algorithm 2 (cf. Appendix B). The analysis, which substantially deviates from the discounted setting, exploits the structure of finite-horizon MDP and time-inhomogeneous policies, enabling a much larger range of learning rates as showed in the following theorem. Theorem 2. Setting 0 < η ≤ 1 8H and α t = ητ , it holds for all h ∈ [H] and t ≥ T h := (H -h)T start with T start = ⌈ 1 ητ log H⌉ that Q ⋆ h,τ -Q (t) h ∞ ≤ (1 -ητ ) t-T h t H-h ; (13a) max s∈S,µ,ν V µ,ν (t) h,τ (s) -V μ(t) ,ν h,τ (s) ≤ 4(1 -ητ ) t-T h max 8H 2 τ , 1 η 8H τ + 6ηt H-h+1 . (13b) Theorem 2 implies that the last iterate of Algorithm 2 takes no more than O HT start + H ητ log 1 ϵ = O H ητ log 1 ϵ iterations for finding an ϵ-QRE. Minimizing the bound over the learning rate η, Algorithm 2 is guaranteed to find an ϵ-QRE in O H 2 τ log 1 ϵ iterations, which translates into an iteration complexity of O H 3 ϵ for finding an ϵ-NE in terms of the duality gap, i.e., max s∈S,h∈[H],µ,ν V µ,ν (t) h (s) -V μ(t) ,ν h (s) ≤ ϵ, by setting τ = O ϵ H(log |A|+log |B|) .

4. DISCUSSION

This work develops policy optimization methods for zero-sum Markov games that feature singleloop and symmetric updates with provable last-iterate convergence guarantees. Our approach yields better iteration complexities in both infinite-horizon and finite-horizon settings, by adopting entropy regularization and non-Euclidean policy update. Important future directions include investigating whether larger learning rates are possible without knowing problem-dependent information a priori, extending the framework to allow function approximation, and designing sample-efficient implementations of the proposed method. 

Appendix

A ANALYSIS FOR THE INFINITE-HORIZON SETTING Definition 1. Given ρ ∈ ∆(S) with ρ(s) > 0, ∀s ∈ S, concentrability coefficient c ρ (t) is defined as c ρ (t) = sup x (l) ∈A S ,1≤l≤t, y (l) ∈B S ,1≤l≤t ρP x (1) ,y (1) • • • P x (t) ,y (t) ρ ∞ , where P x (l) ,y (l) ∈ R |S|×|S| is the state transition matrix induced by a pair of deterministic policy x (l) , y (l) : [P x (l) ,y (l) ] s,s ′ = P (s ′ |s, x (l) (s), y (l) (s)). Let C ρ be the maximum value of c ρ (t) over t ≥ 0: C ρ = sup t≥0 c ρ (t). In addition, let Γ(ρ) be the set of all possible distribution over S induced by initial distribution ρ and deterministic policy sequences, i.e., Γ(ρ) = ∞ t=0 ρP x (1) ,y (1) • • • P x (t) ,y (t) : x (l) ∈ A S , y (l) ∈ B S , ∀l ∈ [t] We make note that Theorem 1 is the direct corollary of following theorems, by setting ρ to the uniform distribution over S, where C ρ admits a trivial upper bounded |S|. Theorem 3. With 0 < η ≤ (1-γ) 3 32000Cρ , and α i = ητ , we have max KL ρ ζ ⋆ τ ∥ ζ (t) , 1 2 KL ρ ζ ⋆ τ ∥ ζ(t) , 3η E s∼ρ Q (t) (s) -Q ⋆ τ (s) ∞ ≤ 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t . Definition 2. We define regularized minimax mismatch coefficient by C † ρ,τ = max max µ d µ,ν † τ (µ) ρ ρ ∞ , max ν d µ † τ (ν),ν ρ ρ ∞ . Here, ν † τ (µ) denotes the optimal policy of the min player when the max player adopts policy µ: ν † τ (µ) = arg min ν V µ,ν τ (s), and µ † τ (ν) is defined in a symmetric way. The discounted state visitation distribution d µ,ν ρ is defined as d µ,ν ρ (s) = (1 -γ) E s0∼ρ ∞ t=0 γ t P (s t = s|s 0 ) . Note that this definition parallels that of the (unregularized) minimax mismatch coefficient in (Daskalakis et al., 2020) . Theorem 4. With 0 < η ≤ (1-γ) 3 32000Cρ , and α i = ητ , we have max s∈S,µ,ν V µ,ν (t) τ (s) -V μ(t) ,ν τ (s) ≤ 6000∥1/ρ∥ ∞ (1 -γ) 3 τ max 8 (1 -γ) 2 τ , 1 η 1 - (1 -γ)ητ 4 t , max µ,ν V µ,ν (t) τ (ρ) -V μ(t) ,ν τ (ρ) ≤ 6000C † ρ,τ (1 -γ) 3 τ max 8 (1 -γ) 2 τ , 1 η 1 - (1 -γ)ητ 4 t . We start with the following lemma. The proof can be found in Appendix C.1. For notational simplicity, we set Q (-1) = 0, ζ(-1) = ζ(0) and α 0 = 1. It follows from the update rule (9a) that ζ(1) = ζ (0) = ζ(0) . Lemma 1. It holds for any step size 0 < η ≤ 1/τ and t ≥ 0 that KL ρ ζ ⋆ τ ∥ ζ (t+1) -(1 -ητ )KL ρ ζ ⋆ τ ∥ ζ (t) + 1 -ητ - 4η 1 -γ KL ρ ζ(t+1) ∥ ζ(t) + ητ KL ρ ζ(t+1) ∥ ζ ⋆ τ + 1 - 2η 1 -γ KL ρ ζ (t+1) ∥ ζ(t+1) + (1 -ητ )KL ρ ζ(t) ∥ ζ (t) - 2η 1 -γ KL ρ ζ(t) ∥ ζ(t-1) ≤ E s∼ρ 2η Q (t+1) (s) -Q ⋆ τ (s) ∞ + 4η 2 1 -γ Q (t) (s) -Q (t+1) (s) ∞ + 12η 2 1 -γ Q (t-1) (s) -Q (t) (s) ∞ . It remains to bound the terms on the right hand side of ( 14). By a slight abuse of notation, we denote Q (t+1) -Q ⋆ τ Γ(ρ) = sup χ∈Γ(ρ) E s∼χ Q (t+1) (s) -Q ⋆ τ (s) ∞ , Q (t+1) -Q (t) Γ(ρ) = sup χ∈Γ(ρ) E s∼χ Q (t+1) (s) -Q (t) (s) ∞ . The following two lemmas establish a set of recursive bounds that relate Q (l+1) (s) -Q ⋆ τ (s) Γ(ρ) l=0,••• ,t and Q (l+1) (s) -Q (l) (s) Γ(ρ) l=0,••• ,t with KL ρ ζ(l+1) ∥ ζ(l) l=0,••• ,t-1 : Lemma 2. With 0 < η ≤ min{(1 -γ)/180, (1 -γ) 2 /48}, it holds for all t ≥ 1 that Q (t+1) -Q (t) Γ(ρ) ≤ 1 + γ 2 t l=1 α l,t Q (l) -Q (l-1) Γ(ρ) + 4C ρ η • t l=1 α l,t KL ρ ζ(l) ∥ ζ(l-1) , (15) Here, α l,t is defined as α l,t = α l t i=l+1 (1 -α i ). When t = 0, we have Q (1) (s) -Q (0) (s) Γ(ρ) ≤ 2. Proof. The proof can be found in Appendix C.2. Lemma 3. With 0 < η ≤ (1 -γ) 2 /16, it holds for all t ≥ 1 that Q (t+1) -Q ⋆ τ Γ(ρ) ≤ 1 + γ 2 • t l=0 α l,t Q (l) -Q ⋆ τ Γ(ρ) + 2η 1 -γ Q (l) -Q (l-1) Γ(ρ) + 2α 0,t . When t = 0, we have Q (t+1) -Q ⋆ τ Γ(ρ) ≤ 2γ 1-γ . Proof. The proof can be found in Appendix C.3. The following lemma further demystify the complicated recursive bounds showed in Lemma 2 and 3. Lemma 4. Let λ l,t be defined as λ l,t = α l t i=l+1 1 - 1 -γ 4 • α i . Under the assumption of Lemma 2 and 3, it holds for all t ≥ 0 that t l=0 λ l+1,t+1 η Q ⋆ τ -Q (l+1) Γ(ρ) + 12η 2 (1 -γ) 2 Q (l+1) -Q (l) Γ(ρ) ≤ 6250ηC ρ (1 -γ) 3 t-1 l=0 λ l+1,t+1 KL ζ(l+1) ∥ ζ(l) + 550η (1 -γ) 2 λ 0,t+1 Proof. The proof can be found in Appendix C.4. We are now ready to prove our main results. Averaging ( 14) with weight λ gives t l=0 λ l+1,t+1 KL ρ ζ ⋆ τ ∥ ζ (l+1) -(1 -ητ )KL ρ ζ ⋆ τ ∥ ζ (l) + 1 - 2η 1 -γ KL ρ ζ (l+1) ∥ ζ(l+1) + 3η E s∼ρ Q (l+1) (s) -Q ⋆ τ (s) ∞ + 1 -ητ - 4η 1 -γ KL ρ ζ(l+1) ∥ ζ(l) - 2η 1 -γ KL ρ ζ(l) ∥ ζ(l-1) ≤ t l=0 λ l+1,t+1 E s∼ρ 5η Q (l+1) (s) -Q ⋆ τ (s) ∞ + 4η 2 1 -γ Q (l+1) (s) -Q (l) (s) ∞ + 13η 2 1 -γ Q (l-1) (s) -Q (l) (s) ∞ ≤ 5 t l=0 λ l+1,t+1 E s∼ρ η Q ⋆ τ (s) -Q (l+1) (s) Γ(ρ) + 12η 2 (1 -γ) 2 Q (l+1) (s) -Q (l) (s) Γ(ρ) ≤ 31250ηC ρ (1 -γ) 3 t-1 l=0 λ l+1,t+1 KL ρ ζ(l+1) ∥ ζ(l) + 2750η (1 -γ) 2 λ 0,t+1 for all t ≥ 0. Rearranging terms, we have α t+1 KL ρ ζ ⋆ τ ∥ ζ (t+1) + 1 - 2η 1 -γ KL ρ ζ (t+1) ∥ ζ(t+1) + 3η E s∼ρ Q (t+1) (s) -Q ⋆ τ (s) ∞ + t l=1 (λ l,t+1 -(1 -ητ )λ l+1,t+1 )KL ρ ζ ⋆ τ ∥ ζ (l) + t-1 l=0 λ l+1,t+1 1 -ητ - 4η 1 -γ - 31250ηC ρ (1 -γ) 3 -λ l+2,t+1 2η 1 -γ KL ζ(l+1) ∥ ζ(l) ≤ 2750η (1 -γ) 2 λ 0,t+1 + (1 -ητ )λ 1,t+1 KL ρ ζ ⋆ τ ∥ ζ (0) ≤ 2750η (1 -γ) 2 + η λ 0,t+1 . With 0 < η ≤ (1-γ) 3 32000Cρ , and α i = ητ , we have λ l,t+1 -(1 -ητ )λ l+1,t+1 ≥ 0 (c.f. ( 40)), and λ l+1,t+1 1 -ητ - 4η 1 -γ - 31250ηC ρ (1 -γ) 3 -λ l+2,t+1 2η 1 -γ = ητ t+1 j=l+3 1 - 1 -γ 4 α j (1 - 1 -γ 4 ητ ) 1 -ητ - 4η 1 -γ - 31250ηC ρ (1 -γ) 3 - 2η 1 -γ ≥ 0. It follows that KL ρ ζ ⋆ τ ∥ ζ (t+1) + 1 - 2η 1 -γ KL ρ ζ (t+1) ∥ ζ(t+1) + 3η E s∼ρ Q (t+1) (s) -Q ⋆ τ (s) ∞ ≤ 2750 (1 -γ) 2 τ + 1 τ 1 - (1 -γ)ητ 4 t+1 < 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t+1 . ( ) This proves the bound of KL ρ ζ ⋆ τ ∥ ζ (t+1) and 3η E s∼ρ Q (t+1) (s) -Q ⋆ τ (s) ∞ in Theorem 3. Note that the bound holds trivially for KL ρ ζ ⋆ τ ∥ ζ (0) and 3η E s∼ρ Q (0) (s) -Q ⋆ τ (s) ∞ . It re- mains to bound KL ζ ⋆ τ ∥ ζ(t+1) . Lemma 5. With 0 < η ≤ (1 -γ)/8, we have 1 2 KL s ζ ⋆ τ ∥ ζ(t+1) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ ≤ (1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + 2η 1 -γ KL s ζ (t) ∥ ζ(t) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ . Proof. See Appendix C.5. Combining the above Lemma with (17) gives 1 2 KL ρ ζ ⋆ τ ∥ ζ(t+1) + ητ KL ρ ζ(t+1) ∥ ζ ⋆ τ ≤ (1 -ητ ) KL ρ ζ ⋆ τ ∥ ζ (t) + 1 - 2η 1 -γ KL ρ ζ (t) ∥ ζ(t) + 3η E s∼ρ Q (t) (s) -Q ⋆ τ (s) ∞ ≤ 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t+1 . We are now ready to bound the duality gap. Before proceeding, we introduce the following two lemmas: Lemma 6. It holds for any policy pair µ, ν that max µ ′ ,ν ′ V µ ′ ,ν τ (ρ) -V µ,ν ′ τ (ρ) ≤ 2C † ρ,τ 1 -γ E s∼ρ max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) (18) and max s∈S,µ ′ ,ν ′ V µ ′ ,ν τ (s) -V µ,ν ′ τ (s) ≤ 2∥1/ρ∥ ∞ 1 -γ E s∼ρ max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) . ( ) Proof. Note that ( 19) is a slight generalization of (Wei et al., 2021b, Lemma 32) . The proof can be found in Appendix C.6. Lemma 7 ((Cen et al., 2021b, Lemma 4)). It holds for all s ∈ S and policy pair µ, ν that max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) ≤ 4 (1 -γ) 2 τ KL s ζ ⋆ τ ∥ ζ + τ KL s ζ ∥ ζ ⋆ τ . Putting all pieces together, we arrive at max µ,ν V µ,ν (t) τ (ρ) -V μ(t) ,ν τ (ρ) ≤ 2C † ρ,τ 1 -γ 4 (1 -γ) 2 τ KL ρ ζ ⋆ τ ∥ ζ(t+1) + τ KL ρ ζ(t+1) ∥ ζ ⋆ τ ≤ 2C † ρ,τ 1 -γ max 8 (1 -γ) 2 τ , 1 η 1 2 KL ρ ζ ⋆ τ ∥ ζ(t+1) + ητ KL ρ ζ(t+1) ∥ ζ ⋆ τ ≤ 6000C † ρ,τ (1 -γ) 3 τ max 8 (1 -γ) 2 τ , 1 η 1 - (1 -γ)ητ 4 t . We omit the proof for max s∈S,µ,ν V µ,ν (t) τ (s) -V μ(t) ,ν τ (s) as it follows virtually the same argument.

B ANALYSIS FOR THE EPISODIC SETTING

Throughout the analysis, we restrict our choice of value update step size to α t = ητ . We start with the following lemma which parallels Lemma 11 in the episodic Markov game setting: Lemma 8. With 0 < η ≤ 1/τ , it holds for all s ∈ S, h ∈ [H] and t ≥ 0 that max μ(t+1) h (s) -µ (t+1) h (s) 1 , ν(t+1) h (s) -ν (t+1) h (s) 1 ≤ 2ηH. In addition, we have max{∥ log ζ (t) h (s)∥ ∞ , ∥ log ζ(t) h (s)∥ ∞ ∥ log ζ ⋆ h,τ (s)∥ ∞ } ≤ 2H τ (23) Lemma 9. With 0 < η ≤ 1 8H , it holds for all 0 ≤ t 1 ≤ t 2 , h ∈ [H] and s ∈ S that KL s ζ ⋆ h,τ ∥ ζ (t2) h + (1 -4ηH)KL s ζ (t2) h ∥ ζ(t2) h ≤ (1 -ητ ) t2-t1 KL s ζ ⋆ h,τ ∥ ζ (t1) h + (1 -4ηH)KL s ζ (t1) h ∥ ζ(t1) h + 4η t2 l=t1 (1 -ητ ) t2-l Q (l) h (s) -Q ⋆ τ (s) ∞ . Proof. See Appendix D.1. Lemma 10. With 0 < η ≤ 1 8H , it holds for all 0 < t 1 ≤ t 2 , 2 ≤ h ≤ H and s ∈ S that Q (t2) h-1 (s, a, b) -Q ⋆ h-1,τ (s, a, b) ≤ 2(1 -ητ ) t2-t1 H + 10ητ E s ′ ∼P h-1 (•|s,a,b) t2-1 l=t1-1 (1 -ητ ) t2-1-l Q (l) h (s) -Q ⋆ h,τ (s) ∞ + τ (1 -ητ ) t2-t1 E s ′ ∼P h-1 (•|s,a,b) KL s ζ ⋆ h,τ ∥ ζ (t1-1) h + (1 -4ηH)KL s ζ (t1-1) h ∥ ζ(t1-1) h . Algorithm 2: Entropy-regularized OMWU for Episodic Two-player Zero-sum Markov Game Input: Regularization parameter τ > 0, learning rate for policy update η > 0, learning rate for value update {α t } ∞ t=1 . Initialization: Set µ (0) , μ(0) , ν (0) and ν(0) as uniform policies; set Q (0) = 0, V (0) = τ (log |A| -log |B|). for t = 0, 1, • • • do for all h ∈ [H], s ∈ S do in parallel When t ≥ 1, update policy pair ζ (t) h (s) as: µ (t) h (a|s) ∝ µ (t-1) h (a|s) 1-ητ exp(η[Q (t) h (s)ν (t) h (s)] a ) ν (t) h (b|s) ∝ ν (t-1) h (b|s) 1-ητ exp(-η[Q (t) h (s) ⊤ μ(t) h (s)] b ) . ( ) Update policy pair ζ(t+1) h (s) as: μ(t+1) h (a|s) ∝ µ (t) h (a|s) 1-ητ exp(η[Q (t) h (s)ν (t) h (s)] a ) ν(t+1) h (b|s) ∝ ν (t) h (b|s) 1-ητ exp(-η[Q (t) h (s) ⊤ μ(t) h (s)] b ) . ( ) Update Q (t+1) h (s) and V (t+1) h (s) as        Q (t+1) h (s, a, b) = r h (s, a, b) + E s ′ ∼P h (•|s,a,b) V (t) h+1 (s ′ ) V (t+1) h (s) = (1 -α t+1 )V (t) h (s) +α t+1 μ(t+1) h (s) ⊤ Q (t+1) h (s)ν (t+1) h (s) + τ H μ(t+1) h (s) -τ H ν . Proof. See Appendix D.2. We prove Theorem 2 by induction. By definition, we have Q ⋆ H,τ -Q (0) H ∞ = Q ⋆ H,τ ∞ ≤ 1, and Q ⋆ H,τ -Q (t) H ∞ = r H -r H ∞ = 0 for t > 0. So (13a) holds trivially for h = H. When the statement holds for some h, we can invoke Lemma 10 with t 1 = T h + 1 and t 2 = t ≥ T h-1 , which yields Q (t) h-1 -Q ⋆ h-1,τ ≤ 2(1 -ητ ) t-T h -1 H + 10ητ E s ′ ∼P (•|s,a,b) t-1 l=T h (1 -ητ ) t-1-l Q (l) h (s) -Q ⋆ h,τ (s) ∞ + τ (1 -ητ ) t-T h -1 E s ′ ∼P (•|s,a,b) KL s ζ ⋆ h,τ ∥ ζ (T h ) h + (1 -4ηH)KL s ζ (T h ) h ∥ ζ(T h ) h ≤ 2(1 -ητ ) t-T h -1 H + 10ητ E s ′ ∼P (•|s,a,b) t-1 l=T h (1 -ητ ) t-T h -1 l H-h + τ (1 -ητ ) t-T h -1 E s ′ ∼P (•|s,a,b) KL s ζ ⋆ h,τ ∥ ζ (T h ) h + (1 -4ηH)KL s ζ (T h ) h ∥ ζ(T h ) h ≤ (1 -ητ ) t-T h-1 (1 -ητ ) Tstart-1 10H + 10ητ t H-h+1 , where the last step results from τ KL s ζ ⋆ h,τ ∥ ζ (T h ) h + (1 -4ηH)KL s ζ (T h ) h ∥ ζ(T h ) h ≤ τ log µ ⋆ h,τ (s) -log µ (T h ) h (s) ∞ + log ν ⋆ h,τ (s) -log ν (T h ) h (s) ∞ + log µ (T h ) h (s) -log μ(T h ) h (s) ∞ + log ν (T h ) h (s) -log ν(T h ) h (s) ∞ ≤ 8H. Therefore, with T start = ⌈ 1 ητ log H⌉ we can guarantee that Q (t) h-1 -Q ⋆ h-1,τ ≤ 10(1 -ητ ) t-T h-1 (1 -ητ ) Tstart-1 H + ητ t H-h+1 ≤ (1 -ητ ) t-T h-1 t H-h+1 . This completes the proof for (13a). Regarding (13b), we start by the following lemmas, which are simply Lemma 5 and Lemma 7 applied to the episodic setting: Lemma 5A. With 0 < η ≤ 1 8H , we have 1 2 KL s ζ ⋆ h,τ ∥ ζ(t+1) h + ητ KL s ζ(t+1) h ∥ ζ ⋆ h,τ ≤ (1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t) h + 2ηHKL s ζ (t) h ∥ ζ(t) h + 2η Q (t) h (s) -Q ⋆ h,τ (s) ∞ . Lemma 7A. It holds for all h ∈ [H], s ∈ S and policy pair µ, ν that max µ ′ ,ν ′ f s (Q ⋆ h,τ , µ ′ h , ν h ) -f s (Q ⋆ τ , µ h , ν ′ h ) ≤ 4H 2 τ KL s ζ ⋆ h,τ ∥ ζ h + τ KL s ζ h ∥ ζ ⋆ h,τ . Combining Lemma 9 with Lemma 5A and Lemma 7A, we conclude that for 0 ≤ t 1 ≤ t 2 -1, max µ,ν f s (Q ⋆ h,τ , µ h , ν(t2) h ) -f s (Q ⋆ τ , μ(t2) h , ν h ) ≤ 4H 2 τ KL s ζ ⋆ h,τ ∥ ζ(t2) h + τ KL s ζ(t2) h ∥ ζ ⋆ h,τ ≤ max 8H 2 τ , 1 η 1 2 KL s ζ ⋆ h,τ ∥ ζ(t2) h + ητ KL s ζ(t2) h ∥ ζ ⋆ h,τ ≤ max 8H 2 τ , 1 η (1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t2-1) h + 2ηHKL s ζ (t2-1) h ∥ ζ(t2-1) h + 2η Q (t2-1) h (s) -Q ⋆ h,τ (s) ∞ ≤ max 8H 2 τ , 1 η (1 -ητ ) t2-t1 KL s ζ ⋆ h,τ ∥ ζ (t1) h + (1 -4ηH)KL s ζ (t1) h ∥ ζ(t1) h + 6η t2 l=t1 (1 -ητ ) t2-l Q (l) h (s) -Q ⋆ h,τ (s) ∞ . It is straightforward to verify that the above inequality holds for 0 ≤ t 1 ≤ t 2 , by omitting the third step. Substitution of (13a) into the above inequality yields max µ,ν f s (Q ⋆ h,τ , µ h , ν(t) h ) -f s (Q ⋆ τ , μ(t) h , ν h ) ≤ max 8H 2 τ , 1 η (1 -ητ ) t-T h KL s ζ ⋆ h,τ ∥ ζ (T h ) h + (1 -4ηH)KL s ζ (T h ) h ∥ ζ(T h ) h + 6η t l=T h (1 -ητ ) t-l (1 -ητ ) l-T h l H-h ≤ (1 -ητ ) t-T h max 8H 2 τ , 1 η 8H τ + 6ηt H-h+1 . ( ) We prove the following results instead, where (13b) is a direct conclusion of (25) by summing the two inequalities.      max s∈S,µ V µ,ν (t) h,τ (s) -V ⋆ h,τ (s) ≤ 2(1 -ητ ) t-T h max 8H 2 τ , 1 η 8H τ + 6ηt H-h+1 max s∈S,µ V ⋆ h,τ (s) -V μ(t) ,ν h,τ (s) ≤ 2(1 -ητ ) t-T h max 8H 2 τ , 1 η 8H τ + 6ηt H-h+1 We prove by induction. Note that when h = H, we have V µ,ν H,τ (s) = f s (r H , µ H , ν H ) = f s (Q ⋆ H,τ , µ H , ν H ) and the claim holds by invoking (24). When the claim holds for some 2 ≤ h ≤ H, we have V µ,ν (t) h-1,τ (s) -V ⋆ h-1,τ (s) = µ h-1 (s) ⊤ Q µ,ν (t) h-1,τ (s)ν (t) h-1 (s) + τ H µ h-1 (s) -τ H ν(t) h-1 (s) -µ ⋆ h-1,τ (s) ⊤ Q ⋆ h-1,τ (s)ν ⋆ h-1,τ (s) + τ H µ ⋆ h-1,τ (s) -τ H ν ⋆ h-1,τ (s) = f s (Q ⋆ h-1,τ , µ h-1 , ν(t) h-1 ) -f s (Q ⋆ h-1,τ , µ ⋆ h-1,τ , ν ⋆ h-1,τ ) + µ h-1 (s) ⊤ Q µ,ν (t) h-1,τ (s) -Q ⋆ h-1,τ (s) ν(t) h-1 (s) ≤ f s (Q ⋆ h-1,τ , µ h-1 , ν(t) h-1 ) -f s (Q ⋆ h-1,τ , μ(t) h-1 , ν ⋆ h-1,τ ) + max s ′ ∈S V µ,ν (t) h,τ (s ′ ) -V ⋆ h,τ (s ′ ) ≤ max µ ′ h-1 ,ν ′ h-1 f s (Q ⋆ h-1,τ , µ ′ h-1 , ν(t) h-1 ) -f s (Q ⋆ h-1,τ , μ(t) h-1 , ν ′ h-1 ) + max s ′ ∈S V µ,ν (t) h,τ (s ′ ) -V ⋆ h,τ (s ′ ) ≤ (1 -ητ ) t-T h-1 max 8H 2 τ , 1 η 8H τ + 6ηt H-h+2 + 2(1 -ητ ) t-T h max 8H 2 τ , 1 η 8H τ + 6ηt H-h+1 ≤ 2(1 -ητ ) t-T h-1 max 8H 2 τ , 1 η 8H τ + 6ηt H-h+2 . Taking maximum over µ verifies the claim for h -1, thereby finishing the proof. The bound for max s∈S,µ V ⋆ h,τ (s) -V μ(t) ,ν h,τ ) can be established by following a similar argument and is therefore omitted.

C PROOF OF KEY LEMMAS FOR THE DISCOUNTED SETTING

C.1 PROOF OF LEMMA 1 Before proceeding, we shall introduce the following lemma that quantifies the distance between two consecutive updates, whose proof can be found in Appendix E.1. Lemma 11. For 0 < η ≤ 1/τ , it holds for all s ∈ S and t ≥ 0 that max μ(t+1) (s) -µ (t+1) (s) 1 , ν(t+1) (s) -ν (t+1) (s) 1 ≤ 2η 1 -γ and that max μ(t+1) (s) -μ(t) (s) 1 , ν(t+1) (s) -ν(t) (s) 1 ≤ 6η 1 -γ . For notational simplicity, we use x 1 = y to denote equivalence up to a global shift for two vectors x, y: x = y + c • 1 for some constant c ∈ R. Taking logarithm on the both sides of the update rule (9a), we get log µ (t+1) (s) -(1 -ητ ) log µ (t) (s) 1 = ηQ (t+1) (s)ν (t+1) (s) log ν (t+1) (s) -(1 -ητ ) log ν (t) (s) 1 = -ηQ (t+1) (s) ⊤ μ(t+1) (s) . ( ) On the other hand, it holds for the optimal policies (µ ⋆ τ , ν ⋆ τ ) that ητ log µ ⋆ τ (s) 1 = ηQ ⋆ τ (s)ν ⋆ τ (s) ητ log ν ⋆ τ (s) 1 = -ηQ ⋆ τ (s) ⊤ µ ⋆ τ (s) . ( ) Subtracting ( 27) from ( 26) and taking inner product with ζ(t+1 ) (s) -ζ ⋆ τ (s) gives log ζ (t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) -ζ ⋆ τ (s) = η μ(t+1) (s) -µ ⋆ τ (s), Q (t+1) (s)ν (t+1) (s) -Q ⋆ τ (s)ν ⋆ τ (s) -η ν(t+1) (s) -ν ⋆ τ (s), Q (t+1) (s) ⊤ μ(t+1) (s) -Q ⋆ τ (s) ⊤ µ ⋆ τ (s) = η μ(t+1) (s) -µ ⋆ τ (s), (Q (t+1) (s) -Q ⋆ τ (s))ν (t+1) (s) -η ν(t+1) (s) -ν ⋆ τ (s), (Q (t+1) (s) -Q ⋆ τ (s)) ⊤ μ(t+1) (s) = -η µ ⋆ τ (s), (Q (t+1) (s) -Q ⋆ τ (s))ν (t+1) (s) + η ν ⋆ τ (s), (Q (t+1) (s) -Q ⋆ τ (s)) ⊤ μ(t+1) (s) ≤ 2η Q (t+1) (s) -Q ⋆ τ (s) ∞ . ( ) We rewrite the LHS as log ζ (t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) -ζ ⋆ τ (s) = -log ζ (t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ ⋆ τ (s) + log ζ(t+1) (s) -(1 -ητ ) log ζ(t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) + log ζ (t+1) (s) -log ζ(t+1) (s), ζ(t+1) (s) -(1 -ητ ) log ζ (t) (s) -log ζ(t) (s), ζ(t+1) (s) = KL s ζ ⋆ τ ∥ ζ (t+1) -(1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + (1 -ητ )KL s ζ(t+1) ∥ ζ(t) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ + KL s ζ (t+1) ∥ ζ(t+1) -log ζ(t+1) (s) -log ζ (t+1) (s), ζ(t+1) (s) -ζ (t+1) (s) + (1 -ητ )KL s ζ(t) ∥ ζ (t) -(1 -ητ ) log ζ (t) (s) -log ζ(t) (s), ζ(t+1) (s) -ζ(t) (s) . Rearranging terms, we have KL s ζ ⋆ τ ∥ ζ (t+1) -(1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + (1 -ητ )KL s ζ(t+1) ∥ ζ(t) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ + KL s ζ (t+1) ∥ ζ(t+1) + (1 -ητ )KL s ζ(t) ∥ ζ (t) -log ζ(t+1) (s) -log ζ (t+1) (s), ζ(t+1) (s) -ζ (t+1) (s) -(1 -ητ ) log ζ (t) (s) -log ζ(t) (s), ζ(t+1) (s) -ζ(t) (s) ≤ 2η Q (t+1) (s) -Q ⋆ τ (s) ∞ . It remains to bound log ζ(t+1) (s) -log ζ (t+1) (s), ζ(t+1) (s) -ζ (t+1) (s) and log ζ (t) (s) -log ζ(t) (s), ζ(t+1) (s) -ζ(t) (s) . Note that log μ(t+1) (s) -log µ (t+1) (s), μ(t+1) (s) -µ (t+1) (s) = η Q (t) (s)ν (t) (s) -Q (t+1) (s)ν (t+1) (s), μ(t+1) (s) -µ (t+1) (s) ≤ η Q (t) (s)ν (t) (s) -Q (t+1) (s)ν (t+1) (s) 1 μ(t+1) (s) -µ (t+1) (s) 1 . We bound Q (t) (s)ν (t) (s) -Q (t+1) (s)ν (t+1) (s) 1 as Q (t) (s)ν (t) (s) -Q (t+1) (s)ν (t+1) (s) 1 ≤ Q (t+1) (s) ν(t) (s) -ν(t+1) (s) 1 + Q (t) (s) -Q (t+1) (s) ν(t) (s) 1 ≤ 2 1 -γ ν(t) (s) -ν(t+1) (s) 1 + Q (t) (s) -Q (t+1) (s) ∞ . Plugging the above inequality into (29) and invoking Young's inequality yields log μ(t+1) (s) -log µ (t+1) (s), μ(t+1) (s) -µ (t+1) (s) ≤ η 1 -γ ν(t+1) (s) -ν(t) (s) 2 1 + μ(t+1) (s) -µ (t+1) (s) 2 1 + η Q (t) (s) -Q (t+1) (s) ∞ μ(t+1) (s) -µ (t+1) (s) 1 ≤ 2η 1 -γ KL s ν(t+1) ∥ ν(t) + 2η 1 -γ KL s µ (t+1) ∥ μ(t+1) + 2η 2 1 -γ Q (t) (s) -Q (t+1) (s) ∞ , where the last step results from Pinsker's inequality and Lemma 11. Similarly, we have log ν(t+1) (s) -log ν (t+1) (s), ν(t+1) (s) -ν (t+1) (s) ≤ 2η 1 -γ KL s μ(t+1) ∥ μ(t) + 2η 1 -γ KL s ν (t+1) ∥ ν(t+1) + 2η 2 1 -γ Q (t) (s) -Q (t+1) (s) ∞ . Combining the above two inequalities gives log ζ(t+1) (s) -log ζ (t+1) (s), ζ(t+1) (s) -ζ (t+1) (s) ≤ 2η 1 -γ KL s ζ(t+1) ∥ ζ(t) + 2η 1 -γ KL s ζ (t+1) ∥ ζ(t+1) + 4η 2 1 -γ Q (t) (s) -Q (t+1) (s) ∞ . By a similar argument, when t ≥ 1: log ζ (t) (s) -log ζ(t) (s), ζ(t+1) (s) -ζ(t) (s) = η Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s), μ(t+1) (s) -μ(t) (s) -η Q (t) (s) ⊤ μ(t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s), ν(t+1) (s) -ν(t) (s) ≤ 2η 1 -γ KL s ζ(t) ∥ ζ(t-1) + 2η 1 -γ KL s ζ(t+1) ∥ ζ(t) + η μ(t+1) (s) -μ(t) (s) 1 + ν(t+1) (s) -ν(t) (s) 1 Q (t) (s) -Q (t-1) (s) ∞ ≤ 2η 1 -γ KL s ζ(t) ∥ ζ(t-1) + 2η 1 -γ KL s ζ(t+1) ∥ ζ(t) + 12η 2 1 -γ Q (t) (s) -Q (t-1) (s) ∞ . Note that the above inequality trivially holds for t = 0, since log ζ (0) (s) = log ζ(0) (s). Putting pieces together, we conclude for that KL s ζ ⋆ τ ∥ ζ (t+1) -(1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + 1 -ητ - 4η 1 -γ KL s ζ(t+1) ∥ ζ(t) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ + 1 - 2η 1 -γ KL s ζ (t+1) ∥ ζ(t+1) + (1 -ητ )KL s ζ(t) ∥ ζ (t) - 2η 1 -γ KL s ζ(t) ∥ ζ(t-1) ≤ 2η Q (t+1) (s) -Q ⋆ τ (s) ∞ + 4η 2 1 -γ Q (t) (s) -Q (t+1) (s) ∞ + 12η 2 1 -γ Q (t-1) (s) -Q (t) (s) ∞ . Averaging s over the distribution ρ completes the proof.

C.2 PROOF OF LEMMA 2

Proof. By definition of Q, it holds for t ≥ 1 that Q (t+1) (s, a, b) -Q (t) (s, a, b) ≤ γE s ′ ∼P (•|s,a,b) V (t) (s ′ ) -V (t-1) (s ′ ) . We denote by f s (Q, µ, ν) the one-step entropy-regularized game value at state s, i.e., f s (Q, µ, ν) = µ(s) ⊤ Q(s)ν(s) + τ H(µ(s)) -τ H(ν(s)). We further simplify the notation by introducing f (t) s = f s (Q (t) , μ(t) , ν(t) ) By recursively applying the update rule V (t) (s) = (1 -α t )V (t-1) (s) + α t f (t) s , we get V (t) (s) = α 0,t V (0) + t l=1 α l,t f s (Q (l) , μ(l) , ν(l) ) = t l=0 α l,t f (l) s . Since α 0 = 1, it follows that t l=0 α l,t = α 0 = 1 So we have V (t) (s) -V (t-1) (s) = α t f (t) s -V (t-1) (s) = α t t-1 l=0 α l,t-1 f (t) s -f (l) s ≤ α t t-1 l=0 α l,t-1 t-1 j=l f (j+1) s -f (j) s ( ) The next lemma enables us to upper bound f (t+1) s -f (t) s with Q (t+1) (s) -Q (t) (s) ∞ and KL s ζ(t+1) ∥ ζ(t) (as well as their (t -1)-th iteration counter parts). The proof is postponed to Appendix E.2. Lemma 12. For any t ≥ 0, η ≤ (1 -γ)/180, we have f (t+1) s -f (t) s ≤ Q (t+1) (s) -Q (t) (s) ∞ + 3 η + 4 1 -γ KL s ζ(t+1) ∥ ζ(t) + 12η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 2 1 -γ KL s ζ(t) ∥ ζ(t-1) . Plugging the above lemma into (31), V (t) (s) -V (t-1) (s) ≤ α t t-1 l=0 α l,t-1 t-1 j=l Q (j+1) (s) -Q (j) (s) ∞ + 3 η + 4 1 -γ KL s ζ(j+1) ∥ ζ(j) + α t t-1 l=0 α l,t-1 t-1 j=l 12η 1 -γ Q (j) (s) -Q (j-1) (s) ∞ + 2 1 -γ KL s ζ(j) ∥ ζ(j-1) ≤ α t t-1 l=0 α l,t-1 t-1 j=l 1 + 12η 1 -γ Q (j+1) (s) -Q (j) (s) ∞ + 3 η + 6 1 -γ KL s ζ(j+1) ∥ ζ(j) + α t t-1 l=0 α l,t-1 12η 1 -γ Q (l) (s) -Q (l-1) (s) ∞ + 2 1 -γ KL s ζ(l) ∥ ζ(l-1) ≤ t-1 j=0 α j+1 j l=0 α l,t-1 1 + 12η 1 -γ Q (j+1) (s) -Q (j) (s) ∞ + 3 η + 6 1 -γ KL s ζ(j+1) ∥ ζ(j) + α t t-2 l=0 α l+1,t-1 12η 1 -γ Q (l+1) (s) -Q (l) (s) ∞ + 2 1 -γ KL s ζ(l+1) ∥ ζ(l) , where the last step is due to α t ≤ α j for all j ≤ t. To continue, by definition of α we have α t α l+1,t-1 ≤ α l+1,t-1 (1 -α t ) = α l+1,t for 0 ≤ l < t, and that α j+1 j l=0 α l,t-1 = α j+1 j l=0 t-1 i=l+1 (1 -α i ) - t-1 i=l (1 -α i ) = α j+1 t-1 i=j+1 (1 -α i ) ≤ α j+1 t i=j+2 (1 -α i ) = α j+1,t . Plugging into the inequality above gives V (t) (s) -V (t-1) (s) ≤ t-1 j=0 α j+1,t 1 + 12η 1 -γ Q (j+1) (s) -Q (j) (s) ∞ + 3 η + 6 1 -γ KL s ζ(j+1) ∥ ζ(j) + t-2 l=0 α l+1,t 12η 1 -γ Q (l+1) (s) -Q (l) (s) ∞ + 2 1 -γ KL s ζ(l+1) ∥ ζ(l) ≤ t-1 l=0 α l+1,t 1 + 24η 1 -γ Q (l+1) (s) -Q (l) (s) ∞ + 4 η KL s ζ(l+1) ∥ ζ(l) . Plugging the above inequality into (30) leads to Q (t+1) (s, a, b) -Q (t) (s, a, b) ≤ γ E s ′ ∼P (•|s,a,b)) t-1 l=0 α l+1,t 1 + 24η 1 -γ Q (l+1) (s ′ ) -Q (l) (s ′ ) ∞ + 4 η KL s ′ ( ζ(l+1) ∥ ζ(l) ) . When η ≤ (1-γ) 2 48γ , we have γ(1 + 24η 1-γ ) ≤ 1+γ 2 and hence that Q (t+1) (s, a, b) -Q (t) (s, a, b) ≤ E s ′ ∼P (•|s,a,b)) 1 + γ 2 t-1 l=0 α l+1,t Q (l+1) (s ′ ) -Q (l) (s ′ ) ∞ + 4 η KL s ′ ( ζ(l+1) ∥ ζ(l) ) . Let x (t+1) ∈ A S and y (t+1) ∈ B S be defined as (x (t+1) (s), y (t+1) (s)) = arg max (a,b)∈A×B Q (t+1) (s, a, b) -Q (t) (s, a, b) . It follows that ∀χ ∈ Γ(ρ), we have χP x (t+1) ,y (t+1) ∈ Γ(ρ) and hence E s∼χ Q (t+1) (s) -Q (t) (s) ∞ = E s∼χ, a=x (t+1) (s), b=y (t+1) (s) Q (t+1) (s, a, b) -Q (t) (s, a, b) ≤ E s ′ ∼χP x (t+1) ,y (t+1) 1 + γ 2 t-1 l=0 α l+1,t Q (l+1) (s ′ ) -Q (l) (s ′ ) ∞ + 4 η KL s ′ ( ζ(l+1) ∥ ζ(l) ) ≤ 1 + γ 2 t-1 l=0 α l+1,t Q (l+1) (s ′ ) -Q (l) (s ′ ) Γ(ρ) + 4 η • χP x (t+1) ,y (t+1) ρ ∞ KL ρ ζ(l+1) ∥ ζ(l) ≤ 1 + γ 2 t-1 l=0 α l+1,t Q (l+1) (s ′ ) -Q (l) (s ′ ) Γ(ρ) + 4C ρ η KL ρ ζ(l+1) ∥ ζ(l) . ( ) Taking supremum over χ ∈ Γ(ρ) completes the proof. When t = 0, we have Q (0) (s) -Q (1) (s) Γ(ρ) = Q (1) (s) Γ(ρ) ≤ 2.

C.3 PROOF OF LEMMA 3

Note that if suffices to show for t ≥ 0, s ∈ S, (a, b) ∈ A × B: Q (t+1) (s, a, b) -Q ⋆ τ (s, a, b) ≤ 1 + γ 2 • E s ′ ∼P (s,a,b) t l=0 α l,t Q (l) (s ′ ) -Q ⋆ τ (s ′ ) ∞ + 2η 1 -γ Q (l) (s ′ ) -Q (l-1) (s ′ ) ∞ + 2α 0,t . The remaining step follows a similar argument as (32) and is therefore omitted. For t ≥ 0, we have Q (t+1) (s, a, b) -Q ⋆ τ (s, a, b) = γE s ′ ∼P (•|s,a,b) V (t) (s ′ ) -V ⋆ τ (s ′ ) = γE s ′ ∼P (•|s,a,b) t l=0 α l,t (f (l) s ′ -f ⋆ s ′ ) . ( ) We start by decomposing f (t) s -f ⋆ s as f (t) s -f ⋆ s = f s (Q (t) , μ(t) , ν(t) ) -f s (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ) = f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t) , ν ⋆ τ ) + f s (Q (t) , μ(t) , ν ⋆ τ ) -f s (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ) ≤ f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t) , ν ⋆ τ ) + f s (Q ⋆ τ , μ(t) , ν ⋆ τ ) -f s (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ) + Q (t) (s) -Q ⋆ τ (s) ∞ ≤ f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t) , ν ⋆ τ ) + Q (t) (s) -Q ⋆ τ (s) ∞ . We bound the first two terms with the following lemma: Lemma 13. It holds for all t ≥ 0, s ∈ S and ν(s) ∈ ∆(B) that f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t) , ν) = ν(t) (s) -ν(s), Q (t) (s) ⊤ μ(t) (s) -τ H(ν (t) (s)) + τ H(ν ⋆ τ (s)) ≤ 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 2 1 -γ KL s μ(t) ∥ µ (t-1) + KL s µ (t-1) ∥ μ(t-1) - 1 η 1 - 4η 1 -γ KL s ν (t) ∥ ν(t) - 1 -ητ η KL s ν(t) ∥ ν (t-1) + 1 -ητ η KL s ν ∥ ν (t-1) - 1 η KL s ν ∥ ν (t) . Proof. See Appendix E.3. Applying Lemma 13 with ν(s) = ν ⋆ τ (s) gives f (t) s -f ⋆ s ≤ Q (t) (s) -Q ⋆ τ (s) ∞ + 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 1 -ητ η KL s ν ⋆ τ ∥ ν (t-1) - 1 η KL s ν ⋆ τ ∥ ν (t) - 1 η 1 - 4η 1 -γ KL s ν (t) ∥ ν(t) - 1 -ητ η KL s ν(t) ∥ ν (t-1) + 2 1 -γ KL s μ(t) ∥ µ (t-1) + KL s µ (t-1) ∥ μ(t-1) By a similar argument, we can derive f ⋆ s -f (t) s ≤ Q (t) (s) -Q ⋆ τ (s) ∞ + 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 1 -ητ η KL s µ ⋆ τ ∥ µ (t-1) - 1 η KL s µ ⋆ τ ∥ µ (t) - 1 η 1 - 4η 1 -γ KL s µ (t) ∥ μ(t) - 1 -ητ η KL s μ(t) ∥ µ (t-1) + 2 1 -γ KL s ν(t) ∥ ν (t-1) + KL s ν (t-1) ∥ ν(t-1) . Computing ( 35) + 1-γ 4 • (36) gives (1 - 1 -γ 4 )(f (t) s -f ⋆ s ) ≤ (1 + 1 -γ 4 ) Q (t) (s) -Q ⋆ τ (s) ∞ + 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 1 -ητ η KL s ν ⋆ τ ∥ ν (t-1) + 1 -γ 4 KL s µ ⋆ τ ∥ µ (t-1) - 1 η KL s ν ⋆ τ ∥ ν (t) + 1 -γ 4 KL s µ ⋆ τ ∥ µ (t) + 2 1 -γ KL s µ (t-1) ∥ μ(t-1) + 1 -γ 4 KL s ν (t-1) ∥ ν(t-1) - 1 η 1 - 4η 1 -γ 1 -γ 4 KL s µ (t) ∥ μ(t) + KL s ν (t) ∥ ν(t) + ( 2 1 -γ - 1 -ητ η • 1 -γ 4 )KL s μ(t) ∥ µ (t-1) + ( 2 1 -γ • 1 -γ 4 - 1 -ητ η )KL s ν(t) ∥ ν (t-1) . (37) With 0 < η ≤ (1 -γ) 2 /16, we have ( 2 1-γ -1-ητ η • 1-γ 4 ) ≤ 0, ( 2 1-γ • 1-γ 4 -1-ητ η ) ≤ 0, and 1 η 1 - 4η 1 -γ • 1 -γ 4 ≥ 2 1 -γ • 1 1 -ητ . To proceed, we introduce a shorthand notation G (t) (s) = 1 η KL s ν ⋆ τ ∥ ν (t) + 1 -γ 4 KL s µ ⋆ τ ∥ µ (t) + 2 (1 -γ)(1 -ητ ) KL s µ (t) ∥ μ(t) + KL s ν (t) ∥ ν(t) . We can then write (37) as (1 - 1 -γ 4 )(f (t) s -f ⋆ s ) ≤ (1 + 1 -γ 4 ) Q (t) (s) -Q ⋆ τ (s) ∞ + 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + (1 -ητ )G (t-1) (s) -G (t) (s). ( ) Note that when t = 0, we have f (0) s -f ⋆ s = τ log |A| -τ log |B| -µ ⋆ τ (s) ⊤ Q ⋆ τ (s)ν ⋆ τ (s) -τ H(µ ⋆ τ (s)) + τ H(ν ⋆ τ (s)) = max µ(s) min ν(s) f s (Q (0) , µ, ν) -max µ(s) min ν(s) f s (Q ⋆ τ , µ, ν) ≤ Q (0) (s) -Q ⋆ τ (s) ∞ . ( ) Substitution of ( 38) and ( 39) into (34) gives Q (t+1) (s, a, b) -Q ⋆ τ (s, a, b) = γE s ′ ∼P (•|s,a,b) t l=0 α l,t (f (l) s ′ -f ⋆ s ′ ) ≤ γE s ′ ∼P (s,a,b) α 0,t Q (0) (s ′ ) -Q ⋆ τ (s ′ ) ∞ + γ • 1 + (1 -γ)/4 1 -(1 -γ)/4 E s ′ ∼P (s,a,b) t l=1 α l,t Q (l) (s ′ ) -Q ⋆ τ (s ′ ) ∞ + 2η 1 -γ Q (l) (s ′ ) -Q (l-1) (s ′ ) ∞ + γ 1 -(1 -γ)/4 E s ′ ∼P (s,a,b) (1 -ητ ) t l=1 α l,t G (l-1) (s ′ ) - t l=1 α l,t G (l) (s ′ ) . Note that (1 -ητ ) t l=1 α l,t G (l-1) (s ′ ) - t l=1 α l,t G (l) (s ′ ) ≤ t-1 l=1 ((1 -ητ )α l+1,t -α l,t )G (l) (s ′ ) + α 1,t G (0) (s ′ ) ≤ α 1,t G (0) (s ′ ) ≤ 2α 0,t ητ G (0) (s ′ ) ≤ 2α 0,t , where the second step is due to (1 -ητ )α l+1,t -α l,t = ((1 -ητ )α l+1 -α l (1 -α l+1 )) t j=l+2 α j ≤ ((1 -ητ )α l+1 -α l+1 + α l α l+1 ) t j=l+2 α j = α l+1 (α l -ητ ) t j=l+2 α j ≤ 0. So we conclude that Q (t+1) (s, a, b) -Q ⋆ τ (s, a, b) ≤ γ • 1 + (1 -γ)/4 1 -(1 -γ)/4 E s ′ ∼P (s,a,b) t l=0 α l,t Q (l) (s ′ ) -Q ⋆ τ (s ′ ) ∞ + 2η 1 -γ Q (l) (s ′ ) -Q (l-1) (s ′ ) ∞ + 2α 0,t ≤ 1 + γ 2 • E s ′ ∼P (s,a,b) t l=0 α l,t Q (l) (s ′ ) -Q ⋆ τ (s ′ ) ∞ + 2η 1 -γ Q (l) (s ′ ) -Q (l-1) (s ′ ) ∞ + 2α 0,t . The other side of ( 33) can be obtained by computing 1-γ 4 • (35) + ( 36) and following a similar argument, and is therefore omitted. For t = 0, we have Q (1) (s, a, b) -Q ⋆ τ (s, a, b) ≤ γ max s ′ ∈S |f (0) s ′ -f ⋆ s ′ | ≤ 2γ 1-γ . C.4 PROOF OF LEMMA 4 For t ≥ 1, let u t = η Q ⋆ τ (s) -Q (t) (s) Γ(ρ) + 12η 2 (1 -γ) 2 Q (t) (s) -Q (t-1) (s) Γ(ρ) .

It follows that

u 1 ≤ 2γη 1 -γ + 24η 2 (1 -γ) 3 ≤ 1. When t ≥ 1, invoking Lemma 2 and Lemma 3 gives u t+1 ≤ 1 - 1 -γ 2 t l=1 α l,t η Q (l) -Q ⋆ τ Γ(ρ) + 2η 2 1 -γ + 12η 2 (1 -γ) 2 Q (l) -Q (l-1) Γ(ρ) + 48ηC ρ (1 -γ) 2 t l=1 α l,t KL ρ ζ(l) ∥ ζ(l-1) + 2α 0,t η + α 0,t η Q (0) -Q ⋆ τ Γ(ρ) ≤ 1 - 1 -γ 3 t l=1 α l,t u l + 48ηC ρ (1 -γ) 2 t l=1 α l,t KL ρ ζ(l) ∥ ζ(l-1) + 4η 1 -γ α 0,t . (41) Let β l,t = α l t i=l+1 1 - 1 -γ 3 • α i . It follows that for t ≥ 0, t+1 l=1 α l,t+1 u l = (1 -α t+1 ) t l=1 α l,t u l + α t+1 u t+1 ≤ 1 - 1 -γ 3 • α t+1 t l=1 α l,t u l + α t+1 48ηC ρ (1 -γ) 2 • t l=1 α l,t KL ρ ζ(l) ∥ ζ(l-1) + 4η 1 -γ α t+1 α 0,t ≤ t+1 l=2 1 - 1 -γ 3 • α l α 1,1 u 1 + 48ηC ρ (1 -γ) 2 t i=1 β i+1,t+1 i l=1 α l,i KL ρ ζ(l) ∥ ζ(l-1) + 4η 1 -γ t i=1 α 0,i β i+1,t+1 ≤ β 1,t+1 u 1 + 48ηC ρ (1 -γ) 2 t l=1 t i=l α l,i β i+1,t+1 KL ρ ζ(l) ∥ ζ(l-1) + 4η 1 -γ t i=1 α 0,i β i+1,t+1 ≤ 200ηC ρ (1 -γ) 2 t l=1 β l,t+1 KL ρ ζ(l) ∥ ζ(l-1) + 18η 1 -γ β 0,t+1 , where the last step is due to the following lemma. Similar lemma has appeared in prior works (see i.e., (Wei et al., 2021b, Lemma 36) ). Our version features a simpler proof, which is postponed to Appendix E.4. Lemma 14. Let two sequences {δ i }, {ξ i } be defined as δ i = 1 -c 1 α i , ξ i = 1 -c 2 α i , where the constants c 1 , c 2 satisfiy 0 < c 1 < c 2 < 1 2αi . For l ≤ t, let δ l,t = α l d i=l+1 δ i and ξ l,t = α l d i=l+1 ξ i . We have t i=l ξ l,i δ i+1,t ≤ 1 + 2 c 2 -c 1 δ l,t . Substitution of ( 42) into (41) gives u t+1 ≤ 1 - 1 -γ 3 t l=1 α l,t u l + 48η (1 -γ) 2 t l=1 α l,t KL ζ(l) ∥ ζ(l-1) + 4η 1 -γ α 0,t ≤ 200ηC ρ (1 -γ) 2 t l=1 β l,t KL ζ(l) ∥ ζ(l-1) + 18η 1 -γ β 0,t + 48ηC ρ (1 -γ) 2 t l=1 α l,t KL ζ(l) ∥ ζ(l-1) + 4η 1 -γ α 0,t ≤ 250ηC ρ (1 -γ) 2 t l=1 β l,t KL ζ(l) ∥ ζ(l-1) + 22η 1 -γ β 0,t . for t ≥ 1. It is straightforward to verify that the above inequality holds for t = 0 as well. So we conclude that t l=0 λ l+1,t+1 u l+1 = t i=0 λ i+1,t+1 u i+1 ≤ t i=0 λ i+1,t+1 250ηC ρ (1 -γ) 2 i l=1 β l,i KL ζ(l) ∥ ζ(l-1) + 22η 1 -γ β 0,i = 250ηC ρ (1 -γ) 2 t l=1 t i=l β l,i λ i+1,t+1 KL ζ(l) ∥ ζ(l-1) + 22η 1 -γ t i=0 β 0,i λ i+1,t+1 ≤ 6250ηC ρ (1 -γ) 3 t l=1 λ l,t+1 KL ζ(l) ∥ ζ(l-1) + 550η (1 -γ) 2 λ 0,t+1 = 6250ηC ρ (1 -γ) 3 t-1 l=0 λ l+1,t+1 KL ζ(l+1) ∥ ζ(l) + 550η (1 -γ) 2 λ 0,t+1 , where the penultimate step invokes Lemma 14.

C.5 PROOF OF LEMMA 5

Taking logarithm on the both sides of the update rule (9b), we get log μ(t+1) (s) -(1 -ητ ) log µ (t) (s) 1 = ηQ (t) (s)ν (t) (s) log ν(t+1) (s) -(1 -ητ ) log ν (t) (s) 1 = -ηQ (t) (s) ⊤ μ(t) (s) . Subtracting ( 27) from ( 43) and taking inner product with ζ(t+1 ) (s) -ζ ⋆ τ (s) gives log ζ(t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) -ζ ⋆ τ (s) = η μ(t+1) (s) -µ ⋆ τ (s), Q (t) (s)ν (t) (s) -Q ⋆ τ (s)ν ⋆ τ (s) -η ν(t+1) (s) -ν ⋆ τ (s), Q (t) (s) ⊤ μ(t) (s) -Q ⋆ τ (s) ⊤ µ ⋆ τ (s) ≤ η μ(t+1) (s) -µ ⋆ τ (s), Q (t) (s) ν(t) (s) -ν ⋆ τ (s) -η ν(t+1) (s) -ν ⋆ τ (s), Q (t) (s) ⊤ μ(t) (s) -µ ⋆ τ (s) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ ≤ η μ(t+1) (s) -µ ⋆ τ (s), Q (t) (s) ν(t) (s) -ν(t+1) (s) -η ν(t+1) (s) -ν ⋆ τ (s), Q (t) (s) ⊤ μ(t) (s) -μ(t+1) (s) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ ≤ 2η 1 -γ 2KL s ζ ⋆ τ ∥ ζ(t+1) + KL s ζ(t+1) ∥ ζ (t) + KL s ζ (t) ∥ ζ(t) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ . LHS can be written as log ζ(t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) -ζ ⋆ τ (s) = -log ζ(t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ ⋆ τ (s) + log ζ(t+1) (s) -(1 -ητ ) log ζ (t) (s) -ητ log ζ ⋆ τ (s), ζ(t+1) (s) = KL s ζ ⋆ τ ∥ ζ(t+1) -(1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + (1 -ητ )KL s ζ(t+1) ∥ ζ (t) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ . So we conclude that 1 - 4η 1 -γ KL s ζ ⋆ τ ∥ ζ(t+1) -(1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + 1 -ητ - 2η 1 -γ KL s ζ(t+1) ∥ ζ (t) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ ≤ 2η 1 -γ KL s ζ (t) ∥ ζ(t) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ . With 0 < η ≤ 1-γ 8 , we have 1 2 KL s ζ ⋆ τ ∥ ζ(t+1) + ητ KL s ζ(t+1) ∥ ζ ⋆ τ ≤ (1 -ητ )KL s ζ ⋆ τ ∥ ζ (t) + 2η 1 -γ KL s ζ (t) ∥ ζ(t) + 2η Q (t) (s) -Q ⋆ τ (s) ∞ . C.6 PROOF OF LEMMA 6 We have V µ,ν τ (s) -V ⋆ τ (s) = µ(s) ⊤ Q µ,ν τ (s)ν(s) + τ H µ(s) -τ H ν(s) -µ ⋆ τ (s) ⊤ Q ⋆ τ (s)ν ⋆ τ (s) -τ H µ ⋆ τ (s) + τ H ν ⋆ τ (s) = µ(s) ⊤ Q µ,ν τ (s)ν(s) -µ(s) ⊤ Q ⋆ τ (s)ν(s) + f s (Q ⋆ τ , µ, ν) -f s (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ) = γ E a∼µ(•|s), b∼ν(•|s), s ′ ∼P (•|s,a,b) [V µ,ν τ (s ′ ) -V ⋆ τ (s ′ )] + f s (Q ⋆ τ , µ, ν) -f s (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ). Applying the inequality recursively and averaging s over ρ, we arrive at V µ,ν τ (ρ) -V ⋆ τ (ρ) = 1 1 -γ E s ′ ∼d µ,ν ρ [f s ′ (Q ⋆ τ , µ, ν) -f s ′ (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ )] , which is the well-known performance difference lemma applied to the setting of Markov games. It follows that V µ † τ (ν),ν τ (ρ) -V ⋆ τ (ρ) = 1 1 -γ E s ′ ∼d µ † τ (ν),ν ρ f s ′ (Q ⋆ τ , µ † τ (ν), ν) -f s ′ (Q ⋆ τ , µ ⋆ τ , ν ⋆ τ ) ≤ 1 1 -γ E s ′ ∼d µ † τ (ν),ν ρ f s ′ (Q ⋆ τ , µ † τ (ν), ν) -f s ′ (Q ⋆ τ , µ, ν ⋆ τ ) ≤ 1 1 -γ E s ′ ∼d µ † τ (ν),ν ρ max µ ′ ,ν ′ f s ′ (Q ⋆ τ , µ ′ , ν) -f s ′ (Q ⋆ τ , µ, ν ′ ) (45) ≤ C † ρ,τ 1 -γ E s∼ρ max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) . A similar argument gives V ⋆ τ (ρ)-V µ,ν † τ (µ) τ (ρ) ≤ C † ρ,τ 1-γ E s∼ρ max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) . Summing the two inequalities proves (18). Alternatively, we continue from (45) and show that V µ † τ (ν),ν τ (s) -V ⋆ τ (s) ≤ 1 1 -γ E s ′ ∼d µ † τ (ν),ν s max µ ′ ,ν ′ f s ′ (Q ⋆ τ , µ ′ , ν) -f s ′ (Q ⋆ τ , µ, ν ′ ) ≤ ∥1/ρ∥ ∞ 1 -γ E s∼ρ max µ ′ ,ν ′ f s (Q ⋆ τ , µ ′ , ν) -f s (Q ⋆ τ , µ, ν ′ ) . Summing the inequality with the one for V ⋆ τ (s) -V µ,ν † τ (µ) τ (s) and taking maximum over s ∈ S completes the proof for (19).

D PROOF OF KEY LEMMAS FOR THE EPISODIC SETTING D.1 PROOF OF LEMMA 9

Following the similar argument of arriving (28), we have log ζ (t+1) h (s) -(1 -ητ ) log ζ (t) h (s) -ητ log ζ ⋆ h,τ (s), ζ(t+1) h (s) -ζ ⋆ h,τ (s) ≤ 2η Q (t+1) h (s) -Q ⋆ h,τ (s) ∞ . We rewrite LHS as log ζ (t+1) h (s) -(1 -ητ ) log ζ (t) h (s) -ητ log ζ ⋆ h,τ (s), ζ(t+1) h (s) -ζ ⋆ h,τ (s) = -log ζ (t+1) h (s) -(1 -ητ ) log ζ (t) h (s) -ητ log ζ ⋆ h,τ (s), ζ ⋆ h,τ (s) + log ζ(t+1) h (s) -(1 -ητ ) log ζ (t) h (s) -ητ log ζ ⋆ h,τ (s), ζ + log ζ (t+1) h (s) -log ζ(t+1) h (s), ζ = KL s ζ ⋆ h,τ ∥ ζ (t+1) h -(1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t) h + (1 -ητ )KL s ζ(t+1) h ∥ ζ (t) h + ητ KL s ζ(t+1) h ∥ ζ ⋆ h,τ + KL s ζ (t+1) h ∥ ζ(t+1) h -log ζ(t+1) h (s) -log ζ (t+1) h (s), ζ(t+1) h (s) -ζ (t+1) h (s) . Rearranging terms gives KL s ζ ⋆ h,τ ∥ ζ (t+1) h -(1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t) h + (1 -ητ )KL s ζ(t+1) h ∥ ζ (t) h + ητ KL s ζ(t+1) h ∥ ζ ⋆ h,τ + KL s ζ (t+1) h ∥ ζ(t+1) h -log ζ(t+1) h (s) -log ζ (t+1) h (s), ζ(t+1) h (s) -ζ (t+1) h (s) ≤ 2η Q (t+1) (s) -Q ⋆ τ (s) ∞ . h (s) -Q ( We bound Q (t) h (s)ν (t) h (s) -Q (t+1) h (s)ν (t+1) h (s) 1 as Q (t) h (s)ν (t) h (s) -Q (t+1) h (s)ν (t+1) h (s) 1 ≤ Q (t+1) h (s) ν(t) h (s) - ν(t+1) h (s) 1 + Q (t) h (s) -Q (t+1) h (s) ν(t) h (s) 1 ≤ 2H ν(t) h (s) - ν(t+1) h (s) 1 + Q (t) h (s) -Q (t+1) h (s) ∞ ≤ 2H ν(t+1) h (s) -ν (t) h (s) 1 + 2H ν (t) h (s) - ν(t) h (s) 1 + Q (t) h (s) -Q (t+1) h (s) ∞ . Plugging the above inequality into (47) and invoking Young's inequality yields log μ(t+1) h (s) -log µ (t+1) h (s), μ(t+1) h (s) -µ (t+1) h (s) ≤ ηH ν(t+1) h (s) -ν (t) h (s) 2 1 + ν (t) h (s) - ν(t) h (s) 2 1 + 2 μ(t+1) h (s) -µ (t+1) h (s) 2 1 + η Q (t) h (s) -Q (t+1) h (s) ∞ μ(t+1) h (s) -µ (t+1) h (s) 1 ≤ 2ηHKL s ν(t+1) h ∥ ν (t) h + 2ηHKL s ν (t) h ∥ ν(t) h + 4ηHKL s µ (t+1) h ∥ μ(t+1) h + 2η 2 H Q (t) h (s) -Q (t+1) h (s) ∞ , where the last step results from Pinsker's inequality and Lemma 8. Similarly, we have log ν(t+1) h (s) -log ν (t+1) h (s), ν(t+1) h (s) -ν (t+1) h (s) ≤ 2ηHKL s μ(t+1) h ∥ µ (t) h + 2ηHKL s µ (t) h ∥ μ(t) h + 4ηHKL s ν (t+1) h ∥ ν(t+1) h + 2η 2 H Q (t) h (s) -Q (t+1) h (s) ∞ . Summing the above two inequalities gives log ζ(t+1) h (s) -log ζ (t+1) h (s), ζ(t+1) h (s) -ζ (t+1) h (s) ≤ 2ηHKL s ζ(t+1) h ∥ ζ (t) h + 2ηHKL s ζ (t) h ∥ ζ(t) h + 4ηHKL s ζ (t+1) h ∥ ζ(t+1) h + 4η 2 H Q (t) h (s) -Q (t+1) h (s) ∞ ≤ 2ηHKL s ζ(t+1) h ∥ ζ (t) h + 2ηHKL s ζ (t) h ∥ ζ(t) h + 4ηHKL s ζ (t+1) h ∥ ζ(t+1) h + η 2 Q (t) h (s) -Q ⋆ h,τ (s) ∞ + Q (t+1) h (s) -Q ⋆ h,τ (s) ∞ , where the second step invokes triangular inequality and the fact that η ≤ 1 8H . Plugging the above inequality into (46) gives KL s ζ ⋆ h,τ ∥ ζ (t+1) h -(1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t) h + (1 -η(τ + 2H))KL s ζ(t+1) h ∥ ζ (t) h + ητ KL s ζ(t+1) h ∥ ζ ⋆ h,τ + (1 -4ηH)KL s ζ (t+1) h ∥ ζ(t+1) h -2ηHKL s ζ (t) h ∥ ζ(t) h ≤ 5η 2 Q (t+1) h (s) -Q ⋆ τ (s) ∞ + η 2 Q (t) h (s) -Q ⋆ τ (s) ∞ . With η ≤ 1 8H , we have (1 -ητ )(1 -4ηH) ≥ 2ηH and 1 -η(τ + 2H) ≥ 0. It follows that KL s ζ ⋆ h,τ ∥ ζ (t+1) h + (1 -4ηH)KL s ζ (t+1) h ∥ ζ(t+1) h + ητ KL s ζ(t+1) h ∥ ζ ⋆ h,τ ≤ (1 -ητ )KL s ζ ⋆ h,τ ∥ ζ (t) h + 2ηHKL s ζ (t) h ∥ ζ(t) h + 5η 2 Q (t+1) h (s) -Q ⋆ τ (s) ∞ + η 2 Q (t) h (s) -Q ⋆ τ (s) ∞ ≤ (1 -ητ ) KL s ζ ⋆ h,τ ∥ ζ (t) h + (1 -4ηH)KL s ζ (t) h ∥ ζ(t) h + 5η 2 Q (t+1) h (s) -Q ⋆ τ (s) ∞ + η 2 Q (t) h (s) -Q ⋆ τ (s) ∞ . Therefore, it holds for 0 ≤ t 1 < t 2 that KL s ζ ⋆ h,τ ∥ ζ (t2) h + (1 -4ηH)KL s ζ (t2) h ∥ ζ(t2) h + ητ KL s ζ(t2) h ∥ ζ ⋆ h,τ ≤ (1 -ητ ) t2-t1 KL s ζ ⋆ h,τ ∥ ζ (t1) h + (1 -4ηH)KL s ζ (t1) h ∥ ζt1 h + t2 t ′ =t1+1 (1 -ητ ) t2-l 5η 2 Q (l) h (s) -Q ⋆ τ (s) ∞ + η 2 Q (l-1) h (s) -Q ⋆ τ (s) ∞ ≤ (1 -ητ ) t2-t1 KL s ζ ⋆ h,τ ∥ ζ (t1) h + (1 -4ηH)KL s ζ (t1) h ∥ ζ(t1) h + 4η t2 l=t1 (1 -ητ ) t2-l Q (l) h (s) -Q ⋆ τ (s) ∞ .

D.2 PROOF OF LEMMA 10

For t 2 > 0, we have Q (t2) h-1 (s, a, b) -Q ⋆ h-1,τ (s, a, b) = E s ′ ∼P h-1 (•|s,a,b) V (t2-1) h (s ′ ) -V ⋆ h,τ (s ′ ) = E s ′ ∼P h-1 (•|s,a,b) (1 -ητ ) t2-t1 V (t1-1) h (s ′ ) -V ⋆ h,τ (s ′ ) + ητ t2-1 l=t1 (1 -ητ ) t2-1-l f s ′ (Q (t1) , μ(t1) h , ν(t1) h ) -f s ′ (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) ≤ (1 -ητ ) t2-t1 2H + E s ′ ∼P h-1 (•|s,a,b) ητ t2-1 l=t1 (1 -ητ ) t2-1-l f s ′ (Q (l) h , μ(l) h , ν(l) h ) -f s ′ (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) . We start by decomposing f (t) s -f ⋆ s as f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) = f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q (t) h , μ(t) h , ν ⋆ h,τ ) + f s (Q (t) h , μ(t) h , ν ⋆ h,τ ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) ≤ f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q (t) h , μ(t) h , ν ⋆ h,τ ) + f s (Q ⋆ τ , μ(t) , ν ⋆ h,τ ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) + Q (t) h (s) -Q ⋆ h,τ (s) ∞ ≤ f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q (t) h , μ(t) h , ν ⋆ h,τ ) + Q (t) h (s) -Q ⋆ h,τ (s) ∞ . Note that Lemma 13 can be applied to the episodic setting by simply replacing 1/(1 -γ) with H, which yields f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) ≤ Q (t) h (s) -Q ⋆ h,τ (s) ∞ + 2ηH Q (t) h (s) -Q (t-1) h (s) ∞ + 1 -ητ η KL s ν ⋆ h,τ ∥ ν (t-1) h - 1 η KL s ν ⋆ h,τ ∥ ν (t) h - 1 η 1 -4ηH KL s ν (t) h ∥ ν(t) h - 1 -ητ η KL s ν(t) h ∥ ν (t-1) h + 2H KL s μ(t) h ∥ µ (t-1) h + KL s µ (t-1) h ∥ μ(t-1) h . ( ) By a similar argument, f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) -f s (Q (t) h , μ(t) h , ν(t) h ) ≤ Q (t) h (s) -Q ⋆ h,τ (s) ∞ + 2ηH Q (t) h (s) -Q (t-1) h (s) ∞ + 1 -ητ η KL s µ ⋆ h,τ ∥ µ (t-1) h - 1 η KL s µ ⋆ h,τ ∥ µ (t) h - 1 η 1 -4ηH KL s µ (t) h ∥ μ(t) h - 1 -ητ η KL s μ(t) h ∥ µ (t-1) h + 2H KL s ν(t) h ∥ ν (t-1) h + KL s ν (t-1) h ∥ ν(t-1) h . Computing (49 ) + 2 3 • (50) gives 1 3 f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) ≤ 5 3 Q (t) h (s) -Q ⋆ h,τ (s) ∞ + 2ηH Q (t) h (s) -Q (t-1) h (s) ∞ + 1 -ητ η KL s ν ⋆ h,τ ∥ ν (t-1) h + 2 3 KL s µ ⋆ h,τ ∥ µ (t-1) h - 1 η KL s ν ⋆ h,τ ∥ ν (t) h + 2 3 KL s µ ⋆ h,τ ∥ µ (t) h + 2H KL s µ (t-1) h ∥ μ(t-1) h + 2 3 KL s ν (t-1) h ∥ ν(t-1) h - 1 η 1 -4ηH 2 3 KL s µ (t) h ∥ μ(t) h + KL s ν (t) h ∥ ν(t) h + 2H - 1 -ητ η • 2 3 KL s μ(t) ∥ µ (t-1) + 2H • 2 3 - 1 -ητ η KL s ν(t) ∥ ν (t-1) . (51) With η ≤ 1 8H , we have 2H - 1 -ητ η • 2 3 ≤ 0, 2H • 2 3 - 1 -ητ η ≤ 0, and 1 η (1 -ητ )(1 -4ηH) • 2 3 ≥ 2H. Let G (t) h (s) = KL s ν ⋆ h,τ ∥ ν (t) h + 2 3 KL s µ ⋆ h,τ ∥ µ (t) h + 2 3 (1 -4ηH) KL s µ (t) h ∥ μ(t) h + KL s ν (t) h ∥ ν(t) h . We can simplify (51) as f s (Q (t) h , μ(t) h , ν(t) h ) -f s (Q ⋆ h,τ , µ ⋆ h,τ , ν ⋆ h,τ ) ≤ 5 Q (t) h (s) -Q ⋆ h,τ (s) ∞ + 2ηH Q (t) h (s) -Q (t-1) h (s) ∞ + 1 -ητ η G (t-1) h (s) - 1 η G (t) h (s). Plugging the above inequality into (48) gives Q (t2) h-1 (s, a, b) -Q ⋆ h-1,τ (s, a, b) ≤ (1 -ητ ) t2-t1 2H + E s ′ ∼P h-1 (•|s,a,b) 5ητ t2-1 l=t1 (1 -ητ ) t2-1-l Q (l) h (s ′ ) -Q ⋆ h,τ (s ′ ) ∞ + 2ηH Q (l) h (s ′ ) -Q (l-1) h (s ′ ) ∞ + E s ′ ∼P h-1 (•|s,a,b) τ (1 -ητ ) t2-t1 G (t1-1) h (s ′ ) ≤ (1 -ητ ) t2-t1 2H + 10ητ E s ′ ∼P h-1 (•|s,a,b) t2-1 l=t1-1 (1 -ητ ) t2-1-l Q (l) h (s ′ ) -Q ⋆ h,τ (s ′ ) ∞ + τ (1 -ητ ) t2-t1 E s ′ ∼P h-1 (•|s,a,b) KL s ′ ζ ⋆ h,τ ∥ ζ (t1-1) h + (1 -4ηH)KL s ′ ζ (t1-1) h ∥ ζ(t1-1) h . E PROOF OF AUXILIARY LEMMAS E.1 PROOF OF LEMMA 11 We first single out a set of bounds for V (t) and Q (t) , which can be obtained by a simple induction: ∀(s, a, b) ∈ S × A × B, -τ log |B| 1-γ ≤ V (t) (s) ≤ 1+τ log |A| 1-γ -γτ log |B| 1-γ ≤ Q (t) (s, a, b) ≤ 1+γτ log |A| 1-γ . ( ) We invoke the following lemma to bound several key quantities that will be helpful in the analysis. Lemma 15 ((Mei et al., 2020, Lemma 24) ). Let π, π ′ ∈ ∆(A) such that π(a) ∝ exp(θ(a)), π ′ (a) ∝ θ ′ (a) for some θ, θ ′ ∈ R |A| . It holds that π -π ′ 1 ≤ θ -θ ′ ∞ . With this lemma in mind, for any t ≥ 0, it follows that μ(t+1) (s) -µ (t+1) (s) 1 ≤ min c∈R log μ(t+1) (s) -log µ (t+1) (s) -c • 1 ∞ ≤ η Q (t) (s)ν (t) (s) -Q (t+1) (s)ν (t+1) (s) ∞ ≤ η • 1 + γτ (log |A| + log |B|) 1 -γ ≤ 2η 1 -γ , and a similar argument reveals that ν(t+1) (s) -ν (t+1) (s) 1 ≤ 2η 1 -γ . Next we make note of the fact that when t ≥ 1, μ(t+1) (a|s) ∝ µ (t) (a|s) 1-ητ exp(ηQ (t) (s)ν (t) (s)) ∝ μ(t) (a|s) 1-ητ exp η Q (t) (s)ν (t) (s) + (1 -ητ )(Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s)) ∝ μ(t) (a|s) exp(ηw (t) (a)), where w (t) = Q (t) (s)ν (t) (s) + (1 -ητ ) Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s) -τ log μ(t) (s) satisfies w (t) ∞ ≤ Q (t) (s)ν (t) (s) ∞ + τ log μ(t) (s) ∞ + (1 -ητ ) Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s) ∞ ≤ 2 1 -γ + 2 1 -γ + 2(1 -ητ ) 1 -γ ≤ 6 1 -γ , where the second step is due to the following bound: ∀t ≥ 0, s ∈ S, max log ζ (t) (s) ∞ , log ζ(t) (s) ∞ ≤ 2 (1 -γ)τ . ( ) Recall that when t = 0, we have μ(t+1) = μ(0) . So we have ∀s ∈ S, t ≥ 0, μ(t+1) (s) -μ(t) (s) 1 ≤ 6η 1 -γ . It remains to prove the claim (54). Proof. It is worth noting that µ (t) (s) can be always written as µ (t) (a|s) ∝ exp(w (t) (a)/τ ) for some w (t) ∈ R |A| satisfying ∀a ∈ A, - γτ log |B| 1 -γ ≤ w (t) (a) ≤ 1 + γτ log |A| 1 -γ . To see this, note that the claim trivially holds for t = 0 with w (0) = 0. When the statement holds for some t ≥ 0, we have µ (t+1) (a|s) ∝ µ (t) (a|s) 1-ητ exp(ηQ (t+1) (s)ν (t+1) (s)) ∝ exp ((1 -ητ )w (t) + ητ Q (t+1) (s)ν (t+1) (s))/τ ∝ exp w (t+1) /τ , with w (t+1) = (1 -ητ )w (t) + ητ Q (t+1) (s)ν (t+1) (s). We conclude that the claim holds for t + 1 by recalling (52). It then follows straightforwardly that µ (t) (a 1 ) µ (t) (a 2 ) = exp w (t) (a 1 ) -w (t) (a 2 ) τ ≤ exp 1 + γτ (log |A| + log |B|) (1 -γ)τ for any a 1 , a 2 ∈ A. This allows us to show that min a∈A µ (t) (a) ≥ 1 |A| exp 1+γτ (log |A|+log |B|) (1-γ)τ a∈A µ (t) (a) = 1 |A| exp 1+γτ (log |A|+log |B|) (1-γ)τ , which gives ∥ log µ (t) ∥ ∞ ≤ 1 + γτ (log |A| + log |B|) (1 -γ)τ + log |A| ≤ 1 (1 -γ)τ + log |A| + γ log |B| 1 -γ ≤ 2 (1 -γ)τ .

E.2 PROOF OF LEMMA 12

We decompose the term f s (Q (t+1) , μ(t+1) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) as follows: f s (Q (t+1) , μ(t+1) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) = f s (Q (t+1) , μ(t+1) , ν(t+1) ) -f s (Q (t) , μ(t+1) , ν(t+1) ) + f s (Q (t) , μ(t+1) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) = μ(t+1) (s) ⊤ Q (t+1) (s) -Q (t) (s) ν(t+1) (s) + f s (Q (t) , μ(t+1) , ν(t) ) -f s (Q (t) , μ(t) , ν(t) ) + f s (Q (t) , μ(t) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) + f s (Q (t) , μ(t+1) , ν(t+1) ) + f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t+1) , ν(t) ) -f s (Q (t) , μ(t) , ν(t+1) ) Note that μ(t+1) (s) ⊤ Q (t+1) (s) -Q (t) (s) ν(t+1) (s) ≤ Q (t+1) (s) -Q (t) (s) ∞ . For the terms in the bracket, we have f s (Q (t) , μ(t+1) , ν(t+1) ) + f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t+1) , ν(t) ) -f s (Q (t) , μ(t) , ν(t+1) ) = μ(t+1) (s) -μ(t) (s) ⊤ Q (t) (s) ν(t+1) (s) -ν(t) (s) ≤ 2 1 -γ KL s ζ(t+1) ∥ ζ(t) . It remains to bound the two difference terms f s (Q (t) , μ(t+1) , ν(t) ) -f s (Q (t) , μ(t) , ν(t) ) and f s (Q (t) , μ(t) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) . To proceed, we show that f s (Q (t) , μ(t) , ν(t) ) -f s (Q (t) , μ(t+1) , ν(t) ) = μ(t) (s) -ν(t+1) (s), Q (t) (s) ⊤ μ(t) (s) + τ H(μ (t) (s)) -τ H(μ (t+1) (s)) = μ(t) (s) -μ(t+1) (s), Q (t) (s) ⊤ ν(t) (s) + (1 -ητ ) Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s) + τ H(μ (t) (s)) -τ H(μ (t+1) (s)) -(1 -ητ ) μ(t) (s) -μ(t+1) (s), Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s) = - 1 η KL s μ(t) ∥ μ(t+1) - 1 -ητ η KL s μ(t+1) ∥ μ(t) -(1 -ητ ) μ(t) (s) -μ(t+1) (s), Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s) Here, the third step results from Lemma 17 along with (53). Recall from previous discussion (cf. ( 53)) that μ(t+1) (a|s) ∝ μ(t) (a|s) exp(ηw (t) (s)) with some w (t) ∈ R |B| satisfying w (t) ∞ ≤ 6 1 -γ . We can ensure that ∥ηw (t) ∥ ∞ ≤ 1/30 with η -1 ≥ 180 1-γ , and the next lemma guarantees KL s μ(t) ∥ μ(t+1) ≤ 2KL s μ(t+1) ∥ μ(t) in this case. Lemma 16. Let w ∈ R |A| , π, π ′ ∈ ∆(A) satisfy, for each a ∈ A, π ′ (a) ∝ π(a) exp(w(a)) with ∥w∥ ∞ ≤ 1 30 . It holds that KL π ∥ π ′ ≤ 2KL π ′ ∥ π . Therefore, we can continue (55) by showing that ) , μ(t) , ν(t+1) ) with similar argument. Putting all pieces together, we arrive at f s (Q (t) , μ(t+1) , ν(t) ) -f s (Q (t) , μ(t) , ν(t) ) ≤ 1 η KL s μ(t) ∥ μ(t+1) + 1 -ητ η KL s μ(t+1) ∥ μ(t) + μ(t+1) (s) -μ(t) (s) 1 Q (t) (s)ν (t) (s) -Q (t-1) (s)ν (t-1) (s)) ∞ ≤ 3 η KL s μ(t+1) ∥ μ(t) + μ(t+1) (s) -μ(t) (s) 1 Q (t) (s) -Q (t-1) (s) ∞ + Q (t) (s) ∞ μ(t+1) (s) -μ(t) (s) 1 ν(t) (s) -ν(t-1) (s) 1 ≤ 3 η + 2 1 -γ KL s μ(t+1) ∥ μ(t) + 2 1 -γ KL s μ(t) ∥ μ(t-1) + 6η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ One can bound f s (Q (t) , μ(t) , ν(t) )-f s (Q (t f s (Q (t+1) , μ(t+1) , ν(t+1) ) -f s (Q (t) , μ(t) , ν(t) ) ≤ Q (t+1) (s) -Q (t) (s) ∞ + 3 η + 4 1 -γ KL s ζ(t+1) ∥ ζ(t) + 2 1 -γ KL s ζ(t) ∥ ζ(t-1) + 12η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ . E.3 PROOF OF LEMMA 13 ν(t) (s) -ν ⋆ τ (s), Q (t) (s) ⊤ μ(t) (s) -τ H(ν (t) (s)) + τ H(ν ⋆ τ (s)) = ν(t) (s) -ν (t) (s), Q (t) (s) ⊤ μ(t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s) + ν(t) (s) -ν (t) (s), Q (t-1) (s) ⊤ μ(t-1) (s) -τ H(ν (t) (s)) + τ H(ν (t) (s)) + ν (t) (s) -ν ⋆ τ (s), Q (t) (s) ⊤ μ(t) (s) -τ H(ν (t) (s)) + τ H(µ ⋆ τ (s)) = ν(t) (s) -ν (t) (s), Q (t) (s) ⊤ μ(t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s) + 1 -ητ η KL s ν (t) ∥ ν (t-1) - 1 η KL s ν (t) ∥ ν(t) - 1 -ητ η KL s ν(t) ∥ ν (t-1) + 1 -ητ η KL s ν ⋆ τ ∥ ν (t-1) - 1 η KL s ν ⋆ τ ∥ ν (t) - 1 -ητ η KL s ν (t) ∥ ν (t-1) ≤ ν(t) (s) -ν (t) (s) 1 Q (t) (s) ⊤ μ(t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s) ∞ - 1 η KL s ν (t) ∥ ν(t) - 1 -ητ η KL s ν(t) ∥ ν (t-1) + 1 -ητ η KL s ν ⋆ τ ∥ ν (t-1) - 1 η KL s ν ⋆ τ ∥ ν (t) . (56) The second step results from the following three-point lemma: Lemma 17 (Regularized 3-point lemma). Let x ∈ ∆(A) be defined as x(a) ∝ y(a) 1-ητ exp(-ηw(a)) for some w ∈ R |A| and y ∈ ∆(A). It holds for all z ∈ ∆(A) that η 1 -ητ x -z, w -τ H(x) + τ H(z) = KL z ∥ y - 1 1 -ητ KL z ∥ x -KL x ∥ y . We bound the first term in (56) as follows: ν(t) (s) -ν (t) (s) 1 Q (t) (s) ⊤ μ(t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s) ∞ ≤ ν(t) (s) -ν (t) (s) 1 Q (t) (s) -Q (t-1) (s) ⊤ μ(t-1) (s) ∞ + Q (t) (s) μ(t) (s) -μ(t-1) (s) ∞ ≤ ν(t) (s) -ν (t) (s) 1 Q (t) (s) -Q (t-1) (s) ∞ + 2 1 -γ ν(t) (s) -ν (t) (s) 1 μ(t) (s) -μ(t-1) (s) 1 ≤ 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 1 1 -γ 2 ν(t) (s) -ν (t) (s) 2 1 + μ(t) (s) -µ (t-1) (s) 2 1 + µ (t-1) (s) -μ(t-1) (s) 2 1 ≤ 2η 1 -γ Q (t) (s) -Q (t-1) (s) ∞ + 4 1 -γ KL s ν (t) ∥ ν(t) + 2 1 -γ KL μ(t) (s) ∥ µ (t-1) (s) + 2 1 -γ KL µ (t-1) (s) ∥ μ(t-1) (s) . Substitution of the above inequality into (56) completes the proof.

E.4 PROOF OF LEMMA 14

We have δ l,t = α l t i=l+1 (1 -c 1 α i ) = α l t i=l+1 (1 -c 2 α i + (c 2 -c 1 )α i ) = α l (c 2 -c 1 )α l+1 t i=l+2 (1 -c 2 α i + (c 2 -c 1 )α i ) + α l (1 -c 2 α l+1 ) t i=l+2 (1 -c 2 α i + (c 2 -c 1 )α i ) = α l t i=l+1 (c 2 -c 1 )α i • i j=l+1 (1 -c 2 α j ) • t k=i+1 (1 -c 1 α k ) + α l t i=l+1 (1 -c 2 α i ) = (c 2 -c 1 ) t i=l+1 ξ l,i δ i,t + ξ l,t . Rearranging terms, t i=l ξ l,i δ i+1,t = α l δ l+1,t + t i=l+1 ξ l,i δ i+1,t = α l+1 1 -c 1 α l+1 δ l,t + t i=l+1 ξ l,i δ i,t • α i+1 α i (1 -c 1 α i+1 ) ≤ δ l,t + 2 t i=l+1 ξ l,i δ i,t = δ l,t + 2 c 2 -c 1 (δ l,t -ξ l,t ) ≤ 1 + 2 c 2 -c 1 δ l,t , where the inequality is due to α l+1 ≤ α ≤ 1/2 and 1 -c 1 α l ≥ 1/2 for all l ≥ 1.

E.5 PROOF OF LEMMA 16

Proof. For any x > -1, it holds that log(1 + x) ≤ x - x 2 2 + x 3 3 ≤ x - x 2 2 + |x 3 | 3 = x - 1 2 - |x| 3 x 2 , and that log(1 + x) ≥ x - x 2 2 + x 3 3(1 + x) 3 ≥ x - x 2 2 - |x 3 | 3(1 + x) 3 = x - 1 2 + |x| 3(1 + x) 3 x 2 . Therefore, when x > -1 10 , we have (1 + x) 3 > 2 3 and thus x - 1 2 + |x| 2 x 2 ≤ log(1 + x) ≤ x - 1 2 - |x| 3 x 2 . Let c be a shorthand notation for w ∞ . The following lemma is standard (see, e.g., (Mei et al., 2020, Lemma 23) , (Cen et al., 2021a , Lemma 3)), which ensures that log π -log π ′ ∞ ≤ 2c.  Combining ( 57), ( 58) and (59) gives KL π ∥ π ′ ≤ (1 + 3c) • 1 -1/2 -c (1 -3c) 1 -(1 + 3c) 2 /2 KL π ′ ∥ π . It is straightforward to verify that the factor is less than 2 when c ≤ 1/30. Here, δ(s, a, b) (t+1) ∈ R represents the error due to approximate evaluation. For simplicity we focus on the case where the policy update rules (9a), (9b) remain unchanged. The following theorems reveal that the algorithm converges linearly to the QRE until it reaches an error floor determined by δ (i) Γ(ρ) : Theorem 5. With 0 < η ≤ (1-γ) 3 32000Cρ , and α i = ητ , we have max KL ρ ζ ⋆ τ ∥ ζ (t) , 1 2 KL ρ ζ ⋆ τ ∥ ζ(t) , 3η E s∼ρ Q (t) (s) -Q ⋆ τ (s) ∞ ≤ 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t + 1500 (1 -γ)τ max 0≤i≤t δ (i) Γ(ρ) . Theorem 6. With 0 < η ≤ (1-γ) 3 32000Cρ , and α i = ητ , we have max s∈S,µ,ν V µ,ν (t) τ (s) -V μ(t) ,ν τ (s) ≤ 2∥1/ρ∥ ∞ 1 -γ max 8 (1 -γ) 2 τ , 1 η • 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t + 1500 (1 -γ)τ max 0≤i≤t δ (i) Γ(ρ) , max µ,ν V µ,ν (t) τ (ρ) -V μ(t) ,ν τ (ρ) ≤ 2C † ρ,τ 1 -γ max 8 (1 -γ) 2 τ , 1 η 1 - (1 -γ)ητ 4 t • 3000 (1 -γ) 2 τ 1 - (1 -γ)ητ 4 t + 1500 (1 -γ)τ max 0≤i≤t δ (i) Γ(ρ) . We remark that δ (t) Γ(ρ) can be bounded either by ϵ stat or C ρ ϵ stat with evaluation error guarantee max s∈S δ (t) (s) ∞ ≤ ϵ stat and E s∼ρ δ (t) (s) ∞ ≤ ϵ stat respectively. The remaining part of this section outlines the proof for the above Theorems. For simplicity, we only highlight the key difference from the previous proof due to evaluation error and omit the proof for corresponding lemmas. We first remark that Lemma 1 depends solely on the policy update rules and hence still holds. The error propagation of {δ (l) } is captured by the following lemmas which parallels Lemma 2 and Lemma 16: Following the similar argument in Lemma 4, we can show that Lemma 21. Under the assumption of Lemma 19 and 20, it holds for all t ≥ 0 that t l=0 λ l+1,t+1 η Q ⋆ τ -Q (l+1) Γ(ρ) + 12η 2 (1 -γ) 2 Q (l+1) -Q (l) Γ(ρ) ≤ 6250ηC ρ (1 -γ) 3 t-1 l=0 λ l+1,t+1 KL ζ(l+1) ∥ ζ(l) + 550η (1 -γ) 2 λ 0,t+1 + 60η t l=0 λ l+1,t+1 α l δ (l) Γ(ρ) . With α l = ητ for l ≥ 1, we have It is then straightforward to put together the above lemmas in a similar way to the proof in Appendix A to obtain Theorem 5 and 6. G FURTHER DISCUSSION REGARDING WEI ET AL. (2021B) This section demonstrates how the last-iterate convergence result in Wei et al. (2021b, Theorem 2) in terms of the Euclidean distance to the set of NEs can be translated to that of the duality gap. Given any policy pair ζ = (µ, ν) and a NE ζ ⋆ = (µ ⋆ , ν ⋆ ), we can invoke performance difference lemma (44) and obtain: V µ,ν (ρ) -V ⋆ (ρ) = 1 1 -γ E s ′ ∼d µ,ν ρ µ(s ′ ) ⊤ Q ⋆ (s ′ )ν(s ′ ) -µ ⋆ (s ′ ) ⊤ Q ⋆ (s ′ )ν ⋆ (s ′ ) ≤ 1 1 -γ E s ′ ∼d µ,ν ρ max µ ′ µ ′ (s ′ ) ⊤ Q ⋆ (s ′ )ν(s ′ ) -µ ⋆ (s ′ ) ⊤ Q ⋆ (s ′ )ν ⋆ (s ′ ) = 1 1 -γ E s ′ ∼d µ,ν ρ max µ ′ µ ′ (s ′ ) ⊤ Q ⋆ (s ′ )ν(s ′ ) -max µ ′ µ ′ (s ′ ) ⊤ Q ⋆ (s ′ )ν ⋆ (s ′ ) ≤ 1 1 -γ E s ′ ∼d µ,ν ρ max µ ′ µ ′ (s ′ ) ⊤ Q ⋆ (s ′ ) ν(s ′ ) -ν ⋆ (s ′ ) ≤ 1 (1 -γ) 2 E s ′ ∼d µ,ν ρ ν(s ′ ) -ν ⋆ (s ′ ) 1 . Setting µ to the best-response policy of ν, i.e., µ = µ † (ν) := arg max µ V µ,ν (ρ), we get max µ ′ V µ ′ ,ν (ρ) -V ⋆ (ρ) = V µ † (ν),ν (ρ) -V ⋆ (ρ) ≤ 1 (1 -γ) 2 E s ′ ∼d µ † (ν),ν ρ ν(s ′ ) -ν ⋆ (s ′ ) 1 ≤ d µ † (ν),ν ρ ∞ (1 -γ) 2 s∈S ν(s) -ν ⋆ (s) 1 . Similarly, we have V ⋆ (ρ) -min ν ′ V µ,ν ′ (ρ) ≤ d µ,ν † (µ) ρ ∞ (1 -γ) 2 s∈S µ(s ′ ) -µ ⋆ (s ′ ) 1 . Taken together, the duality gap can be bounded by the policy's ℓ 1 distance to NE (µ ⋆ , ν ⋆ ) as (1 -γ) 16 c 4 ϵ 2 to achieve ϵ-NE in a last-iterate fashion. max µ ′ ,ν ′ V µ ′ ,ν (ρ) -V µ,ν ′ (ρ) ≤ 1 (1 -γ) 2 s∈S ν(s ′ ) -ν ⋆ (s ′ ) 1 + µ(s ′ ) -µ ⋆ (s ′ ) 1 ≤ |S| 1/2 (|A| + |B|) 1/2 (1 -γ) 2 s∈S ν(s ′ ) -ν ⋆ (s ′ ) 2 2 + µ(s ′ ) -µ ⋆ (s ′ )



Lemma 18. Let π, π ′ ∈ ∆(A) satisfy π(a) ∝ exp(θ(a)) and π ′ (a) ∝ exp(θ ′ (a)) for some θ, θ ′ ∈ R |A| . It holds that log π -log π ′ ∞ ≤ 2 θ -θ ′ ∞ .Since c < 1/30, we haveπ(a) π ′ (a) -1 = exp log π(a) π ′ (a) -exp(0) ≤ | log π(a) -log π ′ (a)| max 1, π(a) π ′ (a) ≤ 2c exp(|2c|) ≤ 3c, ∀a ∈ A.Therefore, we can bound KL π ∥ π ′ as KL π ∥ π ′ = 3c)χ 2 (π ′ ; π).

PROOF OF LEMMA 17Proof. We haveKL z ∥ y = -H(z) + H(y) -z -y, log y = -H(z) + H(x) -z -x, log y -H(x) + H(y) -x -y, log y = -H(z) + H(x) -z -x, log x -H(x) + H(y) -x -y, log y -z -x, log y -log x = KL z ∥ x + KL x ∥ y -η 1 -ητ z -x, w + τ log x .Rearranging terms givesη 1 -ητ x -z, w = KL z ∥ y -KL z ∥ x -KL x ∥ y + ητ 1 -ητ z -x, log x .Adding ητ 1-ητ (-H(x) + H(z)) to both sides, we are left with η1 -ητ x -z, w -τ H(x) + τ H(z) = KL z ∥ y -KL z ∥ x -KL x ∥ y ητ 1 -ητ -H(z) + H(x) -z -x, log x = KL z ∥ y -1 1 -ητ KL z ∥ x -KL x ∥ y .F FURTHER DISCUSSION REGARDING APPROXIMATE ALGORITHMSIn this section we verify the convergence of the proposed method equipped with inexact value updates in the infinite-horizon setting, where (10) in Algorithm 1 is replaced by   Q (t+1) (s, a, b) = r(s, a, b) + γE s ′ ∼P (•|s,a,b) V (t) (s ′ ) + α t δ (t) (s, a, b) V (t+1) (s) = (1 -α t+1 )V (t) (s) +α t+1 μ(t+1) (s) ⊤ Q (t+1) (s)ν (t+1) (s) + τ H μ(t+1) (s) -τ H ν(t+1) (s).or equivalentlyQ (t+1) (s, a, b) = (1 -α t )Q (t) (s, a, b) + α t r(s, a, b) + γE s ′ ∼P (•|s,a,b) μ(t) (s) ⊤ Q (t) (s)ν (t) (s) + τ H μ(t) (s) -τ H ν(t) (s) + δ (t) (s, a, b) .

step results from Cauchy-Schwarz inequality. Finally, recall from Wei et al. (2021b, Theorem 2) that it takes at mostO |S| 2 η 4 c 4 (1 -γ) 4 ϵ 2 iterations to ensure 1 |S| s∈S ν(s ′ ) -ν ⋆ (s ′ ) 2 2 + µ(s ′ ) -µ ⋆ (s ′ ) 2 2 ≤ ϵ 2 , with η 2 = O((1 -γ) 5 |S| -1). Putting pieces together and minimizing the bound over η, this leads to an iteration complexity ofO |S| 5 (|A| + |B|) 1/2

Comparison

Comparison of policy optimization methods for finding an ϵ-optimal NE or QRE of two-player zerosum episodic Markov games in terms of the duality gap.

• • ) is generated according to a t ∼ µ(•|s t ), b t ∼ ν(•|s t ) and s t+1 ∼ P (•|s t , a t , b t ). Similarly, the Q-function Q µ,ν (s, a, b) evaluates the expected discounted cumulative reward with initial state s and initial action pair (a, b):

Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ACKNOWLEDGMENTS

The authors would like to thank Gen Li and Zeyuan Allen-Zhu for valuable discussions. Part of this work was completed while S. Cen was an intern at Meta AI Research. S. Cen and Y. Chi are supported in part by the grants ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, NSF CCF-1901199, CCF-2007911, CCF-2106778 and CNS-2148212. S. Cen is also gratefully supported by Wei Shen and Xuehong Zhang Presidential Fellowship, and Nicholas Minnici Dean's Graduate Fellowship in Electrical and Computer Engineering at Carnegie Mellon University.

annex

Lemma 19. With 0 < η ≤ min{(1 -γ)/180, (1 -γ) 2 /48}, it holds for all t ≥ 1 thatWhen t = 0, we haveWhen t = 0, we have Q (1) -Q ⋆ τ Γ(ρ) ≤ 2γ 1-γ + α 0 δ (0) Γ(ρ) .

