A SHARP ANALYSIS OF MODEL-BASED REINFORCE-MENT LEARNING WITH SELF-PLAY Anonymous

Abstract

Model-based algorithms-algorithms that explore the environment through building and utilizing an estimated model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for singleagent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an -approximate Nash policy in Õ(H 3 SAB/ 2 ) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This significantly improves over the best known model-based guarantee of Õ(H 4 S 2 AB/ 2 ), and is the first that matches the information-theoretic lower bound Ω(H 3 S(A + B)/ 2 ) except for a min {A, B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min {A, B} = o(H 3 ), and outputs a single Markov policy while existing sampleefficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

1. INTRODUCTION

This paper is concerned with the problem of multi-agent reinforcement learning (multi-agent RL), in which multiple agents learn to make decisions in an unknown environment in order to maximize their (own) cumulative rewards. Multi-agent RL has achieved significant recent success in traditionally hard AI challenges including large-scale strategy games (such as GO) (Silver et al., 2016; 2017) , real-time video games involving team play such as Starcraft and Dota2 (OpenAI, 2018; Vinyals et al., 2019) , as well as behavior learning in complex social scenarios (Baker et al., 2020) . Achieving human-like (or super-human) performance in these games using multi-agent RL typically requires a large number of samples (steps of game playing) due to the necessity of exploration, and how to improve the sample complexity of multi-agent RL has been an important research question. One prevalent approach towards solving multi-agent RL is model-based methods, that is, to use the existing visitation data to build an estimate of the model (i.e. transition dynamics and rewards), run an offline planning algorithm on the estimated model to obtain the policy, and play the policy in the environment. Such a principle underlies some of the earliest single-agent online RL algorithms such as E3 (Kearns & Singh, 2002) and RMax (Brafman & Tennenholtz, 2002) , and is conceptually appealing for multi-agent RL too since the multi-agent structure does not add complexity onto the model estimation part and only requires an appropriate multi-agent planning algorithm (such as value iteration for games (Shapley, 1953) ) in a black-box fashion. On the other hand, modelfree methods do not directly build estimates of the model, but instead directly estimate the value functions or action-value (Q) functions of the problem at the optimal/equilibrium policies, and play the greedy policies with respect to the estimated value functions. Model-free algorithms have also Table 1 : Sample complexity (the required number of episodes) for algorithms to find -approximate Nash equlibrium policies in zero-sum Markov games: VI-explore and VI-UCLB by Bai & Jin (2020) , OMVI-SM by Xie et al. (2020) , and Nash Q/V-learning by Bai et al. (2020) . The lower bound was proved by Jin et al. (2018) ; Domingues et al. (2020) . • We design an alternative algorithm Optimistic Value Iteration with Zero Reward (VI-Zero) that is able to perform task-agnostic (reward-free) learning for multiple Markov games sharing the same transition (Section 4). For N > 1 games with the same transition and different (known) rewards, VI-Zero can find -approximate Nash policy for all games simultaneously in Õ(H 4 SAB log N/ 2 ) episodes of game playing, which scales logarithmically in the number of games. • We design the first line of sample-efficient algorithms for multi-player general-sum Markov games. In a multi-player game with M players and A i actions per player, we show that an nearoptimal policy can be found in Õ(H 4 S 2 i∈[M ] A i / 2 ) episodes, where the desired optimality can be either one of Nash equilibrium, correlated equilibrium (CE), or coarse correlated equilibrium (CCE). We achieve this guarantee by either a multi-player version of Nash-VI or a multi-player version of reward-free value iteration (Section 5 & Appendix C). Due to space limit, we defer a detailed survey of related works to Appendix A.

2. PRELIMINARIES

In this paper, we consider Markov Games (MGs, Shapley, 1953; Littman, 1994) , which are also known as stochastic games in the literature. Markov games are the generalization of standard Markov Decision Processes (MDPs) into the multi-player setting, where each player seeks to maximize her own utility. For simplicity, in this section we describe the important special case of twoplayer zero-sum games, and return to the general formulation in Appendix C. Formally, we consider the tabular episodic version of two-player zero-sum Markov game, which we denote as MG(H, S, A, B, P, r). Here H is the number of steps in each episode, S is the set of states with |S| ≤ S, (A, B) are the sets of actions of the max-player and the min-player respectively with |A| ≤ A and |B| ≤ B, P = {P h } h∈ [H] is a collection of transition matrices, so that P h (•|s, a, b) gives the distribution of the next state if action pair (a, b) is taken at state s at step h, and r = {r h } h∈ [H] is a collection of reward functions, where r h : S × A × B → [0, 1] is the deterministic reward function at step h.foot_0 This reward represents both the gain of the max-player and the loss of the min-player, making the problem a zero-sum Markov game. In each episode of this MG, we start with a fixed initial state s 1 . At each step h ∈ [H], both players observe state s h ∈ S, and pick their own actions a h ∈ A and b h ∈ B simultaneously. Then, both players observe the actions of their opponent, receive reward r h (s h , a h , b h ), and then the environment transitions to the next state s h+1 ∼ P h (•|s h , a h , b h ). The episode ends when s H+1 is reached. Policy, value function. A (Markov) policy µ of the max-player is a collection of H functions {µ h : S → ∆ A } h∈ [H] , each mapping from a state to a distribution over actions. (Here ∆ A is the probability simplex over action set A.) Similarly, a policy ν of the min-player is a collection of H functions {ν h : S → ∆ B } h∈ [H] . We use the notation µ h (a|s) and ν h (b|s) to represent the probability of taking action a or b for state s at step h under Markov policy µ or ν respectively. We use V µ,ν h : S → R to denote the value function at step h under policy µ and ν, so that V µ,ν h (s) gives the expected cumulative rewards received under policy µ and ν, starting from s at step h: V µ,ν h (s) := E µ,ν H h =h r h (s h , a h , b h ) s h = s . We also define Q µ,ν h : S × A × B → R to be the Q-value function at step h so that Q µ,ν h (s, a, b) gives the cumulative rewards received under policy µ and ν, starting from (s, a, b) at step h: Q µ,ν h (s, a, b) := E µ,ν H h =h r h (s h , a h , b h ) s h = s, a h = a, b h = b . For simplicity, we define operator P h as By definition of value functions, we have the Bellman equation [P h V ](s, Q µ,ν h (s, a, b) = (r h + P h V µ,ν h+1 )(s, a, b), V µ,ν h (s) = (D µ h ×ν h Q µ,ν h )(s) for all (s, a, b, h) ∈ S × A × B × [H], and at the (H + 1) th step we have V µ,ν H+1 (s) = 0 for all s ∈ S. Best response and Nash equilibrium. For any policy of the max-player µ, there exists a best response of the min-player, which is a policy ν † (µ) satisfying V µ,ν † (µ) h (s) = inf ν V µ,ν h (s) for any (s, h) ∈ S × [H] . We denote V µ, † h := V µ,ν † (µ) h . By symmetry, we can also define µ † (ν) and V †,ν h . It is further known (cf. (Filar & Vrieze, 2012) ) that there exist policies µ , ν that are optimal against the best responses of the opponents, in the sense that V µ , † h (s) = sup µ V µ, † h (s), V †,ν h (s) = inf ν V †,ν h (s) , for all (s, h). We call these optimal strategies (µ , ν ) the Nash equilibrium of the Markov game, which satisfies the following minimax equationfoot_1 : sup µ inf ν V µ,ν h (s) = V µ ,ν h (s) = inf ν sup µ V µ,ν h (s). Intuitively, a Nash equilibrium gives a solution in which no player has anything to gain by changing only her own policy. We further abbreviate the values of Nash equilibrium V µ ,ν h and Q µ ,ν h as V h and Q h . We refer readers to Appendix D for Bellman optimality equations for (the value functions of) the best responses and the Nash equilibrium. Learning Objective. We measure the suboptimality of any pair of general policies (μ, ν) using the gap between their performance and the performance of the optimal strategy (i.e., Nash equilibrium) when playing against the best responses respectively: V †,ν 1 (s 1 ) -V μ, † 1 (s 1 ) = V †,ν 1 (s 1 ) -V 1 (s 1 ) + V 1 (s 1 ) -V μ, † 1 (s 1 ) Definition 1 ( -approximate Nash equilibrium). A pair of general policies (μ, ν) is an - approximate Nash equilibrium, if V †,ν 1 (s 1 ) -V μ, † 1 (s 1 ) ≤ . Definition 2 (Regret). Let (µ k , ν k ) denote the policies deployed by the algorithm in the k th episode. After a total of K episodes, the regret is defined as Regret(K) = K k=1 (V †,ν k 1 -V µ k , † 1 )(s 1 ). One goal of reinforcement learning is to design algorithms for Markov games that can find an -approximate Nash equilibrium using a number of episodes that is small in its dependency on S, A, B, H as well as 1/ (PAC sample complexity bound). An alternative goal is to design algorithms for Markov games that achieves regret that is sublinear in K, and polynomial in S, A, B, H (regret bound). We remark that any sublinear regret algorithm can be directly converted to a polynomial-sample PAC algorithm via the standard online-to-batch conversion (see e.g., Jin et al. (2018) ).

3. OPTIMISTIC NASH VALUE ITERATION

In this section, we present our main algorithm-Optimistic Nash Value Iteration (Nash-VI), and provide its theoretical guarantee.

3.1. ALGORITHM DESCRIPTION

We describe our Nash-VI Algorithm 1. In each episode, the algorithm can be decomposed into two parts. • Line 3-13 (Optimistic planning from the estimated model): Performs value iteration with bonus using the empirical estimate of the transition P, and computes a new (joint) policy π which is "greedy" with respect to the estimated value functions; • Line 16-19 (Play the policy and update the model estimate): Executes the policy π, collects samples, and updates the estimate of the transition P. At a high-level, this two-phase strategy is standard in the majority of model-based RL algorithms, and also underlies provably efficient model-based algorithms such as UCBVI for single-agent (MDP) setting (Azar et al., 2017) and VI-ULCB for the two-player Markov game setting (Bai & Jin, 2020) . However, VI-ULCB has two undesirable drawbacks: the sample complexity is not tight in any of H, S, and A, B dependency, and its computational complexity is PPAD-complete (a complexity class conjectured to be computationally hard (Daskalakis, 2013) ). As we elaborate in the following, our Nash-VI algorithm differs from VI-ULCB in a few important technical aspects, which allows it to significantly improve the sample complexity over VI-ULCB, and ensures that our algorithm terminates in polynomial time. Before digging into explanations of techniques, we remark that line 14-15 is only used for computing the output policies. It chooses policy π out to be the policy in the episode with minimum gap  (V 1 - V 1 )(s 1 ).

3.1.1. OVERVIEW OF TECHNIQUES

Auxiliary bonus γ. The major improvement over VI-ULCB (Bai & Jin, 2020) comes from the use of a different style of bonus term γ (line 8), in addition to the standard bonus β (line 7), in value iteration steps (line 9-10). This is also the main technical contribution of our Nash-VI algorithm. This auxiliary bonus γ is computed by applying the empirical transition matrix Ph to the gap at the next step V h+1 -V h+1 , This is very different from standard bonus β, which is typically designed according to the concentration inequalities. The main purpose of these value iteration steps (line 9-10) is to ensure that the estimated values Q h and Q h are with high probability the upper bound and the lower bound of the Q-value of the current policy when facing best responses (see Lemma 20 and 22 for more details)foot_2 . To do so, prior work (Bai & Jin, 2020 ) only adds bonus β, which needs to be as large as Θ( S/t). In contrast, the inclusion of auxiliary bonus γ in our algorithm allows a much smaller choice for bonus β-which scales only as Õ( 1/t)-while still maintaining valid confidence bounds. This technique alone brings down the sample complexity to Õ(H 4 SAB/ 2 ), removing an entire S factor compared to VI-ULCB. Furthermore, the coefficient in γ is only c/H for some absolute constant c, which ensures that the introduction of error term γ would hurt the overall sample complexity only up to a constant factor. Bernstein concentration. Our Nash-VI allows two choices of the bonus function β = BONUS(t, σ2 ): Hoeffding type: c( H 2 ι/t + H 2 Sι/t), Bernstein type: c( σ2 ι/t + H 2 Sι/t). ( ) where σ2 is the estimated variance, ι is the logarithmic factors and c is absolute constant. The V in line 7 is the empirical variance operator defined as V h V = P h V 2 -( P h V ) 2 for any V ∈ [0, H] S . The design of both bonuses stem from the Hoeffding and Bernstein concentration inequalities. Further, the Bernstein bonus uses a sharper concentration, which saves an H factor in sample complexity compared to the Hoeffding bonus (similar to the single-agent setting (Azar et al., 2017) ). This further reduces the sample complexity to Õ(H 3 SAB/ 2 ) which matches the lower bound in all H, S, factors. Coarse Correlated Equalibirum (CCE). The prior algorithm VI-ULCB (Bai & Jin, 2020) computes the "greedy" policy with respect to the estimated value functions by directly computing the Nash equilibrium for the Q-value at each step h. However, since the algorithm maintains both the upper confidence bound and lower confidence bound of the Q-value, this leads to the requirement Algorithm 1 Optimistic Nash Value Iteration (Nash-VI) 1: Initialize: for any (s, a, b, h), Q h (s, a, b) ← H, Q h (s, a, b) ← 0, ∆ ← H, N h (s, a, b) ← 0. 2: for episode k = 1, . . . , K do 3: for step h = H, H -1, . . . , 1 do 4: for (s, a, b) ∈ S × A × B do 5: t ← N h (s, a, b). 6: if t > 0 then 7: β ← BONUS(t, V h [(V h+1 + V h+1 )/2](s, a, b)). 8: γ ← (c/H) P h (V h+1 -V h+1 )(s, a, b). 9: Q h (s, a, b) ← min{(r h + P h V h+1 )(s, a, b) + γ + β, H}. 10: Q h (s, a, b) ← max{(r h + P h V h+1 )(s, a, b) -γ -β, 0}. 11: for s ∈ S do 12: π h (•, •|s) ← CCE(Q h (s, •, •), Q h (s, •, •)). 13: V h (s) ← (D π h Q h )(s); V h (s) ← (D π h Q h )(s). 14: if (V 1 -V 1 )(s 1 ) < ∆ then 15: ∆ ← (V 1 -V 1 )(s 1 ) and π out ← π.

16:

for step h = 1, . .  P h (•|s h , a h , b h ) ← N h (s h , a h , b h , •)/N h (s h , a h , b h ). 20: Output (µ out , ν out ) that are the marginal policies of π out . to compute the Nash equilibrium for a two-player general-sum matrix game, which is in general PPAD-complete (Daskalakis, 2013) . To overcome this computational challenge, we compute a relaxation of the Nash equilibrium-Coarse Correlated Equalibirum (CCE)-instead, a technique first introduced by Xie et al. (2020) to address reinforcement learning problems in Markov Games. Formally, for any pair of matrices Q, Q ∈ [0, H] A×B , CCE(Q, Q) returns a distribution π ∈ ∆ A×B such that E (a,b)∼π Q(a, b) ≥ max a E (a,b)∼π Q(a , b), E (a,b)∼π Q(a, b) ≤ min b E (a,b)∼π Q(a, b ). (4) Intuitively, in a CCE the players choose their actions in a potentially correlated way such that no one can benefit from unilateral unconditional deviation. A CCE always exists, since Nash equilibrium is also a CCE and a Nash equilibrium always exists. Furthermore, a CCE can be computed by linear programming in polynomial time. We remark that different from Nash equilibrium where the policies of each player are independent, the policies given by CCE are in general correlated for each player. Therefore, executing such a policy (line 17) requires the cooperation of two players.

3.2. THEORETICAL GUARANTEES

Now we are ready to present the theoretical guarantees for Algorithm 1. We let π k denote the policy computed in line 12 in the k th episode, and µ k , ν k denote the marginal policy of π k for each player. Theorem 3 (Nash-VI with Hoeffding bonus). For any p ∈ (0, 1], letting ι = log(SABT /p), then with probability at least 1-p, Algorithm 1 with Hoeffding type bonus (3) (with some absolute c > 0) achieves: • (V †,ν out 1 -V µ out , † 1 )(s 1 ) ≤ , if the number of episodes K ≥ Ω(H 4 SABι/ 2 +H 3 S 2 ABι 2 / ). • Regret(K) = K k=1 (V †,ν k 1 -V µ k , † 1 )(s 1 ) ≤ O( √ H 3 SABT ι + H 3 S 2 ABι 2 ). Theorem 3 provides both a sample complexity bound and a regret bound for Nash-VI to find an -approximate Nash equilibrium. For small ≤ H/(Sι), the sample complexity scales as Õ(H 4 SAB/ 2 ). Similarly, for large T ≥ H 3 S 3 ABι 3 , the regret scales as Õ( √ H 3 SABT ), where T = KH is the total number of steps played within K episodes. Theorem 3 is significant in that it improves the sample complexity of the model-based algorithm in Markov games from S 2 to S (and the regret from S to √ S). This is achieved by adding the new auxiliary bonus γ in value iteration steps as explained in Section 3.1. The proof of Theorem 3 can be found in Appendix F.1. Our next theorem states that when using Bernstein bonus instead of Hoeffding bonus as in (3), the sample complexity of Nash-VI algorithm can be further improved by a H factor in the leading order term (and the regret improved by a √ H factor). Theorem 4 (Nash-VI with the Bernstein bonus). For any p ∈ (0, 1], letting ι = log(SABT /p), then with probability at least 1 -p, Algorithm 1 with Bernstein type bonus (3) (with some absolute c > 0) achieves: • (V †,ν out 1 -V µ out , † 1 )(s 1 ) ≤ , if the number of episodes K ≥ Ω(H 3 SABι/ 2 +H 3 S 2 ABι 2 / ). • Regret(K) = K k=1 (V †,ν k 1 -V µ k , † 1 )(s 1 ) ≤ O( √ H 2 SABT ι + H 3 S 2 ABι 2 ). Compared with the information-theoretic sample complexity lower bound Ω(H 3 S(A + B)ι/ 2 ) and regret lower bound Ω( H 2 S(A + B)T ) (Bai & Jin, 2020) , when is small, Nash-VI with Bernstein bonus achieves the optimal dependency on all of H, S, up to logarithmic factors in both the sample complexity and the regret, and the only gap that remains open is a AB/(A + B) ≤ min {A, B} factor. The proof of Theorem 4 can be found in Appendix F.2. Comparison with model-free approaches. Different from our model-based approach, a recently proposed model-free algorithm Nash V-Learning (Bai et al., 2020) achieves sample complexity Õ(H 6 S(A + B)ι/ 2 ), which has a tight (A + B) dependency on A, B. However, our Nash-VI has the following important advantages over Nash V-Learning: 1. Our sample complexity has a better dependency on horizon H; 2. Our algorithm outputs a single pair of Markov policies (µ out , ν out ) while their algorithm outputs a generic history-dependent policy that can be only written as a nested mixture of Markov policies; 3. The model-free algorithms in Bai et al. (2020) cannot be directly modified to obtain a √ T -regret (so that the exploration policies can be arbitrarily poor), while our model-based algorithm has the

√

T -regret guarantee. We comment that although both Nash-VI and Nash V-Learning have polynomial running time, the later enjoys a better computational complexity because Nash-VI requires to solve LPs for computing CCEs in each episode.

4. REWARD-FREE LEARNING

In this section, we modify our model-based algorithm Nash-VI for the reward-free exploration setting (Jin et al., 2020b) , which is also known as the task-agnostic (Zhang et al., 2020b) or reward-agnostic setting. Reward-free learning has two phases: In the exploration phase, the agent first collects a dataset of transitions D = {(s k,h , a k,h , b k,h , s k,h+1 )} (k,h)∈[K]×[H] from a Markov game M without the guidance of reward information. After the exploration, in the planning phase, for each task i ∈ [N ], D is augmented with stochastic reward information to become D i = {(s k,h , a k,h , b k,h , s k,h+1 , r k,h )} (k,h)∈[K]×[H] , where r k,h is sampled from an unknown reward distribution with expectation equal to r i h (s k,h , a k,h , b k,h ). Here, we use r i to refer to the unknown reward function of the i th task. The goal is to compute nearly-optimal policies for N tasks under M, given the augmented datasets. There are strong practical motivations for considering the reward-free setting. First, in applications such as robotics, we face multiple tasks in sequential systems with shared transition dynamics (i.e. the world) but very different rewards. There, we prefer to learn the underlying transition independent of reward information. Second, from the algorithm design perspective, decoupling exploration and planning (i.e. performing exploration without reward information) can be valuable for designing new algorithms in more challenging settings (e.g. with function approximation). Due to space limits, we defer the description of our algorithm Optimistic Value Iteration with Zero Reward (VI-Zero, Algorithm 2) to Appendix B and only state its theoretical guarantees here. The following theorem claims that the empirical transition P out outputted by VI-Zero is close to the true transition P, in the sense that any Nash equilibrium of the M( P, r i ) (i ∈ [N ]) is also an approximate Under review as a conference paper at ICLR 2021 Nash equilibrium of the true underlying Markov game M(P, r i ), where r i is the empirical estimate of r i computed using D i . Theorem 5 (Sample complexity of VI-Zero). There exists an absolute constant c, for any p ∈ (0, 1], ∈ (0, H], N ∈ N, if we choose bonus β t = c( H 2 ι/t + H 2 Sι/t) with ι = log(N SABT /p) and K ≥ c(H 4 SABι/ 2 + H 3 S 2 ABι 2 / ), then with probability at least 1 -p, the output P out of Algorithm 2 has the following property: for any N fixed reward functions r 1 , . . . , r N , a Nash equilibrium of Markov game M( P out , r i ) is also an -approximate Nash equilibrium of the true Markov game M(P, r i ) for all i ∈ [N ]. Theorem 5 shows that, when is small, VI-Zero only needs Õ(H 4 SAB/ 2 ) samples to learn an estimate of the transition P out , which is accurate enough to learn the approximate Nash equilibrium for any N fixed rewards. The most important advantage of reward-free learning comes from the sample complexity only scaling polylogarithmically with respect to the number of tasks or reward functions N . This is in sharp contrast to the reward-aware algorithms (e.g. Nash-VI), where the algorithm has to be rerun for each different task, and the total sample complexity must scale linearly in N . In exchange for this benefit, compared to Nash-VI, VI-Zero loses a factor of H in the leading term of sample complexity since we cannot use Bernstein bonus anymore due to the lack of reward information. VI-Zero also does not have a regret guarantee, since again without reward information, the exploration policies are naturally sub-optimal. The proof of Theorem 5 can be found in Appendix G.1. Connections with reward-free learning in MDPs. Since MDPs are special cases of Markov games, our algorithm VI-Zero directly applies to the single-agent setting, and yields a sample complexity similar to existing results (Zhang et al., 2020b; Wang et al., 2020) . However, distinct from existing results which require both the exploration algorithm and the planning algorithm to be specially designed to work together, our algorithm allows an arbitrary planning algorithm as long as it computes the Nash equilibrium of a Markov game with known transition and reward. Therefore, our results completely decouple the exploration and the planning. Lower bound for reward-free learning. Finally, we comment that despite the sample complexity in Theorem 5 scaling as AB instead of A+B, our next theorem states that unlike the general rewardaware setting, this AB scaling is unavoidable in the reward-free setting. This reveals an intrinsic gap between the reward-free and reward-aware learning: An A + B dependency is only achievable via sampling schemes that are reward-aware. A similar lower bound is also presented in recent work (Zhang et al., 2020a) for the discounted setting with a different hard instance construction. Theorem 6 (Lower bound for reward-free learning of Markov games). There exists an absolute constant c > 0 such that for any ∈ (0, c], there exists a family of Markov games M( ) satisfying that: for any reward-free algorithm A using K ≤ cH 2 SAB/ 2 episodes, there exists a Markov game M ∈ M( ) such that if we run A on M and output policies (μ, ν), then with probability at least 1/4, we have (V †,ν 1 -V μ, † 1 )(s 1 ) ≥ . This lower bound shows that the sample complexity in Theorem 5 is optimal in S, A, B, and . The proof of Theorem 6 can be found in Appendix G.3.

5. MULTI-PLAYER GENERAL-SUM GAMES

We adapt our analysis to multi-player general-sum games and present the first lines of provably efficient algorithms. Concretely, we design two model-based algorithms Multi-Nash-VI and Multi-VI-Zero (Algorithm 3 and Algorithm 4) that can find an ( -approximate) {NASH, CE, CCE} equilibrium for any multi-player general-sum Markov game in Õ(H 4 S 2 m i=1 A i / 2 ) episodes of game playing, where A i is the number of actions for player i ∈ {1, . . . , m} (Theorem 15 and Theorem 16). Due to space limit, we defer the detailed setups, algorithms and results to Appendix C.

6. CONCLUSION

In this paper, we provided a sharp analysis of model-based algorithms for Markov games. Our new algorithm Nash-VI can find an -approximate Nash equilibrium of a zero-sum Markov game in Õ(H 3 SAB/ 2 ) episodes of game playing, which almost matches the sample complexity lower bound except for the AB vs. A + B dependency. We also applied our analysis to derive new efficient algorithms for task-agnostic game playing, as well as the first line of multi-player generalsum Markov games. There are a number of compelling future directions to this work. For example, can we achieve A + B instead of AB sample complexity for zero-sum games using model-based approaches (thus closing the gap between lower and upper bounds)? How can we design more efficient algorithms for general-sum games with better sample complexity (e.g., O(S) instead of O(S 2 ))? We leave these problems as future work. Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pp. 4987-4997, 2017. Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneousmove markov games using function approximation and correlated equilibrium. arXiv preprint arXiv:2002.07066, 2020. Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pp. 2508-2516, 2013. Kaiqing Zhang, Sham M Kakade, Tamer Bas ¸ar, and Lin F Yang. Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. arXiv preprint arXiv:2007.07461, 2020a. Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2006.09497, 2020b. Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference-advantage decomposition. arXiv preprint arXiv:2004.10019, 2020c. Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pp. 1583-1591, 2013.

A RELATED WORK

Markov games. Markov games (or stochastic games) are proposed in the early 1950s (Shapley, 1953) . They are widely used to model multi-agent RL. Learning the Nash equilibria of Markov games has been studied in Littman (1994; 2001) ; Hu & Wellman (2003); Hansen et al. (2013) ; Lee et al. (2020) , where the transition matrix and reward are assumed to be known, or in the asymptotic setting where the number of data goes to infinity. These results do not directly apply to the nonasymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them. Another line of work assumes certain strong reachability assumptions under which sophisticated exploration strategies are not required. A prevalent approach is to assume access to simulators (generative models) that enable the agent to directly sample transition and reward information for any state-action pair. In this setting, 2020) provide the first line of non-asymptotic sample complexity guarantees on learning Markov games without these reachability assumptions, in which exploration is essential. However, both results suffer from highly suboptimal sample complexity. The results of Xie et al. ( 2020) also apply to the linear function approximation setting. More recently, two model-free algorithms-Nash Q-Learning and Nash V-Learning-are shown to achieve better sample complexity guarantees (Bai et al., 2020) . In particular, the Nash V-learning algorithm achieves the nearoptimal dependence on S, A and B. However, the dependence on H is worse than our results and the output policy is a nested mixture, which is hard to implement. We compare our results with existing non-asymptotic guarantees in Table 1 . We remark that the classical R-max algorithm (Brafman & Tennenholtz, 2002 ) also provides provable guarantees for learning Markov games. However, Brafman & Tennenholtz (2002) uses a weaker definition of regret (similar to the online setting in Xie et al. ( 2020)), and consequently their result does not imply any sample complexity result for finding Nash equilibrium policies. Adversarial MDPs. Another way to model the multi-player bahavior is to use adversarial MDPs. Most work in this line considers the setting with adversarial rewards (Zimin & Neu, 2013; Rosenberg & Mansour, 2019; Jin et al., 2019) , where the reward can be manipulated by an adversary arbitrarily and the goal is to compete with the optimal (stationary) policy in hindsight. Adversarial MDP with changing dynamics is computationally hard even under full-information feedback (Yadkori et al., 2013) . Notice these results do not directly imply provable self-play algorithms in our setting, because the opponent in Markov games can affect both the reward and the transition.

Single-agent RL.

There is a rich literature on reinforcement learning in MDPs (see e.g., Jaksch et al., 2010; Osband et al., 2014; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018) . MDP is a special case of Markov games, where only a single agent interacts with a stochastic environment. For the tabular episodic setting with nonstationary dynamics and no simulators, the best sample complexity is Reward-free and task-agnostic exploration. Jin et al. (2020a) proposes a new paradigm of learning an MDP, which they called reward-free exploration. In this setting, the agent goes through a two-stage process. In the exploration phase the agent can interacts with the environment without knowing the reward function and in the planning phase the reward function is given and the agent needs output a policy. The goal is to make the output policy near optimal for any given reward function. A closely related setting is task-agnostic learning, where the reward function is determined at the very beginning but not revealed until the planning phase. Notice algorithms for task-agnostic  Õ(H 3 SA/ 2 ), for step h = H, H -1, . . . , 1 do 4: for (s, a, b) ∈ S × A × B do 5: t ← N h (s, a, b). 6: if t > 0 then 7: Q h (s, a, b) ← min{( P h V h+1 )(s, a, b) + β t , H}. 8: for s ∈ S do 9: π h (s) ← arg max (a,b)∈A×B Q h (s, a, b). 10: V h (s) ← (D π h Q h )(s). 11: if V 1 (s 1 ) < ∆ then 12: ∆ ← V 1 (s 1 ) and P out ← P. 13: for step h = 1, . . . , H do 14: take action (a h , b h ) ∼ π h (•, •|s h ), observe next state s h+1 . 15: add 1 to N h (s h , a h , b h ) and N h (s h , a h , b h , s h+1 ). 16: P h (•|s h , a h , b h ) ← N h (s h , a h , b h , •)/N h (s h , a h , b h ). 17: Output P out . learning can also be transferred to reward-free exploration by taking union bound w.r.t. different possible reward function. In Table 1 , VI-explore (Bai & Jin, 2020) and Algorithm 2 can also be applied to this setting. Jin et al. (2020a) also proposes an algorithm, which first finds a covering policy to maximize the probability to reach each state separately and then collects data following this policy. Zhang et al. (2020b) takes a different approach by first runs the optimistic Q-learning algorithm (Jin et al., 2018) with zero reward to explore the environment, and then they utilizes the trajectories collected to compute a policy in an incremental manner. Wang et al. (2020) follows a simialr scheme, but studies reward-free exploration in linear-parametrized MDPs.

B OPTIMISTIC VALUE ITERATION WITH ZERO REWARD -VI-ZERO

We now describe our algorithm for reward-free learning in zero-sum Markov games. Exploration phase. In the first phase of reward-free learning, we deploy algorithm Optimistic Value Iteration with Zero Reward (VI-Zero, Algorithm 2). This algorithm differs from the rewardaware Nash-VI (Algorithm 1) in two important aspects. First, we use zero reward in the exploration phase (Line 7), and only maintains an upper bound of the (reward-free) value function instead of both upper and lower bounds. Second, our exploration policy is the maximizing (instead of CCE) policy of the value function (Line 9). We remark that the Q h (s, a, b) maintained in the algorithm 2 is no longer an upper bound for any actual value function (as it has no reward), but rather a measure of uncertainty or suboptimality that the agent may suffer-if she takes action (a, b) at state s and step h, and makes decisions by utilizing the empirical estimate P in the remaining steps (see a rigorous version of this statement in Lemma 27). Finally, the empirical transition P of the episode that minimizes V 1 (s 1 ) is outputted and passed to the planning phase. Planning phase. After obtaining the estimate of tranisiton P, our planning algorithm is rather simple. For the i th task, let r i be the empirical estimate of r i computed using the i th augmented dataset D i . Then we compute the Nash equilibrium of the Markov game M( P, r i ) with estimated transition P and reward r i . Since both P and r i are known exactly, this is a pure computation problem without any sampling error and can be efficiently solved by simple planning algorithms such as the vanilla Nash value iteration without optimism (see Appendix G.2 for more details).

C MULTIPLAYER GENERAL-SUM MARKOV GAMES

In this section, we extend both our model-based algorithms (Algorithm 1 and Algorithm 2) to the setting of multiplayer general-sum Markov games, and present corresponding theoretical guarantees.

C.1 PROBLEM FORMULATION

A general-sum Markov game (general-sum MG) with m players is a tuple MG(H, S, {A i } m i=1 , P, {r i } m i=1 ), where H, S denote the length of each episode and the state space. Different from the two-player zero-sum setting, we now have m different action spaces, where A i is the action space for the i th player and |A i | = A i . We let a := (a 1 , • • • , a m ) denote the (tuple of) joint actions by all m players. P = {P h } h∈[H] is a collection of transition matrices, so that P h (•|s, a) gives the distribution of the next state if actions a are taken at state s at step h, and r i = {r h,i } h∈[H] is a collection of reward functions for i th player, so that r h,i (s, a) gives the reward received by the i th player if actions a are taken at state s at step h. In this section, we consider three versions of equlibrium for general-sum MGs: Nash equilibrium (NE), correlated equilibrium (CE), and coarse correlated equilibrium (CCE), all being standard solution notions in games (Nisan et al., 2007) . These three notions coincide on two-player zero-sum games, but are not equivalent to each other on multi-player general-sum games; any one of them could be desired depending on the application at hand. Below we introduce their definitions. (Approximate) Nash equilibrium in general-sum MG. The policy of the i th player is denoted as π i := π h,i : S → ∆ Ai h∈[H] . We denote the product policy of all the players as π := π 1 × • • • × π M , and denote the policy of the all the players except the i th player as π -i . We define V π h,i (s) as the expected cumulative reward that will be received by the i th player if starting at state s at step h and all players follow policy π. For any strategy π -i , there also exists a best response of the i th player, which is a policy µ † (π -i ) satisfying V µ † (π-i),π-i h,i (s) = sup πi V πi,π-i h,i (s) for any (s, h) ∈ S × [H]. We denote V †,π-i h,i := V µ † (π-i),π-i h,i . The Q-functions of the best response can be defined similarly. Our first objective is to find an approximate Nash equilibrium of Markov games. Definition 7 ( -approximate Nash equilibrium in general-sum MG). A product policy π is an - approximate Nash equilibrium if max i∈[m] (V †,π-i 1,i -V π 1,i )(s 1 ) ≤ . The above definition requires the suboptimality gap (V †,π-i 1,i -V π 1,i )(s 1 ) to be less than for all player i. This is consistent with the two-player case (Definition 1) up to a constant of 2, since in the two-player zero-sum setting, we have V π 1,1 (s 1 ) = -V π 1,2 (s 1 ) for any product policy π = (µ, ν), and therefore (V †,ν 1,1 -V µ, † 1,1 )(s 1 ) ≤ 2 max i∈[2] (V †,π-i 1,i -V π 1,i )(s 1 ) ≤ 2(V †,ν 1,1 -V µ, † 1,1 )(s 1 ).We can similarly define the regret. Definition 8 (Nash-regret in general-sum MG). Let π k denote the (product) policy deployed by the algorithm in the k th episode. After a total of K episodes, the regret is defined as Regret Nash (K) = K k=1 max i∈[m] (V †,π k -i 1,i -V π k 1,i )(s 1 ). (Approximate) CCE in general-sum MG. The coarse correlated equilibrium (CCE) is a relaxed version of Nash equilibrium in which we consider general correlated policies instead of product policies. Let A = A 1 × • • • × A m denote the joint action space. Definition 9 (CCE in in general-sum MG). A (correlated) policy π := {π h (s) ∈ ∆ A : (h, s) ∈ [H] × S} is a CCE if max i∈[m] V †,π-i h,i (s) ≤ V π h,i (s) for all (s, h) ∈ S × [H]. Compared with a Nash equilibrium, a CEE is not necessarily a product policy, that is, we may not have π h (s) ∈ ∆ A1 × • • • × ∆ Am . Similarly, we also define -approximate CCE and CCE-regret below. Definition 10 ( -approximate CCE in general-sum MG). A policy π : = {π h (s) ∈ ∆ A : (h, s) ∈ [H] × S} is an -approximate CCE if max i∈[m] (V †,π-i 1,i -V π 1,i )(s 1 ) ≤ . Definition 11 (CCE-regret in general-sum MG). Let policy π k denote the (correlated) policy deployed by the algorithm in the k th episode. After a total of K episodes, the regret is defined as Regret CCE (K) = K k=1 max i∈[m] (V †,π k -i 1,i -V π k 1,i )(s 1 ). (Approximate) CE in general-sum MG. The correlated equilibrium (CE) is another relaxation of the Nash equilibrium. To define CE, we first introduce the concept of strategy modification: A strategy modification φ := {φ h,s (a) ∈ A i : (h, s, a) ∈ [H] × S × A i } for player i is a set of S × H injective functions from A i to itself. Let Φ i denote the set of all possible strategy modifications for player i. One can compose a strategy modification φ with any Markov policy π and obtain a new policy φ π such that when policy π chooses to play a := (a 1 , . . . , a m ) at state s and step h, policy φ πwill play (a 1 , . . . , a i-1 , φ h,s (a i ), a i+1 , . . . , a m ) instead. Definition 12 (CE in general-sum MG). A policy π := {π h (s) ∈ ∆ A : (h, s) ∈ [H] × S} is a CE if max i∈[m] max φ∈Φi V φ π h,i (s) ≤ V π h,i (s) holds for all (s, h) ∈ S × [H]. Similarly, we have an approximate version of CE and CE-regret. Definition 13 ( -approximate CE in Markov games). A policy π := {π h (s) ∈ ∆ A : (h, s) ∈ [H] × S} is an -approximate CE if max i∈[m] max φ∈Φi (V φ π 1,i -V π 1,i )(s 1 ) ≤ . Definition 14 (CE-regret in multiplayer Markov games). Let product policy π k denote the policy deployed by the algorithm in the k th episode. After a total of K episodes, the regret is defined as Regret CE (K) = K k=1 max i∈[m] max φ ( V φ π k 1,i -V π k 1,i )(s 1 ). Relationship between Nash, CE, and CCE For general-sum MGs, we have {Nash} ⊆ {CE} ⊆ {CCE}, so that they form a nested set of notions of equilibria (Nisan et al., 2007) . Indeed, one can easily verify that if we restrict the choice of strategy modification φ to those consisting of only constant functions, i.e., φ h,s (a) being independent of a, Definition 12 will reduce to the definition of CCE policy. In addition, any Nash equilibrium is a CE by definition. Finally, since a Nash equilibrium always exists, so does CE and CCE.

C.2 MULTIPLAYER OPTIMISTIC NASH VALUE ITERATION

Here we present the Multi-Nash-VI algorithm, which is an extension of Algorithm 1 for multi-player general-sum Markov games. The EQUILIBRIUM Subroutine. Our EQUILIBRIUM subroutine in Line 11 could be taken from either one of the {NASH, CE, CCE} subroutines for one-step games. When using NASH, we compute the Nash equilibrium of a one-step multi-player game (see, e.g., Berg & Sandholm (2016) for an overview of the available algorithms); the worst-case computational complexity of such a subroutine will be PPAD-hard (Daskalakis, 2013) . When using CE or CCE, we find CEs or CCEs of the one-step games respectively, which can be solved in polynomial time using linear programming. However, the policies found are not guaranteed to be a product policy. We remark that in Algorithm 1 we used the CCE subroutine for finding Nash in two-player zero-sum games, which seemingly contrasts the principle of using the right subroutine for finding the right equilibrium, but nevertheless works as the Nash equilibrium and CCE are equivalent in zero-sum games. Now we are ready to present the theoretical guarantees for Algorithm 3. We let π k denote the policy computed in line 11 of Algorithm 3 in the k th episode. Theorem 15 (Multi-Nash-VI). There exists an absolute constant c, for any p ∈ (0, 1], let ι = log(SABT /p), then with probability at least 1 -p, Algorithm 3 with bonus β t = c SH 2 ι/t and EQUILIBRIUM being one of {NASH, CE, CCE} satisfies (repsectively): Algorithm 3 Multiplayer Optimistic Nash Value Iteration (Multi-Nash-VI) 1: Initialize: for any (s, a, h, i), Q h,i (s, a) ← H, Q h,i (s, a) ← 0, ∆ ← H, N h (s, a) ← 0. 2: for episode k = 1, . . . , K do 3: for step h = H, H -1, . . . , 1 do 4: for (s, a) ∈ S × A 1 × • • • × A m do 5: t ← N h (s, a); 6: if t > 0 then 7: for player i = 1, 2, . . . , m do 8: Q h,i (s, a) ← min{(r h,i + P h V h+1,i )(s, a) + β t , H}. 9: Q h,i (s, a) ← max{(r h,i + P h V h+1,i )(s, a) -β t , 0}. 10: for s ∈ S do 11: π h (•|s) ← EQUILIBRIUM(Q h,1 (s, •), Q h,2 (s, •), • • • , Q h,M (s, •)). 12: for player i = 1, 2, . . . , m do 13: V h,i (s) ← (D π h Q h,i )(s); V h,i (s) ← (D π h Q h,i )(s). 14: if max i∈[m] (V 1,i -V 1,i )(s 1 ) < ∆ then 15: ∆ ← max i∈[m] (V 1,i -V 1,i )(s 1 ) and π out ← π.

16:

for step h = 1, . . . , H do 17: take action a h ∼ π h (•|s h ), observe reward r h and next state s h+1 . 18: add 1 to N h (s h , a h ) and N h (s h , a h , s h+1 ). 19: P h (•|s h , a h ) ← N h (s h , a h , •)/N h (s h , a h ). 20: Output π out . • π out is an -approximate {NASH,CE,CCE}, if the number of episodes K ≥ Ω(H 4 S 2 ( m i=1 A i )ι/ 2 ). • Regret {Nash,CE,CCE} (K) ≤ O( H 3 S 2 ( m i=1 A i )T ι). In the situation where the EQUILIBRIUM subroutine is taken as NASH, Theorem 15 provides the sample complexity bound of Multi-Nash-VI algorithm to find a -approximate Nash equilibrium and its regret bound. Compared with our earlier result in two-player zero-sum games (Theorem 3), here the sample complexity scales as S 2 H 4 instead of SH 3 . This is because the auxiliary bonus and Bernstein concentration technique do not apply here. Furthermore, the sample complexity is proportional to m i=1 A i , which increases exponentially as the number of players increases.

Runtime of Algorithm 3

We remark that while the Nash guarantee is the strongest among the three guarantees presented in Theorem 15, the runtime of Algorithm 3 in the Nash case is not guaranteed to be polynomial and in the worst case PPAD-hard (due to the hardness of the NASH subroutine). In contrast, the CE and CCE guarantees are weaker, but the corresponding algorithms are guaranteed to finish in polynomial time.

C.3 MULTIPLAYER REWARD-FREE LEARNING

We can also generalize VI-Zero to the multiplayer setting and obtain Algorithm 4, Multi-VI-Zero, which is almost the same as VI-Zero except that its exploration bonus β t is larger than that of VI-Zero by a √ S factor. Similar to Theorem 5, we have the following theoretical guarantee claiming that any {NASH,CCE,CE} of the M( P, r i ) (i ∈ [N ] ) is also an approximate {NASH,CCE,CE} of the true Markov game M(P, r i ), where P out is the empirical transition outputted by Algorithm 4 and r i is the empirical estimate of r i . Theorem 16 (Multi-VI-Zero). There exists an absolute constant c, for any p ∈ (0, 1], ∈ (0, H], N ∈ N, if we choose bonus β t = c H 2 Sι/t with ι = log(N SABT /p) and K ≥ c(H 4 S 2 ( m i=1 A i )ι/ 2 ), then with probability at least 1 -p, the output P out of Algorithm 4 has the following property: for any N fixed reward functions r 1 , . . . , r N , any {NASH,CCE,CE} of Algorithm 4 Multiplayer Optimistic Value Iteration with Zero Reward (Multi-VI-Zero) 1: Initialize: for any (s, a, h), V h (s, a) ← H, ∆ ← H, N h (s, a) ← 0. 2: for episode k = 1, . . . , K do 3: for step h = H, H -1, . . . , 1 do 4: for (s, a) ∈ S × A 1 × • • • × A m do 5: t ← N h (s, a). 6: if t > 0 then 7: Q h (s, a) ← min{( P h V h+1 )(s, a) + β t , H}. 8: for s ∈ S do 9:  π h (s) ← arg max a∈A1×•••×Am Q h (s, a). 10: V h (s) ← (D π h Q h )(s). 11: if V 1 (s 1 ) < ∆ then 12: ∆ ← V 1 ( P h (•|s h , a h ) ← N h (s h , a h , •)/N h (s h , a h ). 17: Output P out . Markov game M( P out , r i ) is also an -approximate {NASH,CCE,CE} of the true Markov game M(P, r i ) for all i ∈ [N ]. The proof of Theorem 16 can be found in Appendix H.2. It is worth mentioning that the empirical Markov game M( P out , r i ) may have multiple {Nash equilibria,CCEs,CEs} and Theorem 16 ensures that all of them are -approximate {Nash equilibria,CCEs,CEs} of the true Markov game. Also, note that the sample complexity here is quadratic in the number of states because we are using the exploration bonus β t = H 2 Sι/t that is larger than usual by a √ S factor.

D BELLMAN EQUATIONS FOR MARKOV GAMES

In this section, we present the Bellman equations for different types of values in Markov games. Fixed policies. For any pair of Markov policy (µ, ν), by definition of their values in (1) (2), we have the following Bellman equations: Q µ,ν h (s, a, b) = (r h + P h V µ,ν h+1 )(s, a, b), V µ,ν h (s) = (D µ h ×ν h Q µ,ν h )(s) for all (s, a, b, h) ∈ S × A × B × [H] , where V µ,ν H+1 (s) = 0 for all s ∈ S. Best responses. For any Markov policy µ of the max-player, by definition, we have the following Bellman equations for values of its best response: Q µ, † h (s, a, b) = (r h + P h V µ, † h+1 )(s, a, b), V µ, † h (s) = inf ν∈∆ B (D µ h ×ν Q µ, † h )(s), for all (s, a, b, h) ∈ S × A × B × [H] , where V µ, † H+1 (s) = 0 for all s ∈ S. Similarly, for any Markov policy ν of the min-player, we also have the following symmetric version of Bellman equations for values of its best response: Q †,ν h (s, a, b) = (r h + P h V †,ν h+1 )(s, a, b), V †,ν h (s) = sup µ∈∆ A (D µ×ν h Q †,ν h )(s). for all (s, a, b, h) ∈ S × A × B × [H] , where V †,ν H+1 (s) = 0 for all s ∈ S. Nash equilibria. Finally, by definition of Nash equilibria in Markov games, we have the following Bellman optimality equations: Q h (s, a, b) =(r h + P h V h+1 )(s, a, b) V h (s) = sup µ∈∆ A inf ν∈∆ B (D µ×ν Q h )(s) = inf ν∈∆ B sup µ∈∆ A (D µ×ν Q h )(s). for all (s, a, b, h) ∈ S × A × B × [H], where V H+1 (s) = 0 for all s ∈ S.

E PROPERTIES OF COARSE CORRELATED EQUILIBRIUM

Recall the definition for CCE in our main paper (4), we restate it here after rescaling. For any pair of matrices P, Q ∈ [0, 1] n×m , the subroutine CCE(P, Q) returns a distribution π ∈ ∆ n×m that satisfies: E (a,b)∼π P (a, b) ≥ max a E (a,b)∼π P (a , b) (5) E (a,b)∼π Q(a, b) ≤ min b E (a,b)∼π Q(a, b ) We make three remarks on CCE. First, a CCE always exists since a Nash equilibrium for a generalsum game with payoff matrices (P, Q) is also a CCE defined by (P, Q), and a Nash equilibrium always exists. Second, a CCE can be efficiently computed, since above constraints (5) for CCE can be rewritten as n + m linear constraints on π ∈ ∆ n×m , which can be efficiently resolved by standard linear programming algorithm. Third, a CCE in general-sum games needs not to be a Nash equilibrium. However, a CCE in zero-sum games is guaranteed to be a Nash equalibrium. Proposition 17. Let π = CCE(Q, Q), and (µ, ν) be the marginal distribution over both players' actions induced by π. Then (µ, ν) is a Nash equilibrium for payoff matrix Q. Proof of Proposition 17. Let N be the value of Nash equilibrium for Q. Since π = CCE(Q, Q), by definition, we have: E (a,b)∼π Q(a, b) ≥ max a E (a,b)∼π Q(a , b) = max a E b∼ν Q(a , b) ≥ N E (a,b)∼π Q(a, b) ≤ min b E (a,b)∼π Q(a, b ) = min b E a∼µ Q(a, b ) ≤ N This gives: max a E b∼ν Q(a , b) = min b E a∼µ Q(a, b ) = N which finishes the proof. Intuitively, a CCE procedure can be used in Nash Q-learning for finding an approximate Nash equilibrium, because the values of upper confidence and lower confidence (Q and Q) will be eventually very close, so that the preconditions of Proposition 17 becomes approximately satisfied.

F PROOF FOR SECTION 3 -OPTIMISTIC NASH VALUE ITERATION F.1 PROOF OF THEOREM 3

We denote V k , Q k , π k , µ k and ν kfoot_3 for values and policies at the beginning of the k-th episode. In particular, N k h (s, a, b) is the number we have visited the state-action tuple (s, a, b) at the h-th step before the k-th episode. N k h (s, a, b, s ) is defined by the same token. Using this notation, we can further define the empirical transition by P k h (s |s, a, b) := N k h (s, a, b, s )/N k h (s, a, b). If N k h (s, a, b) = 0, we set P k h (s |s, a, b) = 1/S. As a result, the bonus terms can be written as β k h (s, a, b) := C ιH 2 max{N k h (s, a, b), 1} + H 2 Sι max{N k h (s, a, b), 1} γ k h (s, a, b) := C H P h (V k h+1 -V k h+1 )(s, a, b) for some large absolute constant C > 0. Lemma 18. Let c 1 be some large absolute constant. Define event E 0 to be: for all h, s, a, b, s and k ∈ [K],              |[( P k h -P h )V h+1 ](s, a, b)| ≤ c 1 H 2 ι max{N k h (s, a, b), 1} |( P k h -P h )(s | s, a, b)| ≤ c 1   min{P h (s | s, a, b), P k h (s | s, a, b)}ι max{N k h (s, a, b), 1} + ι max{N k h (s, a, b), 1}   . We have P(E 1 ) ≥ 1 -p. Proof. The proof is standard and folklore: apply standard concentration inequalities and then take a union bound. For completeness, we provide the proof of the second one here. Consider a fixed (s, a, b, h) tuple. Let's consider the following equivalent random process: (a) before the agent starts, the environment samples {s (1) , s (2) , . . . , s (K) } independently from P h (• | s, a, b); (b) during the interaction between the agent and environment, the i th time the agent reaches (s, a, b, h), the environment will make the agent transit to s (i) . Note that the randomness induced by this interaction procedure is exactly the same as the original one, which means the probability of any event in this context is the same as in the original problem. Therefore, it suffices to prove the target concentration inequality in this 'easy' context. Denote by P (t) h (• | s, a, b) the empirical estimate of P h (• | s, a, b) calculated using {s (1) , s (2) , . . . , s (t) }. For a fixed t and s , by applying the Bernstein inequality and its empirical version, we have with probability at least 1 -p/S 2 ABT , |(P h -P (t) h )(s | s, a, b)| ≤ O   min{P h (s | s, a, b), P (t) h (s | s, a, b)}ι t + ι t   . Now we can take a union bound over all s, a, b, h, s and t ∈ [K], and obtain that with probability at least 1 -p, for all s, a, b, h, s and t ∈ [K], |(P h -P (t) h )(s | s, a, b)| ≤ O   min{P h (s | s, a, b), P (t) h (s | s, a, b)}ι t + ι t   . Note that the agent can reach each (s, a, b, h) for at most K times, this directly implies that the third inequality also holds with probability at least 1 -p. We begin with an auxiliary lemma bounding the lower-order term. Lemma 19. Suppose event E 0 holds, then there exists absolute constant c 2 such that: if function g(s) satisfies |g|(s) ≤ (V k h+1 -V k h+1 )(s) for all s, then |( P k h -P h )g(s, a, b)| ≤c 2 1 H min{ P k h (V k h+1 -V k h+1 )(s, a, b), P h (V k h+1 -V k h+1 )(s, a, b)} + H 2 Sι max{N k h (s, a, b), 1} Proof. By triangle inequality, |( P k h -P h )g(s, a, b)| ≤ s |( P k h -P h )(s |s, a, b)||g|(s ) ≤ s |( P k h -P h )(s |s, a, b)|(V k h+1 -V k h+1 )(s ) (i) ≤O   s ( ι P k h (s |s, a, b) max{N k h (s, a, b), 1} + ι max{N k h (s, a, b), 1} )(V k h+1 -V k h+1 )(s )   (ii) ≤ O s ( P k h (s |s, a, b) H + Hι max{N k h (s, a, b), 1} )(V k h+1 -V k h+1 )(s ) ≤O P k h (V k h+1 -V k h+1 )(s, a, b) H + H 2 Sι max{N k h (s, a, b), 1} , where (i) is by the second inequality in event E 0 and (ii) is by AM-GM inequality. This proves the empirical version. Similarly, we can show |( P k h -P h )g(s, a, b)| ≤ O P h (V k h+1 -V k h+1 )(s, a, b) H + H 2 Sι max{N k h (s, a, b), 1} , Combining the two bounds completes the proof. Now we can prove the upper and lower bounds are indeed upper and lower bounds of the best reponses. Lemma 20. Suppose event E 0 holds. Then for all h, s, a, b and k ∈ [K], we have    Q k h (s, a, b) ≥ Q †,ν k h (s, a, b) ≥ Q µ k , † h (s, a, b) ≥ Q k h (s, a, b), V k h (s) ≥ V †,ν k h (s) ≥ V µ k , † h (s) ≥ V k h (s). ( ) Proof. The proof is by backward induction. Suppose the bounds hold for the Q-values in the (h + 1) th step, we now establish the bounds for the V -values in the (h + 1) th step and Q-values in the h th -step. For any state s: V k h+1 (s) = D π k h+1 Q k h+1 (s) ≥ max µ D µ×ν k h+1 Q k h+1 (s) ≥ max µ D µ×ν k h+1 Q †,ν k h+1 (s) = V †,ν k h+1 (s). Similarly, we can show V k h+1 (s) ≤ V µ k , † h+1 (s). Therefore, we have: for all s, V k h+1 (s) ≥ V †,ν k h+1 (s) ≥ V h+1 (s) ≥ V µ k , † h+1 (s) ≥ V k h+1 s). Now consider an arbitrary triple (s, a, b) in the h th step. We have (Q k h -Q †,ν k h )(s, a, b) ≥ min ( P k h V k h+1 -P h V †,ν k h+1 + β k h + γ k h )(s, a, b), 0 ≥ min ( P k h V †,ν k h+1 -P h V †,ν k h+1 + β k h + γ k h )(s, a, b), 0 = min ( P k h -P h )(V †,ν k h+1 -V h+1 )(s, a, b) (A) + ( P k h -P h )V h+1 (s, a, b) (B) + (β k h + γ k h )(s, a, b), 0 . (10) Invoking Lemma 19 with g = V †,ν k h+1 -V h+1 , |(A)| ≤ O P k h (V k h+1 -V k h+1 )(s, a, b) H + H 2 Sι max{N k h (s, a, b), 1} . By the first inequality in event E 0 , |(B)| ≤ O H 2 ι max{N k h (s, a, b), 1} . Plugging the two inequalities above back into (10) and recalling the definition of β k h and γ k h , we obtain Q k h (s, a, b) ≥ Q †,ν k h (s, a, b). Similarly, we can show Q k h (s, a, b) ≤ Q µ k , † h (s, a, b). Finally we come to the proof of Theorem 3. Proof of Theorem 3. Suppose event E 0 holds. We first upper bound the regret. By Lemma 20, the regret can be upper bounded by k (V †,ν k 1 (s k 1 ) -V µ k , † 1 (s k 1 )) ≤ k (V k 1 (s k 1 ) -V k 1 (s k 1 )). For brevity's sake, we define the following notations:        ∆ k h := (V k h -V k h )(s k h ), ζ k h := ∆ k h -(Q k h -Q k h )(s k h , a k h , b k h ), ξ k h := P h (V k h+1 -V k h+1 )(s k h , a k h , b k h ) -∆ k h+1 . Let F k h be the σ-field generated by the following random variables: {(s j i , a j i , b j i , r j i )} (i,j)∈[H]×[k-1] {(s k i , a k i , b k i , r k i )} i∈[h-1] {s k h }. It's easy to check ζ k h and ξ k h are martingale differences with respect to F k h . With a slight abuse of notation, we use β k h to refer to β k h (s k h , a k h , b k h ) and N k h to refer to N k h (s k h , a k h , b k h ) in the following proof. We have ∆ k h =ζ k h + Q k h -Q k h s k h , a k h , b k h ≤ζ k h + 2β k h + 2γ k h + P k h (V k h+1 -V k h+1 ) s k h , a k h , b k h (i) ≤ζ k h + 2β k h + 2γ k h + P h (V k h+1 -V k h+1 ) s k h , a k h , b k h + c 2 P h (V k h+1 -V k h+1 )(s k h , a k h , b k h ) H + H 2 Sι max{N k h , 1} (ii) ≤ ζ k h + 2β k h + P h (V k h+1 -V k h+1 ) s k h , a k h , b k h + 2c 2 C P h (V k h+1 -V k h+1 )(s k h , a k h , b k h ) H + H 2 Sι max{N k h , 1} ≤ζ k h + 1 + 2c 2 C H P h (V k h+1 -V k h+1 ) s k h , a k h , b k h + 4c 2 C ιH 2 max{N k h , 1} + H 2 Sι max{N k h , 1} =ζ k h + 1 + 2c 2 C H ξ k h + 1 + 2c 2 C H ∆ k h+1 + 4c 2 C ιH 2 max{N k h , 1} + H 2 Sι max{N k h , 1} where (i) and (ii) follow from Lemma 19. Define c 3 := 1 + 2c 2 C and κ := 1 + c 3 /H. Recursing this argument for h ∈ [H] and summing over k, K k=1 ∆ k 1 ≤ K k=1 H h=1 κ h-1 ζ k h + κ h ξ k h + O ιH 2 max{N k h , 1} + H 2 Sι max{N k h , 1} . By Azuma-Hoeffding inequality, with probability at least 1 -p,              K k=1 H h=1 κ h-1 ζ k h ≤ O H √ HKι = O √ H 2 T ι , K k=1 H h=1 κ h ξ k h ≤ O H √ HKι = O √ H 2 T ι . ( ) By pigeon-hole argument, K k=1 H h=1 1 max{N k h , 1} ≤ s,a,b,h: N K h (s,a,b)>0 N K h (s,a,b) n=1 1 √ n + HSAB ≤O √ HSABT + HSAB , K k=1 H h=1 1 max{N k h , 1} ≤ s,a,b,h: N K h (s,a,b)>0 N K h (s,a,b) n=1 1 n + HSAB ≤O(HSABι). Put everything together, with probability at least 1 -2p (one p comes from P(E 0 ) ≥ 1 -p and the other is for equation ( 12)), K k=1 (V †,ν k 1 (s k 1 ) -V µ k , † 1 (s k 1 )) ≤ O √ H 3 SABT ι + H 3 S 2 ABι 2 For the PAC guarantee, recall that we choose π out = π k such that k = argmin k V k 1 -V k 1 (s 1 ). As a result, (V †,ν k 1 -V µ k , † 1 )(s 1 ) ≤ (V k 1 -V k 1 )(s 1 ) ≤ 1 K O √ H 3 SABT ι + H 3 S 2 ABι 2 , which concludes the proof.

F.2 PROOF OF THEOREM 4

We use the same notation as in Appendix F.1 except the form of bonus. Besides, we define the empirical variance operator V k h V (s, a, b) := Var s ∼ P k h (•|s,a,b) V (s ) and the true (population) variance operator V h V (s, a, b) := Var s ∼P h (•|s,a,b) V (s ) for any function V ∈ ∆ S . If N k h (s, a, b) = 0, we simply set V k h V (s, a, b) := H 2 regardless of the choice of V . As a result, the bonus terms can be written as β k h (s, a, b) := C    ι V k h [(V k h+1 + V k h+1 )/2](s, a, b) max{N k h (s, a, b), 1} + H 2 Sι max{N k h (s, a, b), 1}    ( ) for some absolute constant C > 0. Lemma 21. Let c 1 be some large absolute constant. Define event E 1 to be: for all h, s, a, b, s and k ∈ [K],                            |[( P k h -P h )V h+1 ](s, a, b)| ≤ c 1   V k h V h+1 (s, a, b)ι max{N k h (s, a, b), 1} + Hι max{N k h (s, a, b), 1}   , |( P k h -P h )(s | s, a, b)| ≤ c 1   min{P h (s | s, a, b), P k h (s | s, a, b)}ι max{N k h (s, a, b), 1} + ι max{N k h (s, a, b), 1}   , ( P k h -P h )(• | s, a, b) 1 ≤ c 1 Sι max{N k h (s, a, b), 1} . We have P(E 1 ) ≥ 1 -p. The proof of Lemma 21 is highly similar to that of Lemma 18. Specifically, the first two can be proved by following basically the same argument in Lemma 18; the third one is standard (e.g., equation ( 12) in Azar et al. ( 2017)). We omit the proof here. Since the proof of Lemma 19 does not depend on the form of the bonus, it can also be applied in this section. As in Appendix F.1, we will prove the upper and lower bounds are indeed upper and lower bounds of the best reponses. Lemma 22. Suppose event E 1 holds. Then for all h, s, a, b and k ∈ [K], we have    Q k h (s, a, b) ≥ Q †,ν k h (s, a, b) ≥ Q µ k , † h (s, a, b) ≥ Q k h (s, a, b), V k h (s) ≥ V †,ν k h (s) ≥ V µ k , † h (s) ≥ V k h (s). Proof. The proof is by backward induction and very similar to that of Lemma 20. Suppose the bounds hold for the Q-values in the (h + 1) th step, we now establish the bounds for the V -values in the (h + 1) th step and Q-values in the h th -step. The proof for the V -values is the same as (9). For the Q-values, the decomposition (10) still holds and (A) is bounded using Lemma 19 as before. The only difference is that we need to bound (B) more carefully. First, by the first inequality in event E 1 , |(B)| ≤ O   V k h V h+1 (s, a, b)ι max{N k h (s, a, b), 1} + Hι max{N k h (s, a, b), 1}   . By the relation of V -values in the (h + 1) th step, |[ V k h (V k h+1 + V k h+1 )/2] -V k h V h+1 |(s, a, b) ≤|[ P k h (V k h+1 + V k h+1 )/2] 2 -( P k h V h+1 ) 2 |(s, a, b) + | P k h [(V k h+1 + V k h+1 )/2] 2 -P k h (V h+1 ) 2 |(s, a, b) ≤4H P k h |(V k h+1 + V k h+1 )/2 -V h+1 |(s, a, b) ≤4H P k h (V k h+1 -V k h+1 )(s, a, b), which implies ι V k h V h+1 (s, a, b) max{N k h (s, a, b), 1} ≤ ι[ V k h [(V k h+1 + V k h+1 )/2] + 4H P k h (V k h+1 -V k h+1 )](s, a, b) max{N k h (s, a, b), 1} ≤ ι V k h [(V k h+1 + V k h+1 )/2](s, a, b) max{N k h (s, a, b), 1} + 4ιH P k h (V k h+1 -V k h+1 )](s, a, b) max{N k h (s, a, b), 1} ≤ ι V k h [(V k h+1 + V k h+1 )/2](s, a, b) max{N k h (s, a, b), 1} + P k h (V k h+1 -V k h+1 ) H + 4H 2 ι max{N k h (s, a, b), 1} , where (i) is by AM-GM inequality. Plugging the above inequalities back into (10) and recalling the definition of β k h and γ k h completes the proof. We need one more lemma to control the error of the empirical variance estimator: Lemma 23. Suppose event E 1 holds. Then for all h, s, a, b and k ∈ [K], we have | V k h [(V k h+1 + V k h+1 )/2] -V h V π k h+1 |(s, a, b) ≤4HP h (V k h+1 -V k h+1 )(s, a, b) + O 1 + H 4 Sι max{N k h (s, a, b), 1} Proof. By Lemma 22, we have V k h (s) ≥ V π k h (s) ≥ V k h (s). As a result, | V k h [(V k h+1 + V k h+1 )/2] -V h V π k h+1 |(s, a, b) =|[ P k h (V k h+1 + V k h+1 ) 2 /4 -P h (V π k h+1 ) 2 ](s, a, b) -[( P k h (V k h+1 + V k h+1 )) 2 /4 -(P h V π k h+1 ) 2 ](s, a, b)| ≤[ P k h (V k h+1 ) 2 -P h (V k h+1 ) 2 -( P k h V k h+1 ) 2 + (P h V k h+1 ) 2 ](s, a, b) ≤[|( P k h -P h )(V k h+1 ) 2 | + |P h [(V k h+1 ) 2 -(V k h+1 ) 2 ]| + |( P k h V k h+1 ) 2 -(P h V k h+1 ) 2 | + |(P h V k h+1 ) 2 -(P h V k h+1 ) 2 |](s, a, b) These terms can be bounded separately by using event E 1 : |( P k h -P h )(V k h+1 ) 2 |(s, a, b) ≤ H 2 ( P k h -P h )(• | s, a, b) 1 ≤ O(H 2 Sι max{N k h (s, a, b), 1} ), |P h [(V k h+1 ) 2 -(V k h+1 ) 2 ]|(s, a, b) ≤ 2H[P h (V k h+1 -V k h+1 )](s, a, b), |( P k h V k h+1 ) 2 -(P h V k h+1 ) 2 |(s, a, b) ≤ 2H[( P k h -P h )V k h+1 ](s, a, b) ≤ O(H 2 Sι max{N k h (s, a, b), 1} ), |(P h V k h+1 ) 2 -(P h V k h+1 ) 2 |(s, a, b) ≤ 2H[P h (V k h+1 -V k h+1 )](s, a, b). Combining with H 2 Sι max{N k h (s,a,b),1} ≤ 1 + H 4 Sι max{N k h (s,a,b),1} completes the proof. Finally we come to the proof of Theorem 4. Proof of Theorem 4. Suppose event E 1 holds. We define ∆ k h , ζ k h abd ξ k h as in the proof of Theorem 3. As before we have ∆ k h ≤ζ k h + 1 + c 3 H P h (V k h+1 -V k h+1 ) s k h , a k h , b k h + 4c 2 C    ι V k h [(V k h+1 + V k h+1 )/2](s k h , a k h , b k h ) max{N k h (s k h , a k h , b k h ), 1} + H 2 Sι max{N k h (s k h , a k h , b k h ), 1}    . By Lemma 23, ι V k h [(V k h+1 + V k h+1 )/2](s, a, b) max{N k h (s, a, b), 1} ≤O    ιV h V π k h+1 (s, a, b) + ι max{N k h (s, a, b), 1} + HιP h (V k h+1 -V k h+1 )(s, a, b) max{N k h (s, a, b), 1} + H 2 √ Sι max{N k h (s, a, b), 1}    ≤c 4   ιV h V π k h+1 (s, a, b) + ι max{N k h (s, a, b), 1} + P h (V k h+1 -V k h+1 )(s, a, b) H + H 2 √ Sι max{N k h (s, a, b), 1}   , where c 4 is some absolute constant. Define c 5 := 4c 2 c 4 C + c 3 and κ := 1 + c 5 /H. Plugging (18) back into (17), we have ∆ k h ≤κ∆ k h+1 + κξ k h + ζ k h + O ιV h V π k h+1 (s k h , a k h , b k h ) N k h (s k h , a k h , b k h ) + ι N k h (s k h , a k h , b k h ) + H 2 Sι N k h (s k h , a k h , b k h ) . Recursing this argument for h ∈ [H] and summing over k, K k=1 ∆ k 1 ≤ K k=1 H h=1 κ h-1 ζ k h + κ h ξ k h + O   ιV h V π k h+1 (s k h , a k h , b k h ) max{N k h , 1} + ι max{N k h , 1} + H 2 Sι max{N k h , 1}   . The remaining steps are the same as that in the proof of Theorem 3 except that we need to bound the sum of variance term. By Cauchy-Schwarz, K k=1 H h=1 V h V π k h+1 (s k h , a k h , b k h ) max{N k h (s k h , a k h , b k h ), 1} ≤ K k=1 H h=1 V h V π k h+1 (s k h , a k h , b k h ) • K k=1 H h=1 1 max{N k h (s k h , a k h , b k h ), 1} . By the Law of total variation and standard martingale concentration (see Lemma C.5 in Jin et al. (2018) for a formal proof), with probability at least 1 -p, we have K k=1 H h=1 V h V π k h+1 (s k h , a k h , b k h )≤O HT + H 3 ι . Putting all relations together, we obtain that with probability at least 1 -2p (one p comes from P(E 1 ) ≥ 1 -p and the other comes from the inequality for bounding the variance term), Regret(K) = K k=1 (V †,ν k 1 -V µ k , † 1 )(s 1 ) ≤ O( √ H 2 SABT ι + H 3 S 2 ABι 2 ). Rescaling p completes the proof.

G PROOF FOR SECTION 4 -REWARD-FREE LEARNING

G.1 PROOF OF THEOREM 5 In this section, we prove Theorem 5 for the single reward function case, i.e., N = 1. The proof for multiple reward functions (N > 1) simply follows from taking a union bound, that is, replacing the failure probability p by N p. Let (µ k , ν k ) be an arbitrary Nash-equilibrium policy of M k := ( P k , r k ), where P k and r k are our empirical estimate of the transition and the reward at the beginning of the k'th episode in Algorithm 2, respectively. We use N k h (s, a, b) to denote the number we have visited the state-action tuple (s, a, b) at the h-th step before the k'th episode. And the bonus used in the k'th episode can be written as β k h (s, a, b) := C H 2 ι max{N k h (s, a, b), 1} + H 2 Sι max{N k h (s, a, b), 1} , where ι = log(SABT /p) and C is some large absolute constant. We use Q k and V k to denote the empirical optimal value functions of M k as following.    Q k h (s, a, b) = ( P k h V h+1 )(s, a, b) + r k h (s, a, b), V k h (s) = max µ min ν D µ×ν Q k h (s). Since (µ k , ν k ) is a Nash-equilibrium policy of M k , we also have V k h (s) = D µ k ×ν k Q k h (s). We begin with stating a useful property of matrix game that will be frequently used in our analysis. Since its proof is quite simple, we omit it here. Lemma 24. Let X, Y, Z ∈ R A×B and ∆ d be the d-dimensional simplex. Suppose |X -Y| ≤ Z, where the inequality is entry-wise. Then max µ∈ A min ν∈ B µ Xν -max µ∈ A min ν∈ B µ Yν ≤ max i,j Z ij . Lemma 25. Let c 1 be some large absolute constant such that c 2 1 + c 1 ≤ C. Define event E 1 to be: for all h, s, a, b, s and k ∈ [K],                          |[( P k h -P h )V h+1 ](s, a, b)| ≤ c 1 H 2 ι max{N k h (s, a, b), 1} , |( r k h -r h )(s, a, b)| ≤ c 1 H 2 ι max{N k h (s, a, b), 1} |( P k h -P h )(s | s, a, b)| ≤ c 1 10   P k h (s | s, a, b)ι max{N k h (s, a, b), 1} + ι max{N k h (s, a, b), 1}   . ( ) We have P(E 1 ) ≥ 1 -p. Proof. The proof is standard and folklore: apply standard concentration inequalities and then take a union bound. For completeness, we provide the proof of the third one here. Consider a fixed (s, a, b, h) tuple. Let's consider the following equivalent random process: (a) before the agent starts, the environment samples {s (1) , s (2) , . . . , s (K) } independently from P h (• | s, a, b); (b) during the interaction between the agent and environment, the i th time the agent reaches (s, a, b, h), the environment will make the agent transit to s (i) . Note that the randomness induced by this interaction procedure is exactly the same as the original one, which means the probability of any event in this context is the same as in the original problem. Therefore, it suffices to prove the target concentration inequality in this 'easy' context. Denote by P (t) h (• | s, a, b) the empirical estimate of P h (• | s, a, b) calculated using {s (1) , s (2) , . . . , s (t) }. For a fixed t and s , by the empirical Bernstein inequality, we have with probability at least 1 -p/S 2 ABT , |(P h -P (t) h )(s | s, a, b)| ≤ O   P (t) h (s | s, a, b)ι t + ι t   . Now we can take a union bound over all s, a, b, h, s and t ∈ [K], and obtain that with probability at least 1 -p, for all s, a, b, h, s and t ∈ [K], |(P h -P (t) h )(s | s, a, b)| ≤ O   P (t) h (s | s, a, b)ι t + ι t   . Note that the agent can reach each (s, a, b, h) for at most K times, this directly implies that the third inequality also holds with probability at least 1 -p. The following lemma states that the empirical optimal value functions are close to the true optimal ones, and their difference is controlled by the exploration value functions calculated in Algorithm 2. Lemma 26. Suppose event E 1 (defined in Lemma 25) holds. Then for all h, s, a, b and k ∈ [K], we have,      Q k h (s, a, b) -Q h (s, a, b) ≤ Q k h (s, a, b), V k h (s) -V h (s) ≤ V k h (s). Proof. Let's prove by doing backward induction on h. The case of h = H + 1 holds trivially. Assume the conclusion hold for (h + 1)'th step. For h'th step, Q k h (s, a, b) -Q h (s, a, b) ≤ min [( P k h -P h )V h+1 ](s, a, b) + |( r k h -r h )(s, a, b)| + [ P k h ( V k h+1 -V h+1 )](s, a, b) , H (i) ≤ min β k h (s, a, b) + ( P k h V k h+1 )(s, a, b), H (ii) = Q k h (s, a, b), where (i) follows from the induction hypothesis and event E 1 , and (ii) follows from the definition of Q k h . By Lemma 24, we immediately obtain | V k h (s) -V h (s)| ≤ V k h (s). Now, we are ready to establish the key lemma in our analysis using Lemma 26. Lemma 27. Suppose event E 1 (defined in Lemma 25) holds. Then for all h, s, a, b and k ∈ [K], we have | Q k h (s, a, b) -Q †,ν k h (s, a, b)| ≤ α h Q k h (s, a, b), | V k h (s) -V †,ν k h (s)| ≤ α h V k h (s), and | Q k h (s, a, b) -Q µ k , † h (s, a, b)| ≤ α h Q k h (s, a, b), | V k h (s) -V µ k , † h (s)| ≤ α h V k h (s), where α H+1 = 0 and α h = [(1 + 1 H )α h+1 + 1 H ] ≤ 4. Proof. We only prove the first set of inequalities. The second one follows exactly the same. Again, the proof is by performing backward induction on h. It is trivial to see the conclusion holds for (H + 1)'th step with α H+1 = 0. Now, assume the conclusion holds for (h + 1)'th step. For h'th step, | Q k h (s, a, b) -Q †,ν k h (s, a, b)| ≤ min |[( P k h -P h )(V †,ν k h+1 -V h+1 )](s, a, b)| + |( P k h -P h )V h+1 (s, a, b)| + |( r k h -r h )(s, a, b)| + |[ P h ( V k h+1 -V †,ν k h+1 )](s, a, b)|, H ≤ min |[( P k h -P h )(V †,ν k h+1 -V h+1 )](s, a, b)| (T1) +c 1 H 2 ι max{N k h (s, a, b), 1} + |[ P h ( V k h+1 -V †,ν k h+1 )](s, a, b)| (T2) , H , where the second inequality follows from the definition of event E 1 . We can control the term (T 1 ) by combining Lemma 26 and the induction hypothesis to bound |V †,ν k h+1 -V h+1 |, and then applying the third inequality in event E 1 : (T 1 ) ≤ s | P k h (s | s, a, b) -P h (s | s, a, b)||V †,ν k h+1 -V h+1 (s )| ≤ s | P k h (s | s, a, b) -P h (s | s, a, b)| |V †,ν k h+1 -V k h+1 (s )| + | V k h+1 -V h+1 (s )| ≤ s | P k h (s | s, a, b) -P h (s | s, a, b)|(α h+1 + 1) V k h+1 ≤ (α h+1 + 1) H ( P k h V k h+1 )(s, a, b) + c 2 1 (α h+1 + 1)H 2 Sι max{N k h (s, a, b), 1} . The term (T 2 ) is bounded by directly applying the induction hypothesis |[ P h ( V k h+1 -V †,ν k h+1 )](s, a, b)| ≤ α h+1 [ P h V k h+1 ](s, a, b). Plugging ( 29) and ( 30) into (28), we obtain Q k h (s, a, b) -Q †,ν k h (s, a, b) ≤ min (1 + 1 H )α h+1 + 1 H [ P k h V k h+1 ](s, a, b) + c 1 H 2 ι max{N k h (s, a, b), 1} + c 2 1 (α h+1 + 1)H 2 Sι max{N k h (s, a, b), 1} , H (i) ≤ min (1 + 1 H )α h+1 + 1 H [ P k h V k h+1 ](s, a, b) + β k h (s, a, b), H (ii) ≤ (1 + 1 H )α h+1 + 1 H Q k h (s, a, b), where (i) follows from the definition of β k h , and (ii) follows from the definition of Q k h . Therefore, by (31), choosing α h = [(1 + 1 H )α h+1 + 1 H ] suffices for the purpose of induction. Now, let's prove the inequality for V functions. |( V k h -V †,ν k h )(s)| (i) = | max µ∈ A (D µ,ν k Q k h )(s) -max µ∈ A (D µ,ν k Q †,ν k h )(s)| (ii) ≤ max a,b α h Q k h (s, a, b) = α h V k h (s), where (i) follows from the definition of V k h and V †,ν k h , and (ii) uses (31) and Lemma 24. Theorem 28 (Guarantee for UCB-VI from Azar et al. ( 2017)). For any p ∈ (0, 1], choose the exploration bonus β t in Algrothm 2 as (20). Then, with probability at least 1 -p, K k=1 V k 1 (s 1 ) ≤ O( √ H 4 SAKι + H 3 S 2 Aι 2 ). Proof of Theorem 5. Recall that out = arg min k∈[K] V k h (s). By Lemma 27 and Theorem 28, with probability at least 1 -2p, V †,ν out h (s) -V µ out , † h (s) ≤|V †,ν out h (s) -V out h (s)| + | V out h (s) -V µ out , † h (s)| ≤8 V out h (s) ≤ O( H 4 SAι K + H 3 S 2 Aι 2 K ). Rescaling p completes the proof.

G.2 VANILLA NASH VALUE ITERATION

Here, we provide one optional algorithm, Vanilla Nash VI, for computing the Nash equilibrium policy for a known model. Its only difference from the value iteration algorithm for MDPs is that the maximum operator is replaced by the minimax operator in Line 7. We remark that the Nash equilibrium for a two-player zero-sum game can be computed in polynomial time. 

6:

for s ∈ S do 7: (μ h (• | s), νh (• | s)) ← NASH-ZERO-SUM(Q h (s, •, •)). 8: V h (s) ← μh (• | s) Q h (s, •, •)ν h (• | s). 9: Output (μ, ν) ← {(μ h (• | s), νh (• | s))} (h,s)∈[H]×S . By recalling the definition of best responses in Appendix D, one can directly see that the output policy (μ, ν) is a Nash equilibrium for M.

G.3 PROOF OF THEOREM 6

In this section, we first prove a Θ(AB/ 2 ) lower bound for reward-free matrix games, i.e., S = H = 1, and then generalize it to Θ(SABH 2 / 2 ) for the Markov games setting.

G.3.1 REWARD-FREE MATRIX GAMES

In the matrix game, let the max-player pick row and the min-player pick column. We consider the following family of Bernoulli matrix games: M( ) = M ∈ R A×B with M ab = 1 2 + (1 -2 • 1{a = a &b = b }) : (a , b ) ∈ [A] × [B] , where in matrix game M, the reward is sampled from Bernoulli(M ab ) if the max-player picks the a'th row and the min-player picks the b'th column. Proof. Denote by π the output policy for the max player. Denote by Z the collection of (h, i)'s in [H] × [S] such that π h+1 (a h,i | s i ) ≤ 2/3. Observe that each time the max player picks a suboptimal action, it will incur an 2 /H suboptimality in expectation. As a result, if π is at most /10 3 -suboptimal, we must have 1 S (h,i)∈Z (1 -π h+1 (a h,i | s i )) × 2 H ≤ 10 3 , which implies |Z| ≤ SH/500, that is, for at most SH/500 different i's, π(a i | s i ) ≤ 2/3. Therefore, we can simply pick argmax a π h+1 (a | s i ) as the guess for a h,i . Since policy π is at most /10 3 suboptimal with probability at least p, we can correctly identify the optimal actions for at least SH -SH/500 different (s, h) pairs also with probability no smaller than p. Lemma 32 directly implies that in order to prove the desired lower bound for reward-free Markov games: Claim 33. for any algorithm A interacting with the environment for at most K = ABSH 2 /(10 4 2 ) episodes, there exists J ∈ J( ) such that when running A on J , it will output a policy that is at least /10 3 suboptimal for the max-player with probability at least 1/4, it suffices to prove the following claim: Claim 34. for any algorithm Â interacting with the environment for at most K = ABSH 2 /(10 4 2 ) episodes, there exists J ∈ J( ) such that when running Â on J , it will fail to identify the optimal actions for at least SH/500 + 1 different (s, h) pairs with probability at least 1/4. Proof of Claim 34. Denote by P (E ) the probability (expectation) induced by picking J (a , b ) uniformly at random from J( ) and running Â on J . Denote by n wrong the number of (s, h) pairs for which Â fails to identify the optimal actions. Denote by error h,i the indicator function of the event that Â fails to identify the optimal action for (h + 1, i). We prove by contradiction. Suppose for any J ∈ J( ), Â can identify the optimal actions for at least SH -SH/500 different (s, h) pairs with probability larger than 3/4. Then we have E [n wrong ] ≤ 1 4 × SH + 3 4 × SH 500 ≤ 101SH 400 . Since (h,i)∈[H]×[S] E [error h,i ] = E [n wrong ], there must exists (h , i ) ∈ [H] × [S] such that E [error h ,i ] ≤ 101/400. However, in the following, we show that for every (h, i) ∈ [H] × [S], Â fails to identify the optimal action for the step-state pair (h+1, i) with probability at least 1/3, which directly implies E [error h,i ] ≥ 1/3 for all (h, i) ∈ [H] × [S]. As a result, we obtain a contraction and Claim 34 holds.

Now, let us prove that for every

(h, i) ∈ [H] × [S], E [error h,i ] ≥ 1/6 . WLOG, we assume Â is deterministic and it runs for exactly K = ABSH 2 /(10 3 2 ) episodes. In the following, we consider a fixed (h , i ) pair. For technical reason, we define MDP J -(h ,i ) (a , b ) as below: • States, actions and transitions: same as J (a , b ). • Rewards: there is no reward in the first step. For the remaining steps h ∈ {2, . . . , H + 1}, if the agent takes action (a, b) at state s i in the h th step such that (h -1, i) = (h , i ), it will receive a binary reward sampled from Bernoulli 1 2 + (1 -2 • 1{a = a h-1,i &b = b h-1,i }) H , otherwise it will receive a binary reward sampled from  Bernoulli 1 2 + (1 -2 • 1{b = b h-1,i }) H .

H PROOF FOR APPENDIX C -MULTI-PLAYER GENERAL-SUM MARKOV GAMES

Q k h,i (s, a) ≥ Q †,π k -i h,i (s, a) , Q k h,i (s, a) ≤ Q π k h,i (s, a) , V k h,i (s) ≥ V †,π k -i h,i (s) , V k h,i (s) ≤ V π k h,i (s) . Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step,V k H+1,i (s) = V †,π k -i H+1,i (s) = 0. Now, assume the inequality (40) holds for the (h + 1)-th step, for the h-th step, by definition of the Q functions, Q k h,i (s, a) -Q †,π k -i h,i (s, a) = P k h V k h+1,i (s, a) -P h V †,π k -i h+1,i (s, a) + β t = P k h V k h+1,i -V †,π k -i h+1,i (A) + P k h -P h V †,π k -i h+1,i (s, a) (B) + β t . By induction hypothesis, for any s ,  V k h+1,i -V †,π k -i h+1,i Q k h,i (s, a) -Q †,π k -i h,i (s, a) ≥ 0. The second inequality can be proved similarly. Now assume inequality (39) holds for the h-th step, by definition of value functions and Nash equilibrium, V k h,i (s) = D π k Q k h,i (s) = max µ D µ×π k -i Q k h,i . By Bellman equation, V †,π k -i h,i (s) = max µ D µ×π k -i Q †,π k -i h,i . Since by induction hypothesis, for any (s, a), Q k h,i (s, a) ≥ Q †,π k -i h,i (s, a). As a result, we also have V k h,i (s) ≥ V †,π k -i h,i , which is exactly inequality (40) for the h-th step. The second inequality can be proved similarly. Proof of Theorem 15. Let us focus on the i-th player and ignore the subscript when there is no confusion. To bound max i V †,π k -i 1,i -V π k 1,i s k h ≤ max i V k 1,i -V k 1,i s k h , we notice the following propogation:    (Q k h,i -Q k h,i )(s, a) ≤ P k h (V k h+1,i -V k h+1,i )(s, a) + 2β k h (s, a), (V h,i -V h,i )(s) = [D π h (Q k h,i -Q k h,i )](s). We can define Q k h and V k h recursively by V k H+1 = 0 and Q k h (s, a) = P k h V k h+1 (s, a) + 2β k h (s, a), V k h (s) = [D π h Q k h ](s). Then we can prove inductively that for any k, h, s and a we have    max i (Q k h,i -Q k h,i )(s, a) ≤ Q k h (s, a), max i (V h,i -V h,i )(s) ≤ V k h (s). Thus we only need to bound K k=1 V k 1 (s). Define the shorthand notation            β k h := β k h (s k h , a k h ), ∆ k h := V k h (s k h ), ζ k h := D π k Q k h s k h -Q k h (s k h , a k h ), ξ k h := P h V k h (s k h , a k h ) -∆ k h+1 . We can check ζ k h and ξ k h are martingale difference sequences. As a result, ∆ k h =D π k Q k h s k h =ζ k h + Q k h s k h , a k h =ζ k h + 2β k h + P k h V k h+1 s k h , a k h ≤ζ k h + 3β k h + P h V k h+1 s k h , a k h =ζ k h + 3β k h + ξ k h + ∆ k h+1 . Recursing this argument for h ∈ [H] and taking the sum, K k=1 ∆ k 1 ≤ K k=1 ζ k h + 3β k h + ξ k h ≤ O   S H 3 T ι M i=1 A i   .

H.1.2 CCE VERSION

The proof is very similar to the NE version. Specifically, the only part that uses the properties of NE there is Lemma 35. We prove a counterpart here. Lemma 36. With probability 1 -p, for any (s, a, h, k, i): Q k h,i (s, a) ≥ Q †,π k -i h,i (s, a) , Q k h,i (s, a) ≤ Q π k h,i (s, a) , V k h,i (s) ≥ V †,π k -i h,i (s) , V k h,i (s) ≤ V π k h,i (s) . Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step,V k H+1,i (s) = V †,π k -i H+1,i (s) = 0. Now, assume the inequality (40) holds for the (h + 1)-th step, for the h-th step, by definition of the Q functions, Q k h,i (s, a) -Q †,π k -i h,i (s, a) = P k h V k h+1,i (s, a) -P h V †,π k -i h+1,i (s, a) + β t = P k h V k h+1,i -V †,π k -i h+1,i (s, a) (A) + P k h -P h V †,π k -i h+1,i (s, a) By Bellman equation, V †,π k -i h,i (s) = max µ D µ×π k -i Q †,π k -i h,i . Since by induction hypothesis, for any (s, a), Q k h,i (s, a) ≥ Q †,π k -i h,i (s, a). As a result, we also have V k h,i (s) ≥ V †,π k -i h,i , which is exactly inequality (40) for the h-th step. The second inequality can be proved similarly.

H.1.3 CE VERSION

The proof is very similar to the NE version. Specifically, the only part that uses the properties of NE there is Lemma 35. We prove a counterpart here. Lemma 37. With probability 1 -p, for any (s, a, h, k, i): Q k h,i (s, a) ≥ max φ Q φ π k h,i (s, a) , Q k h,i (s, a) ≤ Q π k h,i (s, a) , V k h,i (s) ≥ max φ V φ π k h,i (s) , V k h,i (s) ≤ V π k h,i (s) . Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step,V By Bellman equation, max φ V φ π k h,i (s) = max φ D φ π k max φ Q φ π k h,i . Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step, V k H+1,i = V π k -i , † H+1,i = Q k H+1,i = Q π k -i , † H+1,i = 0. Now, assume the conclusion holds for the (h + 1)'th step, for the h'th step, by definition of the Q functions, Q k h,i (s, a) -Q π k -i , † h,i (s, a) = P k h V k h+1,i (s, a) -P h V π k -i , † h+1,i (s, a) + r h (s, a) -r k h (s, a) ≤ P k h V k h+1,i -V π k -i , † h+1,i (s, a) (A) + P k h -P h V π k -i , † h+1,i (s, a) + r h (s, a) -r k h (s, a) (B) By the induction hypothesis, (A) ≤ P k h V k h+1,i -V π k -i , † h+1,i (s, a) ≤ ( P k h V k h+1 )(s, a). By uniform concentration, (B) ≤ SH 2 ι/N k h (s, a) β t . Putting everything together we have Q π k -i , † h,i (s, a) -Q k h,i (s, a) ≤ min ( P k h V k h+1 )(s, a) + β t , H = Q k h (s, a), which proves the first inequality in (50). It remains to show the inequality for V functions also hold in h'th step. Since π k is a Nash-equilibrium policy, we have V k h,i (s) = max µ D µ×π k -i Q k h,i (s) . By Bellman equation, V π k -i , † h,i (s) = max µ D µ×π k -i Q π k -i , † h,i (s) . Combining the two equations above, and utilizing the bound we just proved for Q functions, we obtain V k h,i (s) -V π k -i , † h,i (s) ≤ max µ D µ×π k -i Q k h,i (s) -max µ D µ×π k -i Q π k -i , † h,i (s) ≤ max a Q k h (s, a) = V k h (s), which completes the whole proof.

H.2.2 CCE VERSION

The proof is almost the same as that for Nash equilibriums. We will reuse Lemma 38 and prove an analogue of Lemma 39. The conclusion for CCEs will follow directly by combining the two lemmas as in the proof of Theorem 5. Lemma 40. With probability 1 -p, for any (h, s, a, i, k), we have    Q π k -i , † h,i (s, a) -Q k h,i (s, a) ≤ Q k h (s, a), V π k -i , † h,i (s) -V k h,i (s) ≤ V k h (s). Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step, V k H+1,i = V π k -i , † H+1,i = Q k H+1,i = Q π k -i , † H+1,i = 0. Now, assume the conclusion holds for the (h + 1)'th step, for the h'th step, by definition of the Q functions, Q π k -i , † h,i (s, a) -Q k h,i (s, a) ≤ P h V π k -i , † h+1,i (s, a) -P k h V k h+1,i (s, a) + r h (s, a) -r k h (s, a) ≤ P k h V π k -i , † h+1,i -V k h+1,i (s, a) (A) + P h -P k h V π k -i , † h+1,i (s, a) + r h (s, a) -r k h (s, a) (B) . By the induction hypothesis, (A) ≤ ( P k h V k h+1 )(s, a). By uniform concentration, (B) ≤ SH 2 ι/N k h (s, a) = β t . Putting everything together we have Q π k -i , † h,i (s, a) -Q k h,i (s, a) ≤ min ( P k h V k h+1 )(s, a) + β t , H = Q k h (s, a), which proves the first inequality in (51). It remains to show the inequality for V functions also hold in h'th step. Since π k is a CCE, we have V k h,i (s) ≥ max µ D µ×π k -i Q k h,i (s) . Observe that V π k -i , † h,i obeys the Bellman optimality equation, so we have V π k -i , † h,i (s) = max µ D µ×π k -i Q π k -i , † h,i (s) . Combining the two equations above, and utilizing the bound we just proved for Q functions, we obtain V π k -i , † h,i (s) -V k h,i (s) ≤ max µ D µ×π k -i Q π k -i , † h,i (s) -max µ D µ×π k -i Q k h,i (s) ≤ max a Q k h (s, a) = V k h (s), which completes the whole proof.

H.2.3 CE VERSION

The proof is almost the same as that for Nash equilibriums. We will reuse Lemma 38 and prove an analogue of Lemma 39. The conclusion for CEs will follow directly by combining the two lemmas as in the proof of Theorem 5. Lemma 41. With probability 1 -p, for any (h, s, a, i, k) and strategy modification φ for player i, we have    Q φ π k h,i (s, a) -Q k h,i (s, a) ≤ Q k h (s, a), V φ π k h,i (s) -V k h,i (s) ≤ V k h (s). Proof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step, V k H+1,i = V φ π k H+1,i = Q k H+1,i = Q φ π k H+1,i = 0. Now, assume the conclusion holds for the (h + 1)'th step, for the h'th step, following exactly the same argument as Lemma 40, we can show Q φ π k h,i (s, a) -Q k h,i (s, a) ≤ min ( P k h V k h+1 )(s, a) + β t , H = Q k h (s, a), which proves the first inequality in (52). It remains to show the inequality for V functions also hold in h'th step. Since π k is a CE, we have V k h,i (s) = max φh,s D φh,s π k Q k h,i (s) , where the maximum is take over all possible injective functions from A i to itself. Observe that V φ π k h,i obeys the Bellman optimality equation, so we have V φ π k h,i (s) = max φh,s D φh,s π k Q φ π k h,i (s) . Combining the two equations above, and utilizing the bound we just proved for Q functions, we obtain V φ π k h,i (s) -V k h,i (s) = max φh,s D φh,s π k Q φ π k h,i (s) -max φh,s D φh,s π k Q k h,i (s) ≤ max a Q k h (s, a) = V k h (s), which completes the whole proof.



We assume the rewards in [0, 1] for normalization. Our results directly generalize to randomized reward functions, since learning the transition is more difficult than learning the reward. The minimax theorem here is different from the one for matrix games, i.e. max φ min ψ φ Aψ = min ψ max φ φ Aψ for any matrix A, since here V µ,ν h (s) is in general not bilinear in µ, ν. We remark that the current policy is stochastic. This is different from the single-agent setting, where the algorithm only seeks to provide an upper bound of the value of the optimal policy where the optimal policy is not random. Due to this difference, the techniques of Azar et al. (2017) cannot be directly applied here. recall that (µ k h , ν k h ) are the marginal distributions of π k h .



a, b) := E s ∼P h (•|s,a,b) V (s ) for any value function V . We also use notation [D π Q](s) := E (a,b)∼π(•,•|s) Q(s, a, b) for any action-value function Q.

Vanilla Nash Value Iteration 1: Input: model M = ( P, r). 2: Initialize: for all (s, a, b), V H+1 (s, a, b) ← 0. 3: for step h = H, H -1, . . . , 1 do 4: for (s, a, b) ∈ S × A × B do 5: Q h (s, a, b) ← [ P h V h+1 ](s, a, b) + r h (s, a, b).

s ) ≥ 0, and thus (A) ≥ 0. By uniform concentration (e.g., Lemma 12 in Bai & Jin (2020)), (B) ≤ C SH 2 ι/N k h (s, a) = β t . Putting everything together we have

≥ 0, and thus (A) ≥ 0. By uniform concentration, (B) ≤ C SH 2 ι/N k h (s, a) = β t . Putting everything together we haveQ k h,i (s, a) -Q †,π k -i h,i(s, a) ≥ 0. The second inequality can be proved similarly. Now assume inequality (45) holds for the h-th step, by definition of value functions and CCE,

(s) = 0. Now, assume the inequality (40) holds for the (h + 1)-th step, for the h-th step, by definition of the Q functions,Q k h,i (s, a) -max φ Q φ π k h,i (s, a) = P k h V k h+1,i (s, a) -P h max φ V φ π k h+1,i (s, a) + β t k h+1,i (s ) ≥ 0, and thus (A) ≥ 0. By uniform concentration, (B) ≤ C SH 2 ι/N k h (s, a) = β t .Putting everything together we haveQ k h,i (s, a) -max φ Q φ π k h,i (s, a) ≥ 0.The second inequality can be proved similarly. Now assume inequality (47) holds for the h-th step, by definition of value functions and CE,V k h,i (s) = D π k Q k h,i (s) = max φ D φ π k Q k h,i (s) .

Our final output policies (µ out , ν out ) are simply the marginal policies of π out . That is, for all (s, h) ∈ S × [H], µ out h (•|s) := b∈B π out h (•, b|s), and ν out h (•|s) := a∈A π out h (a, •|s).

Jia et al. (2019);Sidford et al. (2019); Zhang et al. (2020a)  provide non-asymptotic bounds on the number of calls to the simulator for finding an approximate Nash equilibrium.Wei et al. (2017)  studies Markov games under an alternative assumption that no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy.

achieved by model-based algorithm inAzar et al. (2017) and model-free algorithms inZhang et al. (2020c), respectively, where S is the number of states, A is the number of actions, H is the length of each episode. Both of them match the lower bound Ω(H 3 SA/ 2 )(Jaksch et al., 2010;Osband & Van Roy, 2016;Jin et al., 2018).

Algorithm 2 Optimistic Value Iteration with Zero Reward (VI-Zero)

s 1 ) and P out ← P. ∼ π h (•, •|s h ), observe next state s h+1 . (s h , a h ) and N h (s h , a h , s h+1 ).

annex

Above, we visualize the hard instance by using + andto represent 1/2+ and 1/2-, respectively.It is direct to see that the optimal policy for the max-player is always picking the a 'th row and the optimal policy for the min-player is always picking the b 'th column. If the max-player picks the a 'th row with probability smaller than 2/3, it is at least /10 suboptimal.Lemma 29. For any fixed matrix game M from M( ) and N ∈ N, if an algorithm A can output a policy that is at most /10 suboptimal with probability at least p using at most N samples, then there exists an algorithm Â that can identify the best row in M with probability at least p using at most N samples.Proof. We simply define Â as running algorithm A and choosing the most played row by its outputted policy as the guess for the best row. By simple calculation, one can show Â will output the best row in M with probability at least p.Lemma 29 directly implies that in order to prove the desired lower bound for matrix games:Claim 30. for any algorithm A using at most N = AB/(10 3 2 ) samples, there exists a matrix game M in M( ) such that when running A on M, it will output a policy that is at least /10 suboptimal for the max-player with probability at least 1/4, it suffices to prove the following claim:Claim 31. for any algorithm Â using at most N = AB/(10 3 2 ) samples, there exists a matrix game M in M( ) such that when running Â on M, it will fail to identify the optimal row with probability at least 1/4.Proof of Claim 31. WLOG, we assume Â is deterministic. Since this is the reward-free setting, being deterministic means that algorithm Â will always pull each arm (a, b) for some fixed n(a, b) times and then output a guess for a which is a function of the reward revealed.Denote by L the reward revealed after algorithm Â's pulling. Denote by P the probability induced by picking M uniformly at random from M( ) and running Â on M. Denote by P a,b the probability induced by running Â on M, whose indices of the special row and the special column are a and b, respectively. Denote by P 0,b the probability induced by running A on M, whose b'th column are all 1/2and other columns are all 1/2 + . We want to mention that the M we use to define P 0,b does not belong to M( ).We haveChoosing N = AB/(10 3 2 ) concludes the proof.

G.3.2 REWARD-FREE MARKOV GAMES

Now let's generalize the Θ(AB/ 2 ) lower bound to Θ(SABH 2 / 2 ) for reward-free Markov games.We define the following family of MDPs:where MDP J (a , b ) is defined as below:• States and actions: J (a , b ) is a finite-horizon MDP with S + 1 states of length H + 1.There is a fixed initial state s 0 in the first step, S states {s 1 , . . . , s S } for the remaining steps. The two players have A and B actions, respectively.• Rewards: there is no reward in the first step. For the remaining steps h ∈ {2, . . . , H + 1}, if the agent takes action (a, b) at state s i in the h th step, it will receive a binary reward sampled from• Transitions: Regardless of the current state, actions and index of steps, the agent will always transit to one of s 1 , . . . , s S uniformly at random.It is direct to see that J (a , b ) is a collection of SH independent matrix games from M( /H). Therefore, the optimal policy for the max player is to always pick action a h-1,i whenever it reaches state s i in the h th step (h ≥ 2). In other words, a h-1,i is the unique optimal action for the step-state pair (h, i).At a high level, in order to find an -optimal policy for the above Markov game, we need to identify at least half of the entries of a . Therefore, the number of episodes should be at leastBelow we provide a formal proof of this argument, which is almost the same as that for the setting of reward-free matrix games.We start by proving an analogue of Lemma 29. Lemma 32. For any fixed matrix game J (a , b ) from J( ) and N ∈ N, if an algorithm A can output a policy that is at most /10 3 suboptimal with probability at least p using at most N samples, then there exists an algorithm Â that can correctly identify at least SH -SH/500 entries of a with probability at least p using at most N samples.Intuitively, J -(h ,i ) (a , b ) is the same as J (a , b ) except that for the max player, all its actions at state s i in the h th step are equivalent. In other words, J -(h ,i ) (a , b ) is independent of a h ,i .To proceed, we need to define the following notations: denote by n(a, b) the number of times Â picks action (a, b) at state s i in the (h + 1) th step; denote by) the probability (expectation) induced by running algorithm Â on J (a , b ); similarly, we define; also recall that we denote by P (E ) the probability (expectation) induced by picking J (a , b ) uniformly at random from J( ) and running Â on J ; denote by L the whole trajectory of states, actions and rewards produced by algorithm Â in N episodes; with slight abuse of notation, denote by Â(L) the guess of Â for a h ,i based on L.First, note that for any (a, b) We havePlugging in K = SABH 2 /(10 4 2 ) completes the proof.Since by induction hypothesis, for any (s, a),As a result, we also have, which is exactly inequality (40) for the h-th step. The second inequality can be proved similarly.

H.2 PROOF OF THEOREM 16

In this section, we prove each theorem for the single reward function case, i.e., N = 1. The proof for the case of multiple reward functions (N > 1) simply follows from taking a union bound, that is, replacing the failure probability p by N p.

H.2.1 NE VERSION

Let (µ k , ν k ) be an arbitrary Nash-equilibrium policy of M k := ( P k , r k ), where P k and r k are our empirical estimate of the transition and the reward at the beginning of the k'th episode in Algorithm 4. Given an arbitrary Nash equilibrium π k of M k , we use Q k h,i and V k h,i to denote its value functions of the i'th player at the h'th step in M k .We prove the following two lemmas, which together imply the conclusion about Nash equilibriums in Theorem 16 as in the proof of Theorem 5.Lemma 38. With probability 1 -p, for any (h, s, a, i, k), we haveProof. For each fixed k, we prove this by induction from h = H + 1 to h = 1. For base case, we know at the (H + 1)-th step, V k H+1,i = V π k H+1,i = Q k H+1,i = Q π k H+1,i = 0. Now, assume the conclusion holds for the (h + 1)'th step, for the h'th step, by definition of the Q functions,By the induction hypothesis,By uniform concentration (e.g., Lemma 12 in Bai & Jin (2020)), (B) ≤ SH 2 ι/N k h (s, a) = β t . Putting everything together we havewhich proves the first inequality in (49). The inequality for V functions follows directly by noting that the value functions are computed using the same policy π k .Lemma 39. With probability 1 -p, for any (h, s, a, i, k), we have(50)

