REPRESENTATION LEARNING FOR GENERAL-SUM LOW-RANK MARKOV GAMES

Abstract

We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an ε-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from collected data. For both approaches, the algorithm achieves a sample complexity of poly(H, d, A, 1/ε), where H is the game horizon, d is the dimension of the feature vector, A is the size of the joint action space and ε is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov Games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) studies the problem where multiple agents learn to make sequential decisions in an unknown environment to maximize their (own) cumulative rewards. Recently, MARL has achieved remarkable empirical success, such as in traditional games like GO (Silver et al., 2016 (Silver et al., , 2017) ) and Poker (Moravčík et al., 2017) , real-time video games such as Starcraft and Dota 2 (Vinyals et al., 2019; Berner et al., 2019) , decentralized controls or multi-agent robotics systems (Brambilla et al., 2013) and autonomous driving (Shalev-Shwartz et al., 2016) . On the theoretical front, however, provably sample-efficient algorithms for Markov games have been largely restricted to either two-player zero-sum games (Bai et al., 2020; Xie et al., 2020; Chen et al., 2021; Jin et al., 2021c) or general-sum games with small and finite state and action spaces (Bai and Jin, 2020; Liu et al., 2021; Jin et al., 2021b) . These algorithms typically do not permit a scalable implementation applicable to real-world games, due to either (1) they only work for tabular or linear Markov games which are too restrictive to model real-world games, or (2) the ones that do handle rich non-linear function approximation (Jin et al., 2021c) are not computationally efficient. This motivates us to ask the following question: Can we design an efficient algorithm that (1) provably learns multi-player general-sum Markov games with rich nonlinear function approximation and (2) permits scalable implementations? This paper presents the first positive answer to the above question. In particular, we make the following contributions: 1. We design a new centralized self-play meta algorithm for multi-agent low-rank Markov games: General Representation Learning for Multi-player General-sum Markov Game (GERL_MG2). We present a model-based and a model-free instantiation of GERL_MG2 which differ by the way function approximation is used, and a clean and unified analysis for both approaches. 2. We show that the model-based variant requires access to an MLE oracle and a NE/CE/CCE oracle for matrix games, and enjoys a Õ H 6 d 4 A 2 log(|Φ||Ψ|)/ε 2 sample complexity to learn an ε-NE/CE/CCE equilibrium policy, where d is the dimension of the feature vector, A is the size of the joint action space, H is the game horizon, Φ and Ψ are the function classes for the representation and emission process. The model-free variant replaces model-learning with solving a minimax optimization problem, and enjoys a sample complexity of Õ H 6 d 4 A 3 M log(|Φ|)/ε 2 for a slightly restricted class of Markov game with latent block structure. 3. Both of the above algorithms have sample complexities scaling with the joint action space size, which is exponential in the number of players. This unfavorable scaling is referred to as the curse of multi-agent, and is unavoidable in the worst case under general function approximation. We consider a spatial factorization structure where the transition of each player's local state is directly affected only by at most L = O(1) players in its adjacency. Given this additional structure, we provide an algorithm that achieves Õ(M 4 H 6 d 2(L+1) 2 Ã2(L+1) /ε 2 ) sample complexity, where Ã is the size of a single player's action space, thus escaping the exponential scaling to the number of agents. 4. Finally, we provide an efficient implementation of our reward-free algorithm, and show that it achieves superior performance against traditional deep RL baselines without principled representation learning.

1.1. RELATED WORKS

Markov games Markov games (Littman, 1994; Shapley, 1953) is an extensively used framework introduced for game playing with sequential decision making. Previous works (Littman, 1994; Hu and Wellman, 2003; Hansen et al., 2013) studied how to find the Nash equilibrium of a Markov game when the transition matrix and reward function are known. When the dynamic of the Markov game is unknown, recent works provide a line of finite-sample guarantees for learning Nash equilibrium in two-player zero-sum Markov games (Bai and Jin, 2020; Xie et al., 2020; Bai et al., 2020; Zhang et al., 2020; Liu et al., 2021; Jin et al., 2021c; Huang et al., 2021) and learning various equilibriums (including NE,CE,CCE, which are standard solution notions in games (Roughgarden, 2010) ) in general-sum Markov games (Liu et al., 2021; Bai et al., 2021; Jin et al., 2021b) . Some of the analyses in these works are based on the techniques for learning single-agent Markov Decision Processes (MDPs) (Azar et al., 2017; Jin et al., 2018 Jin et al., , 2020)) . RL with Function Approximation Function approximation in reinforcement learning has been extensively studied in recent years. For the single-agent Markov decision process, function approximation is adopted to achieve a better sample complexity that depends on the complexity of function approximators rather than the size of the state-action space. For example, (Yang and Wang, 2019; Jin et al., 2020; Zanette et al., 2020) considered the linear MDP model, where the transition probability function and reward function are linear in some feature mapping over state-action pairs. Another line of works (see, e.g., Jiang et al., 2017; Jin et al., 2021a; Du et al., 2021; Foster et al., 2021) studied the MDPs with general nonlinear function approximations. When it comes to Markov game, (Chen et al., 2021; Xie et al., 2020; Jia et al., 2019) studied the Markov games with linear function approximations. Recently, (Huang et al., 2021) and (Jin et al., 2021c) proposed the first algorithms for two-player zero-sum Markov games with general function approximation, and provided a sample complexity governed by the minimax Eluder dimension. However, technical difficulties prevent extending these results to multi-player general-sum Markov games with nonlinear function approximation. The results for linear function approximation assume a known state-action feature, and are unable to solve the Markov games with a more general non-linear approximation where both the feature and function parameters are unknown. For the general function class works, their approaches rely heavily on the two-player nature, and it's not clear how to apply their methods to the general multi-player setting. Representation Learning in RL Our work is closely related to representation learning in singleagent RL, where the study mainly focuses on the low-rank MDPs. A low-rank MDP is strictly more general than a linear MDP which assumes the representation is known a priori. Several related works studied low-rank MDPs with provable sample complexities. (Agarwal et al., 2020b; Ren et al., 2021) and (Uehara et al., 2021) consider the model-based setting, where the algorithm learns the representation with the model class of the transition probability given. (Modi et al., 2021) provided a representation learning algorithm under the model-free setting and proved its sample efficiency when the MDP satisfies the minimal reachability assumption. (Zhang et al., 2022) proposed a model-free method for the more restricted MDP class called Block MDP, but does not rely on the reachability assumption, which is also studied in papers including (Du et al., 2019) and (Misra et al., 2020) . A concurrent work (Qiu et al., 2022) studies representation learning in RL with contrastive learning and extends their algorithm to the Markov game setting. However, their method requires strong data assumption and does not provide any practical implementation in the Markov game setting.

2. PROBLEM SETTINGS

A general-sum Markov game with M players is defined by a tuple (S, {A i } M i=1 , P ⋆ , {r i } M i=1 , H, d 1 ). Here S is the state space, A i is the action space for player i, H is the time horizon of each episode and d 1 is the initial state distribution. We let A = A 1 × . . . × A M and use a = (a 1 , a 2 , . . . , a M ) to denote the joint actions by all M players. Denote Ã = max i |A i | and A = |A|. P ⋆ = {P ⋆ h } H h=1 is a collection of transition probabilities, so that P ⋆ h (•|s, a) gives the distribution of the next state if actions a are taken at state s and step h. And r i = {r h,i } H h=1 is a collection of reward functions, so that r h,i (s, a) gives the reward received by player i when actions a are taken at state s and step h.

2.1. SOLUTION CONCEPTS

The policy of player i is denoted as π i := {π h,i : S → ∆ Ai } h∈ [H] . We denote the product policy of all the players as π := π 1 × . . . × π M , here "product" means that conditioned on the same state, the action of each player is sampled independently according to their own policy. We denote the policy of all the players except the ith player as π -i . We define V π h,i (s) as the expected cumulative reward that will be received by the ith player if starting at state s at step h and all players follow policy π. For any strategy π -i , there exists a best response of the ith player, which is a policy µ † (π -i ) satisfying V µ † (π-i),π-i h,i (s) = max πi V πi,π-i h,i (s) for any (s, h) ∈ S × [H] . We denote V †,π-i h,i := V µ † (π-i),π-i h,i . Let v †,π-i i := E s∼d1 V †,π-i 1,i (s) , v π i := E s∼d1 V π 1,i (s) . Definition 2.1 (NE). A product policy π is a Nash equilibrium (NE) if v π i = v †,π-i i , ∀i ∈ [M ] . And we call π an ε-approximate NE if max i∈[M ] {v †,π-i i -v π i } < ε. The coarse correlated equilibrium (CCE) is a relaxed version of Nash equilibrium in which we consider general correlated policies instead of product policies.

Definition 2.2 (CCE). A correlated policy

π is a CCE if V †,π-i h,i (s) ≤ V π h,i (s) for all s ∈ S, h ∈ [H], i ∈ [M ]. And we call π an ε-approximate CCE if max i∈[M ] {v †,π-i i -v π i } < ε. The correlated equilibrium (CE) is another relaxation of the Nash equilibrium. To define CE, we first introduce the concept of strategy modification: A strategy modification ω i := {ω h,i } h∈ [H] for player i is a set of H functions from S × A i to A i . Let Ω i := {Ω h,i } h∈[H] denote the set of all possible strategy modifications for player i. One can compose a strategy modification ω i with any Markov policy π and obtain a new policy ω i • π such that when policy π chooses to play a := (a 1 , . . . , a M ) at state s and step h, policy ω i • π will play (a 1 , . . . , a i-1 , ω h,i (s, a i ), a i+1 , . . . , a M ) instead. Definition 2.3 (CE). A correlated policy π is a CE if max i∈[M ] max ωi∈Ωi V ωi•π h,i (s) ≤ V π h,i (s) for all (s, h) ∈ S × [H]. And we call π an ε-approximate CE if max i∈[M ] {max ωi∈Ωi v ωi•π i -v π i } < ε. Remark 2.1. For general-sum Markov Games, we have {NE} ⊆ {CE} ⊆ {CCE}, so that they form a nested set of notions of equilibria (Roughgarden, 2010) . While there exist algorithms to approximately compute the Nash equilibrium (Berg and Sandholm, 2017) , the computation of NE for general-sum games in the worst case is still PPAD-hard (Daskalakis, 2013) . On the other hand, CCE and CE can be solved in polynomial time using linear programming (Examples include Papadimitriou and Roughgarden (2008) ; Blum et al. (2008) ). Therefore, in this paper we study both NE and these weaker equilibrium concepts that permit more computationally efficient solutions. Algorithm 1 General Representation Learning for Multi-player General-sum Low-Rank Markov Game with UCB-driven Exploration (GERL_MG2) 1: Input: Regularizer λ, iteration N , parameter {α (n) } N n=1 , {ζ (n) } N n=1 . 2: Initialize π (0) to be uniform; set D (0) h = ∅, D(0) h = ∅, ∀h ∈ [H]. 3: for episode n = 1, 2, • • • , N do 4: Set V (n) H+1,i ← 0, V (n) H+1,i ← 0 5: for step h = H, H -1 . . . , 1 do 6: Collect two triples (s, a, s ′ ), (s ′ , ã′ , s′′ ) with s ∼ d π (n-1) P ⋆ ,h , a ∼ U (A), s ′ ∼ P ⋆ h (s, a), s ∼ d π (n-1) P ⋆ ,h-1 , ã ∼ U (A), s′ ∼ P ⋆ h-1 (s, ã), ã′ ∼ U (A), s′′ ∼ P ⋆ h (s ′ , ã′ ). 7: Update datasets: D (n) h = D (n-1) h ∪ {(s, a, s ′ )}, D(n) h = D(n-1) h ∪ {(s ′ , ã′ , s′′ )}. 8: Learn representation via model-based or model-free methods: ϕ (n) h , P (n) h = MBREPLEARN D (n) h ∪ D(n) h , h or MFREPLEARN D (n) h ∪ D(n) h , h, λ 9: Compute β(n) h from equation 5, for each (s, a) ∈ S × A, i ∈ [M ], set Q (n) h,i (s, a) ← r h,i (s, a) + P (n) h V (n) h+1,i (s, a) + β(n) h (s, a) Q (n) h,i (s, a) ← r h,i (s, a) + P (n) h V (n) h+1,i (s, a) - β(n) h (s, a). 10: Compute π (n) h from equation 2 or equation 3 or equation 4. For each s ∈ S, i ∈ [M ], set V (n) h,i (s) ← D π (n) h Q (n) h,i (s), V (n) h,i (s) ← D π (n) h Q (n) h,i (s), ∀s ∈ S.

11:

end for n) , where v 12: Let ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2H Aζ ( (n) i = S V (n) 1,i (s)d 1 (s)ds, and v (n) i = S V (n) 1,i (s)d 1 (s)ds. 13: end for 14: Return π = π (n ⋆ ) where n ⋆ = arg min n∈[N ] ∆ (n) . Model-free Representation Learning In the model-free setting, we are only given the function class of the feature vectors, Φ h , which we assume also includes the true feature ϕ ⋆ h . Given the dataset D := D (n) h ∪ D(n) h , MFREPLEARN aims to learn a feature vector that is able to linearly fit the Bellman backup of any function f (s) in an appropriately chosen discriminator function class F h . To be precise, we aim to optimize the following objective: min ϕ∈Φ h max f ∈F h min θ E D ϕ(s, a) ⊤ θ -f (s ′ ) 2 -min θ, φ∈Φ h E (n) D φ(s, a) ⊤ θ -f (s ′ ) 2 , where the first term is the empirical squared loss and the second term is the conditional expectation of f (s ′ ) given (s, a), subtracted for the purpose of bias reduction. Once we obtain an estimation φ(n) h , we can construct a non-parametric transition model defined as: P (n) h (s ′ |s, a) = φ(n) h (s, a) ⊤   (s,ã)∈D φ(n) h (s, ã) φ(n) h (s, ã) ⊤ + λI d   -1 (s,ã,s ′ )∈D φ(n) h (s, ã)1 s′ =s ′ . (1) We show that doing Least-square Value Iteration (LSVI) is equivalent to doing model-based planning inside P (n) h (line 10 of Alg. 1), and thus the model-free algorithm can be analyzed in the same way as the model-based algorithm. In practice, for applications where the raw observation states are high-dimensional, e.g. images, estimating the transition is often much harder than estimating the one-directional feature function. In such cases, we expect the Ψ class to be much larger than the Φ class and the model-free approach to be more efficient.

3.2. PLANNING

Based on the feature vector and transition probability computed from the representation learning phase, a new policy π (n+1) is computed using the planning module. The planning phase is conducted  L λ,D (ϕ, θ, f ) := E D ϕ(s, a) ⊤ θ -f (s ′ ) 2 + λ∥θ∥ 2 2 . 3: Compute φ = arg min ϕ∈Φ h max f ∈F h [min θ L λ,D (ϕ, θ, f ) -min φ∈Φ h , θ L λ,D ( φ, θ, f )] 4: Return φ, P where P is calculated from equation 1. with a Upper-Confidence-Bound (UCB) style approach, and we maintain both an optimistic and a pessimistic estimation of the value functions and the Q-value functions V (n) h,i , V (n) h,i , Q (n) h,i , Q (n) h,i , which are computed recursively through the Bellman's equation with the bonus function β(n) h (Line 9 and 10 of Alg. 1). Here the operator D is defined by (D π f )(s) := E a∼π(s) [f (s, a)] , ∀f : S × A → R, and π (n) h is the policy computed from M induced Q-value functions Q(n) h,i . For the model-based setting, we simply let Q(n) h,i be the optimistic estimator Q (n) h,i . For the model-free setting, for technical reasons, we instead let Q(n) h,i be the nearest neighbor of Q (n) h,i in N h with respect to the ∥ • ∥ ∞ metric, where N h ⊆ R S×A is a properly designed set of functions, whose construction is deferred to the appendix. Depending on the problem settings, the policy π (n) h takes either one of the following formulations: • For the NE, we compute π (n) h = π (n) h,1 , π (n) h,2 , . . . , π (n) h,M such that ∀s ∈ S, i ∈ [M ], π (n) h,i (•|s) = arg max π h,i D π h,i ,π (n) h,-i Q(n) h,i (s). • For the CCE, we compute π (n) h such that ∀s ∈ S, i ∈ [M ], max π h,i D π h,i ,π (n) h,-i Q(n) h,i (s) ≤ D π (n) Q(n) h,i (s). • For CE, we compute π (n) h such that ∀s ∈ S, i ∈ [M ], max ω h,i ∈Ω h,i D ω h,i •π (n) h Q(n) h,i (s) ≤ D π (n) Q(n) h,i (s). Without loss of generality we assume the solution to the above formulations is unique, if there are multiple solutions, one can always adopt a deterministic selection rule such that it always outputs the same policy given the same inputs. Note that although the policy is computed using only the optimistic estimations, we still maintain a pessimistic estimator, which is used to estimate the optimality gap ∆ (n) of the current policy. The algorithm's output policy π is chosen to be the one with the minimum estimated optimality gap.

The bonus term β(n)

h is a linear bandit style bonus computed using the learned feature φ: β(n) h (s, a) := min{α (n) ∥ φ(n) h (s, a)∥ Σ(n) h -1 , H}. where Σ(n) h := (s,a)∈D (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ + λI d is the empirical covariance matrix.

4. THEORETICAL RESULTS

In this section, we provide the theoretical guarantees of the proposed algorithm for both the modelbased and model-free approaches. We denote |M| := max h∈[H] |M h | and |Φ| := max h∈[H] |Φ h |. The first theorem provides a guarantee of the sample complexity for the model-based method.  := arg max (w,ϕ)∈M h,i E D log w(s ′ i ) ⊤ ϕ(s[Z i ], a i ) , for each i ∈ [M ]. 3: Return { φi } M i=1 , P : P (s ′ |s, a) = M i=1 ŵi (s ′ i ) ⊤ φi (s[Z i ], a i ) . Theorem 4.1 (PAC guarantee of Algorithm 1 (model-based)). When Alg. 1 is applied with modelbased representation learning algorithm Alg. 2, with parameters λ = Θ (d log( N H|Φ| /δ)) , α (n) = Θ Hd A log( |M|HN /δ) , ζ (n) = Θ n -1 log( |M|HN /δ) , by setting the number of episodes N to be at most O H 6 d 4 A 2 ε -2 log 2 ( HdA|M| /δε) , with probability 1 -δ, the output policy π is an ε-approximate {NE, CCE, CE}. Theorem 4.1 shows that GERL_MG2 can find an ε-approximate {NE, CCE, CE} by running the algorithm for at most Õ H 6 d 4 A 2 ε -2 episodes, which depends polynomially on the parameters H, d, A, ε -1 and only has a logarithmic dependency on the cardinality of the model class |M|. In particular, when reducing the Markov game to the single-agent MDP setting, the sample complexity of the model-based approach matches the result provided in (Uehara et al., 2021) , which is known to have the best sample complexity among all oracle efficient algorithms for low-rank MDPs. For model-free representation learning, we have the following guarantee: Theorem 4.2 (PAC guarantee of Algorithm 1 (model-free)). When Alg. 1 is applied with model-free representation learning algorithm Alg. 3, and  λ = Θ (d log( N H|Φ| /δ)) , α (n) = Θ HAd M log( dN HAM |Φ| /δ) , ζ (n) = Θ d 2 An -1 log( O H 6 d 4 A 3 M ε -2 log 2 ( HdAM |Φ| /δε) , for an appropriately designed function class {N h } H h=1 and discriminator class {F h } H h=1 , with probability 1 -δ, the output policy π is an ε-approximate {NE, CCE, CE}. For the model-free block Markov game setting, the number of episodes required to find an εapproximate {NE, CCE, CE} becomes Õ H 6 d 4 A 3 M ε -2 . While it has a worse dependency compared with the model-based approach, the advantage of the model-free approach is it doesn't require the full model class of the transition probability but only the model class of the feature vector, which applies to a wider range of RL problems. The proofs of Theorem 4.1 and Theorem 4.2 are deferred to Appendix B and C. Theorem 4.1 and Theorem 4.2 show that GERL_MG2 learns low-rank Markov games in a statistically efficient and oracle-efficient manner. We also remark that our modular analysis can be of independent theoretical interest. Unlike prior works that make heavy distinctions between model-based and model-free approaches, e.g. (Liu et al., 2021) , we show that both approaches can be analyzed in a unified manner.

5. FACTORED MARKOV GAMES

The result in Theorem 4.1 is tractable in games with a moderate number of players. However, in applications with a large number of players, such as the scenario of autonomous traffic control, the total number of players in the game can be so large that the joint action space size A = ÃM dominates all other factors in the sample complexity bound. This exponential scaling with the number of players is sometimes referred to as the curse of multi-player. The only known class of algorithms that overcomes this challenge in Markov games is V-learning (see, e.g., Bai et al., 2020; Jin et al., 2021b) , a value-based method that fits the V-function rather than the Q-function, thus removing the dependency on the action space size. However, V-learning only works for tabular Markov games with finite state and action spaces. Extending V-learning to the function approximation setting is extremely Published as a conference paper at ICLR 2023 non-trivial, because even in the single agent setting, no known algorithm can achieve sample efficient learning in MDPs while only performing function approximation on the V-function. In this section we take a different approach that relies on the following observation. In a setting where the number of agents is large, there is often a spatial correlation among the agents, such that each agent's local state is only immediately affected by the agent's own action and the states of agents in its adjacency. For example, in smart traffic control, a vehicle's local environment is only immediately affected by the states of the vehicles around it. On the other hand, it takes time for the course of actions of a vehicle from afar to propagate its influence on the vehicle of reference. Such spatial structure motivates the definition of a factored Markov Game. In a factored Markov Game, each agent i has its local state s i , whose transition is affected by agent i's action a i and the state of the agents in its neighborhood Z i . We remark that the factored Markov Game structure still allows an agent to be affected by all other agents in the long run, as long as the directed graph defined by the neighborhood sets Z i is connected. In particular, we have Definition 5.1 (Low-Rank Factored Markov Game). We call a Markov game a low-rank factored Markov game if for any s, s ′ ∈ S, a ∈ A, h ∈ [H], i ∈ [M ], we have P ⋆ h (s ′ |s, a) = M i=1 ϕ ⋆ h,i (s[Z i ], a i ) ⊤ w ⋆ h,i (s ′ i ) . where Z i ⊆ [M ], ϕ ⋆ h,i (s[Z i ], a i ), w ⋆ h,i (s ′ i ) ∈ R d , ∥ϕ ⋆ h,i (s[Z i ], a i )∥ 2 ≤ 1 and ∥w ⋆ h,i (s ′ i )∥ 2 ≤ √ d for all (s[Z i ], a i , s ′ i ). We assume |Z i | ≤ L, ∀i ∈ [M ]. And we are given a group of model classes M h,i , h ∈ [H], i ∈ [M ] such that (ϕ ⋆ h,i , w ⋆ h,i ) ∈ M h,i . We are now ready to present our algorithm and result in the low-rank factored Markov Game setting. Surprisingly, the same algorithm GERL_MG2 works in this setting, with the representation learning module Alg. 2 replaced by Alg. 4, and a few changes of variables. For simplicity, we focus on the model-based version. Define φ(n) h,i (s, a) = j∈Zi φ(n) h,j (s[Z j ], a j ) ∈ R d |Z i | where ⊗ means the Kronecker product. Let β(n) h (s, a) := M i=1 min{α (n) ∥ φ(n) h,i (s, a)∥ Σ(n) h,i -1 , H}, ∆ (n) := max i∈[M ] {v (n) i -v (n) i } + 2HM Ãζ (n) where Σ(n) h,i := (s,a)∈D (n) h φh,i (s, a) φh,i (s, a) ⊤ + λI d |Z i | . Then, GERL_MG2 with φ and the newly defined β(n) , ∆ (n) achieves the following guarantee: Theorem 5.1 (PAC guarantee of GERL_MG2 in Low-Rank Factored Markov Game). When Alg. 1 is applied with model-based representation learning algorithm Alg. 4, with L = O(1) and parameters λ = Θ Ld L log( N HM |Φ| /δ) , α (n) = Θ H Ãd L L log( |M|HN M /δ) , ζ (n) = Θ n -1 log( |M|HN M /δ) , by setting the number of episodes N to be at most O M 4 H 6 d 2(L+1) 2 Ã2(L+1) ε -2 log 2 ( HdALM |M| /δε) , with probability 1 -δ, the output policy π is an ε-approximate {NE, CCE, CE}. Remark 5.1. This sample complexity only scales with exp(L) where L is the degree of the connection graph, which is assumed to be O(1) in Definition 5.1 and in general much smaller than the total number of agents in practice. We remark that the factored structure is also previously studied in single-agent tabular MDPs (examples include Chen et al. (2020) ; Kearns and Koller (1999) ; Guestrin et al. (2002 Guestrin et al. ( , 2003);); Strehl et al. (2007) ). Chen et al. (2020) provided a lower-bound showing that the exponential dependency on L is unimprovable in the worst case. Therefore, our bound here is also nearly tight, upto polynomial factors.

6. EXPERIMENT

In this section we investigate our algorithm with proof-of-concept empirical studies. We design our testing bed using rich observation Markov game with arbitrary latent transitions and rewards. To 

Zero-sum Markov game

In this section we first show the empirical evaluations under the two-player zero-sum Markov game setting. For an environment with horizon H, the randomly generated matrix R denotes the reward for player 1 and -R ⊤ denotes the reward for player 2, respectively. For the zero-sum game setting, we designed two variants of Block Markov games: one with short horizon (H = 3) and one with long horizon (H = 10). We show in the following that GERL_MG2 works in both settings where the other baseline could only work in the short horizon setting.

Baseline

We adopt one open-sourced implementation of DQN (Silver et al., 2016) with fictitious self-play (Heinrich et al., 2015) . We keep track of the exploitability of the returned strategy to evaluate the practical performances of the baselines. In the zero-sum setting, we only need to fix one agent (e.g., agent 2), train the other single agent (the exploiter) to maximize its corresponding return until convergence, and report the difference between the returns of the exploiter and the final return of the final policies. We include the exploitability in Table . 1. We provide training curves in Appendix. F.3 for completeness. We note that compared with the Deep RL baseline, GERL_MG2 shows a faster and more stable convergence in both environments, where the baseline is unstable during training and has a much larger exploitability. General-sum Markov game. In this section we move on to the general-sum setting. To our best knowledge, our algorithm is the only principled algorithm that can be implemented on scale under the general-sum setting. For the general sum setting, we can not just compare our returned value to the oracle NE values, because multiple NE/CCE values may exist. Instead, we keep track of the exploitability of the policy and plot the training curve on the exploitability in Fig. 2 (deferred to Appendix. F). Note that in this case we need to test both policies since their reward matrices are independently sampled.

7. DISCUSSION AND FUTURE WORKS

In this paper, we present the first algorithm that solves general-sum Markov games under function approximation. We provide both a model-based and a model-free variant of the algorithm and present a unified analysis. Empirically, we show that our algorithm outperforms existing deep RL baselines in a general benchmark with rich observation. Future work includes evaluating more challenging benchmarks and extending beyond the low-rank Markov game structure.

REPRODUCIBILITY STATEMENT

For theory, we provide proof and additional results in the Appendix. For empirical results, we provide implementation and environment details and hyperparameters in the Appendix. We also submit anonymous code in the supplemental materials. A ADDITIONAL NOTATIONS  ρ (n) h (s, a) = 1 n n i=1 d π (i) P ⋆ ,h (s)u A (a), ρ(n) h (s, a) = 1 n n i=1 E s∼d π (i) P ⋆ ,h-1 ,ã∼U (A) [P ⋆ (s|s, ã)u A (a)] , γ (n) h (s, a) = 1 n n i=1 d π (i) P ⋆ ,h (s, a). When we use the expectation E (s,a)∼ρ [f (s, a)] (or E s∼ρ [f (s)]) for some (possibly not normalized) distribution ρ and function f , we simply mean s∈S,a∈A ρ(s, a)f (s, a) (or s∈S ρ(s)f (s)) so that the expectation can be naturally extended to the unnormalized distributions. For an iteration n, a distribution ρ and a feature ϕ, we denote the expected feature covariance as Σ n,ρ,ϕ = nE (s,a)∼ρ ϕ(s, a)ϕ(s, a) ⊤ + λI d . Meanwhile, define the empirical covariance by Σ(n) h,ϕ := (s,a)∈D (n) h ϕ(s, a)ϕ(s, a) ⊤ + λI d .

B ANALYSIS OF THE MODEL-BASED METHOD B.1 HIGH PROBABILITY EVENTS

We define the following event E 1 : ∀n ∈ [N ], h ∈ [H], ρ ∈ ρ (n) h , ρ(n) h , E (s,a)∼ρ P (n) h (•|s, a) -P ⋆ h (•|s, a) 2 1 ≤ ζ (n) , E 2 : ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h , s ∈ S, a ∈ A, ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h E := E 1 ∩ E 2 . To prove E holds with a high probability, we first introduce the following MLE guarantee, whose original version can be found in (Agarwal et al., 2020b) : Lemma B.1 (MLE guarantee). For a fixed episode n and any step h, with probability 1 -δ, E (s,a)∼{0.5ρ (n) h +0.5 ρ(n) h } P (n) h (•|s, a) -P ⋆ h (•|s, a) 2 1 ≲ 1 n log |M| δ . As a straightforward corollary, with probability 1 -δ, ∀n ∈ N + , ∀h ∈ [H], E (s,a)∼{0.5ρ (n) h +0.5 ρ(n) h } P (n) h (•|s, a) -P ⋆ h (•|s, a) 2 1 ≲ 1 n log nH|M| δ . Proof. See Agarwal et al. (Agarwal et al., 2020b)  E (s,a)∼d π P (n) ,h [g h (s, a)] ≤            AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)], h = 1 E (s,ã)∼d π P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAE (s,a)∼ ρ(n) h [g 2 h (s, a)] + B 2 λd + B 2 nζ (n) , B      , h ≥ 2 Recall Σ n,ρ (n) h , φ(n) h = nE (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ + λI d . Proof. For step h = 1, we have E (s,a)∼d π P (n) ,1 [g 1 (s, a)] =E s∼d1,a∼π1(s) [g 1 (s, a)] ≤ max (s,a) d 1 (s)π 1 (a|s) ρ (n) 1 (s, a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] = max (s,a) d 1 (s)π 1 (a|s) d 1 (s)u A (a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] ≤ AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)]. For step h = 2, . . . , H -1, we observe the following one-step-back decomposition: E (s,a)∼d π P (n) ,h [g h (s, a)] =E (s,ã)∼d π P (n) ,h-1 ,s∼ P (n) h-1 (s,ã),a∼π h (s) [g h (s, a)] =E (s,ã)∼d π P (n) ,h-1 φ(n) h-1 (s, ã) ⊤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds =E (s,ã)∼d π P (n) ,h-1 min φ(n) h-1 (s, ã) ⊤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds, B ≤E (s,ã)∼d π P (n) ,h-1     min        φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds Σ n,ρ (n) h-1 , φ(n) h-1 , B            . where we use the fact that g h is bounded by B. Then, S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds 2 Σ n,ρ (n) h-1 , φ(n) h-1 ≤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds ⊤ nE (s,a)∼ρ (n) h-1 φ(n) h-1 (s, a) φ(n) h-1 (s, a) ⊤ + λI d S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds ≤nE (s,ã)∼ρ (n) h-1 S a∈A ŵ(n) h-1 (s) ⊤ φ(n) h-1 (s, ã)π h (a|s)g h (s, a)ds 2 + B 2 λd ( a∈A π h (a|s)g h (s, a) ∞ ≤ B and by assumption ŵ(n) h-1 (s) 2 ≤ √ d.) =nE (s,ã)∼ρ (n) h-1 E s∼ P (n) h-1 (s,ã),a∼π h (s) [g h (s, a)] 2 + B 2 λd ≤nE (s,ã)∼ρ (n) h-1 E s∼P ⋆ h-1 (s,ã),a∼π h (s) [g h (s, a)] 2 + B 2 λd + nB 2 ξ (n) (Event E) ≤nE (s,ã)∼ρ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a∼π h (s) g 2 h (s, a) + B 2 λd + B 2 nξ (n) . (Jensen) ≤nAE (s,ã)∼ρ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a∼U (A) g 2 h (s, a) + B 2 λd + B 2 nζ (n) (Importance sampling) ≤nAE (s,a)∼ ρ(n) h g 2 h (s, a) + B 2 λd + B 2 nζ (n) . (Definition of ρ(n) h ) Combing the above results together, we get E (s,a)∼d π P (n) ,h [g h (s, a)] ≤E (s,ã)∼d π P (n) ,h-1     min        φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds Σ n,ρ (n) h-1 , φ(n) h-1 , B            ≤E (s,ã)∼d π P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAE (s,a)∼ ρ(n) h [g 2 h (s, a)] + B 2 λd + B 2 nζ (n) , B      , which has finished the proof.

Lemma B.4 (One-step back inequality for the true model). Consider a set of functions {g

h } H h=1 that satisfies g h ∈ S × A → R + , s.t. ∥g h ∥ ∞ ≤ B. Then for any given policy π, we have E (s,a)∼d π P ⋆ ,h [g h (s, a)] ≤          AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)], h = 1 E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) Σ -1 n,γ (n) h-1 ,ϕ ⋆ h-1 nAE (s,a)∼ρ (n) h [g 2 h (s, a)] + B 2 λd, h ≥ 2 Recall Σ n,γ (n) h ,ϕ ⋆ h = nE (s,a)∼γ (n) h ϕ ⋆ h (s, a)ϕ ⋆ h (s, a) ⊤ + λI d . Proof. For step h = 1, we have E (s,a)∼d π P ⋆ ,1 [g 1 (s, a)] =E s∼d1,a∼π1(s) [g 1 (s, a)] ≤ max (s,a) d 1 (s)π 1 (a|s) ρ (n) 1 (s, a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] = max (s,a) d 1 (s)π 1 (a|s) d 1 (s)u A (a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] ≤ AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)]. For step h = 2, . . . , H -1, we observe the following one-step-back decomposition: E (s,a)∼d π P ⋆ ,h [g h (s, a)] =E (s,ã)∼d π P ⋆ ,h-1 ,s∼P ⋆ h-1 (s,ã),a∼π h (s) [g h (s, a)] =E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) ⊤ S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds ≤E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) Σ -1 n,γ (n) h-1 ,ϕ ⋆ h-1 S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds Σ n,γ (n) h-1 ,ϕ ⋆ h-1 . Then, S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds 2 Σ n,γ (n) h-1 ,ϕ ⋆ h-1 ≤ S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds ⊤ nE (s,a)∼γ (n) h-1 ϕ ⋆ h-1 (s, a)ϕ ⋆ h-1 (s, a) ⊤ + λI d S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds ≤nE (s,ã)∼γ (n) h-1 S a∈A w ⋆ h-1 (s) ⊤ ϕ ⋆ h-1 (s, ã)π h (a|s)g h (s, a)ds 2 + B 2 λd (Use the assumption a∈A π h (a|s)g h (s, a) ∞ ≤ B and ∥w ⋆ h-1 (s)∥ 2 ≤ √ d.) =nE (s,ã)∼γ (n) h-1 E s∼P ⋆ h-1 (s,ã),a∼π h (s) [g h (s, a)] 2 + B 2 λd ≤nE (s,ã)∼γ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a∼π h (s) g 2 h (s, a) + B 2 λd (Jensen) ≤nAE (s,ã)∼γ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a∼U (A) g 2 h (s, a) + B 2 λd (Importance sampling) ≤nAE (s,a)∼ρ (n) h g 2 h (s, a) + B 2 λd, (Definition of ρ (n) h ) Combing the above results together, we get E (s,a)∼d π P ⋆ ,h [g h (s, a)] =E (s,ã)∼d π P ⋆ ,h-1 ,s∼P ⋆ h-1 (s,ã),a∼π h (s) [g h (s, a)] ≤E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) Σ -1 n,γ (n) h-1 ,ϕ ⋆ h-1 S a∈A w ⋆ h-1 (s)π h (a|s)g h (s, a)ds Σ n,γ (n) h-1 ,ϕ ⋆ h-1 ≤E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) Σ -1 n,γ (n) h-1 ,ϕ ⋆ h-1 nAE (s,a)∼ρ (n) h [g 2 h (s, a)] + B 2 λd, which has finished the proof.

Lemma B.5 (Optimism for NE and CCE). Consider an episode

n ∈ [N ] and set α (n) = Θ H nAζ (n) + dλ . When the event E holds and the policy π (n) is computed by solving NE or CCE, we have v (n) i (s) -v †,π (n) -i i (s) ≥ -H Aζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Define μ(n) h,i (•|s) := arg max µ D µ,π (n) h,-i Q †,π (n) -i h,i (s) as the best response policy for player i at step h, and let π(n) h = μ(n) h,i × π (n) h,-i . Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 , then according to the event E, we have E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Next, we prove by induction that E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i (s) ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -H min{f (n) h ′ (s, a), 1} , ∀h ∈ [H]. First, notice that ∀h ∈ [H], E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i (s) =E s∼d π(n) P (n) ,h D π (n) h Q (n) h,i (s) -D π(n) h Q †,π (n) -i h,i ≥E s∼d π(n) P (n) ,h D π(n) h Q (n) h,i (s) -D π(n) h Q †,π (n) -i h,i =E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -Q †,π (n) -i h,i (s, a) , where the inequality uses the fact that π (n) h is the NE (or CCE) solution for Q (n) h,i M i=1 . Now we are ready to prove equation 7: • When h = H, we have E s∼d π(n) P (n) ,H V (n) H,i (s) -V †,π (n) -i H,i (s) ≥E (s,a)∼d π(n) P (n) ,H Q (n) H,i (s, a) -Q †,π (n) -i H,i (s, a) =E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) ≥E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) -H min f (n) H (s, a), 1 . • Suppose the statement is true for step h + 1, then for step h, we have E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i (s) ≥E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -Q †,π (n) -i h,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h V †,π (n) -i h+1,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i -V †,π (n) -i h+1,i (s, a) + P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -V †,π (n) -i h+1,i (s) ≥E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H min f (n) h (s, a), 1 + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -V †,π (n) -i h+1,i (s) ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -H min f (n) h ′ (s, a), 1 , where we use the fact P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a) ≤ min H, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 V †,π (n) -i h+1,i ∞ ≤H min 1, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 =H min 1, f (n) h ′ (s, a) and the last row uses the induction assumption. Therefore, we have proved equation 7. We then apply h = 1 to equation 7, and get E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i (s) =E s∼d π(n) P (n) ,1 V (n) 1,i (s) -V †,π (n) -i 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H min f (n) h (s, a), 1 = H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 . Next we are going to bound the second term, let g h (s, a) = min{f (n) h (s, a), 1} and apply Lemma B.3 to g h , we have for h = 1, E (s,a)∼d π(n) P (n) ,1 min f (n) 1 (s, a), 1 ≤ AE (s,a)∼ρ (n) 1 f (n) 1 (s, a) 2 ≤ Aζ (n) . And ∀h ≥ 2, we have E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≤E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAE (s,a)∼ ρ(n) h f (n) h (s, a) 2 + dλ + nζ (n) , 1      ≲E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAζ (n) + dλ + nζ (n) , 1      . Note that we here use the fact min{f n) . Then according to our choice of α (n) , we get (n) h (s, a), 1} ≤ 1, E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) and E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ ( E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≤ E (s,ã)∼d π(n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      . Combining all things together, v (n) i -v †,π (n) -i i =E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≥ H-1 h=1 E (s,ã)∼d π(n) P (n) ,h   β(n) h (s, a) -min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      -H Aζ (n) ≥ -H Aζ (n) , which proves the inequality. Lemma B.6 (Optimism for CE). Consider an episode n ∈ [N ] and set α (n) = Θ H nAζ (n) + dλ . When the event E holds, we have v (n) i (s) -max ω∈Ωi v ω•π (n) i (s) ≥ -H Aζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Denote ω(n) h,i = arg max ω h ∈Ω h,i D ω h •π (n) h max ω∈Ωi Q ω•π (n) h,i (s) and let π(n) h = ωh,i •π (n) h . Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 , then according to the event E, we have E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Next, we prove by induction that E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i (s) ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -H min f (n) h ′ (s, a), 1 , ∀h ∈ [H]. First, notice that ∀h ∈ [H], E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i (s) =E s∼d π(n) P (n) ,h D π (n) h Q (n) h,i (s) -D π(n) h max ω∈Ωi Q ω•π (n) h,i ≥E s∼d π(n) P (n) ,h D π(n) h Q (n) h,i (s) -D π(n) h max ω∈Ωi Q ω•π (n) h,i (s) =E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -max ω∈Ωi Q ω•π (n) h,i (s, a) . where the inequality uses the fact that π (n) h is the CE solution for Q (n) h,i M i=1 . Now we are ready to prove equation 8: • When h = H, we have E s∼d π(n) P (n) ,H V (n) H,i (s) -max ω∈Ωi V ω•π (n) H,i (s) ≥E (s,a)∼d π(n) P (n) ,H Q (n) H,i (s, a) -max ω∈Ωi Q ω•π (n) H,i (s, a) =E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) ≥E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) -H min f (n) H (s, a), 1 . • Suppose the statement is true for h + 1, then for step h, we have E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i (s) ≥E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -max ω∈Ωi Q ω•π (n) h,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i -max ω∈Ωi V ω•π (n) h+1,i (s, a) + P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i (s, a) =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -max ω∈Ωi V ω•π (n) h+1,i (s) ≥E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H min f (n) h (s, a), 1 + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -max ω∈Ωi V ω•π (n) h+1,i (s) ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -H min f (n) h ′ (s, a), 1 , where we use the fact P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i (s, a) ≤ min H, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 max ω∈Ωi V ω•π (n) h+1,i ∞ ≤H min 1, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 =H min 1, f (n) h ′ (s, a) and the last row uses the induction assumption. Therefore, we have proved equation 8. We then apply h = 1 to equation 8, and get E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i (s) =E s∼d π(n) P (n) ,1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H min f (n) h (s, a), 1 = H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 . Next we are going to bound the second term, let g h (s, a) = min{f (n) h (s, a), 1} and apply Lemma B.3 to g h , we have for h = 1, E (s,a)∼d π(n) P (n) ,1 min f (n) 1 (s, a), 1 ≤ AE (s,a)∼ρ (n) 1 f (n) 1 (s, a) 2 ≤ Aζ (n) . And ∀h ≥ 2, we have E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≤E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAE (s,a)∼ ρ(n) h f (n) h (s, a) 2 + dλ + nζ (n) , 1      ≲E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nAζ (n) + dλ + nζ (n) , 1      . Note that we here use the fact min{f n) . Then according to our choice of α (n) , we get (n) h (s, a), 1} ≤ 1, E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) and E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ ( E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) ≤ E (s,ã)∼d π(n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      . Combining all things together, v (n) i -max ω∈Ωi v ω•π (n) i =E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≥ H-1 h=1 E (s,ã)∼d π(n) P (n) ,h   β(n) h (s, a) -min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      -H Aζ (n) ≥ -H Aζ (n) , which proves the inequality. Lemma B.7 (Pessimism). Consider an episode n ∈ [N ] and set α (n) = Θ H nAζ (n) + dλ . When the event E holds, we have v (n) i (s) -v π (n) i (s) ≤ H Aζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 , then according to the event E, we have E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Again, we prove the following inequality by induction: E s∼d π (n) P (n) ,h V (n) h,i (s) -V π (n) h,i (s) ≤ H h ′ =h E (s,a)∼d π (n) P (n) ,h ′ - β(n) h ′ (s, a) + H min f (n) h ′ (s, a), 1 , ∀h ∈ [H]. • When h = H, we have E s∼d π (n) P (n) ,H V (n) H,i (s) -V π (n) H,i (s) =E (s,a)∼d π (n) P (n) ,H Q (n) H,i (s, a) -Q π (n) H,i (s, a) =E (s,a)∼d π (n) P (n) ,H - β(n) H (s, a) ≤E (s,a)∼d π (n) P (n) ,H - β(n) H (s, a) + H min f (n) H (s, a), 1 • Suppose the statement is true for h + 1, then for step h, we have E s∼d π (n) P (n) ,h V (n) h,i (s) -V π (n) h,i (s) =E (s,a)∼d π (n) P (n) ,h Q (n) h,i (s, a) -Q π (n) h,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h V π (n) h+1,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h V (n) h+1,i -V π (n) h+1,i (s, a) + P (n) h -P ⋆ h V π (n) h+1,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h -P ⋆ h V π (n) h+1,i (s, a) + E s∼d π (n) P (n) ,h+1 V (n) h+1,i -V π (n) h+1,i (s) ≤E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + H min f (n) h (s, a), 1 + E s∼d π (n) P (n) ,h+1 V (n) h+1,i -V π (n) h+1,i (s) ≤ H h ′ =h E (s,a)∼d π (n) P (n) ,h ′ - β(n) h ′ (s, a) + H min f (n) h ′ (s, a), 1 . where we use the fact P (n) h -P ⋆ h V π (n) h+1,i (s, a) ≤ min H, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 V π (n) h+1,i ∞ ≤H min 1, P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 =H min 1, f h ′ (s, a) and the last row uses the induction assumption. The remaining steps are exactly the same as the proof in Lemma B.5 or Lemma B.6, we may prove E (s,a)∼d π (n) P (n) ,1 min f (n) 1 (s, a), 1 ≤ Aζ (n) , E (s,a)∼d π (n) P (n) ,h f (n) h (s, a) ≤ E (s,ã)∼d π (n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      , ∀h ≥ 2. Combining all things together, we get v (n) i -v π (n) i =E s∼d1 V (n) 1,i (s) -V π (n) 1,i (s) ≤ H h=1 E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + H min f (n) h (s, a), 1 ≤ H-1 h=1 E (s,a)∼d π (n) P (n) ,h   - β(n) h (s, a) + min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      + H Aζ (n) ≤H Aζ (n) , which has finished the proof. Published as a conference paper at ICLR 2023 Lemma B.8. For the model-based algorithm, when we pick λ = Θ d log N H|Φ| δ , ζ (n) = Θ 1 n log |M|HN δ and α (n) = Θ H nAζ (n) + dλ , with probability 1 -δ, we have N n=1 ∆ (n) ≲ H 3 d 2 AN 1 2 log |M|HN δ . Proof. With our choice of λ and ζ (n) , according to Lemma B.2, we know E holds with probability 1 -δ. Furthermore, we have α (n) = Θ H A log |M|HN δ + d 2 log N H|Φ| δ = O dH A log |M|HN δ Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 . According to the definition of the event E, we have E s∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . By definition, we have ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2H Aζ (n) . For each fixed i ∈ [M ], h ∈ [H] and n ∈ [N ], we have E s∼d π (n) P ⋆ ,h V (n) h,i (s) -V (n) h,i (s) =E s∼d π (n) P ⋆ ,h D π (n) h Q (n) h,i (s) -D π (n) h Q (n) h,i (s) =E (s,a)∼d π (n) P ⋆ ,h Q (n) h,i (s, a) -Q (n) h,i (s, a) =E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h (s, a) + P (n) h V (n) h+1,i -V (n) h+1,i (s, a) =E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h (s, a) + P (n) h -P ⋆ h V (n) h+1,i -V (n) h+1,i (s, a) + E s∼d π (n) P ⋆ ,h+1 V (n) h+1,i (s) -V (n) h+1,i (s) ≤E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h (s, a) + 2H 2 f (n) h (s, a) + E s∼d π (n) P ⋆ ,h+1 V (n) h+1,i (s) -V (n) h+1,i (s) . Note that we use the fact V (n) h+1,i (s) -V (n) h+1,i (s) is upper bounded by 2H 2 , which can be proved easily using induction using the fact that  β(n) h (s, a) ≤ H. Applying the above formula recursively to E s∼d π (n) P ⋆ ,h+1 V (n) h+1,i (s) -V (n) h+1,i (s) , E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≤ 2 H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) (a) +2H 2 H h=1 E (s,a)∼d π (n) P ⋆ ,h f (n) h (s, a) (b) . (11) First, we calculate the first term (a) in Inequality equation 11. Following Lemma B.4 and noting the bonus β(n) h is O(H), we have H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) ≲ H h=1 E (s,a)∼d π (n) P ⋆ ,h   min    α (n) φ(n) h (s, a) Σ -1 n,ρ (n) h , φ(n) h , H      (From equation 10) ≲ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nA α (n) 2 E (s,a)∼ρ (n) h   φ(n) h (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h   + H 2 dλ + A α (n) 2 E (s,a)∼ρ (n) 1   φ(n) 1 (s, a) 2 Σ -1 n,ρ (n) 1 , φ(n) 1   . Note that we use the fact that B = H when applying Lemma B.4. In addition, we have nE (s,a)∼ρ (n) h   φ(n) h (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h   =nTr E (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ nE (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ + λI d -1 ≤d. Then, H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) ≤ E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 /n. Second, we calculate the term (b) in inequality equation 11. Following Lemma B.4 and noting that f (n) h (s, a is upper-bounded by 2 (i.e., B = 2 in Lemma B.4), we have H h=1 E (s,a)∼d π (n) P ⋆ ,h [f (n) h (s, a)] ≤ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nAE (s,a)∼ρ (n) h f (n) h (s, a) 2 + dλ + AE (s,a)∼ρ (n) h f (n) 1 (s, a) 2 ≤ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nAζ (n) + dλ + Aζ (n) ≲ α (n) H H-1 h=1 E (s,ã)∼d πn P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) , where in the second inequality, we use n) , and in the last line, recall nAζ (n) + dλ ≲ α (n) /H. Then, by combining the above calculation of the term (a) and term (b) in inequality equation 11, we have: E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ ( v (n) i -v (n) i =E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≲ H-1 h=1   E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 n   + H 2 H-1 h=1 α (n) H E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) . Taking maximum over i on both sides and using the definition of ∆ (n) , we get ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2H Aζ (n) ≲ H-1 h=1   E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 n   + H 2 H-1 h=1 α (n) H E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) . Hereafter, we take the dominating term out. Note that N n=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h ≤ N N n=1 E (s,ã)∼d π (n) P ⋆ ,h ϕ ⋆ h (s, ã) ⊤ Σ -1 n,γ (n) h ,ϕ ⋆ h ϕ ⋆ h (s, ã) (CS inequality) ≲ N log det N n=1 E (s,ã)∼d π (n) P ⋆ ,h [ϕ ⋆ h (s, ã)ϕ ⋆ h (s, ã) ⊤ ] -log det(λI d ) (Lemma E.2) ≤ dN log 1 + N dλ . (Potential function bound, Lemma E.3 noting ∥ϕ ⋆ h (s, a)∥ 2 ≤ 1 for any (s, a).) Finally, N n=1 ∆ (n) ≲H   dN log 1 + N d dA α (N ) 2 + H 2 dλ + N n=1 dA α (n) 2 n   + H 3 1 H dN log 1 + N dλ α (N ) + N n=1 Aζ (n) ≲H 2 d N A log 1 + N dλ α (N ) (Some algebra. We take the dominating term out. Note that α (n) is increasing in n) ≲H 3 d 2 AN 1 2 log |M|HN δ . This concludes the proof. Proof of Theorem 4.1 Proof. For any fixed episode n and agent i, by Lemma B.5, Lemma B.6 and Lemma B.7, we have v †,π (n) -i i -v π (n) i or max ω∈Ωi v ω•π (n) i -v π (n) i ≤ v (n) i -v (n) i + 2H Aζ (n) ≤ ∆ (n) . Taking maximum over i on both sides, we have max i∈[M ] v †,π (n) -i i -v π (n) i or max i∈[M ] max ω∈Ωi v ω•π (n) i -v π (n) i ≤ ∆ (n) . From Lemma B.8, with probability 1 -δ, we can ensure N n=1 ∆ (n) ≲ H 3 d 2 AN 1 2 log |M|HN δ . Therefore, according to Lemma E.4, when we pick N to be O H 6 d 4 A 2 ε 2 log 2 HdA|M| δε , we have 1 N N n=1 ∆ (n) ≤ ε. On the other hand, from equation 12, we have max i∈[M ] v †,π-i i -v π i or max i∈[M ] max ω∈Ωi v ω•π i -v π i = max i∈[M ] v †,π (n ⋆ ) -i i -v π (n ⋆ ) i or max i∈[M ] max ω∈Ωi v ω•π (n ⋆ ) i -v π (n ⋆ ) i ≤∆ (n ⋆ ) = min n∈[N ] ∆ (n) ≤ 1 N N n=1 ∆ (n) ≤ ε, which has finished the proof.

C ANALYSIS OF THE MODEL-FREE METHOD

For the model-free method, throughout this section we assume the Markov game is a block Markov game.

C.1 CONSTRUCTION OF N h AND F h

Let C h = {Σ h : Σ h = λI d + l k=1 ϕ h (s k , a k )ϕ h (s k , a k ) ⊤ |ϕ h ∈ Φ h , l ∈ [N ], s k ∈ S, a k ∈ A, ∀k ∈ [l]}. Fix a variable L, for each h ∈ [H], define a function class Fh ∈ R S×A by Fh = { f (s, a) := r h,i (s, a) + ϕ h (s, a) ⊤ θ + min c∥ϕ h (s, a)∥ Σ -1 h , H i ∈ [M ], ϕ h ∈ Φ h , ∥θ∥ 2 ≤ 2H 2 √ d, c ∈ [0, L], Σ h ∈ C h } For a given parameter ε, let N h be a ε-net of Fh under the ∥ • ∥ ∞ metric. Define Π h as the set of all possible policies produced by equation 2 (or equation 3 or equation 4, according to the problem setting). We then define the discriminator function class F h as followings: F 1,h := f (s) := E a∼U (A) ϕ h (s, a) ⊤ θ -ϕ ′ h (s, a) ⊤ θ ′ ϕ h , ϕ ′ h ∈ Φ h , max{∥θ∥ 2 , ∥θ ′ ∥ 2 } ≤ √ d , F 2,h := f (s) := E a∼π h+1 (s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ i ∈ [M ], π h+1 ∈ Π h+1 , ϕ h+1 ∈ Φ h+1 , ∥θ∥ 2 ≤ √ d , F 3,h := f (s) := max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ i ∈ [M ], π h+1 ∈ Π h+1 , ϕ h+1 ∈ Φ h+1 , ∥θ∥ 2 ≤ √ d , (For NE and CCE) F 3,h := f (s) := max ω h+1,i ∈Ω h+1,i E a∼(ω h+1,i •π h+1 )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ i ∈ [M ], π h+1 ∈ Π h+1 , ϕ h+1 ∈ Φ h+1 , ∥θ∥ 2 ≤ √ d , (For CE) F 4,h := f (s) := E a∼π h+1 (s)   min c∥ϕ h+1 (s, a)∥ Σ -1 h+1 , H H 2 + ϕ h+1 (s, a) ⊤ θ   c ∈ [0, L], π h+1 ∈ Π h+1 , Σ h+1 ∈ C h+1 , ϕ h+1 ∈ Φ h+1 , ∥θ∥ 2 ≤ √ d , G := {f : S → [0, 1]}, F h := (F 1,h ∪ F 2,h ∪ F 3,h ∪ F 4,h ) ∩ G.

C.2 HIGH PROBABILITY EVENTS

We define the following event E 1 : ∀n ∈ [N ], h ∈ [H], ρ ∈ ρ (n) h , ρ(n) h , f ∈ F h , E ρ P (n) h -P ⋆ h f (s, a) 2 ≤ ζ (n) , E 2 : ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h , ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h E := E 1 ∩ E 2 . Similar to the procedure of the model-based case, we first prove a few lemmas which lead to the conclusion that E holds with a high probability. Lemma C.1. For any n ∈ [N ], h ∈ [H], we have P (n) h (s ′ |s, a) = φ(n) h (s, a) ⊤ ŵ(n) h (s ′ ) for some ŵ(n) h : S → R d . For any function f : S → [0, 1] and n ∈ [N ], h ∈ [H], we have S ŵ(n) h (s ′ )f (s ′ )ds ′ 2 ≤ √ d, and there exist θ, θ ∈ R d such that (P ⋆ h f ) (s, a) = ϕ ⋆ h (s, a) ⊤ θ, P (n) h f (s, a) = φ(n) h (s, a) ⊤ θ and max{∥θ∥ 2 , ∥ θ∥ 2 } ≤ √ d. Furthermore, we have ∥ θ∥ ∞ ≤ 1. Proof. By definition, we have (P ⋆ h f ) (s, a) = S P ⋆ h (s ′ |s, a)f (s ′ )ds ′ =ϕ ⋆ h (s, a) ⊤ S w ⋆ h (s ′ )f (s ′ )ds ′ =ϕ ⋆ h (s, a) ⊤ θ, where θ = S w ⋆ h (s ′ )f (s ′ )ds ′ . Furthermore, note that ∥f ∥ ∞ ≤ 1, according to the assumption on w ⋆ h , we have S w ⋆ h (s ′ )f (s ′ )ds ′ 2 ≤ √ d, which implies ∥θ∥ 2 ≤ √ d. For P (n) h f (s, a), let ŵ(n) h (s ′ ) :=    (s,ã)∈D (n) h ∪ D(n) h ϕ (n) h (s, ã)ϕ (n) h (s, ã) ⊤ + λI d    -1 (s,ã,s ′ )∈D (n) h ∪ D(n) h ϕ (n) h (s, ã)1 s′ =s ′ . Since ϕ (n) h (s, a) is an one-hot vector, one has ŵ(n) h (s ′ ) ∞ ≤ 1, ∀s ′ ∈ S. It follows that S ŵ(n) h (s ′ )f (s ′ )ds ′ ∞ ≤ 1, and therefore, S ŵ(n) h (s ′ )f (s ′ )ds ′ 2 ≤ √ d. By definition, we have P (n) h f (s, a) = S P (n) h (s ′ |s, a)f (s ′ )ds ′ =ϕ (n) h (s, a) ⊤ S ŵ(n) h (s ′ )f (s ′ )ds ′ =ϕ (n) h (s, a) ⊤ θ, where θ = S ŵ(n) h (s ′ )f (s ′ )ds ′ . Due to the property we just derived for ŵ(n) h , similar to the proof of the true model, we also have ∥θ∥ 2 ≤ √ d. Meanwhile, one can easily see that ∥θ∥ ∞ ≤ 1, using the fact S ŵ(n) h (s ′ )f (s ′ )ds ′ ∞ ≤ 1. Lemma C.2 (Covering Number of Fh ). When Φ h is the set of one-hot vectors and λ ≥ 1, it's possible to construct the ε-net N h such that |N h | ≤ M 12H 2 L 2 d ε 3d |Φ|, ∀h ∈ [H]. Furthermore, we have |Π h | ≤ |N h | M ≤ M M 12H 2 L 2 d ε 3M d |Φ| M . Proof. Recall that Fh = { f (s, a) := r h,i (s, a) + ϕ h (s, a) ⊤ θ + min{c∥ϕ h (s, a)∥ Σ -1 h , H} i ∈ [M ], ϕ h ∈ Φ h , ∥θ∥ 2 ≤ 2H 2 √ d, c ∈ [0, L], Σ ∈ C h }. Note that when Φ h is the set of one-hot vectors, Σ h will be a diagonal matrix. In this case, Fh is the subset of the following function class: F′ h := { f (s, a) := r h,i (s, a) + min{cϕ h (s, a) ⊤ θ ′ , H} + ϕ h (s, a) ⊤ θ i ∈ [M ], ϕ h ∈ Φ h , 0 ≤ c ≤ L, max{∥θ∥ 2 , ∥θ ′ ∥ 2 } ≤ 2H 2 √ d}. Let Θ be an ℓ 2 -cover of the set {θ ∈ R d : ∥θ∥ 2 ≤ 2H 2 √ d} at scale ε. Then we know |Θ| ≤ 4H 2 √ d ε d . Let W be an ℓ ∞ -cover of the set [0, L] at scale ε′ := ε 2H 2 √ d , we have |W| ≤ 2H 2 L √ d ε . Define the covering set by Fh := f (s, a) := r h,i (s, a) + min{cϕ h (s, a) ⊤ θ′ , H} + ϕ h (s, a) ⊤ θ i ∈ [M ], ϕ h ∈ Φ h , c ∈ W, θ, θ′ ∈ Θ . Then, for any f ∈ Fh , by definition, suppose f takes the following form: f (s, a) := r h,i (s, a) + min{cϕ h (s, a) ⊤ θ ′ , H} + ϕ h (s, a) ⊤ θ, 0 ≤ c ≤ L, max{∥θ∥ 2 , ∥ θ∥ 2 } ≤ 2H 2 √ d. Then we can find θ, θ′ ∈ Θ, c ∈ W such that ∥θ -θ∥ 2 ≤ ε, ∥θ ′ -θ′ ∥ 2 ≤ ε and |c -c| ≤ ε′ . Let f (s, a) := r h,i (s, a) + min{cϕ h (s, a) ⊤ θ′ , H} + ϕ h (s, a) ⊤ θ, then we have |f (s, a) -f (s, a)| ≤∥ϕ h (s, a)∥ 2 θ -θ 2 + ∥ϕ h (s, a)∥ 2 cθ ′ -cθ ′ 2 ≤ε + |c -c| θ′ 2 + c θ ′ -θ′ 2 ≤ε + 2H 2 √ dε ′ + Lε ≤3Lε, which implies Fh is a 3Lε-covering of F ′ h (therefore, is a 3Lε-covering of Fh ), and we have Fh ≤ M 4H 2 Ld ε 3d |Φ|. Replacing ε by ε 3L , we get an ε-covering of Fh whose size is no larger than M 12H 2 L 2 d ε 3d |Φ|. For Π h , since each policy is determined by M members from N h , we have |Π h | ≤ |N h | M , which has finished the proof. Lemma C.3 (Covering Number of F h ). When Φ h is the set of one-hot vectors and λ ≥ 1. The γ-covering number of F h is at most 4M |Π h+1 | 6L 2 d γ 3d |Φ| 2 . Proof. We cover F 1,h , F 2,h , F 3,h , F 4,h separately. For F 1,h , let Θ be an ℓ 2 -cover of the set {θ ∈ R d : ∥θ∥ 2 ≤ √ d} at scale γ. Then we know |Θ| ≤ 2 √ d γ d . Define the covering set of F 1,h as F1,h := f (s) := E a∼U (A) ϕ h (s, a) ⊤ θ -ϕ ′ h (s, a) ⊤ θ′ ϕ h , ϕ ′ h ∈ Φ h , θ, θ′ ∈ Θ . For any f ∈ F 1,h , suppose f (s) = E a∼U (A) ϕ h (s, a) ⊤ θ -ϕ ′ h (s, a) ⊤ θ ′ , ϕ h , ϕ ′ h ∈ Φ h , max{∥θ∥ 2 , ∥θ ′ ∥ 2 } ≤ √ d, Then we can find θ, θ′ ∈ Θ such that ∥θ - θ∥ 2 ≤ γ, ∥θ ′ -θ′ ∥ 2 ≤ γ. Let f (s) := E a∼U (A) ϕ h (s, a) ⊤ θ -ϕ ′ h (s, a) ⊤ θ′ . Then we have |f (s) -f (s)| ≤ 1 A a∈A ∥ϕ h (s, a)∥ 2 θ -θ 2 + 1 A a∈A ∥ϕ ′ h (s, a)∥ 2 θ ′ -θ′ 2 ≤2γ, which implies F1,h is a 2γ covering of F 1,h . Furthermore, we have | F1,h | ≤ 2d γ 2d |Φ| 2 . For F 2,h and F 3,h , we construct F2,h := f (s) := E a∼π h+1 (s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ i ∈ [M ], ϕ h+1 ∈ Φ h+1 , θ ∈ Θ, π h+1 ∈ Π h+1 . Similar to the proof of F 1,h , we may verify F2,h is a γ-covering of F 2,h , and | F2,h | ≤ M |Π h+1 | 2d γ d |Φ|. For F 3,h , we only prove the case of NE or CCE, the case of CE can be proved in a similar manner. We construct F3,h := f (s) := max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ i ∈ [M ], ϕ h+1 ∈ Φ h+1 , θ ∈ Θ, π h+1 ∈ Π h+1 . For any f ∈ F 3,h , suppose f (s) = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ , i ∈ [M ], π h+1 ∈ Π h+1 , ϕ h+1 ∈ Φ h+1 , ∥θ∥ 2 ≤ √ d. Then we can find θ ∈ Θ such that ∥θ -θ∥ 2 ≤ γ. Let f (s) = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ , we have f (s) -f (s) = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ -max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ ≤ max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ -E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) ϕ h+1 (s, a) ⊤ θ -ϕ h+1 (s, a) ⊤ θ ≤∥θ -θ∥ 2 ≤γ, f (s) -f (s) = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ -max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ ≤ max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ -E a∼(μ h+1,i ×π h+1,-i )(s) r h+1,i (s, a) H + ϕ h+1 (s, a) ⊤ θ = max μh+1,i E a∼(μ h+1,i ×π h+1,-i )(s) ϕ h+1 (s, a) ⊤ θ -ϕ h+1 (s, a) ⊤ θ ≤∥θ -θ∥ 2 ≤γ, which implies f (s) -f (s) ≤ γ. Therefore, we conclude F3,h is a γ-covering of F 3,h , and | F3,h | ≤ M |Π h+1 | 2d γ d |Φ|. For F 4,h , note that when Φ h is the set of one-hot vectors, Σ h will be a diagonal matrix. In this case, F 4,h is the subset of the following function class: F ′ 4,h := f (s) := E a∼π h+1 (s) min{cϕ h+1 (s, a) ⊤ θ ′ , H} H 2 + ϕ h+1 (s, a) ⊤ θ 0 ≤ c ≤ L, π h+1 ∈ Π h+1 , max{∥θ∥ 2 , ∥θ ′ ∥ 2 } ≤ √ d, ϕ h+1 ∈ Φ h+1 . In this case, let W be an ℓ ∞ cover of the set [0, L] at scale γ : = γ √ d , we have |W| ≤ L √ d γ . Let F4,h := f (s) := E a∼π h+1 (s) min{cϕ h+1 (s, a) ⊤ θ′ , H} H 2 + ϕ h+1 (s, a) ⊤ θ c ∈ W, π h+1 ∈ Π h+1 , θ, θ′ ∈ Θ, ϕ h+1 ∈ Φ h+1 . Then, for any f ∈ F 4,h , suppose f (s) := E a∼π h+1 (s) min{cϕ h+1 (s, a) ⊤ θ ′ , H} H 2 + ϕ h+1 (s, a) ⊤ θ , 0 ≤ c ≤ L, π h+1 ∈ Π h+1 , max{∥θ∥ 2 , ∥θ ′ ∥ 2 } ≤ √ d, ϕ h+1 ∈ Φ h+1 . Then we can find θ, θ′ ∈ Θ, c ∈ W such that ∥θ - θ∥ 2 ≤ γ, ∥θ ′ -θ′ ∥ 2 ≤ γ and |c -c| ≤ γ. Let f (s) := E a∼π h+1 (s) min{cϕ h+1 (s, a) ⊤ θ′ , H} H 2 + ϕ h+1 (s, a) ⊤ θ , then we have |f (s) -f (s)| ≤E a∼π h+1 (s) ∥ϕ h+1 (s, a)∥ 2 θ -θ 2 + 1 H 2 E a∼π h+1 (s) ∥ϕ h+1 (s, a)∥ 2 cθ ′ -cθ ′ ≤γ + 1 H 2 |c -c| θ′ 2 + c θ ′ -θ′ 2 ≤γ + √ d H 2 γ + L H 2 γ ≤3Lγ, which implies F4,h is a 3Lγ-covering of F 4,h , and we have F4,h ≤ |Π h+1 | 2Ld γ 3d |Φ|. In summary, we know Fh := F1,h ∪ F2,h ∪ F3,h ∪ F4,h is a 3Lγ-covering of F h . And |F h | ≤ 4M |Π h+1 | 2Ld γ 3d |Φ| 2 . Replacing γ by γ 3L , we get an γ-covering of F h whose size is no larger than 4M |Π h+1 | 6L 2 d γ 3d |Φ| 2 , which has finished the proof. Below we omit the superscript n and subscript h when clear from the context. Denote L λ,D (ϕ, θ, f ) = 1 |D| (s,a,s ′ )∈D ϕ(s, a) ⊤ θ -f (s ′ ) 2 + λ |D| ∥θ∥ 2 2 (13) L D (ϕ, θ, f ) = 1 |D| (s,a,s ′ )∈D ϕ(s, a) ⊤ θ -f (s ′ ) 2 (14) L ρ (ϕ, θ, f ) = E (s,a)∼ρ,s ′ ∼P ⋆ (s,a) ϕ(s, a) ⊤ θ -f (s ′ ) 2 . ( ) Lemma C.4 (Uniform Convergence for Square Loss). Let there be a dataset D := {(s i , a i , s ′ i )} n i=1 collected in n episodes. Denote that the data generating distribution in iteration i by d i , and ρ = 1 n n i=1 d i . Note that d i can depend on the randomness in episodes 1, . . . , i -1. For a finite feature class Φ and a discriminator class F : S → [0, 1] with γ-covering number ∥F∥ γ , we will show that, with probability at least 1 -δ: L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) -L D (ϕ, θ, f ) -L D (ϕ ⋆ , θ ⋆ f , f ) ≤ 1 2 L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) + 64 log( 2(4n) d •|Φ|•∥F ∥ 1/2n δ ) n for all ϕ ∈ Φ, ∥θ∥ ∞ ≤ 1 and f ∈ F, where recall that ϕ ⋆ is the true feature and θ ⋆ f is defined as E s ′ ∼P ⋆ (s,a) [f (s ′ )] = ⟨ϕ ⋆ (s, a), θ ⋆ f ⟩. Proof. To start, we focus on a given f ∈ F. We first give a high probability bound on the following deviation term: L ρ (ϕ, θ, f ) -L ρ (ϕ * , θ * f , f ) -L D (ϕ, θ, f ) -L D (ϕ * , θ * f , f ) . Denote g(s i , a i ) = ϕ(s i , a i ) ⊤ θ and g ⋆ (s i , a i ) = ϕ ⋆ (s i , a i ) ⊤ θ ⋆ f . At episode i, let F i-1 be the σ-field generated by all the random variables over the first i -1 episodes, for the random variable Y i := (g(s i , a i ) -f (s ′ i )) 2 -(g ⋆ (s i , a i ) -f (s ′ i )) 2 , we have E[Y i |F i-1 ] =E (g(s i , a i ) -f (s ′ i )) 2 -(g ⋆ (s i , a i ) -f (s ′ i )) 2 =E [(g(s i , a i ) + g ⋆ (s i , a i ) -2f (s ′ i )) (g(s i , a i ) -g ⋆ (s i , a i ))] =E (g(s i , a i ) -g ⋆ (s i , a i )) 2 . Here the conditional expectation is taken according to the distribution d i |F i-1 . The last equality is due to the fact that E [(g ⋆ (s i , a i ) -f (s ′ i )) (g(s i , a i ) -g ⋆ (s i , a i ))] =E si,ai E s ′ i [(g ⋆ (s i , a i ) -f (s ′ i )) (g(s i , a i ) -g ⋆ (s i , a i )) |s i , a i ] =0. Next, for the conditional variance of the random variable, we have: -4, 4] . Applying Lemma 1 in (Foster and Rakhlin, 2020) , we get with probability at least 1 -δ ′ , we can bound the deviation term above as: V[Y i |F i-1 ] ≤E Y 2 i |F i-1 = E (g(s i , a i ) + g ⋆ (s i , a i ) -2f (s ′ i )) 2 (g(s i , a i ) -g ⋆ (s i , a i )) 2 |F i-1 ≤16E (g(s i , a i ) -g ⋆ (s i , a i )) 2 |F i-1 ≤16E[Y i |F i-1 ]. Noticing Y i ∈ [ L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) -L D (ϕ, θ, f ) -L D (ϕ ⋆ , θ ⋆ f , f ) ≤ 2 n i=1 V[Y i |F i-1 ] log 2 δ ′ n 2 + 16 log 2 δ ′ 3n ≤ 32 n i=1 E[Y i |F i-1 ] log 2 δ ′ n 2 + 16 log 2 δ ′ 3n , Further, consider a finite point-wise cover of the function class . Let F be a γ-covering set of F. For any f ∈ F, there exists f ∈ F such that ∥f -f ∥ ∞ ≤ γ. Then, applying a union bound over elements in Φ × W × F, with probability 1 -|Φ||W|| F|δ ′ , for all θ ∈ W, f ∈ F, we have: G := {g(s, a) = ϕ(s, a) ⊤ θ : ϕ ∈ Φ, ∥θ∥ ∞ ≤ 1}. Note that, with a ℓ ∞ -cover W of W = {∥θ∥ ∞ ≤ 1} at L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) -L D (ϕ, θ, f ) -L D (ϕ ⋆ , θ ⋆ f , f ) ≤ L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) -L D (ϕ, θ, f ) -L D (ϕ ⋆ , θ ⋆ f , f ) + 16γ ≤ 32 n i=1 E[ Ȳi |F i-1 ] log 2 δ ′ n 2 + 16 log 2 δ ′ 3n + 16γ ≤ 1 2n n i=1 E[ Ȳi |F i-1 ] + 16 log 2 δ ′ n + 16 log 2 δ ′ 3n + 16γ ≤ 1 2n n i=1 E[Y i |F i-1 ] + 16 log 2 δ ′ n + 16 log 2 δ ′ 3n + 32γ ≤ 1 2 L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) + 32 log 2 δ ′ n + 32γ ≤ 1 2 L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) + 64 log 2 δ ′ n (setting γ = 1/n) where Ȳi := ϕ(s i , a i ) ⊤ θ -f (s ′ ) 2 -ϕ(s i , a i ) ⊤ θ ⋆ f -f (s ′ ) 2 . Finally, setting δ = δ ′ / |Φ||W|| F| , we get log 2 δ ′ ≤ log 2(4n) d |Φ|| F | δ . This completes the proof. Lemma C.5 (Deviation Bounds for Alg. 3). Let ε ′ = 128 log( 2(4n) d •|Φ|•∥F ∥ 1/2n δ ) n . If Alg. 3 is called with a dataset D of size n, then with probability at least 1 -δ, for any f ∈ F ⊂ [0, 1] S , we have E ρ φ(s, a) ⊤ θf -ϕ ⋆ (s, a) ⊤ θ ⋆ f 2 ≤ ε ′ + 2λd n . Proof. We begin by using the result in Lemma C.4 such that, with probability at least 1 -δ, for all ∥θ∥ ∞ ≤ 1, ϕ ∈ Φ and f ∈ F, we have L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) -L D (ϕ, θ, f ) -L D (ϕ ⋆ , θ ⋆ f , f ) ≤ 1 2 L ρ (ϕ, θ, f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) + ε ′ /2. Thus, with probability at least 1 -δ we have: E ρ φ(s, a) ⊤ θf -ϕ ⋆ (s, a) ⊤ θ ⋆ f 2 =L ρ ( φ, θf , f ) -L ρ (ϕ ⋆ , θ ⋆ f , f ) (since E s ′ ∼P ⋆ (s,a) [f (s ′ )] = ϕ ⋆ (s, a) ⊤ θ ⋆ f ) ≤2 L D ( φ, θf , f ) -L D (ϕ ⋆ , θ ⋆ f , f ) + ε ′ (Lemma C.4, and ∥ θf ∥ ∞ ≤ 1 according to the proof in Lemma C.1) ≤2 L λ,D ( φ, θf , f ) -L λ,D (ϕ ⋆ , θ ⋆ f , f ) + λ n ∥θ ⋆ f ∥ 2 2 + ε ′ ≤ε ′ + 2λd n , (by the optimality of φ, θf under L λ,D (•, •, f )) which means the inequality in the lemma statement holds. Here, we use ∥θ ⋆ f ∥ 2 2 ≤ d. Lemma C.6. When P (n) h is computed using Alg. 3 and the Markov games is a block Markov game, if we set λ = Θ d log N H|Φ| δ , ζ (n) = Θ d 2 M log dN HM L|Φ| δ ε n . then E holds with probability at least 1 -δ. Proof. Combining Lemma C.5 and Lemma C.3, we have that max f ∈F h E ρ ( φ(s, a) ⊤ θf -ϕ ⋆ (s, a) ⊤ θ ⋆ f ) 2 ≤ ε ′ + 2λd n ≤ ζ (n) := Θ   d 2 M log dN HM L|Φ| δ ε n   , which shows E 1 holds with a high probability. Combining this result with Lemma E.1, we have proved Lemma C.6.

C.3 STATISTICAL GUARANTEES

To ensure the algorithm is well-defined, we first prove the following lemma which implies the optimistic Q-value estimators always belong to the function class Fh . Lemma C.7. When α (n) ≤ L, we have Q (n) h,i ∈ Fh , ∀h ∈ [H], i ∈ [M ], n ∈ [N ].

Proof. Because β(n)

h is upper bounded by H, by induction one can easily get V (n) h+1,i ≤ 2H 2 . Then according to the result of Lemma C.1, we know ( P (n) h V (n) h+1,i )(s, a) = ϕ (n) h (s, a) ⊤ θ with ∥θ∥ 2 ≤ 2H 2 √ d. We conclude Q (n) h,i ∈ Fh . We will show later that our choice of α (n) and L always satisfies the condition α (n) ≤ L. Lemma C.8. We have • For NE and CCE, max π h,i D π h,i ,π (n) h,-i Q (n) h,i (s) ≤ D π (n) h Q (n) h,i (s) + 2ε; • For CE, max ω h,i ∈Ω h,i D ω h,i •π (n) h Q (n) h,i (s) ≤ D π (n) h Q (n) h,i (s) + 2ε. Proof. We only prove the case of NE and CCE, the case of CE can be proved similarly. Let Q(n) h,i be the nearest neighbour of Q (n) h,i in N h , we have max π h,i D π h,i ,π (n) h,-i Q (n) h,i (s) ≤ max π h,i D π h,i ,π (n) h,-i Q(n) h,i (s) + ε ≤ D π (n) h Q(n) h,i (s) + ε (Definition of π (n) h ) ≤ D π (n) h Q (n) h,i (s) + 2ε, which has finished the proof. Lemma C.9 (One-step back inequality for the learned model). Suppose the event E holds. Consider a set of functions {g h } H h=1 that satisfies g h ∈ S × A → R + , s.t. ∥g h ∥ ∞ ≤ B. For a given policy π, suppose E a∼U (A) [g h (•, a)] ∈ F 1,h , then we have E (s,a)∼d π P (n) ,h [g h (s, a)] ≤            AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)], h = 1 E (s,ã)∼d π P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 E (s,a)∼ ρ(n) h [g 2 h (s, a)] + B 2 λd + nA 2 ζ (n) , B      , h ≥ 2 Recall Σ n,ρ (n) h , φ(n) h = nE (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ + λI d . Proof. For step h = 1, we have E (s,a)∼d π P (n) ,1 [g 1 (s, a)] =E s∼d1,a∼π1(s) [g 1 (s, a)] ≤ max (s,a) d 1 (s)π 1 (a|s) ρ (n) 1 (s, a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] = max (s,a) d 1 (s)π 1 (a|s) d 1 (s)u A (a) E (s ′ ,a ′ )∼ρ (n) 1 [g 2 1 (s ′ , a ′ )] ≤ AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)]. For step h = 2, . . . , H -1, we observe the following one-step-back decomposition: E (s,a)∼d π P (n) ,h [g h (s, a)] =E (s,ã)∼d π P (n) ,h-1 ,s∼ P (n) h-1 (s,ã),a∼π h (s) [g h (s, a)] =E (s,ã)∼d π P (n) ,h-1 φ(n) h-1 (s, ã) ⊤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds =E (s,ã)∼d π P (n) ,h-1 min φ(n) h-1 (s, ã) ⊤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds, B ≤E (s,ã)∼d π P (n) ,h-1     min        φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds Σ n,ρ (n) h-1 , φ(n) h-1 , B            . where we use the fact that g h is bounded by B. Then, S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds 2 Σ n,ρ (n) h-1 , φ(n) h-1 ≤ S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds ⊤ nE (s,a)∼ρ (n) h-1 φ(n) h-1 (s, a) φ(n) h-1 (s, a) ⊤ + λI d S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds ≤nE (s,ã)∼ρ (n) h-1 S a∈A ŵ(n) h-1 (s) ⊤ φ(n) h-1 (s, ã)π h (a|s)g h (s, a)ds 2 + B 2 λd ( a∈A π h (a|s)g h (s, a) ∞ ≤ B and by Lemma C.1 S ŵ(n) h-1 (s)l(s)ds 2 ≤ √ d for any l : S → [0, 1].) =nE (s,ã)∼ρ (n) h-1 E s∼ P (n) h-1 (s,ã),a∼π h (s) [g h (s, a)] 2 + B 2 λd ≤nA 2 E (s,ã)∼ρ (n) h-1 E s∼ P (n) h-1 (s,ã),a∼U (A) [g h (s, a)] 2 + B 2 λd (Importance sampling) ≤nA 2 E (s,ã)∼ρ (n) h-1 E s∼P ⋆ h-1 (s,ã),a∼U (A) [g h (s, a)] 2 + B 2 λd + nA 2 ξ (n) (Assumption on g h ) ≤nA 2 E (s,ã)∼ρ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a∼U (A) g 2 h (s, a) + B 2 λd + nA 2 ξ (n) . (Jensen) ≤nA 2 E (s,a)∼ ρ(n) h g 2 h (s, a) + B 2 λd + nA 2 ζ (n) . (Definition of ρ(n) h ) Combing the above results together, we get E (s,a)∼d π P (n) ,h [g h (s, a)] ≤E (s,ã)∼d π P (n) ,h-1     min        φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 S a∈A ŵ(n) h-1 (s)π h (a|s)g h (s, a)ds Σ n,ρ (n) h-1 , φ(n) h-1 , B            ≤E (s,ã)∼d π P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 E (s,a)∼ ρ(n) h [g 2 h (s, a)] + B 2 λd + nA 2 ζ (n) , B      , which has finished the proof. The following lemma is an exact copy of Lemma B.4, and here we state it again just for completeness. Lemma C.10 (One-step back inequality for the true model). Consider a set of functions {g h } H h=1 that satisfies g h ∈ S × A → R + , s.t. ∥g h ∥ ∞ ≤ B. Then for any given policy π, we have E (s,a)∼d π P ⋆ ,h [g h (s, a)] ≤          AE (s,a)∼ρ (n) 1 [g 2 1 (s, a)], h = 1 E (s,ã)∼d π P ⋆ ,h-1 ϕ ⋆ h-1 (s, ã) Σ -1 n,γ (n) h-1 ,ϕ ⋆ h-1 nAE (s,a)∼ ρ(n) h [g 2 h (s, a)] + B 2 λd, h ≥ 2 Recall Σ n,γ (n) h ,ϕ ⋆ h = nE (s,a)∼γ (n) h ϕ ⋆ h (s, a)ϕ ⋆ h (s, a) ⊤ + λI d . Lemma C.11 (Optimism for NE and CCE). Consider an episode n ∈ [N ] and set α (n) = Θ H nA 2 ζ (n) + dλ . When the event E holds and the policy π (n) is computed by solving NE or CCE, we have v (n) i (s) -v †,π (n) -i i (s) ≥ -H Aζ (n) -2H ε, ∀n ∈ [N ], i ∈ [M ]. Proof. Denote μ(n) h,i (•|s) := arg max µ D µ,π (n) h,-i Q †,π (n) -i h,i (s) and let π(n) h = μ(n) h,i × π (n) h,-i . Let f (n) h (s, a) = 1 H P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a), note that by definition, we have 1 H V †,π (n) -i h+1,i (s) is bounded by 1, and 1 H V †,π (n) -i h+1,i (s) =E a∼π (n) h (s) r h+1,i (s, a) H + 1 H P ⋆ h+1 V †,π (n) -i h+2,i (s, a) = max µ h+1,i E a∼(µ h+1,i ×π (n) h+1,-i )(s) r h+1,i (s, a) H + 1 H P ⋆ h+1 V †,π (n) -i h+2,i (s, a) ∈ F 3,h . where we use the result of Lemma C.1 and get 1 H P ⋆ h+1 V †,π (n) -i h+2,i (s, a) is a linear function in ϕ ⋆ h+1 and the 2-norm of the weight is upper bounded by √ d. Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H] ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Next, we prove by induction that E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -Hf (n) h ′ (s, a) -2(H -h + 1)ε, ∀h ∈ [H]. First, notice that ∀h ∈ [H], E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i (s) =E s∼d π(n) P (n) ,h D π (n) h Q (n) h,i (s) -D π(n) h Q †,π (n) -i h,i ≥E s∼d π(n) P (n) ,h D π(n) h Q (n) h,i (s) -D π(n) h Q †,π (n) -i h,i (s) -2ε =E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -Q †,π (n) -i h,i (s, a) -2ε, where the inequality uses the result of Lemma C.8. Now we are ready to prove equation 16, • When h = H, we have E s∼d π(n) P (n) ,H V (n) H,i (s) -V †,π (n) -i H,i (s) ≥E (s,a)∼d π(n) P (n) ,H Q (n) H,i (s, a) -Q †,π (n) -i H,i (s, a) -2ε =E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) -2ε ≥E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) -Hf (n) H (s, a) -2ε. • Suppose the statement is true for h + 1, then for step h, we have E s∼d π(n) P (n) ,h V (n) h,i (s) -V †,π (n) -i h,i (s) ≥E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -Q †,π (n) -i h,i (s, a) -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h V †,π (n) -i h+1,i (s, a) -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i -V †,π (n) -i h+1,i (s, a) + P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a) -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h -P ⋆ h V †,π (n) -i h+1,i (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -V †,π (n) -i h+1,i (s) -2ε ≥E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -Hf (n) h (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -V †,π (n) -i h+1,i (s) -2ε ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -Hf (n) h ′ (s, a) -2(H -h + 1)ε, where the last row uses the induction assumption. Therefore, we have proved equation 16. We then apply h = 1 to equation 16, and get E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i (s) =E s∼d π(n) P (n) ,1 V (n) 1,i (s) -V †,π (n) -i 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -Hf (n) h (s, a) -2H ε = H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) -2H ε. For the second term, since 1 H P (n) h V †,π (n) -i h+1,i is linear in φ(n) h and 1 H P ⋆ h V †,π (n) -i h+1,i is linear in ϕ ⋆ h , and according to the result of Lemma C.1, the 2-norm of their weights are both upper bounded by √ d. Therefore, we have E a∼U (A) f (n) h (•, a) ∈ F 1,h . By Lemma C.9, we have for h = 1, E (s,a)∼d π(n) P (n) ,1 f (n) 1 (s, a) ≤ AE (s,a)∼ρ (n) 1 f (n) 1 (s, a) 2 ≤ Aζ (n) . And ∀h ≥ 2, we have E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) ≤E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 E (s,a)∼ ρ(n) h f (n) h (s, a) 2 + dλ + nA 2 ζ (n) , 1      ≲E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 ζ (n) + dλ, 1      . Note that we here use f n) . Then according to our choice of α (n) , we get (n) h (s, a) ≤ 1, E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) and E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ ( E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) ≤ E (s,ã)∼d π(n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      . Combining all things together, v (n) i -v †,π (n) -i i =E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i (s) ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) -2H ε ≥ H-1 h=1 E (s,ã)∼d π(n) P (n) ,h   β(n) h (s, a) -min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      -H Aζ (n) -2H ε = -H Aζ (n) -2H ε, which proves the inequality. Lemma C.12 (Optimism for CE). Consider an episode n ∈ [N ] and set α (n) = Θ H nA 2 ζ (n) + dλ . When the event E holds, we have v (n) i (s) -max ω∈Ωi v ω•π (n) i (s) ≥ -H Aζ (n) -2H ε, ∀n ∈ [N ], i ∈ [M ]. Proof. Denote ω(n) h,i = arg max ω h ∈Ω h,i D ω h •π (n) h max ω∈Ωi Q ω•π (n) h,i (s) and let π(n) h = ωh,i • π (n) h . Let f (n) h (s, a) = 1 H P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i (s, a), note that by definition, we have 1 H max ω∈Ωi V ω•π (n) h+1,i (s) is bounded by 1, and 1 H max ω∈Ωi V ω•π (n) h+1,i (s) = max ω h+1,i ∈Ω h+1,i E a∼(ω h+1,i •π h )(s) r h+1,i (s, a) H + 1 H P ⋆ h+1 max ω∈Ωi V ω•π (n) h+2,i (s, a) ∈ F 3,h . where we use the result of Lemma C.1 and get 1  H P ⋆ h+1 max ω∈Ωi V ω•π (n) h+2,i E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H] ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Next, we prove by induction that E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -Hf (n) h ′ (s, a) -2(H -h + 1)ε, ∀h ∈ [H]. First, notice that ∀h ∈ [H], E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i (s) =E s∼d π(n) P (n) ,h D π (n) h Q (n) h,i (s) -D π(n) h max ω∈Ωi Q ω•π (n) h,i ≥E s∼d π(n) P (n) ,h D π(n) h Q (n) h,i (s) -D π(n) h max ω∈Ωi Q ω•π (n) h,i (s) -2ε =E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -max ω∈Ωi Q ω•π (n) h,i (s, a) -2ε. where the inequality uses the result of Lemma C.8. Now we are ready to prove equation 17, • When h = H, we have E s∼d π(n) P (n) ,H V (n) H,i (s) -max ω∈Ωi V ω•π (n) H,i (s) ≥E (s,a)∼d π(n) P (n) ,H Q (n) H,i (s, a) -max ω∈Ωi Q ω•π (n) H,i (s, a) =E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) ≥E (s,a)∼d π(n) P (n) ,H β(n) h (s, a) -Hf (n) H (s, a) -2ε. • Suppose the statement is true for h + 1, then for step h, we have E s∼d π(n) P (n) ,h V (n) h,i (s) -max ω∈Ωi V ω•π (n) h,i ≥E (s,a)∼d π(n) P (n) ,h Q (n) h,i (s, a) -max ω∈Ωi Q ω•π (n) h,i (s, a) -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h max ω∈Ωi V ω•π (n) -i h+1,i a) -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) + P (n) h V (n) h+1,i -max ω∈Ωi V ω•π (n) h+1,i (s, a) - P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) h+1,i -2ε =E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) - P (n) h -P ⋆ h max ω∈Ωi V ω•π (n) -i h+1,i (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -max ω∈Ωi V ω•π (n) h+1,i (s) -2ε ≥E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -Hf (n) h (s, a) + E s∼d π(n) P (n) ,h+1 V (n) h+1,i (s) -max ω∈Ωi V ω•π (n) h+1,i (s) -2ε ≥ H h ′ =h E (s,a)∼d π(n) P (n) ,h ′ β(n) h ′ (s, a) -Hf (n) h ′ (s, a) -2(H -h + 1)ε, where the last row uses the induction assumption. Therefore, we have proved equation 17. We then apply h = 1 to equation 17, and get E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i =E s∼d π(n) P (n) ,1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -Hf (n) h (s, a) -2H ε = H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) -2H ε. For the second term, since 1 H P (n) h max ω∈Ωi V ω•π (n) h+1,i is linear in φ(n) h and 1 H P ⋆ h max ω∈Ωi V ω•π (n) h+1,i is linear in ϕ ⋆ h , and according to the result of Lemma C.1, the 2-norm of their weights are both upper bounded by √ d. Therefore, we have E a∼U (A) f (n) h (•, a) ∈ F 1,h . By Lemma C.9, we have for h = 1, E (s,a)∼d π(n) P (n) ,1 f (n) 1 (s, a) ≤ AE (s,a)∼ρ (n) 1 f (n) 1 (s, a) 2 ≤ Aζ (n) . And ∀h ≥ 2, we have E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) ≤E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 E (s,a)∼ ρ(n) h f (n) h (s, a) 2 + dλ + nA 2 ζ (n) , 1     ≲E (s,ã)∼d π(n) P (n) ,h-1   min    φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 nA 2 ζ (n) + dλ, 1      . Note that we here use f n) . Then according to our choice of α (n) , we get (n) h (s, a) ≤ 1, E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) and E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ ( E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) ≤ E (s,ã)∼d π(n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      . Combining all things together, v (n) i -max ω∈Ωi v ω•π (n) i =E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h f (n) h (s, a) -2H ε ≥ H-1 h=1 E (s,ã)∼d π(n) P (n) ,h   β(n) h (s, a) -min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      -H Aζ (n) -2H ε = -H Aζ (n) -2H ε, which proves the inequality. Lemma C.13 (pessimism). Consider an episode n ∈ [N ] and set α (n) = Θ H nA 2 ζ (n) + dλ . When the event E holds, we have v (n) i (s) -v π (n) i (s) ≤ H Aζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Let f (n) h (s, a) = 1 H P (n) h -P ⋆ h V π (n) h+1,i (s, a), note that by definition, we have 1 H V π (n) h+1,i (s) is bounded by 1, and 1 H V π (n) h+1,i (s) =E a∼π (n) h (s) r h+1,i (s, a) H + 1 H P ⋆ h+1 V π (n) h+2,i (s, a) ∈ F 2,h . where we use the result of Lemma C.1 and get 1 H P ⋆ h+1 V π (n) h+2,i (s, a) is a linear function in ϕ ⋆ h+1 and the 2-norm of the weight is upper bounded by √ d. Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h (s, a) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H] ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], h ∈ [H], ϕ h ∈ Φ h . A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) φ(n) h (s, ã) Σ (n) h, φ(n) h -1 , H      ≥ min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H    , ∀n ∈ [N ], h ∈ [H]. Again, we prove the following inequality by induction: E s∼d π P (n) ,h V (n) h,i (s) -V π (n) h,i (s) ≤ H h ′ =h E (s,a)∼d π (n) P (n) ,h ′ - β(n) h ′ (s, a) + Hf (n) h ′ (s, a) , ∀h ∈ [H]. • When h = H, we have E s∼d π (n) P (n) ,H V (n) H,i (s) -V π (n) H,i (s) =E (s,a)∼d π (n) P (n) ,H Q (n) H,i (s, a) -Q π (n) H,i (s, a) =E (s,a)∼d π (n) P (n) ,H - β(n) H (s, a) ≤E (s,a)∼d π (n) P (n) ,H - β(n) H (s, a) + Hf (n) H (s, a) • Suppose the statement is true for h + 1, then for step h, we have E s∼d π (n) P (n) ,h V (n) h,i (s) -V π (n) h,i (s) =E (s,a)∼d π (n) P (n) ,h Q (n) h,i (s, a) -Q π (n) h,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h V (n) h+1,i (s, a) -P ⋆ h V π (n) h+1,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h V (n) h+1,i -V π (n) h+1,i (s, a) + P (n) h -P ⋆ h V π (n) h+1,i (s, a) =E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + P (n) h -P ⋆ h V π (n) h+1,i (s, a) + E s∼d π (n) P (n) ,h+1 V (n) h+1,i -V π (n) h+1,i (s) ≤E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + Hf (n) h (s, a) + E s∼d π (n) P (n) ,h+1 V (n) h+1,i -V π (n) h+1,i (s) ≤ H h ′ =h E (s,a)∼d π (n) P (n) ,h ′ - β(n) h ′ (s, a) + Hf (n) h ′ (s, a) . where the last row uses the induction assumption. The remaining steps are exactly the same as the proof in Lemma C.11 or Lemma C.12, we may prove E (s,a)∼d π (n) P (n) ,1 min f (n) 1 (s, a), 1 ≤ Aζ (n) , E (s,a)∼d π (n) P (n) ,h f (n) h (s, a) ≤ E (s,ã)∼d π (n) P (n) ,h-1   min    cα (n) H φ(n) h-1 (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1 , 1      , ∀h ≥ 2. Combining all things together, we get v (n) i -v π (n) i =E s∼d1 V (n) 1,i (s) -V π (n) 1,i (s) ≤ H h=1 E (s,a)∼d π (n) P (n) ,h - β(n) h (s, a) + Hf (n) h (s, a) ≤ H-1 h=1 E (s,a)∼d π (n) P (n) ,h   - β(n) h (s, a) + min    cα (n) φ(n) h (s, ã) Σ -1 n,ρ (n) h , φ(n) h , H      + H Aζ (n) ≤H Aζ (n) , which has finished the proof. Lemma C.14. For the model-free algorithm, suppose N is large enough, when we pick λ = Θ d log N H|Φ| δ , ζ (n) = Θ d 2 M n log dN HM L|Φ| εδ , L = Θ(N HAM d), ε = 1 2HN and α (n) = Θ H nA 2 ζ (n) + dλ , with probability 1 -δ, we have N n=1 ∆ (n) ≲ H 3 d 2 A 3 2 N 1 2 M 1 2 log dN HAM |Φ| δ . Proof. With our choice of λ and ζ (n) , according to Lemma C.6, we know E holds with probability 1 -δ. Furthermore, with a proper choice of the absolute constants, we have α (n) =Θ H d 2 A 2 M log dN HM L|Φ| δ + d 2 log N H|Φ| δ ≤O HdA M log dN HM A|Φ| δ ≤O (N HAM d) ≤ L. Let f (n) h (s, a) = 1 2H 2 P (n) h -P ⋆ h V (n) h+1,i -V (n) h+1,i (s, a). We first verify 1 2H 2 V (n) h+1,i -V (n) h+1,i ∈ F 4,h . By definition, we have 1 2H 2 V (n) h+1,i -V (n) h+1,i = E a∼π (n) h (s) 1 H 2 β(n) h+1 (s, a) + 1 2H 2 P ⋆ h+1 V (n) h+2,i -V (n) h+2,i (s, a) The first term is equal to 1 H 2 min α (n) φ(n) h (s, a) ⊤ Σ(n) h -1 φ(n) h (s, a) , H , which is exactly the same as that in the definition of F 4,h (note that we use the property α (n) ≤ L, ∀n ∈ [N ]). For the second term, note that we have 0 ≤ 1 2H 2 V (n) h,i -V (n) h,i ≤ 1, ∀h. Therefore, by Lemma C.1, 1 2H 2 P ⋆ h+1 V (n) h+2,i -V (n) h+2,i (s, a) is a linear function in ϕ ⋆ h+1 whose weight's 2-norm is upper bounded by √ d. Combing the above arguments, we conclude 1 2H 2 V (n) h+1,i -V (n) h+1,i ∈ F 4,h . According to the definition of the event E, we have E s∼ρ (n) h f (n) h (s, a) 2 ≤ ζ (n) , ∥ϕ h (s, a)∥ Σ(n) h,ϕ h -1 = Θ ∥ϕ h (s, a)∥ Σ -1 n,ρ (n) h ,ϕ h , ∀n ∈ [N ], ϕ h ∈ Φ h , h ∈ [H]. By definition, we have ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2H Aζ (n) . For each fixed i ∈ [M ], h ∈ [H] and n ∈ [N ], we have E s∼d π (n) P ⋆ ,h V (n) h,i (s) -V (n) h,i (s) =E s∼d π (n) P ⋆ ,h D π (n) h Q (n) h,i (s) -D π (n) h Q (n) h,i (s) =E (s,a)∼d π (n) P ⋆ ,h Q (n) h,i (s, a) -Q (n) h,i (s, a) =E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h + P (n) h V (n) h+1,i -V (n) h+1,i (s, a) =E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h + P (n) h -P ⋆ h V (n) h+1,i -V (n) h+1,i (s, a) + E s∼d π (n) P ⋆ ,h+1 V (n) h+1,i (s) -V (n) h+1,i (s) ≤E (s,a)∼d π (n) P ⋆ ,h 2 β(n) h (s, a) + 2H 2 f (n) h (s, a) + E s∼d π (n) P ⋆ ,h+1 V (n) h+1,i (s) -V (n) h+1,i (s) ≤ . . . ≤2 H h ′ =h E (s,a)∼d π (n) P ⋆ ,h ′ β(n) h ′ (s, a) + H 2 f (n) h ′ (s, a) , where the last inequality is calculated using induction. In particular, E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≤ 2 H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) (a) +2H 2 H h=1 E (s,a)∼d π (n) P ⋆ ,h f (n) h (s, a) (b) . First, we calculate the first term (a) in Inequality equation 20. Following Lemma C.10 and noting the bonus β(n) h is O(H), we have H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) ≲ H h=1 E (s,a)∼d π (n) P ⋆ ,h   min   α (n) φ(n) h (s, a) Σ -1 n,ρ (n) h , φ(n) h , H     (From equation 19) ≲ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nA α (n) 2 E (s,a)∼ρ (n) h   φ(n) h (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h   + H 2 dλ + A α (n) 2 E (s,a)∼ρ (n) 1   φ(n) 1 (s, a) 2 Σ -1 n,ρ (n) 1 , φ(n) 1   . Note that we use the fact that B = H when applying Lemma D.3. In addition, we have nE (s,a)∼ρ (n) h   φ(n) h (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h   =nTr E (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ nE (s,a)∼ρ (n) h φ(n) h (s, a) φ(n) h (s, a) ⊤ + λI d -1 ≤d. Then, H h=1 E (s,a)∼d π (n) Second, we calculate the term (b) in inequality equation 23. Following Lemma D.3 and noting f (n) h (s, a) 2 is upper-bounded by 1 (i.e., B = 1 in Lemma D.3), we have H h=1 E (s,a)∼d π (n) P ⋆ ,h [f (n) h (s, a)] ≤ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nAE (s,a)∼ρ (n) h f (n) h (s, a) 2 + dλ + AE (s,a)∼ρ (n) h f (n) 1 (s, a) 2 ≤ H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h nAζ (n) + dλ + Aζ (n) ≲ α (n) H H-1 h=1 E (s,ã)∼d πn P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) , where in the second inequality, we use n) , and in the last line, recall nAζ (n) + dλ ≲ α (n) /H. Then, by combining the above calculation of the term (a) and term (b) in inequality equation 23, we have: E (s,a)∼ρ (n) h f (n) h (s, a) 2 ≤ ζ ( v (n) i -v (n) i =E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≲ H-1 h=1   E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 n   + H 2 H-1 h=1 α (n) H E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) . Taking maximum over i on both sides and use the definition of ∆ (n) , we get ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2H Aζ (n) ≲ H-1 h=1   E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 n   + H 2 H-1 h=1 α (n) H E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h + Aζ (n) . Hereafter, we take the dominating term out. Note that N n=1 E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h ≤ N N n=1 E (s,ã)∼d π (n) P ⋆ ,h ϕ ⋆ h (s, ã) ⊤ Σ -1 n,γ (n) h ,ϕ ⋆ h ϕ ⋆ h (s, ã) (CS inequality) ≲ N log det N n=1 E (s,ã)∼d π (n) P ⋆ ,h [ϕ ⋆ h (s, ã)ϕ ⋆ h (s, ã) ⊤ ] -log det(λI d ) (Lemma E.2) ≤ dN log 1 + N dλ . (Potential function bound, Lemma E.3 noting ∥ϕ ⋆ h (s, a)∥ 2 ≤ 1 for any (s, a).) Finally, N n=1 ∆ (n) ≲H   dN log 1 + N d dA α (N ) 2 + H 2 dλ + N n=1 dA α (n) 2 n   + H 3 1 H dN log 1 + N dλ α (N ) + N n=1 Aζ (n) + 2HN ε ≲H 2 d N A log 1 + N dλ α (N ) (Some algebra. We take the dominating term out. Note that α (n) is increasing in n) ≲H 3 d 2 A 3 2 N 1 2 M 1 2 log dN HAM |Φ| δ . This concludes the proof. Proof of Theorem 4.2 Proof. For any fixed episode n and agent i, by Lemma C.11, Lemma C.12 and Lemma C.13, we have v †,π (n) -i i -v π (n) i or max ω∈Ωi v ω•π (n) i -v π (n) i ≤ v (n) i -v (n) i + 2 Aζ (n) + 2H ε ≤ ∆ (n) + 2H ε. Taking maximum over i on both sides, we have max i∈[M ] v †,π (n) -i i -v π (n) i or max i∈[M ] max ω∈Ωi v ω•π (n) i -v π (n) i ≤ ∆ (n) + 2H ε. From Lemma C.14, with probability 1 -δ, we can ensure N n=1 (∆ (n) + 2H ε) ≲ H 3 d 2 A 3 2 N 1 2 M 1 2 log dN HAM |Φ| δ . Therefore, according to Lemma E.4, when we pick N to be O H 6 d 4 A 3 M ε 2 log 2 HdAM |Φ| δε , we have 1 N N n=1 (∆ (n) + 2H ε) ≤ ε. On the other hand, from equation 21, we have  max i∈[M ] v †,π-i i -v π i or max i∈[M ] max ω∈Ωi v ω•π i -v π i = max i∈[M ] v †,π (n ⋆ ) -i i -v π (n ⋆ ) i or max i∈[M ] max ω∈Ωi v ω•π (n ⋆ ) i -v π (n ⋆ ) i ≤∆ (n ⋆ ) + 2H ε = min n∈[N ] ∆ (n) + 2H ε ≤ 1 N N n=1 (∆ (n) + 2H ε) ≤ ε, E 1 : ∀n ∈ [N ], h ∈ [H], i ∈ [M ], ρ ∈ ρ (n) h , ρ(n) h , E ρ P (n) h,i (•|s[Z i ], a i ) -P ⋆ h,i (•|s[Z i ], a i ) 2 1 ≤ ζ (n) , E 2 : ∀n ∈ [N ], h ∈ [H], i ∈ [M ], φh,i ∈ Φh,i , ∥ φh,i (s, a)∥ Σ(n) h, φh,i -1 = Θ ∥ φh,i (s, a)∥ Σ -1 n,ρ (n) h , φh,i E := E 1 ∩ E 2 . The following lemma shows that the event E holds with a high probability with proper choices of the parameters. Lemma D.1. When P (n) h is computed using Alg. 2, if we set λ = Θ Ld L log N HM |Φ| δ , ζ (n) = Θ 1 n log |M|HN M δ , then E holds with probability at least 1 -δ. The proof of Lemma D.1 is follows a similar procedure as that of Lemma B.2, with minor changes on the notations as well as some modifications on the union bound.

D.2 STATISTICAL GUARANTEES

Lemma D.2 (One-step back inequality for the learned model). Suppose the event E holds. Consider a set of functions {g h } H h=1 that satisfies g h ∈ S[Z i ] × A i → R + , s.t. ∥g h ∥ ∞ ≤ B. For a given policy π, we have E (s,a)∼d π P (n) ,h [g h (s[Zi], ai)] ≤              ÃE (s,a)∼ρ (n) 1 [g 2 1 (s[Zi], ai)], h = 1 E (s,ã)∼d π P (n) ,h-1   min      Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i nE (s,a)∼ ρ(n) h [g 2 h (s[Zi], ai)] + B 2 λd L + nB 2 ζ (n) , B         , h ≥ 2 where φ(n) h,i (s, a) := j∈Zi φ(n) h-1,j (s[Z j ], a j ), Σ n,ρ (n) h , φ(n) h,i = nE (s,a)∼ρ (n) h φ(n) h,i (s, a) φ(n) h,i (s, a) ⊤ + λI d |Z i | . Proof. For step h = 1, we have E (s,ai)∼d π P (n) ,1 [g 1 (s[Z i ], a i )] =E s∼d1,ai∼π1(s) [g 1 (s[Z i ], a i )] ≤ max (s,ai) d 1 (s)π 1 (a i |s) ρ (n) 1 (s, a i ) E (s ′ ,a ′ i )∼ρ (n) 1 [g 2 1 (s ′ [Z i ], a ′ i )] = max (s,ai) d 1 (s)π 1 (a i |s) d 1 (s)u A (a i ) E (s ′ ,a ′ i )∼ρ (n) 1 [g 2 1 (s ′ [Z i ], a ′ i )] ≤ ÃE (s,ai)∼ρ (n) 1 [g 2 1 (s[Z i ], a i )]. For h ≥ 2, we observe the following one-step-back decomposition: E (s,ã i )∼d π P (n) ,h [g h (s[Zi], ai)] =E (s,ã)∼d π P (n) ,h-1 ,s∼ P (n) h-1 (s,ã),a i ∼π h (s) [g h (s[Zi], ai)] =E (s,ã)∼d π P (n) ,h-1   S M j=1 φ(n) h-1,j (s[Zj], ãj) ⊤ ŵ(n) h-1,j (sj) a i ∈A i π h (ai|s)g h (s[Zi], ai)ds   =E (s,ã)∼d π P (n) ,h-1   min    S M j=1 φ(n) h-1,j (s[Zj], ãj) ⊤ ŵ(n) h-1,j (sj) a i ∈A i π h (ai|s)g h (s[Zi], ai)ds, B      ≤E (s,ã)∼d π P (n) ,h-1   min    Ã S M j=1 φ(n) h-1,j (s[Zj], ãj) ⊤ ŵ(n) h-1,j (sj) 1 |Ai| a i ∈A i g h (s[Zi], ai)ds, B      =E (s,ã)∼d π P (n) ,h-1   min    Ã S[Z i ] j∈Z i φ(n) h-1,j (s[Zj], ãj) ⊤ ŵ(n) h-1,j (sj) 1 |Ai| a i ∈A i g h (s[Zi], ai)ds[Zi], B      =E (s,ã)∼d π P (n) ,h-1   min    Ã S[Z i ]   j∈Z i φ(n) h-1,j (s[Zj], ãj)   ⊤   j∈Z i ŵ(n) h-1,j (sj)   1 |Ai| a i ∈A i g h (s[Zi], ai)ds[Zi], B      =E (s,ã)∼d π P (n) ,h-1   min    Ã S[Z i ] φ(n) h-1,i (s, ã) ⊤   j∈Z i ŵ(n) h-1,j (sj)   1 |Ai| a i ∈A i g h (s[Zi], ai)ds[Zi], B      ≤E (s,ã)∼d π P (n) ,h-1 min Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i S[Z i ] 1 |Ai| a i ∈A i   j∈Z i ŵ(n) h-1,j (sj)   g h (s[Zi], ai)ds[Zi] Σ n,ρ (n) h-1 , φ(n) h-1,i , B . Then, S[Zi] 1 |A i | ai∈A   j∈Zi ŵ(n) h-1,j (s j )   g h (s[Z i ], a i )ds[Z i ] 2 Σ n,ρ (n) h-1 , φ(n) h-1 ,i ≤nE (s,ã)∼ρ (n) h-1      S[Zi] 1 |A i | ai∈A j∈Zi ŵ(n) h-1,j (s j ) ⊤ φ(n) h-1,j (s, ãj ) g h (s[Z i ], a i )ds[Z i ]   2    + B 2 λd L ( 1 |Ai| ai∈Ai g h (s[Z i ], a i ) ∞ ≤ B and ŵ(n) h-1,i (s i ) 2 ≤ √ d.) =nE (s,ã)∼ρ (n) h-1 E s∼ P (n) h-1 (s,ã),ai∼U (Ai) [g h (s[Z i ], a i )] 2 + B 2 λd L ≤nE (s,ã)∼ρ (n) h-1 E s∼P ⋆ h-1 (s,ã),ai∼U (Ai) [g h (s[Z i ], a i )] 2 + B 2 λd L + nB 2 ξ (n) (Event E) ≤nE (s,ã)∼ρ (n) h-1 ,s∼P ⋆ h-1 (s,ã),ai∼U (Ai) g 2 h (s[Z i ], a i ) + B 2 λd L + B 2 nξ (n) . (Jensen) =nE (s,ai)∼ ρ(n) h g 2 h (s[Z i ], a i ) + B 2 λd L + B 2 nζ (n) . (Definition of ρ(n) h ) Combing the above results together, we get E (s,ã i )∼d π P (n) ,h [g h (s[Zi], ai)] ≤E (s,ã)∼d π P (n) ,h-1 min Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i S[Z i ] 1 |Ai| a i ∈A i   j∈Z i ŵ(n) h-1,j (sj)   g h (s[Zi], ai)ds[Zi] Σ n,ρ (n) h-1 , φ(n) h-1,i , B ≤E (s,ã)∼d π P (n) ,h-1   min      Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i nE (s,a i )∼ ρ(n) h [g 2 h (s[Zi], ai)] + B 2 λd L + B 2 nζ (n) , B         , which has finished the proof. Lemma D.3 (One-step back inequality for the true model). Consider a set of functions {g h } H h=1 that satisfies g h ∈ S[Z i ] × A i → R + , s.t. ∥g h ∥ ∞ ≤ B. Then for any policy π, we have E (s,ai)∼d π P ⋆ ,h [g h (s[Z i ], a i )] ≤          ÃE (s,ai)∼ρ (n) 1 [g 2 1 (s[Z i ], a i )], h = 1, ÃE (s,ã)∼d π P ⋆ ,h-1 φ⋆ h-1,i (s, ã) Σ -1 n,γ (n) h-1 , φ⋆ h-1,i nE (s,a)∼ρ (n) h [g 2 h (s[Z i ], a i )] + B 2 λd L , h ≥ 2, where φ⋆ h,i (s, a) := j∈Zi ϕ ⋆ h-1,j (s[Z j ], a j ), Σ n,γ (n) h , φ⋆ h,i = nE (s,a)∼γ (n) h φ⋆ h,i (s, a) φ⋆ h,i (s, a) ⊤ + λI d |Z i | . Proof. For step h = 1, we have E (s,a)∼d π P ⋆ ,1 [g 1 (s[Z i ], a i )] =E s∼d1,ai∼π1(s) [g 1 (s[Z i ], a i )] ≤ max (s,ai) d 1 (s)π 1 (a i |s) ρ (n) 1 (s, a i ) E (s ′ ,a ′ i )∼ρ (n) 1 [g 2 1 (s ′ [Z i ], a ′ i )] = max (s,ai) d 1 (s)π 1 (a i |s) d 1 (s)u Ai (a i ) E (s ′ ,a ′ i )∼ρ (n) 1 [g 2 1 (s ′ [Z i ], a ′ i )] ≤ ÃE (s,ai)∼ρ (n) 1 [g 2 1 (s[Z i ], a i )]. For step h = 2, . . . , H -1, we observe the following one-step-back decomposition: E (s,ãi)∼d π P ⋆ ,h [g h (s[Z i ], a i )] =E (s,ã)∼d π P ⋆ ,h-1 ,s∼P ⋆ h-1 (s,ã),ai∼π h (s) [g h (s[Z i ], a i )] =E (s,ã)∼d π P ⋆ ,h-1      j∈Zi ϕ ⋆ h-1,j (s[Z j ], ãj )   ⊤ S ai∈Ai   j∈Zi w ⋆ h-1,j (s j )   π h (a i |s)g h (s[Z i ], a i )ds    ≤ ÃE (s,ã)∼d π P ⋆ ,h-1      j∈Zi ϕ ⋆ h-1,j (s[Z j ], ãj )   ⊤ S ai∈Ai 1 |A i |   j∈Zi w ⋆ h-1,j (s j )   g h (s[Z i ], a i )ds[Z i ]    ≤ ÃE (s,ã)∼d π P ⋆ ,h-1      j∈Zi ϕ ⋆ h-1,j (s[Z j ], ãj ) Σ -1 n,γ (n) h-1 , φ⋆ h-1,i      • S ai∈Ai 1 |A i |   j∈Zi w ⋆ h-1,j (s j )   g h (s[Z i ], a i )ds[Z i ] Σ n,γ (n) h-1 , φ⋆ h-1 ,i . Then, S a i ∈A i 1 |Ai|   j∈Z i w ⋆ h-1,j (sj)   g h (s[Zi], ai)ds[Zi] 2 Σ n,γ (n) h-1 , φ⋆ h-1,i ≤nE (s,ã)∼γ (n) h-1      S a i ∈A i 1 |Ai|   j∈Z i w ⋆ h-1,j (sj)   ⊤   j∈Z i ϕ ⋆ h-1,j (s[Zj], ãj)   g h (s[Zi], ai)ds[Zi]   2    + B 2 λd L (Use the assumption a i ∈A i 1 |A i | g h (s[Zi], ai) ∞ ≤ B and w ⋆ h-1,i (si) 2 ≤ √ d.) =nE (s,ã)∼γ (n) h-1 E s∼P ⋆ h-1 (s,ã),a i ∼U (A i ) [g h (s[Zi], ai)] 2 + B 2 λd L ≤nE (s,ã)∼γ (n) h-1 ,s∼P ⋆ h-1 (s,ã),a i ∼U (A i ) g 2 h (s[Zi], ai) + B 2 λd L (Jensen) ≤nE (s,a i )∼ρ (n) h g 2 h (s[Zi], ai) + B 2 λd L , (Definition of ρ (n) h ) which has finished the proof. Lemma D.4 (One-step back inequality for the true model). Consider a set of functions {g h } H h=1 that satisfies g h ∈ S[∪ j∈Zi Z j ] × A[Z i ] → R, s.t. ∥g h ∥ ∞ ≤ B. Then for any policy π, we have E (s,a)∼d π P ⋆ ,h [g h (s[∪ j∈Zi Z j ], a[Z i ])] ≤            ÃL E (s,a)∼ρ (n) 1 [g 2 1 (s[∪ j∈Zi Z j ], a[Z i ])], h = 1, ÃL E (s,ã)∼d π P ⋆ ,h-1   φ⋆ h-1,i (s, ã) Σ -1 n,γ (n) h-1 , φ⋆ h-1,i   nE (s,a)∼ρ (n) h [g 2 h (s[∪ j∈Zi Z j ], a[Z i ])] + B 2 λd L 2 , h ≥ 2, where φ⋆ h,i (s, a) := k∈∪ j∈Z i Zj ϕ ⋆ h-1,j (s[Z k ], a k ), and Σ n,γ (n) h , φ⋆ h,i = nE (s,a)∼γ (n) h φ⋆ h,i (s, a) φ⋆ h,i (s, a) ⊤ + λI d |∪ j∈Z i Z j | . Proof. This Lemma can be proved using similar steps as those in the proof of Lemma D.3, noting that in this case the dimension of φ⋆ h,i is at most L 2 . Lemma D.5 (Optimism for NE and CCE). Consider an episode n ∈ [N ] and set α (n) = Θ H Ã nζ (n) + d L λ . When the event E holds and the policy π (n) is computed by solving NE or CCE, we have v (n) i (s) -v †,π (n) -i i (s) ≥ -HM Ãζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Denote μ(n) h,i (•|s) := arg max µ D µ,π (n) h,-i Q †,π (n) -i h,i (s) and let π(n) h = μ(n) h,i × π (n) h,-i . Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 and f (n) h,i (s[Z i ], a i ) = P (n) h,i (•|s[Z i ], a i ) -P ⋆ h,i (•|s[Z i ], a i ) 1 . Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], i ∈ [M ] φh,i (s, a) Σ(n) h, φh,i -1 = Θ φh,i (s, a) Σ -1 n,ρ (n) h , φh,i , ∀n ∈ [N ], h ∈ [H], φh,i ∈ Φh,i , i ∈ [M ]. A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = M i=1 min      α (n) φ(n) h,i (s, ã) Σ (n) h, φ(n) h,i -1 , H      ≥ M i=1 min    cα (n) φ(n) h,i (s, ã) Σ -1 n,ρ (n) h , φ(n) h,i , H    , ∀n ∈ [N ], h ∈ [H]. Next, similar to the proof in Lemma B.5, we may prove E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i (s) ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 . For the second term, note that we have the relation min{f (n) h (s, a), 1} ≤ M i=1 min{f (n) h,i (s[Z i ], a i ), 1}. By Lemma D.2, we have for h = 1, E (s,a)∼d π(n) P (n) ,1 min f (n) 1,i (s[Z i ], a i ), 1 ≤ AE (s,a)∼ρ (n) 1 f (n) 1,i (s[Z i ], a i ) 2 ≤ Ãζ (n) . And ∀h ≥ 2, we have E (s,a)∼d π(n) P (n) ,h min f (n) h,i (s[Z i ], a i ), 1 ≲E (s,ã)∼d π(n) P (n) ,h-1   min    Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i nE (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 + d L λ + nζ (n) , 1      ≲E (s,ã)∼d π(n) P (n) ,h-1   min    Ã φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i nζ (n) + d L λ, 1      Note that we here use min{f n) . Then according to our choice of α (n) , we get (n) h,i (s[Z i ], a i ), 1} ≤ 1, E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) and E (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ ( E (s,a)∼d π(n) P (n) ,h min f (n) h,i (s[Z i ], a i ), 1 ≤ E (s,ã)∼d π(n) P (n) ,h-1   min    cα (n) H φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i , 1      . Combining all things together, v i -v †,π (n) -i i =E s∼d1 V (n) 1,i (s) -V †,π (n) -i 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≥ H-1 h=1 E (s,a)∼d π(n) P (n) ,h   β(n) h (s, a) - M j=1 min    cα (n) φ(n) h,j (s, a) Σ -1 ρ (n) h , φ(n) h,j , H      -HM Ãζ (n) ≥ -HM Ãζ (n) , which proves the inequality. Lemma D.6 (Optimism for CE). Consider an episode n ∈ [N ] and set α (n) = Θ H Ã nζ (n) + d L λ . When the event E holds, we have v (n) i (s) -max ω∈Ωi v ω•π (n) i (s) ≥ -HM Aζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Denote ω(n) h,i = arg max ω h ∈Ω h,i D ω h •π (n) h max ω∈Ωi Q ω•π (n) h,i (s) and let π(n) h = ωh,i • π (n) h . Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 and f (n) h,i (s[Z i ], a i ) = P (n) h,i (•|s[Z i ], a i ) -P ⋆ h,i (•|s[Z i ], a i ) 1 . Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], i ∈ [M ] φh,i (s, a) Σ(n) h, φh,i -1 = Θ φh,i (s, a) Σ -1 n,ρ (n) h , φh,i , ∀n ∈ [N ], h ∈ [H], φh,i ∈ Φh,i , i ∈ [M ]. A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = min      α (n) M i=1 φ(n) h,i (s, ã) Σ (n) h, φ(n) h,i -1 , H      ≥c min    α (n) M i=1 φ(n) h,i (s, ã) Σ -1 n,ρ (n) h , φ(n) h,i , H    , ∀n ∈ [N ], h ∈ [H]. Next, similar to the proof in Lemma B.6, we may prove E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i (s) ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 . Note that we can use exactly the same steps in the proof of Lemma D.5 to bound the second term, and we get for h = 1, E (s,a)∼d π(n) P (n) ,1 min f (n) 1,i (s[Z i ], a i ), 1 ≤ Ãζ (n) . And ∀h ≥ 2, E (s,a)∼d π(n) P (n) ,h min f (n) h,i (s[Z i ], a i ), 1 ≤ cα (n) H E (s,ã)∼d π(n) P (n) ,h-1   φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i   . Combining all things together, v (n) i -max ω∈Ωi v ω•π (n) i =E s∼d1 V (n) 1,i (s) -max ω∈Ωi V ω•π (n) 1,i ≥ H h=1 E (s,a)∼d π(n) P (n) ,h β(n) h (s, a) -H H h=1 E (s,a)∼d π(n) P (n) ,h min f (n) h (s, a), 1 ≥ H-1 h=1 E (s,a)∼d π(n) P (n) ,h   β(n) h (s, a) - M j=1 min   cα (n) φ(n) h,j (s, a) Σ -1 ρ (n) h , φ(n) h,j , H     -HM Ãζ (n) ≥ -HM Ãζ (n) , which proves the inequality. Lemma D.7 (pessimism). Consider an episode n ∈ [N ] and set α (n) = Θ H Ã nζ (n) + d L λ . When the event E holds, we have v (n) i (s) -v π (n) i (s) ≤ HM Ãζ (n) , ∀n ∈ [N ], i ∈ [M ]. Proof. Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 and f (n) h,i (s[Z i ], a i ) = P (n) h,i (•|s[Z i ], a i ) -P ⋆ h,i (•|s[Z i ], a i ) 1 . Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], i ∈ [M ] φh,i (s, a) Σ(n) h, φh,i -1 = Θ φh,i (s, a) Σ -1 n,ρ (n) h , φh,i , ∀n ∈ [N ], h ∈ [H], φh,i ∈ Φh,i , i ∈ [M ]. A direct conclusion of the event E is we can find an absolute constant c, such that β (n) h (s, a) = M i=1 min      α (n) φ(n) h,i (s, ã) Σ (n) h, φ(n) h,i -1 , H      ≥ M i=1 min    cα (n) φ(n) h,i (s, ã) Σ -1 n,ρ (n) h , φ(n) h,i , H    , ∀n ∈ [N ], h ∈ [H]. Next, similar to the proof in Lemma B.7, we may prove E s∼d π (n) P (n) ,h V (n) h,i (s) -V π (n) h,i (s) ≤ H h ′ =h E (s,a)∼d π (n) P (n) ,h ′ - β(n) h ′ (s, a) + H min f (n) h ′ (s, a), 1 , ∀h ∈ [H]. and we get for h = 1, E (s,a)∼d π(n) P (n) ,1 min f (n) 1,i (s[Z i ], a i ), 1 ≤ Ãζ (n) . And ∀h ≥ 2, E (s,a)∼d π(n) P (n) ,h min f (n) h,i (s[Z i ], a i ), 1 ≤ cα (n) H E (s,ã)∼d π(n) P (n) ,h-1   φ(n) h-1,i (s, ã) Σ -1 n,ρ (n) h-1 , φ(n) h-1,i   . Finally, we get v (n) i -v π (n) i =E s∼d1 V (n) 1,i (s) -V π (n) 1,i (s) ≤ H-1 h=1 E (s,a)∼d π (n) P (n) ,h   - β(n) h (s, a) + M j=1 min   cα (n) φ(n) h,j (s, a) Σ -1 ρ (n) h , φ(n) h,j , H     + HM Ãζ (n) ≤HM Ãζ (n) , which has finished the proof. Lemma D.8. When the event E holds and α (n) = Θ H Ã nζ (n) + d L λ satisfies α (1) ≤ α (2) ≤ . . . ≤ α (N ) , we have N n=1 ∆ (n) ≲ H 2 M d L 2 A L N log 1 + N dλ α (N ) . Proof. Let f (n) h (s, a) = P (n) h (•|s, a) -P ⋆ h (•|s, a) 1 and f (n) h,i (s[Z i ], a i ) = P (n) h,i (•|s[Z i ], a i ) -P ⋆ h,i (•|s[Z i ], a i ) 1 . Then according to the event E, we have E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , E (s,a)∼ ρ(n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ (n) , ∀n ∈ [N ], h ∈ [H], i ∈ [M ] φh,i (s, a) Σ(n) h, φh,i -1 = Θ φh,i (s, a) Σ -1 n,ρ (n) h , φh,i , ∀n ∈ [N ], h ∈ [H], φh,i ∈ Φh,i , i ∈ [M ]. By definition, we have ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2HM Ãζ (n) . With similar steps as those in the proof of Lemma B.8 (note that V (n) h,i (s) -V (n) h,i (s) is upper bounded by 2H 2 M ), we have E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≤ 2 H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) (a) +2H 2 M H h=1 E (s,a)∼d π (n) P ⋆ ,h f (n) h (s, a) (b) . (23) First, we calculate the first term (a) in Inequality equation 23. Following Lemma D.4, we have H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) ≲ H h=1 M i=1 E (s,a)∼d π (n) P ⋆ ,h   min   α (n) φ(n) h,i (s, a) Σ -1 n,ρ (n) h , φ(n) h,i , H     ≲ H-1 h=1 M i=1 ÃL E (s,ã)∼d π (n) P ⋆ ,h   φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i   • n α (n) 2 E (s,a)∼ρ (n) h   φ(n) h,i (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h,i   + H 2 d L 2 λ + ÃL α (n) 2 E (s,a)∼ρ (n) 1   φ(n) 1,i (s, a) 2 Σ -1 n,ρ (n) 1 , φ(n) 1,i   . Note that we use the fact that B = H when applying Lemma D.3. In addition, we have nE (s,a)∼ρ (n) h   φ(n) h,i (s, a) 2 Σ -1 n,ρ (n) h , φ(n) h,i   =nTr E (s,a)∼ρ (n) h φ(n) h,i (s, a) φ(n) h,i (s, a) ⊤ nE (s,a)∼ρ (n) h φ(n) h,i (s, a) φ(n) h,i (s, a) ⊤ + λI d |Z i | -1 ≤d L . Then, H h=1 E (s,a)∼d π (n) P ⋆ ,h β(n) h (s, a) ≤ H-1 h=1 i=1 ÃL E (s,ã)∼d π (n) P ⋆ ,h   φ⋆ h,i (s, ã) Σ-1 n,γ (n) h , φ⋆ h,i   d L α (n) 2 + H 2 d L 2 λ + d L ÃL α (n) 2 /n. Second, we calculate the term (b) in inequality equation 23. Following Lemma D.3 and noting f (n) h,i (s[Z i ], a i ) is upper-bounded by 2 (i.e., B = 2 in Lemma D.3), we have H h=1 E (s,a)∼d π (n) P ⋆ ,h f (n) h (s, a) ≤ M i=1 H h=1 E (s,a)∼d π (n) P ⋆ ,h f (n) h,i (s[Z i ], a i ) ≤ M i=1 H-1 h=1 ÃE (s,ã)∼d π (n) P ⋆ ,h φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i nE (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 + d L λ + ÃE (s,a)∼ρ (n) h f (n) 1 (s[Z j ], a j ) 2 ≤ M i=1 H-1 h=1 ÃE (s,ã)∼d π (n) P ⋆ ,h φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i nζ (n) + d L λ + Ãζ (n) ≲ α (n) H M i=1 H-1 h=1 E (s,ã)∼d π (n) P ⋆ ,h φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i + Ãζ (n) , where in the second inequality, we use n) , and in the last line, recall Ã nζ (n) + d L λ ≲ α (n) /H. Then, by combining the above calculation of the term (a) and term (b) in inequality equation 23, we have: E (s,a)∼ρ (n) h f (n) h,i (s[Z i ], a i ) 2 ≤ ζ ( v (n) i -v (n) i =E s∼d π (n) P ⋆ ,1 V (n) 1,i (s) -V (n) 1,i (s) ≲ M i=1 H-1 h=1   ÃL E (s,ã)∼d π (n) P ⋆ ,h   φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i   d L α (n) 2 + H 2 d L 2 λ + d L ÃL α (n) 2 n   + H 2 M M i=1 H-1 h=1 α (n) H E (s,ã)∼d π (n) P ⋆ ,h φ⋆ h,i (s, ã) Σ -1 n,γ (n) h , φ⋆ h,i + Ãζ (n) . Taking maximum over i on both sides and use the definition of ∆ (n) , we get + Ãζ (n) . ∆ (n) = max i∈[M ] v (n) i -v (n) i + 2HM Ãζ (n) ≲ M i=1 H-1 h=1   ÃL E (s,ã)∼d π (n) P ⋆ ,h   φ⋆ h,i (s, ã) Σ-1 n,γ (n) h , φ⋆ h,i   d L α (n) 2 + H 2 d L 2 λ + d L ÃL α (n) 2 n   + H 2 M M i=1 Hereafter, we take the dominating term out. Note that N ) . ∆ (n) ≲HM   d L 2 N log 1 + N dλ ÃL d L α (N ) 2 + H 2 d L 2 λ + N n=1 d L ÃL α (n) 2 n   + H 3 M 2 1 H d L N log 1 + N dλ α (N ) + N n=1 Ãζ (n) ≲H 2 M 2 d L 2 ÃL N log 1 + N dλ α (Some algebra. We take the dominating term out. Note that α (n) is increasing in n) This concludes the proof. , with probability 1 -δ, we have N n=1 ∆ (n) ≲ H 3 M 2 d (L+1) 2 A L+1 2 N 1 2 log |M|HN M δ . Proof. The result of Lemma D.1 implies with our choice of λ and ζ (n) , the event E holds with probability at least 1 -δ. In this case, we have α (n) = Θ H Ã log |M|HN M δ + Ld 2L log N HM |Φ| δ , which is a constant unrelated with n. Therefore, using the result of Lemma D.8, we get N n=1 ∆ (n) ≲ H 2 d L 2 ÃL M 2 N log 1 + N dλ α (N ) ≲ H 3 M 2 d (L+1) 2 A L+1 L 1 2 N 1 2 log |M|HN M δ , which has finished the proof. Proof of Theorem 4.1 Proof. For any fixed episode n and agent i, by Lemma D.5, Lemma D.6 and Lemma D.7, we have v †,π (n) -i i -v π (n) i or max ω∈Ωi v ω•π (n) i -v π (n) i ≤ v (n) i -v (n) i + 2HM Ãζ (n) ≤ ∆ (n) . Taking maximum over i on both sides, we have max i∈[M ] v †,π (n) -i i -v π (n) i or max i∈[M ] max ω∈Ωi v ω•π (n) i -v π (n) i ≤ ∆ (n) . From Lemma B.8, with probability 1 -δ, we can ensure N n=1 ∆ (n) ≲ H 3 M 2 d (L+1) 2 A L+1 L 1 2 N 1 2 log |M|HN M δ . Therefore, according to Lemma E.4, when we pick N to be O L 5 M 4 H 6 d 2(L+1) 2 Ã2(L+1) ε 2 log 2 HdALM |M| δε , we have 1 N N n=1 ∆ (n) ≤ ε. On the other hand, from 25, we have max i∈[M ] v †,π-i i -v π i or max i∈[M ] max ω∈Ωi v ω•π i -v π i = max i∈[M ] v †,π (n ⋆ ) -i i -v π (n ⋆ ) i or max i∈[M ] max ω∈Ωi v ω•π (n ⋆ ) i -v π (n ⋆ ) i ≤∆ (n ⋆ ) = min n∈[N ] ∆ (n) ≤ 1 N N n=1 ∆ (n) ≤ ε, which has finished the proof, noting our assumption that L = O(1). 

E AUXILIARY LEMMAS

h )ϕ ⊤ (s (i) h , a (i) h ) + λ (n) I d . With probability 1 -δ, we have ∀n ∈ N + , ∀h ∈ [H], ∀ϕ ∈ Φ, c 1 ∥ϕ(s, a)∥ Σ -1 ρ (n) h ,ϕ ≤ ∥ϕ(s, a)∥ Σ(n) h,ϕ -1 ≤ c 2 ∥ϕ(s, a)∥ Σ -1 ρ (n) h ,ϕ . (σ i /λ 0 )) Since we have i σ i = Tr(M N ) ≤ dλ 0 + N B 2 , the statement is concluded. Lemma E.4. For parameters A, B, ε such that A 2 B ε 2 is larger than some absolute constant, when we pick N = A 2 ε 2 log 2 A 4 B 2 ε 4 = O A 2 ε 2 log 2 AB ε , we have A √ N log(BN ) ≤ ε. Proof. We have A √ N log(BN ) = ε log A 2 B ε 2 log 2 A 4 B 2 ε 4 log A 4 B 2 ε 4 Note that A 2 B ε 2 log 2 A 4 B 2 ε 4 ≤ A 4 B 2 ε 4 ⇔ log 2 A 4 B 2 ε 4 ≤ A 2 B ε 2 where the right hand side is always true A 2 B ε 2 is larger than some given constant. Therefore, we get A √ N log(BN ) ≤ ε.

F EXPERIMENT DETAILS

F.1 DETAILED ENVIRONMENT SETUP In this section we introduce the details of the environment construction of the Block Markov games. For completeness we repeat certain details already introduced in the main text. We design our Block Markov game by first randomly generating a tabular Markov game with horizon H, 3 states, 2 players each with 3 actions, and random reward matrix R h ∈ (0, 1) 3×3 2 ×H and random transition matrix T h (s h , a h ) ∈ ∆(S h+1 ). For the reward generalization, for each r(s, a, s ′ ) entry in the reward matrix, we assign it with a random number sampled from a uniform distribution from -1 to 1. For the probability matrix generation, for each conditional distribution T (•|s, a), we randomly sample 3 numbers from a uniform distribution from -1 to 1 and form the probability simplex by normalization. For the generation of rich observation (emission distribution), we follow the experiment design of (Misra et al., 2020) : the dimension of the observation is 2 ⌈log(H+|S|+1)⌉ . For an observation o that emitted from state s and time step h, we concatenate the one-hot vector of s and h, adding i.i.d. Gaussian noise N (0, 0.1) on each entry, pend zero at the end if necessary, and finally multiply with a Hadamard matrix. In our setting, we have variants with different horizons H. Here a denotes the one-hot encoding in the joint action space. Different from Zhang et al. (2022) , we solve the optimization problem by directly solving the minmax-min problem instead of using an iterative method. We show the implementation in Algorithm. 5. We first perform minibatch stochastic gradient descent aggressively on the discriminator selection step (line. 5, on φ and f ) and the feature selection step (line. 6, on ϕ), where in each step we first compute the linear weight w and ŵ closed-formly and then perform gradient descent/ascend on the features and discriminators. Note that here the number of iteration T is very small. For solving the Markov games, in addition to following Algorithm. 1, to solve line.10 (i.e., solving equation 2 or equation 3 or equation 4), we implement the NE/CCE solvers based on the public repository: https://github.com/quantumiracle/MARS. Note that the essential difference lies in that (Xie et al., 2020) assumes that the algorithm has the access to the ground-truth feature but our algorithm needs to utilize the different features we learn for each iteration. We also adopt the Deep RL baseline from the same public repository.

F.3 ZERO-SUM EXPERIMENT TRAINING CURVES

In this section we provide the training curves of GERL_MG2 and Deep RL baseline in the zero-sum setting in Figure . 1.

F.4 GENERAL-SUM EXPERIMENT DETAILS

In this section we complete the remaining details for the general-sum experiment. We include the training curve in Fig. 2 . We evaluate each method over 5 random seeds and report the mean and standard deviation of the moving average of evaluation returns, wherein for each evaluation we perform 1000 runs. We use "Oracle" to denote the ground truth NE values of the Markov game. The x-axis denotes the number of episodes and the y-axis denotes the value of returns. 

F.5 HYPERPARAMETERS

In this section, we include the hyperparameter for GERL_MG2 in 



P ⋆ ,h β(n) h (s, a) ≤ E (s,ã)∼d π (n) P ⋆ ,h ∥ϕ ⋆ h (s, ã)∥ Σ -1 n,γ (n) h ,ϕ ⋆ h dA α (n) 2 + H 2 dλ + dA α (n) 2 /n.



one gets the following result (or more formally, one can prove by induction, just like what we did in Lemma B.5, Lemma B.6 and Lemma B.7):

scale γ, we have for all (s, a) and ϕ ∈ Φ, there exists θ ∈ W, |⟨ϕ(s, a), θ -θ⟩| ≤ γ, and we have |W| = 2 γ d

s, a) is a linear function in ϕ ⋆ h and the 2-norm of the weight is upper bounded by √ d. Then according to the event E, we have

which has finished the proof. D ANALYSIS OF THE FACTORED MARKOV GAMES D.1 HIGH PROBABILITY EVENTS Define the set Φh,i = { φh,i (s, a) := j∈Zi ϕ h,j (s[Z j ], a j )|ϕ h,j ∈ Φ h,j }. Let |Φ| = max h,j |Φ h,j | and | Φ| = max h,i | Φh,i |. Clearly, we have | Φ| ≤ |Φ| L . Define the following event

PROOF OF THE MAIN THEOREMS Lemma D.9. For the model-based algorithm, when we pickλ = Θ Ld L log N HM |Φ| δ , α (n) = Θ H Ã nζ (n) + d L λ and ζ (n) = Θ 1 n log |M|HN M δ

Concentration of the bonus term(Zanette et al. (2021), Lemma 39)). Set λ (n) ≥ Θ(d log(nH|Φ|/δ)) for any n. Define Σ n,ρ (n) h ,ϕ = nE (s,a)∼ρ (n) h [ϕ(s, a)ϕ ⊤ (s, a)] + λ (n) I d ,

Agarwal et al. (2020a), Lemma G.2). Consider the following process. For n = 1, . . . , N , M n = M n-1 + G n with M 0 = λ 0 I and G n being a positive semidefinite matrix with eigenvalues upper bounded by 1. We have2 log det(M N ) -2 log det(λ 0 I) ≥ N n=1 Tr(G n M -1 n-1 ). Lemma E.3 (Potential function lemma). Suppose Tr(G n ) ≤ B 2 . 2 log det(M N ) -2 log det(λ 0 I) ≤ d log 1 + N B 2 dλ 0 Proof. Let σ 1 , • • • , σ d bethe set of singular values of M N recalling M N is a positive semidefinite matrix. Then, by the AM-GM inequality, log det(M N )/ det(λ 0 I) = log d i=1 (σ i /λ 0 ) ≤ log d 1 d d i=1

IMPLEMENTATION DETAILSFor the implementation of GERL_MG2, we break down the introduction into two parts: the implementation of Alg. 3 and the implementation of game solving algorithm with current features (line. 10 and line. 11 of Algorithm. 1). For the implementation of Algorithm. 3, we follow the same function approximation as(Zhang et al. (2022)) and adapt their open-sourced code at https:// github.com/yudasong/briee. We include an overview of the function class for completeness: we adopt a two layer neural network with tanh non-linearity as the function class as the discriminator class. For the decoder, we let ψ(o) = softmax(A ⊤ o), where A ∈ R |O|×3 , and we let ϕ(o, a) = ψ(o) ⊗ a.

Figure1: Training curve in the zero-sum setting. We evaluate each method over 5 random seeds and report the mean and standard deviation of the moving average of evaluation returns, wherein for each evaluation we perform 1000 runs. We use "Oracle" to denote the ground truth NE values of the Markov game. The x-axis denotes the number of episodes and the y-axis denotes the value of returns.

Figure 2: Training curve of GERL_MG2 in the general sum setting. In this setting, the y-axis denotes exploitability instead of raw returns.

Algorithm 2 Model-based Representation Learning, MBRepLearn

Algorithm 4 Model-based Representation Learning for Factored MG (MBREPLEARN_FACTOR)

Top: Short Horizon (H=3) exploitability of the final policy of DQN and GERL_MG2. 3×3 2 ×H and random transition matrix T h (s h , a h ) ∈ ∆ S h+1 . We provide more details (e.g., generation of rich observation) in Appendix F.1.

(Theorem 21).

≲ N log det λI d |∪ j∈Z i Z j | + (s[Z i ], a i )∥ 2 ≤ 1 for any (s, a).)Similarly, we have

Table. 2, and the hyperparameter for DQN in Table. 3 and Table. 4.

ACKNOWLEDGMENTS

Mengdi Wang acknowledges the support by NSF grants DMS-1953686, IIS-2107304, CMMI1653435, ONR grant 1006977, and http://C3.AI. Chi Jin gratefully acknowledges the support by Office of Naval Research Grant N00014-22-1-2253.

annex

Algorithm 5 Model-free Representation Learning in Practice 1: Input: Dataset D, step h, regularization λ, iterations T . 2: Denote least squares loss:Discriminator selection:end for 8: Return φ, P where P is calculated from 1. 

