DECENTRALIZED OPTIMISTIC HYPERPOLICY MIRROR DESCENT: PROVABLY NO-REGRET LEARNING IN MARKOV GAMES

Abstract

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result [33], we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS), which achieves √ K-regret in the context of general function approximation, where K is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a hyperpolicy which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) studies how each agent learns to maximize its cumulative rewards by interacting with the environment as well as other agents, where the state transitions and rewards are affected by the actions of all the agents. Equipped with powerful function approximators such as deep neural networks [31] , MARL has achieved significant empirical success in various domains including the game of Go [47] , StarCraft [50] , DOTA2 [5] , Atari [38] , multi-agent robotics systems [8] and autonomous driving [45] . Compared with the centralized setting where a central controller collects the information of all agents and coordinates their behaviors, decentralized algorithms [19, 42] where each agent autonomously chooses its action based on its own local information are often more desirable in MARL applications. In specific, decentralized methods (1) are easier to implement and enjoy better scalability, (2) are more robust to possible adversaries, and (3) require less communication overhead [21, 22, 9, 59, 18] . In this work, we aim to design a provably efficient decentralized reinforcement learning (RL) algorithm in the online setting with function approximation. In the sequel, for the ease of presentation, we refer to the controllable agent as the player and regard the rest of the agents as a meta-agent, called the opponent, which specifies its policies arbitrarily. Our goal is to maximize the cumulative rewards of the player in the face of a possibly adversarial opponent, in the online setting where the policies of the player and opponent can be based on adaptively gathered local information. From a theoretical perspective, arguably the most distinctive challenge of the decentralized setting is nonstationarity. That is, from the perspective of any agent, the states transitions are affected by the policies of other agents in an unpredictable and potentially adversarial way and are thus nonstationary. This is in stark contrast to the centralized setting which can be regarded as a standard RL problem for the central controller which decides the actions for all the players. Furthermore, in the online setting, as the environment is unknown, to achieve sample efficiency, the player needs to strike a balance between exploration and exploitation in the context of function approximation and in the presence of an adversarial opponent. The dual challenges of nonstationarity and efficient exploration are thus intertwined, making it challenging to develop provably efficient decentralized MARL algorithms. Consequently, there seem only limited theoretical understanding of the decentralized MARL setting with a possibly adversarial opponent. Most of the existing algorithms [7, 53, 49, 27, 23] can only compete against the Nash value of the Markov game when faced with an arbitrary opponent. This is a much weaker baseline compared with the results in classic matrix games [17, 1] where the player is required to compete against the best fixed policy in hindsight. Meanwhile, [33] seems the only work we know that can achieve no-regret learning in MARL against the best hindsight policy, which focuses on the policy revealing setting where the player observes the policies played by the opponent in previous episodes. However, the algorithm and theory in this work are limited to tabular cases and fail to deal with large or even continuous state and action space. To this end, we would like to answer the following question: Can we design a decentralized MARL algorithm that provably achieves no-regret against the best fixed policy in hindsight in the context of function approximation? In this work, we provide a positive answer to the above question under the policy revealing setting with general function approximation. In specific, we propose an actor-critic-type algorithm [29] called DORIS, which maintains a distribution over the policy space, named hyperpolicy, for decisionmaking. To combat the nonstationarity, DORIS updates the hyperpolicy via mirror descent (or equivalently, Hedge [16] ). Furthermore, to encourage exploration, the descent directions of mirror descent are obtained by solving optimistic variants of policy evaluation subproblems with general function approximation, which only involve the local information of the player. Under standard regularity assumptions on the underlying function classes, we prove that DORIS achieves a sublinear regret in the presence of an adversarial opponent. In addition, when the agents all adopt DORIS independently, we prove that their average policy constitutes an approximate coarse correlated equilibrium. At the core of our analysis is a new complexity measure of function classes that is tailored to the decentralized MARL setting. Furthermore, to demonstrate the power of DORIS, we adapt it for solving constrained Markov decision process (CMDP) and vector-valued Markov decision process (VMDP), which can both be formulated as a zero-sum Markov game with a fictitious opponent. Our Contributions. Our contributions are four-fold. First, we propose a new decentralized policy optimization algorithm, DORIS, that provably achieves no-regret in the context of general function approximation. As a result, when all agents adopt DORIS, their average policy converges to a CCE of the Markov game. Secondly, we propose a new complexity measure named Bellman Evaluation Eluder dimension, which generalizes Bellman Eluder dimension [25] for single-agent MDP to decentralized learning in Markov games, which might be of independent interest. Third, we modify DORIS for solving CMDP with general function approximation, which is shown to achieve sublinear regret and constraint violation. Finally, we extend DORIS to solving the approchability task [36] in vector-valued Markov decision process (VMDP) and attain a near-optimal solution. To our best knowledge, DORIS seems the first provably efficient decentralized algorithm for achieving no-regret in MARL with general function approximation. Notations. In this paper we let [n] = {1, • • • , n} for any integer n. We denote the set of probability distributions over any set S by ∆ S or ∆(S). We also let • denote the 2 -norm by default. Related works. Our work is related to the bodies of literature on decentralized learning with an adversarial opponent, finding equilibria in self-play Markov games, CMDPs and VMDPs. These works either consider centralized setting or do not have function approximation in decentralized online setting. Due to the page limit, we compare to these works in Appendix B.

2. PRELIMINARIES

2.1 MARKOV GAMES Let us consider an n-agent general-sum Markov game (MG) M MG = (S, {A i } n i=1 , {P h } H h=1 , {r h,i } H,n h=1,i=1 , H), where S is the state space, A i is the action space of the i-th agent, P h : S × n i=1 A i → ∆(S) is the transition function at the h-th step, r h,i : S × n i=1 A i → R + is the reward function of the i-th agent at the h-th step and H is the length of each episode. We assume each episode starts at a fixed start state s 1 and terminates at s H+1 . At step h ∈ [H], each agent i observes the state s h and takes action a h,i simultaneously. After that, agent i receives its own reward r h,i (s h , a h ) where a h := (a h,1 , • • • , a h,n ) is the joint action and the environment transits to a new state s h+1 ∼ P h (•|s h , a h ). Policy. A policy of the i-th agent µ i = {µ h,i : S → ∆ Ai } h∈ [H] specifies the action selection probability of agent i in each state at each step. In the following discussion we will drop the h in µ h,i when it is clear from the context. We use π to represent the joint policy of all agents and µ -i to denote the joint policy of all agents other than i. Further, we assume each agent i chooses its policy from a policy class Π i . Similarly, let Π -i := j =i Π j denote the product of all agents' policy classes excluding the i-th agent. Value functions and Bellman operators. Given any joint policy π, the i-th agent's value function V π h,i : S → R and action-value (or Q) function Q π h,i : S × n i=1 A i → R characterize its expected cumulative rewards given a state or a state-action pair, which are defined as below: V π h,i (s) := E π H l=h r l,i (s l , a l ) s h = s , Q π h,i (s, a) := E π H l=h r l,i (s l , a l ) s h = s, a h = a , where the expectation is w.r.t. to the distribution of the trajectory induced by executing the joint policy π in M MG . Here we suppose the action-value function is bounded: Q π h,i (s, a) ≤ V max , ∀s, a, h, i, π. Notice that when the reward function is bounded in [0, 1], V max = H naturally.

2.2. DECENTRALIZED POLICY LEARNING

In this paper we consider the decentralized learning setting [27, 23, 33] where only one agent is under our control, which we call player, and the other agents can be adversarial. Without loss of generality, assume that we can only control agent 1 and view the other agents as a meta opponent. To simplify writing, we use a h , A, r h , µ, Π, V π h , Q π h to denote a h,1 , A 1 , r h,1 , µ 1 , Π 1 , V π h,1 , Q π h,1 respectively. We also use b h , B, ν, Π to represent the action, the action space, the policy and the policy class of the meta opponent. By decentralized we mean during the episode, the player can only observe its own rewards, actions and some information of the opponent specified by the protocol, i.e., {s t h , a t h , J t h , r t h } H h=1 where {J h } H h=1 is the information revealed by the opponent in each episode and we will specify it later. At the beginning of the t-th episode, the player chooses a policy µ t from its policy class Π based only on its local information collected from previous episodes, without any coordination from a centralized controller. Meanwhile, the opponent selects ν t from Π secretly and probably adversely. The learning objective is to minimize the regret of the player by comparing its performance against the best fixed policy in hindsight as standard in online learning literature [1, 20] : Definition 1 (Regret). Suppose (µ t , ν t ) are the policies played by the player and the opponent in the t-th episode. Then the regret for K episodes is defined as Regret(K) = max µ∈Π K t=1 V µ×ν t 1 (s 1 ) - K t=1 V µ t ×ν t 1 (s 1 ), where µ × ν denotes the joint policy where the player and the opponent play µ and ν independently. We also use π t to denote µ t × ν t . Achieving low regrets defined in (1) indicates that if we sample a policy µ uniformly from {µ t } K t=1 at random, the resulting mixture policy will be close to the best fixed policy in hindsight. Relation between Definition 1 and equilibria. An inspiration for our definition of regrets comes from the tight connection between low regrets and equilibria in the matrix game [17, 6, 10] . By viewing each policy in the policy class as a pure strategy in the matrix game, we can generalize the notion of equilibria in matrix games to Markov games naturally. In particular, a correlated mixed strategy profile π can be defined as a mixture of the joint policy of all agents, i.e., π ∈ ∆( i∈[n] Π i ). Suppose the marginal distribution of π over the policy of agent i is µ i , then we can see that µ i is a mixture of the policies in Π i . For a correlated profile, the agents might not play their mixed policies µ i independently, which means that π might not be the product of µ i . A coarse correlated equilibrium (CCE) is a correlated profile that all the agents have no incentive to deviate from by playing a different independent policy: Definition 2 ( -approximate coarse correlated equilibrium (CCE) for n-player MG). A correlated strategy profile π is an -approximate coarse correlated equilibrium if we have for all i ∈ [n] V π 1,i (s 1 ) ≥ max µ ∈Πi V µ ×µ -i 1,i (s 1 ) -, where µ -i is the marginal distribution of π over the joint policy of all agents other than i. Remark 1. Our definition of correlated strategy profile and CCEs is slightly different from [35] . This is because we are considering with policy classes while [35] does not. In fact, our definition is more strict in the sense that a correlated profile satisfying our definition must also satisfy theirs. Specially, if a CCE π satisfies π = i∈[n] µ i , it is also called a Nash Equilibrium (NE). We will use our algorithm as an example to show that if a decentralized algorithm can achieve low regrets under Definition 1, we will be able to find a CCE by running the algorithm independently for each agent just like in classic matrix games.

2.3. FUNCTION APPROXIMATION

To deal with the potentially large or even infinite state and action space, we consider learning with general value function approximation in this paper [24, 25] . We assume the player is given a function class F = F 1 × • • • × F H+1 (F h ⊆ (S × A × B → [0, V max ] )) to approximate action-value functions. Since there is no reward in state s H+1 , we let f H+1 (s, a, b) = 0 for all s ∈ S, a ∈ A, b ∈ B, f ∈ F. To measure the size of F, we use |F| to denote its cardinality when F is finite. For infinite function classes, we use -covering number to measure its size, which is defined as follows: Definition 3 ( -covering number). The -covering number of F, denoted by N F ( ), is the minimum integer n such that there exists a subset F ⊂ F with |F | = n and for any f ∈ F there exists f ∈ F such that max h∈[H] f h -f h ∞ ≤ . In addition to the size, we also need to impose some complexity assumption on the structure of the function class to achieve small generalization error. Here we introduce one of such structure complexity measures called Distributional Eluder (DE) dimension [25] , which we will utilize in our subsequent analysis. First let us define independence between distributions: Definition 4 ( -independence between distributions). Let W be a function class defined on X , and ρ, ρ 1 , • • • , ρ n be probability measures over X . We say ρ is -independent of {ρ 1 , • • • , ρ n } with respect to W if there exists w ∈ W such that n i=1 (E ρi [w]) 2 ≤ but |E ρ [w]| > . From the definition we can see that a probability distribution ρ is independent from {ρ  (W, Q, ) is the length of the longest sequence {ρ 1 , • • • , ρ n } ⊂ Q such that there exists ≥ where ρ i is -independent of {ρ 1 , • • • , ρ i-1 } for all i ∈ [n]. Eluder dimension, another commonly-used complexity measure proposed by [43] , is a special case of DE dimension. If we choose Q = {δ x (•)|x ∈ X } where δ x (•) is the dirac measure centered at x, then the Eluder dimension can be formulated as dim E (W, ) = dim DE (W -W, Q, ), where W -W = {w 1 -w 2 : w 1 , w 2 ∈ W}. Many function classes in MDPs are known to have low Eluder dimension, including linear MDPs [28] , generalized linear complete models [52] and kernel MDPs [25] . We also assume the existence of an auxiliary function class G = G 1 × • • • × G H (G h ⊆ (S × A × B → [0, V max ])) to capture the results of applying Bellman operators on F as in [25, 27] . When F satisfies completeness (Assumption 3), we can simply choose G = F.

3. ALGORITHM: DORIS

Policy revealing setting. Recall that in decentralized policy learning setting, the player is also able to observe some information of the opponent, denoted by J h , aside from its own actions and rewards. There have been works studying the case where J h = ∅ [49] and J h = b h [27, 23] in two-player zero-sum games. However, their benchmark is the Nash value of the Markov game, i.e., V µ * ×ν *

1

(s 1 ) where µ * × ν * is an NE, which is strictly weaker than our benchmark max µ∈Π K t=1 V µ×ν t 1 (s 1 ) in two-player zero-sum games. In fact, [33] has showed achieving low regrets under Definition 1 is exponentially hard in tabular cases when the opponent's policies are not revealed (see Appendix E.2 for details). Therefore in this paper we let J h = {b h , ν h } just like [33] and call this information structure policy revealing setting. That said, even in policy revealing setting, the challenge of nonstationarity still exists because the opponent's policy can be adversarial and only gets revealed after the player plays a policy. Thus from the perspective of the player, the transition kernel P ν h (•|s, a) := E b∼ν h (s) P h (•|s, a, b) still changes in an unpredictable way across episodes. In addition, the problem of how to balance exploration and exploitation with general function approximation also remains due to the unknown transition probability. In this section we propose DORIS, an algorithm that is capable of handling both these challenges and achieving a √ K regret upper bound in policy revealing setting. Remark 2. When the opponent's policy is not revealed but changes slowly, we indeed can infer the opponent's policy approximately via the procedures in [39, 46] and this can be viewed as an **approximate policy revealing condition** in practice.

DORIS.

Intuitively, our algorithm is an actor-critic / mirror descent (Hedge) algorithm where each policy µ in Π is regarded as an expert and the performance of each expert at episode t is given by the value function of V µ×ν t 1 (s 1 ). We call it Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS). DORIS possesses three important features, whose details are shown in Algorithm 1: • Hyperpolicy and Hedge: Motivated from the adversarial bandit literature [1, 20, 30] and no-regret learning works [33] , DORIS maintains a distribution p over the policies in Π, which we call hyperpolicy, to combat the nonstaionarity. The hyperpolicy is updated using Hedge, with the reward of each policy µ being an estimation of the value function V µ×ν t 1 (s 1 ). This is equivalent to running mirror ascent algorithm over the policy space Π with the gradient being V µ×ν t 1 (s 1 ). • Optimism: However, we do not have access to the exact value function since the transition probability is unknown, which forces us to deal with the exploration-exploitation tradeoff. Here we utilize the Optimism in the Face of Uncertainty principle [2, 28, 25, 27, 23] and choose our estimation V t (µ) to be optimistic with respect to the true value V µ×ν t 1 (s 1 ). In this way DORIS will prefer policies with more uncertainty and thus encourage exploration in the Markov game. • Optimistic policy evaluation with general function approximation: Finally we need to design an efficient method to obtain such optimistic estimation V t (µ) with general function approximation. Here we propose OptLSPE to accomplish this task. In short, OptLSPE constructs a confidence set for the target action-value function Q µ×ν based on the player's local information and chooses an optimistic estimation from the confidence set, as shown in Algorithm 2. The construction of the confidence set utilizes the fact that Q µ×ν h satisfies the Bellman equation [41] : Q µ×ν h (s, a, b) = (T µ,ν h Q µ×ν h+1 )(s, a, b) := r h (s, a, b) + E s ∼P h (•|s,a,b) [Q µ×ν h+1 (s , µ, ν)], where Q µ×ν h+1 (s , µ, ν) = E a ∼µ(•|s ),b ∼ν(•|s ) [Q µ×ν h+1 (s , a , b )]. We call T µ,ν h the Bellman operator induced by µ × ν at the h-th step. Then the construction rule of B D (µ, ν) is based on least-squares policy evaluation with slackness β as below: B D (µ, ν) ← f ∈ F : L D (f h , f h+1 , µ, ν) ≤ inf g∈G L D (g h , f h+1 , µ, ν) + β, ∀h ∈ [H] , where L D is the empirical Bellman residuals on D: L D (ξ h , ζ h+1 , µ, ν) = (s h ,a h ,b h ,r h ,s h+1 )∈D [ξ h (s h , a h , b h ) -r h -ζ h+1 (s h+1 , µ, ν)] 2 .

Algorithm 1 DORIS

Input: learning rate η, confidence parameter β. Initialize p 1 ∈ ∆ Π to be uniform over Π. for t = 1, • • • , K do Collect samples: The player samples µ t from p t . The player runs µ t against the opponent and collects D t = {s t 1 , a t 1 , b t 1 , r t 1 , • • • , s t H+1 }.

Update policy distribution:

The opponent reveals its policy ν t to the player. V t (µ) ← OptLSPE(µ, ν t , D 1:t-1 , F, G, β), ∀µ ∈ Π. p t+1 (µ) ∝ p t (µ) • exp(η • V t (µ)), ∀µ ∈ Π. end for Algorithm 2 OptLSPE(µ, ν, D, F, G, β) Construct B D (µ, ν) based on D via (3). Select V ← max f ∈B D (µ,ν) f (s 1 , µ, ν). return V . Decentralized Algorithm. Here we want to highlight that DORIS is a decentralized algorithm because the player can run DORIS based only on its local information, i.e., {s h , a h , J h , r h }, and we do not make any assumptions on the policies of the opponent. We also discuss the computational complexity of DORIS in Appendix E.3. Comparison with OPMD [33] . The hyperpolicy and Hedge part of DORIS indeed follows OPMD proposed by [33] , which is an algorithm for no-regret learning in tabular MG. The novelty of DORIS lies in the new policy evaluation algorithm specially designed for the policy revealing setting that can tackle general function approximation. We also need to propose new techniques to analyze the performance of DORIS, which is more complicated than tabular cases. See Appendix E.1 for more detailed comparison.

3.1. DORIS IN SELF-PLAY SETTING

Apart from decentralized learning setting with a possibly adversarial opponent, we are also interested in the self-play setting where we can control all the agents and need to find an equilibrium for the nagent general-sum Markov game. Inspired by the existing relationships between no-regret learning and CCEs in matrix games [17, 6, 10] , a natural idea is to simply let all agents run DORIS independently. To achieve this, we assume each agent i is given a value function class F i = F 1,i × • • • × F H,i and an auxiliary function class G i = G 1,i × • • • × G H,i as in DORIS, and run DORIS with learning rate η i and confidence parameter β i by viewing the other agents as its opponent. Suppose the policies played by agent i during K episodes are {µ t i } K t=1 , then we output the final joint policy as a uniform mixture: π ∼ Unif i∈[n] µ 1 i , • • • , i∈[n] µ K i . See Algorithm 3 for more details. Remark 3. Algorithm 3 is also a decentralized algorithm since every agent runs their local algorithm independently without coordination. The only step that requires centralized control is the output process where all the agents need to share the same iteration index, which is also required in the existing decentralized algorithms [35, 26] . )F to be the Bellman residuals induced by the policies in Π and Π :

4. THEORETICAL GUARANTEES

(I -T Π,Π h )F := {f h -T µ,ν h f h+1 : f ∈ F, µ ∈ Π, ν ∈ Π }. Then Bellman Evaluation Eluder (BEE) dimension is the DE dimension of the Bellman residuals induced by the policy class Π and Π on function class F: Definition 6. The -Bellman Evaluation Eluder dimension of function class F on distribution family Q with respect to the policy class Π × Π is defined as follows: dim BEE (F, , Π, Π , Q) := max h∈[H] dim DE ((I -T Π,Π h )F, Q h , ). BEE dimension is able to capture the generalization error of evaluating value function V µ×ν where µ ∈ Π, ν ∈ Π , which is one of the most essential tasks in decentralized policy space optimization as shown in DORIS. Similar to [25, 27, 23] , we mainly consider two distribution families for Q here: • Q 1 = {Q 1 h } h∈[H] : the collection of all probability measures over S × A × B at each step when executing (µ, ν) ∈ Π × Π . • Q 2 = {Q 2 h } h∈[H] : the collection of all probability measures that put measure 1 on a single state-action pair (s, a, b) at each step. We also use dim BEE (F, , Π, Π ) to denote min{dim BEE (F, , Π, Π , Q 1 ), dim BEE (F, , Π, Π , Q 2 )} for simplicity in the following discussion. Relation with Eluder dimension. To illustrate the generality of BEE dimension, we show that all function classes with low Eluder dimension also have low BEE dimension, as long as completeness (Assumption 3) is satisfied. More specifically, we have the following proposition and its proof is deferred to Appendix F: Proposition 1. Assume F satisfies completeness, i.e., T µ,ν h f h+1 ∈ F h , ∀f ∈ F, µ ∈ Π, ν ∈ Π , h ∈ [H]. Then for all > 0, we have dim BEE (F, , Π, Π ) ≤ max h∈[H] dim E (F h , ). Inequality (4) shows that BEE dimension is always upper bounded by Eluder dimension when completeness is satisfied. With Proposition 1, Appendix H validates that kernel Markov games (including tabular Markov games and linear Markov games) and generalized linear complete models all have small Bellman Evaluation Eluder Dimension. Furthermore, in this case the upper bound of BEE dimension does not depend on Π and Π , which is a desirable property when Π and Π is large. Comparison with multi-agent BE dimension. Jin et al. [27] , Huang et al. [23] also propose a variant of BE dimension for Markov games called multi-agent BE dimension. However, this complexity measure and its analysis techniques are not applicable in our setting because DORIS is different from their algorithms in terms of confidence set construction and optimism. See Appendix E.1 for details.

4.2. DECENTRALIZED POLICY LEARNING REGRET

Next we present the regret analysis for DORIS in decentralized policy learning setting. For simplicity, we focus on finite Π here: Assumption 1 (Finite player's policy class). We assume Π is finite. We consider two cases, the oblivious opponent (i.e., the opponent determines {ν t } K t=1 secretly before the game starts) and the adaptive opponent (i.e., the opponent determines its policy adaptively as the game goes on) separately. The difference between these two cases lies in the policy evaluation step in DORIS. The policy ν t of an oblivious opponent does not depend on the collected dataset D 1:t-1 and thus V µ,ν t is easier to evaluate. However, for an adaptive opponent, ν t will be chosen adaptively based on D 1:t-1 and we need to introduce an additional union bound over Π when analyzing the evaluation error of V µ,ν t . Oblivious opponent. To attain accurate value function estimation, we first need to introduce two standard assumptions, realizability and generalized completeness, on F and G [25, 27] . Here realizability refers to that all the true action value functions belong to F and generalized completeness means that G contains all the results of applying Bellman operator to the functions in F.

Assumption 2 (Realizability and generalized completeness). Assume that for any

h ∈ [H], µ ∈ Π, ν ∈ {ν 1 , • • • , ν K }, f h+1 ∈ F h+1 , we have Q µ×ν h ∈ F h , T µ,ν h f h+1 ∈ G h . Remark 4 . Some existing works [57, 23] assume the completeness assumption, which can also be generalized to our setting: Assumption 3. Assume for any h ∈ [H], µ ∈ Π, ν ∈ Π , f h+1 ∈ F h+1 , we have T µ,ν h f h+1 ∈ F h . We want to clarify that Assumption 3 is stronger than generalized completeness in Assumption 2 since if Assumption 3 holds, we can simply let G = F to satisfy generalized completeness. Appendix G shows that realizability and generalized completeness are satisfied in many examples including tabular MGs, linear MGs and kernel MGs. With the above assumptions, we have Theorem 1 to characterize the regret of DORIS when the opponent is oblivious, whose proof sketch is deferred to Appendix I. To simplify writing, we use the following notations in Theorem 1: d BEE := dim BEE F, 1/K, Π, Π , N cov := N F ∪G (V max /K)KH. Theorem 1 (Regret of Oblivious Adversary). Under Assumption 1,2, there exists an absolute constant c such that for any δ ∈ (0, 1], K ∈ N, if we choose β = cV 2 max log(N cov |Π|/δ) and η = log |Π|/(KV 2 max ) in DORIS, then with probability at least 1 -δ, we have: Regret(K) ≤ O HV max Kd BEE log (N cov |Π|/δ) . The √ K regret bound in Theorem 1 is consistent with the rate in tabular case [33] and suggests that the uniform mixture of the output policies {µ t } K t=1 is an -approximate best policy in hindsight when K = O(1/ 2 ). The complexity of the problem affects the regret through the covering number and the BEE dimension, implying that BEE dimension indeed captures the essence of this problem. Further, in oblivious setting, the regret bound in (5) does not depend on Π directly (the upper bound of the BEE dimension is also independent of Π in some special cases as shown in Proposition 1) and thus Theorem 1 can still hold when Π is infinite, as long as Assumptions 2 is satisfied. Adaptive Opponent. In the adaptive setting, we first need to modify Assumption 2 to hold for all ν ∈ Π since ν t depends on the collected data (recall that Π is the player's policy class and Π is the opponent's policy class): Assumption 4 (Uniform realizability and generalized completeness). Assume that for any h ∈ [H], µ ∈ Π, ν ∈ Π , f h+1 ∈ F h+1 , we have Q µ×ν h ∈ F h , T µ,ν h f h+1 ∈ G h . Further, as we have mentioned before, we need to introduce a union bound over the policies in Π in our analysis and thus we also assume Π to be finite for simplicity. Assumption 5 (Finite opponent's policy class). We assume Π is finite. Remark 5. It is straightforward to generalize our analysis to infinite Π by replacing |Π | with the covering number of Π . However, the regret still depends on the size of Π , which is not the case in tabular setting [33] . This dependency originates from our model-free type of policy evaluation algorithm (Algorithm 2) and is inevitable for DORIS in general. That said, when the Markov game has special structures (e.g., the Markov games in Appendix C and D), we can avoid this dependency. With the above assumptions, we have Theorem 2 to show that DORIS can still achieve sublinear regret in adaptive setting, whose proof is deferred to Section I: Theorem 2 (Regret of Adaptive Adversary). Under Assumption 1,4,5, there exists an absolute constant c such that for any δ ∈ (0, 1], K ∈ N, choosing β = cV 2 max log(N cov |Π||Π |/δ) and η = log |Π|/(KV 2 max ) in DORIS, then with probability at least 1 -δ we have: Regret(K) ≤ O HV max Kd BEE log (N cov |Π||Π |/δ) . We can see that in adaptive setting the regret also scales with √ K, implying that DORIS can still find an -approximate best policy in hindsight with O(1/ 2 ) episodes even when the opponent is adaptive. Compared to Theorem 1, Theorem 2 has an additional log |Π | in the upper bound (6) , which comes from the union bound over Π in the analysis. Intuitions on the regret bounds. The regrets in Theorem 1 and Theorem 2 can be decomposed to two parts, the online learning error incurred by Hedge and the cumulative value function estimation error incurred by OptLSPE. From the online learning literature [20] , the online learning error is O(V max K log |Π|) by viewing the policies in Π as experts and V t (µ) as the reward function of expert µ. For the estimation error, we utilize BEE dimensions to bridge V t (µ t ) -V π t 1 (s 1 ) with the function's empirical Bellman residuals on D 1:t-1 . This further incurs O(V max √ Kd BEE ) in the results. Our technical contribution mainly lies in bounding the cumulative value function estimation error with the newly proposed BEE dimensions, which is different from [25] where they focus on bounding the cumulative distance from the optimal value function. Comparison with existing works. There have been works studying decentralized policy learning. However, most of them [7, 53, 49, 27, 23] only competes against the Nash value in a two-player zero-sum games, which is a much weaker baseline than ours. [33] can achieve √ K regret under Definition 1, but they are restricted in tabular cases and the bound becomes vacuous with more complicated cases like linear MGs and kernel MGs in Appendix H. DORIS is the first algorithm that can achieve √ K regret under Definition 1 with general function approximation and is capable of tackling all models with low BEE dimension, including linear MGs, kernel MGs and generalized linear complete models. More details are deferred to Appendix E.1.

4.3. SELF-PLAY SAMPLE COMPLEXITY

Our previous discussion assumes the opponent is arbitrary or even adversary. A natural question is to ask whether there are any additional guarantees if the player and opponent run DORIS simultaneously, which is exactly the self-play setting. The following corollary answers this question affirmatively and shows that Algorithm 3 can find an approximate CCE π efficiently: Corollary 1. Suppose Assumption 1,4 hold for all the agents i and its corresponding F i , G i , Π i , Π -i . Then for any δ ∈ (0, 1], > 0, if we choose K ≥ O H 2 V 2 max • max i∈[n] d BEE,i • log N cov,i + n j=1 log |Π j | + log(n/δ) 2 , where d BEE,i and N cov,i are defined respectively as d BEE,i := dim BEE F i , 1/K, Π i , Π -i , N cov,i := N Fi∪Gi (V max /K)KH, and set β i = cV 2 max log(N cov,i |Π i ||Π -i |n/δ), η i = log |Π i |/(KV 2 max ), then with probability at least 1 -δ, π is -approximate CCE. The proof is deferred to Appendix K. Corollary 1 shows that if we run DORIS independently for each agent, we are able to find an -approximate CCE with O(1/ 2 ) samples. This can be regarded as a counterpart in Markov games to the classic connection between no-regret learning algorithms and equilibria in matrix games. However, this guarantee does not hold if an algorithm can only achieve low regrets with respect to the Nash values. Avoiding curse of multiagents. The sample complexity in (7) avoids exponential scaling with the number of agents n and only scales with max i∈[n] d BEE,i , max i∈[n] N cov,i and n j=1 log |Π j |, suggesting that statistically Algorithm 3 is able to escape the curse-of-multiagents problem in the literature [26] . Nevertheless, the input dimension of functions in F i and G i may scale with the number of agents, leading to the computational inefficiency of OptLSPE. We comment that finding computational efficient algorithms is beyond the scope of this paper and we leave it to future works. Comparison with existing algorithms. There have been many works studying how to find equilibria in Markov games. However, most of them are focused on centralized two-player zero-sum games [3, 56, 27, 23] rather than decentralized algorithms. For decentralized algorithms, existing literature mainly handle with potential Markov games [60, 32, 12] and two-player zero-sum games [11, 44, 54] . [35, 26] are able to tackle decentralized multi-agent general-sum Markov games while their algorithms are restricted to tabular cases. Algorithm 3 can deal with more general cases with function approximation and policy classes in multi-agent general-sum games. Furthermore, compared to the above works, DORIS has an additional advantage of robustness to adversaries since all the benign agents can exploit the opponents and achieve no-regret learning. Extensions. Although Theorem 1, Theorem 2 and Corollary 1 are aimed at Markov games, DORIS can be applied to a much larger scope of problems. Two such problems are finding the optimal policy in constrained MDPs and vector-valued MDPs. We will investigate these two problems in Appendix C and D, where we demonstrate how to convert such problems into Markov games with a fictitious opponent by duality so that DORIS is ready to use.

A DORIS IN SELF-PLAY SETTING

Here we present the pseudocode of Algorithm 3.

Algorithm 3 DORIS in self-play setting

Input: learning rate {η i } n i=1 , confidence parameter {β i } n i=1 . Initialize p 1 i ∈ ∆ |Πi| to be uniform over Π i for all i ∈ [n]. for t = 1, • • • , K do Collect samples: Agent i samples µ t i from p t i . Run µ t = n i=1 µ t i and collect D t,i = {s t 1 , a t 1 , r t 1,i , • • • , s t H+1 } for each agent i.

Update policy distribution:

All agents reveal their policies µ t i . V t i (µ i ) ← OptLSPE(µ i , µ t -i , D 1:t-1,i , F i , G i , β i ), ∀µ i ∈ Π i , i ∈ [n]. p t+1 i (µ i ) ∝ p t i (µ i ) • exp(η i • V t i (µ i )), ∀µ i ∈ Π i , i ∈ [n]. end for Output: π ∼ Unif({ i∈[n] µ 1 i , • • • , i∈[n] µ K i }).

B RELATED WORKS

In this section we supplement the related literature. Decentralized learning with an adversarial opponent. There have been a few works studying decentralized policy learning in the presence of a possibly adversarial opponent. [7] proposes R-max and is able to attain an average game value close to the Nash value in tabular MGs. More recently, [53, 49] improve the regret bounds in tabular cases and [27, 23] extend the results to general function approximation setting. However, these works only compete against the Nash value of the game and are unable to exploit the opponent. A more related paper is [33] , which develops a provably efficient algorithm that achieves a sublinear regret against the best fixed policy in hindsight. But there results are only limited to the tabular case. Our work extends the results in [33] to the setting with general function approximation, which requires novel technical analysis. Finding equilibria in self-play Markov games. Our work is closely related to the recent literature on finding equilibria in Markov games via reinforcement learning. Most of the existing works focus on two-player zero-sum games and consider centralized algorithms with unknown model dynamics. For example, [53, 3] utilize optimism to tackle the exploration-expoitation tradeoff and find Nash equilibria in tabular cases, and [56, 27, 23] extend the results to linear and general function approximation setting. Furthermore, under the decentralized setting with well-explored data, [11, 60, 44, 54, 32, 12] utilize independent policy gradient algorithms to deal with potential Markov games and two-player zero-sum games. Meanwhile, under the online setting, [4, 35, 26] have designed algorithms named V-learning, which are able to find CCE in multi-agent general-sum games. However, there results are only limited to the tabular case. Constrained Markov decision process. [15, 13] propose a series of primal-dual algorithms for CMDPs which achieve √ K bound on regrets and constraint violations in tabular and linear approximation cases. [34] reduces the constraint violation to O(1) by adding slackness to the algorithm and achieves zero violation when a strictly safe policy is known; [55] further avoids such requirement with the price of worsened regrets. Nevertheless, these improvements are only discussed in the tabular case. Approchability for vector-valued Markov decision process. [36] first introduces the approachability task for VMDPs but does not provide an algorithm with polynomial sample complexity. Then [58] proposes a couple of primal-dual algorithms to solve this task and achieves a O( -2 ) sample complexity in the tabular case. More recently, [37] utilizes reward-free reinforcement learning to tackle the problem and studies both the tabular and linear approximation cases, achieving roughly the same sample complexity as [58] .

C EXTENSION: CONSTRAINED MARKOV DECISION PROCESS

Although DORIS is designed to solve Markov games, there are quite a lot of other problems which DORIS can tackle with small adaptation. In this section we investigate an important scenario in practice called constrained Markov decision process (CMDP). By converting CMDPs into a maximin problem via Lagrangian multiplier, we will be able to view it as a zero-sum Markov game and apply DORIS readily. Constrained Markov decision process. Consider the Constrained Markov Decision Process (CMDP) [13]  M CMDP = (S, A, {P h } H h=1 , {r h } H h=1 , {g h } H h=1 , H) where S is the state space, A is the action space, H is the length of each episode, P h : S × A → ∆(S) is the transition function at the h-th step, r h : S × A → R + is the reward function and g h : S × A → [0, 1] is the utility function at the h-th step. We assume the reward r h is also bounded in [0, 1] for simplicity and thus V max = H. Then given a policy µ = {µ h : S → ∆ A } h∈[H] , we can define the value function V µ r,h and action-value function Q µ r,h with respect to the reward function r as follows: V µ r,h (s) = E µ H l=h r l (s l , a l ) s h = s , Q µ r,h (s, a) = E µ H l=h r l (s l , a l ) s h = s, a h = a . The value function V µ g,h and action-value function Q µ g,h with respect to the utility function g can be defined similarly. Another related concept is the state-action visitation distribution, which can be defined as d µ h (s, a) = Pr µ [(s h , a h ) = (s, a)], where Pr µ denotes the distribution of the trajectory induced by executing policy µ in M CMDP . Learning objective. In CMDPs, the player aims to solve a constrained problem where the objective function is the expected total rewards and the constraint is on the expected total utilities: Problem 1: Optimization problem of CMDP max µ∈Π V µ r,1 (s 1 ) subject to V µ g,1 (s 1 ) ≥ b, where b ∈ (0, H] to avoid triviality. Denote the optimal policy for (8) by µ * CMDP , then the regret can be defined as the performance gap with respect to µ * CMDP : Regret(K) = K t=1 V µ * CMDP r,1 (s 1 ) -V µ t r,1 (s 1 ) . However, since utility information is only revealed after a policy is decided, it is impossible for each policy to satisfy the constraints. Therefore, like [13] , we allow each policy to violate the constraint in each episode and focus on minimizing total constraint violations over K episodes: Violation(K) = K t=1 b -V µ t g,1 (s 1 ) + . ( ) Achieving sublinear violations in (10) implies that if we sample a policy uniformly from {µ t } K t=1 , its constraint violation can be arbitrarily small given large enough K. Therefore, if an algorithm can achieve sublinear regret in (9) and sublinear violations in (10) at the same time, this algorithm will be able to find a good approximate policy to µ * CMDP .

C.1 ALGORITHM: DORIS-C

To solve Problem 1 with DORIS, we first need to convert it into a Markov game. A natural idea is to apply the Lagrangian multiplier Y ∈ R + to Problem 1, which brings about the equivalent maximin problem below: max µ∈Π min Y ≥0 L CMDP (µ, Y ) := V µ r,1 (s 1 ) + Y (V µ g,1 (s 1 ) -b). Although Problem 1 is non-concave in µ, there have been works indicating that strong duality still holds for Problem 1 when the policy class is described by a good parametrization [40] . Therefore, here we assume strong duality holds and it is straightforward to generalize our analysis to the case where there exists a duality gap:  ∈ Π such that V µ g,1 (s 1 ) ≥ b + λ sla . Then the following lemma shows that Assumption 7 implies bounded optimal dual variable, whose proof is deferred to Appendix L.1: Lemma 1. Suppose Assumption 6,7 hold, then we have 0 ≤ Y * ≤ H/λ sla . Now we are ready to adapt DORIS into a primal-dual algorithm to solve Problem 1. Notice that the maximin problem (11) can be viewed as a zero-sum Markov game where the player's policy is µ and the reward function for the player is r h (s, a) + Y g h (s, a). The opponent's action is Y ∈ R + which remains the same throughout a single episode. With this formulation, we can simply run DORIS on the player, assuming the player is given function classes {F r , G r } and {F g , G g } to approximate Q µ r,h and Q µ g,h respectively. In the meanwhile, we run online projected gradient descent on the opponent so that its action Y can capture the total violation so far. This new algorithm is called DORIS-C and shown in Algorithm 4. It consists of the three steps below in each iteration. For the policy evaluation task in the second step, DORIS-C runs a single-agent version of OptLSPE to estimate V µ r,1 (s 1 ) and V µ g,1 (s 1 ) separately, which is essential for DORIS-C to deal with the infinity of the opponent's policy class, i.e., R + . • The player plays a policy µ t sampled from its hyperpolicy p t and collects a trajectory. • The player runs OptLSPE-C to obtain optimistic value function estimations V t r (µ), V t g (µ) for all µ ∈ Π and updates the hyperpolicy using Hedge with the rewards being V t r (µ) + Y t V t g (µ) . The construction rule for B D (µ) is still based on relaxed least-squares policy evaluation: B D (µ) ← {f ∈ F : L D (f h , f h+1 , µ) ≤ inf g∈G L D (g h , f h+1 , µ) + β, ∀h ∈ [H]}, where L D is the empirical Bellman residuals on D: L D (ξ h , ζ h+1 , µ) = (s h ,a h ,x h ,s h+1 )∈D [ξ h (s h , a h ) -x h -ζ h+1 (s h+1 , µ)] 2 . • The dual variable is updated using online projected gradient descent.

Algorithm 4 DORIS-C

Input: learning rate η, α, confidence parameter β r , β g , projection length χ . Initialize p 1 ∈ R |Π| to be uniform over Π, Y 1 ← 0. for t = 1, • • • , K do Collect samples: The player samples µ t from p t . Run µ t and collect D r t = {s t 1 , a t 1 , r t 1 , • • • , s t H+1 },D g t = {s t 1 , a t 1 , g t 1 , • • • , s t H+1 }. Update policy distribution: V t r (µ) ← OptLSPE-C(µ, D r 1:t-1 , F r , G r , β r ), ∀µ ∈ Π. V t g (µ) ← OptLSPE-C(µ, D g 1:t-1 , F g , G g , β g ), ∀µ ∈ Π. p t+1 (µ) ∝ p t (µ) • exp(η • (V t r (µ) + Y t V t g (µ))), ∀µ ∈ Π. Update dual variable: Y t+1 ← Proj [0, χ ] (Y t + α(b -V t g (µ t ))). end for Algorithm 5 OptLSPE-C(µ, D, F, G, β) Construct B D (µ) based on D via (13). Select V ← max f ∈B D (µ) f (s 1 , µ). return V .

C.2 THEORETICAL GUARANTEES

Next we provide the regret and constraint violation bounds for DORIS-C. Here we also consider the case where Π is finite, i.e., Assumption 1 is true. However, we can see that here the opponent is adaptive and its policy class is infinite, suggesting that Assumption 5 is violated. Fortunately, since the opponent only affects the reward function, the player can simply first estimate V µ r,1 (s 1 ) and V µ g,1 (s 1 ) respectively and then use their weighted sum to approximate the target value function V µ r,1 (s 1 ) + Y • V µ g,1 (s 1 ). In this way, DORIS-C circumvents introducing a union bound on Y and thus can work even when the number of possible values for Y is infinite. We also need to introduce the realizability and general completeness assumptions on the function classes as before: Assumption 8 (Realizability and generalized completeness in CMDP). Assume that for any h ∈ [H], µ ∈ Π, f r h+1 ∈ F r h+1 , f g h+1 ∈ F g , we have Q µ r,h ∈ F r h , Q µ g,h ∈ F g h , T µ,r h f r h+1 ∈ G r h , T µ,g h f g h+1 ∈ G g h . Here T µ,r h is the Bellman operator at step h with respect to r: (T µ,r h f h+1 )(s, a) = r h (s, a) + E s ∼P (•|s,a) f h+1 (s , µ), where f h+1 (s , µ) = E a ∼µ(•|s) [f h+1 (s , a )]. T µ,g h is defined similarly. We can see that (14) simply says that all the action value functions with respect to r (g) belong to F r (F g ) and G r (G g ) contains all the results of applying Bellman operator with respect to r (g) to the functions in F r (F g ). In addition, as a simplified case of Definition 6, BEE dimension for single-agent setting can be defined as follows: Definition 7. The single-agent -Bellman Evaluation Eluder dimension of function class F on distribution family Q with respect to the policy class Π and the reward function r is defined as follows: dim BEE (F, , Π, r, Q) := max h∈[H] dim DE ((I -T Π,r h )F, Q h , ), where (I -T Π,r h )F := {f h -T µ,r h f h+1 : f ∈ F, µ ∈ Π}. We also let dim BEE (F, , Π, r) denote min{dim BEE (F, , Π, r, Q 1 ), dim BEE (F, , Π, r, Q 2 )} as before. dim BEE (F, , Π, g, Q) and dim BEE (F, , Π, g) are defined similarly but with respect to the utility function g. Now we can present Theorem 3 which shows that DORIS-C is capable of achieving sublinear regret and constraint violation for Problem 1. We also use the following notations to simplify writing: d BEE,r := dim BEE F r , 1/K, Π, r , N cov,r := N F r ∪G r (H/K)KH, d BEE,g := dim BEE F g , 1/K, Π, g , N cov,r := N F g ∪G g (H/K)KH. Theorem 3. Under Assumption 6,7,1,8, there exists an absolute constant c such that for any δ ∈ (0, 1], K ∈ N, if we choose β r = cH 2 log(N cov,r |Π|/δ), β g = cH 2 log(N cov,g |Π|/δ), α = 1/ √ K, χ = 2H/λ sla and η = log |Π|/(K( χ + 1) 2 H 2 ) in DORIS-C, then with probability at least 1 -δ, we have: Regret(K) ≤ O H 2 + H 2 λ sla Kd BEE,r log (N cov,r |Π|/δ) , Violation(K) ≤ O H 2 + H λ sla K BEE , where BEE = max d BEE,r log (N cov,r |Π|/δ) , d BEE,g log (N cov,g |Π|/δ) . The bounds in ( 15) and ( 16) show that both the regret and constraint violation of DORIS-C scale with √ K. This implies that for any > 0, if µ is sampled uniformly from {µ t } K t=1 and K ≥ O(1/ 2 ), µ will be an near-optimal policy with high probability in the sense that V µ r,1 (s 1 ) ≥ V µ * CMDP r,1 (s 1 ) -, V µ g,1 (s 1 ) ≥ b -. In addition, compared to the results in Theorem 1 and Theorem 2, ( 15) and ( 16) have an extra term scaling with 1/λ sla . This is because DORIS-C is a primal-dual algorithm and λ sla characterizes the regularity of this constrained optimization problem. The proof of the regret bound is similar to Theorem 1 and Theorem 2 by viewing V µ r,1 (s 1 )+Y V µ g,1 (s 1 ) as the target value function and decomposing the regret into cumulative estimation error and online learning error. To bound the constraint violation, we need to utilize the strong duality and the property of online projected gradient descent. See Appendix L for more details. Comparison with existing algorithms. There has been a line of works studying the exploration and exploitation in CMDPs. [15, 13] propose a series of algorithms which can achieve √ K bound on regrets and constraint violations. However, they focus on tabular cases or linear function approximation and do not consider policy classes while DORIS-C can deal with nonlinear function approximation and policy classes. As an interesting follow-up, [34] reduces the constraint violation to O(1) by adding slackness to the algorithm and achieves zero violation when a strictly safe policy is known; [55] further avoids such requirement with the price of worsened regrets. However, these improvements are all limited in tabular cases and we leave the consideration of their general function approximation counterpart to future works.

D EXTENSION: VECTOR-VALUED MARKOV DECISION PROCESS

Another setting where DORIS can play a role is the approachability task for vector-valued Markov decision process (VMDP) [36, 58, 37] . Similar to CMDP, we convert it into a zero-sum Markov game by Fenchel's duality and then adapt DORIS properly to solve it. Vector-valued Markov decision process. Consider the Vector-valued Markov decision process (VMDP) [58]  M VMDP = (S, A, {P h } H h=1 , r, H) where r = {r h : S × A → [0, 1] d } H h=1 is a collection of d-dimensional reward functions and the rest of the components are defined the same as in Section C. Then given a policy µ ∈ Π, we can define the corresponding d-dimensional value function V µ h : S → [0, H] d and action-value function Q µ h : S × A → [0, H] d as follows: V µ h (s) = E µ H l=h r l (s l , a l ) s h = s , Q µ h (s, a) = E µ H l=h r l (s l , a l ) s h = s, a h = a . Learning objective. In this paper we study the approachability task [36] in VMDP where the player needs to learn a policy whose expected cumulative reward vector lies in a convex target set C. We consider a more general agnostic version [58, 37] where we do not assume the existence of such policies and the player learns to minimize the Euclidean distance between expected reward and the target set C: Problem 2: Approachability for VMDP min µ∈Π dist(V µ 1 (s 1 ), C), where dist(x, C) is the Euclidean distance between point x and set C. The approachability for VMDP is a natural objective in multi-task reinforcement learning where each dimension of the reward can be regarded as a task. It is important in many practical domains such as robotics, autonomous vehicles and recommendation systems [58] . Therefore, finding the optimal policy for Problem 2 efficiently is of great significance in modern reinforcement learning.

D.1 ALGORITHM: DORIS-V

To deal with Probelm 2, we first convert Problem 2 into a Markov game as we have done in Appendix C. By Fenchel's duality of the distance function, we know Problem 2 is equivalent to the following minimax problem: min µ∈Π max θ∈B L VMDP (µ, θ) := θ, V µ 1 (s 1 ) -max x∈C θ, x , where B(r) is the d-dimensional Euclidean ball of radius r centered at the origin. Regarding µ as the player's policy and θ as the opponent, we can again view this minimax problem as a Markov game where the reward function for the player is θ, r h (s, a) . Consider the general function approximation case where the player is given function classes F := {F j h } H,d h,j=1 , G := {G j h } H,d h,j=1 to approximate Q µ h (F j h and G j h are the j-th dimension of F h and G h ), we can run DORIS for the player while the opponent will update θ via online projected gradient ascent just like DORIS-C. We call this new algorithm DORIS-V, which is shown in Algorithm 6 and also consists of three steps in each iteration. For the policy evaluation task here, we apply OptLSPE-V and construct a confidence set for each dimension of the function class separately, and let the final confidence set be their intersection. Therefore the construction rule for B D (µ) is given as: B D (µ) ← {f ∈ F : L D j (f j h , f j h+1 , µ) ≤ inf g∈G L D j (g j h , f j h+1 , µ) + β, ∀h ∈ [H], j ∈ [d]}, where for any j ∈ [d] and h ∈ [H], L D j (ξ j h , ζ j h+1 , µ) = (s h ,a h ,r j h ,s h+1 )∈D [ξ j h (s h , a h ) -r j h -ζ j h+1 (s h+1 , µ)] 2 , and r j h is the j-the dimension of r h . In addition, since here we want to minimize the distance, OptLSPE-V will output a pessimistic estimation of the target value function instead of an optimistic one. • The player plays a policy µ t sampled from its hyperpolicy p t and collects a trajectory. • The player runs OptLSPE-V to obtain pessimistic value function estimations θ t , V t (µ) for all µ ∈ Π and updates the hyperpolicy using Hedge. • The dual variable is updated using online projected gradient ascent.

D.2 THEORETICAL GUARANTEES

We still consider finite policy class Π here. Notice that in the fictitious MG of VMDP, the opponent's policy class is also infinite, i.e., B(1). However, since the player only needs to estimate V µ 1 (s 1 ), which is independent of θ, DORIS-V can also circumvent the union bound on θ just like DORIS-C.

Algorithm 6 DORIS-V

Input: learning rate η, α t , confidence parameter β. Initialize p 1 ∈ R |Π| to be uniform over Π, θ 1 ← 0. for t = 1, • • • , K do Collect samples: The learner samples µ t from p t . Run µ t and collect D t = {s t 1 , a t 1 , r t 1 , • • • , s t H+1 }. Update policy distribution: V t (µ) ← OptLSPE-V(µ, D 1:t-1 , F, G, β, θ t ), ∀µ ∈ Π. p t+1 (µ) ∝ p t (µ) • exp(-η V t (µ), θ t ), ∀µ ∈ Π. Update dual variable: θ t+1 ← Proj B(1) (θ t + α t (V t (µ t ) -arg max x∈C θ t , x )). end for Output: µ uniformly sampled from µ 1 , • • • , µ K . Algorithm 7 OptLSPE-V(µ, D, F, G, β, θ) Construct B D (µ) based on D via (17). Select V ← f 1 (s 1 , µ), where f = arg min f ∈B D (µ) f 1 (s 1 , µ), θ . return V . In addition, we need to introduce the realizability and generalized completeness assumptions in this specific setting, which is simply a vectorized version as before: Assumption 9 (Realizability and generalized completeness in VMDP). Assume that for any h ∈ [H], j ∈ [d], µ ∈ Π, f h+1 ∈ F h+1 , we have Q µ,j h ∈ F h,j , T µ,j h f j h+1 ∈ G j h , where Q µ,j h is the j-the dimension of Q µ h and T µ,j h is the j-th dimensional Bellman operator at step h defined in (18) . Here T µ,j h is defined as: (T µ,j h f j h+1 )(s, a) := r j h (s, a) + E s ∼P (•|s,a) f j h+1 (s , µ). In addition, the BEE dimension for VMDP can be defined as the maximum BEE dimension among all d dimensions: Definition 8. The d-dimensional -Bellman Evaluation Eluder dimension of function class F on distribution family Q with respect to the policy class Π is defined as follows: dim BEE (F, , Π, Q) := max j∈[d],h∈[H] dim DE ((I -T Π,j h )F j , Q h , ), where (I -T Π,j h )F j := {f j h -T µ,j h f j h+1 : f ∈ F, µ ∈ Π}. We also use dim BEE (F, , Π) to denote min{dim BEE (F, , Π, Q 1 ), dim BEE (F, , Π, Q 2 )} as before. The next theorem shows that DORIS-V is able to find a near optimal policy for Problem 2 with polynomial samples, where we use the following notations to simplify writing: d BEE,V := dim BEE F, 1/K, Π , N cov,V := max j∈[d] N F j ∪G j (H/K)KH. Theorem 4. Under Assumption 1,9, there exists an absolute constant c such that for any δ ∈ (0, 1], K ∈ N, if we choose β = cH 2 log(N cov,V |Π|d/δ), α t = 2/(H √ dt) , and η = log |Π|/(KH 2 d) in DORIS-V, then with probability at least 1 -δ, we have: dist(V µ 1 (s 1 ), C) ≤ min µ∈Π dist(V µ 1 (s 1 ), C) + O H 2 √ d • d BEE,V log (N cov,V |Π|d/δ)/K . ( ) The bound in (19) shows that for any > 0, if K ≥ O(d/ 2 ), µ will be an near-optimal policy with high probability. Compared to the results in Theorem 1 and Theorem 2, there is an additional term d. This is because the reward is d-dimensional and we are indeed evaluating d scalar value functions in OptLSPE-V. The proof is similar to that of Theorem 3 and utilizes the fact that both µ and θ are updated via no-regret online learning algorithms (Hedge for µ and online projected gradient ascent for θ). See Appendix M for more details. [49] focuses on tabular cases and can only compete agasint the Nash value of the game. Liu et al. [33] also works on tabular cases but the baseline is much stronger, i.e., the best policy in hindsight. Jin et al. [27] , Huang et al. [23] are able to deal with general function approximation, but they can only compete against Nash value. In contrast, our work considers general function approximation and the baseline is the strongest (the same as Liu et al. [33] ).

Paper

Setting Decentralized? Number of Optimism players Jin et al. [26] Tabular Yes ≥ 2 Local, w.r.t. ∀h, s Mao et al. [35] max µ V µ,ν h (s) Jin et al. [27] General FA No 2 Global, w.r.t. V * 1 (s 1 ) Huang et al. [23] This work General FA Yes ≥ 2 Global, w.r.t. V µ,ν k 1 (s 1 ) Table 2 : Comparison with related works on finding equilibria in self-play Markov games. Here General FA means general function approximation setting. We can see that Jin et al. [26] , Mao et al. [35] studies tabular cases. Their algorithms are decentralized and can still work when the number of players is larger than 2. Jin et al. [27] , Huang et al. [23] works on general function approximation setting but their algorithms are centralized and limited to two-player zero-sum games. In comparison, our work can handle multi-agent (≥ 2 players) general-sum games with general function approximation and the algorithm is decentralized. Comparison with existing algorithms. [58] has also proposed algorithms for approachability tasks in tabular cases and achieve the same sub-optimality gap with respect to d and K as Theorem 4. [37] studies the tabular and linear approximation cases, achieving √ K regret as well. Their sample complexity does not scale with d because they have normalized the reward vector to lie in B(1) in tabular cases and B( √ d lin ) in d lin -dimensional linear VMDPs. Compared to the above works, DORIS-V is able to tackle the more general cases with nonlinear function approximation and policy classes while retaining the sample efficiency.

E.1 COMPARISON WITH CLOSELY-RELATED WORKS

In this section we provide a more detailed comparison between DORIS and some closely-related works in this section. First we would like to summarize the comparison with decentralized learning literature and self-play literature in Table 1 and 2 . Next we will clarify the novelty of our work given some relarted works. Novelty given Liu et al. [33] . The idea of maintaining a hyperpolicy and utilizing Hedge to update it in DORIS is inspired from OPMD proposed in Liu et al. [33] . However, the policy evaluation algorithm in Liu et al. [33] can only work in tabular cases and our novelty lies in the new optimism and policy evaluation step (Algorithm 2) specially designed for the policy revealing setting. Note that the extension to the setting of general function approximation is not trivial. Combining existing techniques on reinforcement learning with general function approxiamtion (for example, Bellman Eluder dimension in Jin et al. [25] ) with Liu et al. [33] does not lead to our work because the optimism in DORIS is (i) global (in the sense that optimism is only true for s 1 , which differs from Liu et al. [33] ) and (ii) policy-pair specific (that is, our confidence set is only optimistic with respect to V µ,ν k 1 (s 1 ), which differs from Jin et al. [25] ). More specifically, Liu et al. [33] attains optimism via adding a bonus term β for each step h and state-action pair (s, a, b) in value iteration as follows: Qµ,ν k h (s, a, b) = E s ∼ P (•|s,a) [ V µ,ν k h+1 (s )] + r(s, a, b) + β, V µ,ν k h (s) = E a∼µ h (s),b∼ν k h (s) [ Qµ,ν k h (s, a, b)]. This guarantees that V µ,ν k h (s, a, b) is optimistic with respect to the true value function V µ,ν k h (s, a, b) for each h, s, a, b. However, DORIS picks the most optimistic estimation from the constructed confidence set directly: V µ,ν k = max f ∈B(µ,ν k ) f (s 1 , µ, ν k ). This can only guarantee that V µ,ν k is optimistic with respect to V µ,ν k 1 (s 1 ). Thus bounding the regret will be harder in our case with only global optimism. In addition, although Jin et al. [25] also uses global optimism, it is optimistic with respect to the fixed optimal value function V * 1 (s 1 ) and only needs to construct one confidence set of it. Nevertheless, to tackle the non-stationary optimal policy in the decentralized setting which keeps changing across episodes due to the adversarial opponent, DORIS needs to construct a confidence set of V µ,ν k 1 (s 1 ) for each policy µ in the policy class and the estimation V µ,ν k is only optimistic with respect to V µ,ν k 1 (s 1 ) for each µ respectively. Therefore, the analysis techniques in Jin et al. [25] cannot be applied directly here. We need to decompose the regret in a different way and propose a new complexity measure (i.e., BEE dimension) to bound the regret. Novelty given Jin et al. [27] , Huang et al. [23] . Jin et al. [23] , Huang et al. [23] also design algorithms and utilize analysis techniques based on Bellman-eluder-type complexity (i.e., multi-agent BE dimension [27, 23] ) in Markov games. Here we want to clarify that although the BEE dimension is also inspired from Bellman Eluder dimension proposed in Jin et al. [25] , DORIS is very different from Jin et al. [27] , Huang et al. [23] in terms of confidence set construction and optimism, which makes their analysis techniques not applicable here either. Jin et al. [27] , Huang et al. [23] consider zero-sum Markov games in the centralized setting with general function approximation. The algorithms in these works are based on (i) constructing confidence regions of the optimal value function (i.e., Nash value function in zero-sum games) or model and (ii) solving the Nash equilibrium with respect to the optimistic function/model. As a result, their algorithms can be regarded as running optimistic greedy policies in games and the estimated value functions are always optimistic estimates of the optimal value function V * 1 (s 1 ) of the underlying game. In contrast, in the decentralized setting, one unique challenge faced by DORIS is that the optimal policy is indeed changing across the episodes because we cannot control the opponent and the opponent can be adversarially adjusting its own policy. Therefore, there does not exist such a fixed optimal value function that we can run optimism with respect to. More importantly, from the view of the single agent, the environment is adversarially changing due to the opponent. Such nonstationarity does not appear in these works. To deal with this challenge, DORIS is based on (i) constructing the confidence region for policy evaluation problems, (ii) running mirror descent over the space of policies. As a result, DORIS is more like a decentralized policy optimization algorithm and the value functions maintained by the DORIS are only optimistic with respect to the value functions associated with the current policy pair (µ, ν k ), which changes at each iteration. More importantly, such a different version of optimism leads to a different regret decomposition. Specifically, in (22) , we show that the regret is upper bounded by the policy evaluation error K t=1 ( V t (µ t ) -V π t 1 (s 1 ) ) and online learning error induced K t=1 ( V t (µ * ) -V t (µ t ) ) by mirror descent. Bounding the evaluation error K t=1 ( V t (µ t ) -V π t 1 (s 1 )) incurred by achieving optimism in policy evaluation has not been considered in Jin et al. [27] , Huang et al. [23] . Multi-agent BE dimension [27, 23] cannot be applied here either because it measures the Bellman residuals f h (s, a, b) -r h+1 (s, a, b) -min ν f h+1 (s, µ, ν) and can only help bound K t=1 ( V t (µ t ) -min ν V µ t ,ν 1 (s 1 )) when the policy ν t played by the opponent is a pessimistic best response of µ t . In our case ν t is arbitrary (typically not the best response of µ t ) and the value we want to bound is also different, therefore multi-agent BE dimension is not applicable and we have to propose BEE dimension to evaluate the complexity of policy evaluation tasks with general function approximation. Along with the new measure, we have also identified common function classes with low BEE dimension in the paper to illustrate the capacity of BEE dimension. E.2 LOWER BOUND IN LIU ET AL. [33] Here we present a lower bound from Liu et al. [33] : Theorem 5 (Liu et al. [33, Theorem 4] ). There exists a Markov game with S, A = O(H) and an opponent who chooses policy uniformly at random from an unknown set of H Markov policies in each episode, such that when the opponent's policy is not revealed, the regret for competing with the best fixed Markov policy in hindsight is Ω(min{K, 2 H }/H). The above lower bound shows that if the opponent's policy is not revealed, even when the opponent only plays a finite number of Markov policies, the exponential regret lower bound for competing with the best Markov policy in hindsight is inevitable, which validates the necessity of policy revealing condition.

E.3 COMPUTATIONAL COMPLEXITY OF DORIS

There are mainly two steps in DORIS that require computation, optimistic policy evaluation via OptLSPE and hyperpolicy update via Hedge. Assuming the policy class is finite facilitates the second step, but even with finite policy class, OptLSPE is still computationally inefficient. This is due to the global optimism step in OptLSPE, i.e., constructing the confidence set (Equation ( 3)) and finding the most optimistic estimation. This is a common issue of algorithms with general function approximation even in single-agent MDPs. For example, the global optimism step of the algorithms in [24, 25, 14, 23, 27] are all computationally inefficient and hard to implement. However, if we only consider linear MGs, computationally efficient algorithms are possible since we can use local optimism and implement OptLSPE by an analog of LSVI-UCB [28] , which is computationally efficient. In addition, if there is a computationally efficient solver for optimistic policy evaluation with general function approximation in single-agent MDPs, we believe that we can also utilize it here since the confidence set update rule (Equation ( 3)) is similar to single-agent MDPs. That said, in this work we mainly focus on the statistical complexity of learning the Markov game and thus computationally efficient algorithms are left as future works.

F PROOFS OF PROPOSITION 1

From the completeness assumption, we know that there exists g h ∈ F h such that g h = T µ,ν h f h+1 , which implies that f h -T µ,ν h f h+1 ∈ F h -F h , ∀f ∈ F, µ ∈ Π, ν ∈ Π . In other words, (I -T Π,Π h )F ⊆ F h -F h . Therefore, from the definition of dim BEE (F, , Π, Π ) we have dim BEE (F, , Π, Π ) ≤ dim BEE (F, , Π, Π , Q 2 ) = max h∈[H] dim DE ((I -T Π,Π h )F, Q 2 h , ) ≤ max h∈[H] dim DE ((F h -F h ), Q 2 h , ) = max h∈[H] dim E (F h , ), where the last step comes from the definition of dim E and Q 2 h is the dirac distribution family. This concludes our proof.

G EXAMPLES FOR REALIZABILITY, GENERALIZED COMPLETENESS AND COVERING NUMBER

In this section we illustrate practical examples where realizability and generalized completeness hold and the covering number is upper bounded at the same time. In specific, we will consider tabular MGs, linear MGs and kernel MGs.

G.1 TABULAR MGS

For tabular MGs, we let F h = {f |f : S × A × B → [0, V max ]} and G h = F h for all h ∈ [H]. Then it is obvious that Q µ×ν h ∈ F h and T µ,ν f h+1 ∈ G h for any f ∈ F, h ∈ [H], µ, ν, which implies that realizability and generalized completeness are satisfied. In addition, notice that in this case we have log N F h ( ) = log N G h ( ) ≤ |S||A||B| log(V max / ). This suggests that the size of F and G is also not too large.

G.2 LINEAR MGS

In this subsection we consider linear MGs. Here we generalize the definition of linear MDPs in classic MDPs [28] to Markov Definition 9 (Linear MGs). We say an MG is linear of dimension d if for each h ∈ [H], there exists a feature mapping φ h : S × A × B → R d and d unknown signed measures ψ h = (ψ (1) h , • • • , ψ h ) over S and an unknown vector θ h ∈ R d such that P h (•|s, a, b) = φ h (s, a, b) ψ h (•) and r h (s, a, b) = φ h (s, a, b) θ h for all (s, a, b) ∈ S × A × B.

Without loss of generality, we assume φ

h (s, a, b) ≤ 1 for all s ∈ S, a ∈ A, b ∈ B and ψ h (S) ≤ √ d, θ h ≤ √ d for all h. Let F h = G h = {φ h (•) w|w ∈ R d , w ≤ (H -h+1) √ d, 0 ≤ φ h (•) w ≤ H -h + 1}. Realizability. We have for any µ, ν, Q µ×ν h (s, a, b) = r h (s, a, b) + E s ∼P (•|s,a,b) [V µ×ν h+1 (s )] = φ h (s, a, b), θ h + φ h (s, a, b), S V µ×ν h+1 (s )dψ h (s ) = φ h (s, a, b), θ h + S V µ×ν h+1 (s )dψ h (s ) = φ h (s, a, b), w µ×ν h , where w µ×ν h = θ h + S V µ×ν h+1 (s )dψ h (s ) and thus w µ×ν h ≤ (H -h + 1) √ d. Therefore, Q µ×ν h ∈ F h , which means that realizability holds. Generalized completeness. For any f h+1 ∈ F h+1 , we have T µ,ν f h+1 (s, a, b) = r h (s, a, b) + E s ∼P (•|s,a,b) [f h+1 (s , µ, ν)] = φ h (s, a, b), θ h + S f h+1 (s , µ, ν)dψ h (s ) . Since f h+1 ∞ ≤ H -h, we have θ h + S f h+1 (s , µ, ν)dψ h (s ) ≤ (H -h + 1) √ d, which indicates T µ,ν f h+1 ∈ G h and thus generalized completeness is satisfied. Covering number. First notice that from the literature [51] , the covering number of a l 2 -norm ball can be bounded as log N B((H-h+1) √ d) ( ) ≤ d log(3H √ d/ ). Therefore, there exists W ⊂ B((H -h + 1) √ d) where log |W| ≤ d log(3H √ d/ ) such that for any w ∈ B((H -h + 1) √ d), there exists w ∈ W satisfying w -w ≤ . Now let F h = {φ h (•) w|w ∈ W}. For any f h ∈ F h , suppose f h (•) = φ h (•) w f h . Then we know there exists f h (•) = φ h (•) w f h ∈ F h where w f h -w f h ≤ , which implies |f h (s, a, b) -f h (s, a, b)| ≤ φ h (s, a, b) w f h -w f h ≤ . Therefore log N F h ( ) ≤ log |F h | = log |W| ≤ d log(3H √ d/ ).

G.3 KERNEL MGS

In this subsection we show that kernel MGs also satisfy realizability and generalized completeness naturally. In addition, when a kernel MG has a bounded effective dimension, its covering number will also be bounded. First we generalize the definition of kernel MDPs [25] to MGs as follows. Definition 10 (Kernel MGs). In a kernel MDP, for each step h ∈ [H], there exist feature mapping φ h : S × A × B → H and ψ h : S → H where H is a separable Hilbert space such that P h (s |s, a, b) = φ h (s, a, b), ψ h (s ) H for all s ∈ S, a ∈ A, b ∈ B, s ∈ S. Besides, the reward function os linear in φ, i.e., r h (s, a, b) = φ h (s, a, b), θ h H for some θ h ∈ H. Moreover, a kernel MG satisfies the following regularization conditions: • θ h H ≤ 1, φ h (s, a, b) H ≤ 1, for all s ∈ S, a ∈ A, b ∈ B, h ∈ [H]. • s∈S V (s)ψ h (s) H ≤ 1, for all function V : S → [0, 1], h ∈ [H]. Remark 7. It can be observed that tabular and linear MGs are special cases of kernel MGs. Therefore, the following discussion applies to tabular and linear MGs as well.

Then we let

F h = G h = {φ h (•) w|w ∈ B H (H -h + 1) } where B H (r) is a ball with radius r in H. Following the same arguments in linear MGs, we can validate that realizability and generalized completeness are satisfied in kernel MGs. Covering number. Before bounding the covering number of F h , we need introduce a new measure to evaluate the complexity of a Hilbert space since H might be infinite dimensional. Here we use the effective dimension [14, 25] , which is defined as follows: Definition 11 ( -effective dimension of a set). The -effective dimension of a set X is the minimum integer d eff (X , ) = n such that sup x1,••• ,xn∈X 1 n log det I + 1 2 n i=1 x i x i ≤ e -1 . Remark 8. When X is finite dimensional, suppose its dimension is d. Then its effective dimension can be upper bounded by O d log 1 + R 2 / where R is the norm bound of X [14] . In addition, even when X is infinite dimensional, if the eigenspectrum of the covariance matrices concentrates in a low-dimension subspace, the effective dimension of X can still be small [48] . We call a kernel MG is of effective dimension d( ) if d eff (X h , ) ≤ d( ) for all h and where Proof. Suppose dim E (F h , ) = n. Then by the definition of Eluder dimension, there exists a sequence {φ i } n i=1 such that for any X h = {φ h (s, w 1 , w 2 ∈ B H (H -h+1), φ ∈ X h , if n i=1 ( φ i , w 1 -w 2 ) 2 ≤ 2 , then | φ, w 1 -w 2 | ≤ . Therefore, the covering number of kernel MGs can be reduced to covering the projection of B H (H -h + 1) onto the space spanned by {φ i } n i=1 , whose dimension is at most n. From the literature [51] , the covering number of such space is O (n log (1 + nH/ )), which implies log N F h ( ) ≤ O n log(1 + nH/ ) . Finally, by the proof of Proposition 3, we know n ≤ d( /2H), which concludes the proof.

H EXAMPLES FOR BEE DIMENSION

In this section we will show that kernel MGs (including tabular MGs and linear MGs) and generalized linear complete models have low BEE dimensions. Proof. First in Appendix G we have showed that F satisfies completeness. By Proposition 1, we have d BEE (F, , Π, Π ) ≤ max h∈[H] dim E (F h , ). Therefore we only need to bound dim E (F h , ) for each h ∈ [H]. Suppose dim E (F h , ) = k > d( /2H). Then by the definition of Eluder dimension, there exists a sequence φ 1 , • • • , φ k and {w 1,i } k i=1 , {w 2,i } k i=1 where φ i ∈ X h = {φ h (s, a, b) : (s, a, b) ∈ S × A × B}, w 1,i , w 2,i ∈ B H (H -h + 1) for all i such that for any t ∈ [k]: t-1 i=1 ( φ i , w 1,t -w 2,t ) 2 ≤ ( ) 2 , ( ) | φ t , w 1,t -w 2,t | ≥ , where ≥ . Let Σ t denote t-1 i=1 φ i φ i + 2 4H 2 • I. Then we have for any t ∈ [k] w 1,t -w 2,t 2 Σt ≤ ( ) 2 + 2 . On the other hand, by Cauchy-Schwartz inequality we know φ t Σ -1 t w 1,t -w 2,t Σt ≥ | φ t , w 1,t -w 2,t | ≥ . This implies for all t ∈ [k] φ t Σ -1 t ≥ 2 + ( ) 2 ≥ 1 √ 2 . Therefore, applying elliptical potential lemma (e.g., Lemma 5.6 and Lemma F.3 in [14] ), we have for any t ∈ [k] log det I + 4H 2 2 t i=1 φ i φ i = t i=1 log(1 + φ i 2 Σ -1 i ) ≥ t • log 3 2 . However, by the definition of effective dimension, we know when n = d eff (X h , 2H ), sup φ1,••• ,φn log det I + 4H 2 2 n i=1 φ i φ i ≤ ne -1 . This is a contradiction since n ≤ d( /2H) < k and log 

H.2 GENERALIZED LINEAR COMPLETE MODELS

An important variant of linear MDPs is the generalized linear complete models proposed by [52] . Here we also generalize it into Markov games: Definition 12 (Generalized linear complete models). In d-dimensional generalized linear complete models, for each step h ∈ [H], there exists a feature mapping φ h : S × A × B → R d and a link function σ such that: • for the generalized linear function class F h = {σ(φ h (•) w)|w ∈ W} where W ⊂ R d , realizability and completeness are both satisfied; • the link function is strictly monotone, i.e., there exist 0 < c 1 < c 2 < ∞ such that σ ∈ [c 1 , c 2 ]. • φ h , w satisfy the regularization conditions: φ h (s, a, b) ≤ R, w ≤ R for all s, a, b, h where R > 0 is a constant. When the link function is σ(x) = x, the generalized linear complete models reduce to the linear complete models, which contain instances such as linear MGs and LQRs. The following proposition shows that generalized linear complete models also have low BEE dimensions: Proposition 4. If a generalized linear complete model has dimension d, then for any policy classes Π and Π , its BEE dimension can be bounded as follows: d BEE (F, , Π, Π ) ≤ O(dc 2 2 /c 2 1 ). Proof. The proof is similar to Proposition 3, except ( 20) and ( 21) become  t-1 i=1 c 2 1 ( φ i , w 1,t -w 2,t ) 2 ≤ t-1 i=1 (σ(φ i w 1,t ) -σ(φ i w 2,t )) 2 ≤ ( ) 2 , c 2 | φ t , w

I PROOF OF THEOREM 1 AND THEOREM 2

In this section we present the proof for Theorem 1 and Theorem 2. We first consider the oblivious setting. Let µ * = arg max µ∈Π K t=1 V µ×ν t 1 (s 1 ) and we can decompose the regret into the following terms: max µ∈Π K t=1 V µ×ν t 1 (s 1 ) - K t=1 V π t 1 (s 1 ) = K t=1 V µ * ×ν t 1 (s 1 ) - K t=1 V t (µ * ) + K t=1 V t (µ * ) - K t=1 V t , p t + K t=1 V t , p t - K t=1 V t (µ t ) + K t=1 V t (µ t ) - K t=1 V π t 1 (s 1 ) . Our proof bounds these terms separately and mainly consists of three steps: • Prove V t (µ) is an optimistic estimation of V µ×ν t 1 (s 1 ) for all t ∈ [K] and µ ∈ Π, which implies that term (1) ≤ 0. • Bound term (4), the cumulative estimation error K t=1 V t (µ t ) -V π t 1 (s 1 ) . In this step we utilize the newly proposed complexity measure BEE dimension to bridge the cumulative estimation error and the empirical Bellman residuals occurred in OptLSPE. • Bound term (2) using the existing results of online learning error induced by Hedge and bound (3) by noticing that it is a martingale difference sequence. I.1 STEP 1: PROVE OPTIMISM First we can show that the constructed set B D1:t-1 (µ, ν t ) is not vacuous in the sense that the true action-value function Q µ,ν t belongs to it with high probability Lemma 2. With probability at least 1 -δ/4, we have for all t ∈ [K], µ ∈ Π, Q µ,ν t ∈ B D1:t-1 (µ, ν t ). Proof. See Appendix J.1. Then since V t (µ) = max f ∈B D 1:t-1 (µ,ν t ) f (s 1 , µ, ν t ), we know for all t ∈ [K] and µ ∈ Π, V t (µ) ≥ Q µ,ν t (s 1 , µ, ν t ) = V µ×ν t 1 (s 1 ). In particular, we have for all t ∈ [K], V t (µ * ) ≥ V µ * ×ν t 1 (s 1 ). I.2 STEP 2: BOUND ESTIMATION ERROR Next we need to show the estimation error K t=1 V t (µ t ) -V π t 1 (s 1 ) is small. Let f t,µ = arg max f ∈B D 1:t-1 (µ,ν t ) f (s 1 , µ, ν t ). Then using standard concentration inequalities, we can have the following lemma which says that empirical Bellman residuals are indeed close to true residuals with high probability. Recall that here π k = µ k × ν k . Lemma 3. With probability at least 1 -δ/4, we have for all t ∈ [K], h ∈ [H] and µ ∈ Π, (a) t-1 k=1 E π k f t,µ h (s h , a h , b h ) -(T µ,ν t h f t,µ h+1 )(s h , a h , b h ) 2 ≤ O(β), (b) t-1 k=1 f t,µ h (s k h , a k h , b k h ) -(T µ,ν t h f t,µ h+1 )(s k h , a k h , b k h ) 2 ≤ O(β). Proof. See Appendix J.2. Besides, using performance difference lemma we can easily bridge V t (µ t ) -V π t 1 (s 1 ) with Bellman residuals, whose proof is deferred to Appendix J.3: Lemma 4. For any t ∈ [K], we have V t (µ t ) -V π t 1 (s 1 ) = H h=1 E π t (f t,µ t h -T µ t ,ν t h f t,µ t h+1 )(s h , a h .b h ) . Therefore, from Lemma 4 we can obtain K t=1 V t (µ t ) -V π t 1 (s 1 ) = H h=1 K t=1 E π t (f t,µ t h -T µ t ,ν t h f t,µ t h+1 )(s h , a h , b h ) . Notice that in (26) we need to bound the Bellman residuals of f t,µ t h weighted by policy π t . However, in Lemma 3, we can only bound the Bellman residuals weighted by π 1:t-1 . Fortunately, we can utilize the inherent low BEE dimension to bridge these two values with the help of the following technical lemma: Lemma 5 ( [25] ). Given a function class Φ defined on X with φ(x) ≤ C for all (φ, x) ∈ Φ × X , and a family of probability measures Q over X. Suppose sequence {φ t } K t=1 ⊂ Φ and {ρ t } K t=1 ⊂ Q satisfy that for all t ∈ [K], t-1 k=1 (E ρ k [φ t ]) 2 ≤ β. Then for all t ∈ [K] and w > 0, t k=1 |E ρ k [φ k ]| ≤ O dim DE (Φ, Q, w)βt + min{t, dim DE (Φ, Q, w)}C + tw . Invoking Lemma 5 with Q = Q 1 h , Φ = (I -T Π,Π h )F and w = 1/K, conditioning on the event (24) in Lemma 3 holds true, we have K t=1 E π t (f t,µ t h -T µ t ,ν t h f t,µ t h+1 )(s h , a h .b h ) ≤ O V 2 max Kdim BEE F, 1/K, Π, Π , Q 1 log (N F ∪G (V max /K)KH|Π|/δ) . Similarly, invoking Lemma 5 with Q = Q 2 h , Φ = (I -T Π,Π h )F and w = 1/K, conditioning on the event (25) in Lemma 3 holds true, we have with probability at least 1 -δ/4, K t=1 E π t (f t,µ t h -T µ t ,ν t h f t,µ t h+1 )(s h , a h .b h ) ≤ K t=1 f t,µ t h (s h , a h , b h ) -(T µ,ν t h f t,µ t h+1 )(s t h , a t h , b t h ) + O( K log(K/δ)) ≤ O V 2 max Kdim BEE F, 1/K, Π, Π , Q 2 log(N F ∪G (V max /K)KH|Π|/δ) , where the first inequality comes from standard martingale difference concentration. Therefore, combining ( 27) and ( 28),we have: K t=1 E π t (f t,µ t h -T µ t ,ν t h f t,µ t h+1 )(s h , a h .b h ) ≤ O V 2 max Kdim BEE F, 1/K, Π, Π log(N F ∪G (V max /K)KH|Π|/δ) . Substitute the above bounds into (26) and we have: K t=1 V t (µ t ) -V π t 1 (s 1 ) ≤ O HV max Kdim BEE F, 1/K, Π, Π log(N F ∪G (V max /K)KH|Π|/δ) . I.3 STEP 3: BOUND THE REGRET Now we only need to bound the online learning error. Notice that p t is updated using Hedge with reward V t . Since 0 ≤ V t ≤ V max and there are |Π| policies, we have from the online learning literature [20] that K t=1 V t (µ * ) - K t=1 V t , p t ≤ V max K log |Π|. In addition, suppose F k denotes the filtration induced by {ν 1 } ∪ (∪ k i=1 {µ i , D i , ν i+1 }). Then we can observe that V t , p t -V t (µ t ) ∈ F t . In addition, we have V t ∈ F t-1 since the estimation of V t only utilizes D 1:t-1 , which implies E[ V t , p t -V t (µ t )|F t-1 ] = 0. Therefore (3) is a martingale difference sequence and by Azuma-Hoeffding's inequality we have with probability at least 1 -δ/4, K t=1 V t , p t - K t=1 V t (µ t ) ≤ O(V max K log(1/δ)) Substituting ( 23), ( 29), (30) , and ( 31) into ( 22) concludes our proof for Theorem 1. For the adaptive setting, we can simply repeat the above arguments. The only difference is that now ν t can depend on D 1:t-1 and thus we need to introduce a union bound over Π when proving Lemma 2 and Lemma 3. This will incur an additional log |Π | in β and thus also in the regret bound. This concludes our proof.

J PROOFS OF LEMMAS IN APPENDIX I

J.1 PROOF OF LEMMA 2 Let V ρ be a ρ-cover of G with respect to • ∞ . Consider an arbitrary fixed tuple (µ, t, h, g) ∈  Π × [K] × [H] × G. Define W t,k (h, g, µ) as follows: W t,k (h, g, µ) :=(g h (s k h , a k h , b k h ) -r k h -Q µ,ν t h+1 (s k h+1 , µ, ν t )) 2 -(Q µ,ν t h (s k h , a k h , b k h ) -r k h -Q µ,ν t h+1 (s k h+1 , µ, ν t )) 2 , and F k,h be the filtration induced by {ν 1 , • • • , ν K }∪{s i 1 , a i 1 , b i 1 , r i 1 , • • • , s i H+1 } k-1 i=1 ∪{s k 1 , a k 1 , b k 1 , r k 1 , • • • , s k h , a k h , b k h }. Then we have for all k ≤ t -1, E[W t,k (h, g, µ)|F k,h ] = [(g h -Q µ,ν t h )(s k h , a k h , b k h )] W t,k (h, g, µ) - t-1 k=1 [(g h -Q µ,ν t h )(s k h , a k h , b k h )] 2 ≤ O V max log 1 δ • t-1 k=1 [(g h -Q µ,ν t h )(s k h , a k h , b k h )] 2 + V 2 max log 1 δ .

By taking union bound over

Π × [K] × [H] × V ρ and the non-negativity of t-1 k=1 [(g h -Q µ,ν t h ) (s k h , a k h , b k h )] 2 , we have with probability at least 1 -δ/4, for all (µ, k, h, g) ∈ Π × [K] × [H] × V ρ , - t-1 k=1 W t,k (h, g, µ) ≤ O(V 2 max ι), where ι = log(HK|V ρ ||Π|/δ). This implies for all (µ, t, h, g) ∈ Π × [K] × [H] × G, t-1 k=1 (Q µ,ν t h (s k h , a k h , b k h ) -r k h -Q µ,ν t h+1 (s k h+1 , µ, ν t )) 2 ≤ t-1 k=1 (g h (s k h , a k h , b k h ) -r k h -Q µ,ν t h+1 (s k h+1 , µ, ν t )) 2 + O(V 2 max ι + V max tρ). Choose ρ = V max /K and we know that with probability at least 1 -δ for all µ ∈ Π and t ∈ [K], Q µ,ν t ∈ B D1:t-1 (µ, ν t ). This concludes our proof.

J.2 PROOF OF LEMMA 3

Let Z ρ be a ρ-cover of F with respect to • ∞ . Consider an arbitrary fixed tuple (µ, t, h, f ) ∈ Π × [K] × [H] × F. Let X t,k (h, f, µ) :=(f h (s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 -((T µ,ν t h f h+1 )(s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 , and F k,h be the filtration induced by {ν 1 , • • • , ν K }∪{s i 1 , a i 1 , b i 1 , r i 1 , • • • , s i H+1 } k-1 i=1 ∪{s k 1 , a k 1 , b k 1 , r k 1 , • • • , s k h , a k h , b k h }. Then we have for all k ≤ t -1, E[X t,k (h, f, µ)|F k,h ] = [(f h -T µ,ν t h f h+1 )(s k h , a k h , b k h )] 2 , and Var[X t,k (h, f, µ)|F k,h ] ≤ 4V 2 max E[X t,k (h, f, µ)|F k,h ]. By Freedman's inequality, with probability at least 1 -δ, t-1 k=1 X t,k (h, f, µ) - t-1 k=1 [(f h -T µ,ν t h f h+1 )(s k h , a k h , b k h )] 2 ≤ O V max log 1 δ • t-1 k=1 [(f h -T µ,ν t h f h+1 )(s k h , a k h , b k h )] 2 + V 2 max log 1 δ .

By taking union bound over

Π × [K] × [H] × Z ρ , we have with probability at least 1 -δ, for all (µ, t, h, f ) ∈ Π × [K] × [H] × Z ρ , t-1 k=1 X t,k (h, f, µ) - t-1 k=1 [(f h -T µ,ν t h f h+1 )(s k h , a k h , b k h )] 2 ≤ O V max ι • t-1 k=1 [(f h -T µ,ν t h f h+1 )(s k h , a k h , b k h )] 2 + V 2 max ι . where ι = log(HK|Z ρ ||Π|/δ). Conditioned on the above event being true, we consider an arbitrary pair (h, t, µ) ∈ [H] × [K] × Π. By the definition of B D1:t-1 (µ, ν t ) and Assumption 2, we have: t-1 k=1 X t,k (h, f t,µ , µ) = t-1 k=1 (f h (s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 -((T µ,ν t h f h+1 )(s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 ≤ t-1 k=1 (f h (s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 -inf g∈G (g h (s k h , a k h , b k h ) -r k h -f h+1 (s k h+1 , µ, ν t )) 2 ≤β. Let l t,µ = arg min l∈Zρ max h∈[H] f t,µ h -l t,µ h ∞ . By the definition of Z ρ , we have t-1 k=1 X t,k (h, l t,µ , µ) ≤ O(V max tρ + β). By (32) , we know: t-1 k=1 X t,k (h, l t,µ , µ) - t-1 k=1 [(l t,µ h -T µ,ν t h l t,µ h+1 )(s k h , a k h , b k h )] 2 Step 1: Prove optimism. First we can show that the constructed set B D r 1:t-1 (µ) (B D g 1:t-1 (µ)) is not vacuous in the sense that the true action-value function Q µ r (Q µ g ) belongs to it with high probability: Lemma 6. With probability at least 1 -δ/4, we have for all t ∈ [K] and µ ∈ Π, Q µ r ∈ B D r 1:t-1 (µ), Q µ g ∈ B D g 1:t-1 (µ). Proof. The proof is almost the same as Lemma 2 and thus is omitted here. Then since V t r (µ) = max f ∈B D g 1:t-1 (µ) f (s 1 , µ), we know for all t ∈ [K] and µ ∈ Π, V t r (µ) ≥ Q µ r (s 1 , µ) = V µ r,1 (s 1 ). Similarly, we know V t g (µ) ≥ V µ g,1 (s 1 ). Step 2: Bound estimation error. Next we need to show the estimation error Proof. The proof is almost the same as Lemma 3 and thus is omitted here. K t=1 V t r (µ t ) - V µ t r Besides, using performance difference lemma we can easily bridge V t r (µ t ) -V µ t r,1 (s 1 ) and V t g (µ t ) -V µ t g,1 (s 1 ) with Bellman residuals, whose proof is also omitted: Lemma 8. For any t ∈ [K], we have (s 1 ) -V µ t r,1 (s 1 )). In fact, updating the dual variable Y t with projected gradient descent guarantees us the following lemma: Lemma 9. Suppose the events in Lemma 6 hold true, we have V t r (µ t ) -V µ t r,1 (s 1 ) = H h=1 E µ t [(f t,µ t ,r h -T µ t , - K t=1 Y t (V t g (µ * CMDP ) -V t g (µ t )) ≤ αH 2 K 2 = H 2 √ K 2 . Proof. See Appendix L.  (Y -Y t )(b -V t g (µ t )) ≤ (H 2 + χ 2 ) √ K 2 . Substituting Lemma 10 into (38) and notice that b ≤ V Combining the above inequality with (37), we have K t=1 (V µ * CMDP r,1 (s 1 ) -V µ t r,1 (s 1 )) + Y K t=1 (b -V µ t g,1 (s 1 )) ≤ O H 2 λ 2 sla + H 3 λ sla K BEE , where BEE = max dim BEE F r , 1/K, Π, r log(N F r ∪G r (H/K)KH|Π|/δ), dim BEE F g , 1/K, Π, g log(N F g ∪G g (H/K)KH|Π|/δ) . Choose Y as Y = 0 if K t=1 (b -V µ t g,1 (s 1 )) < 0, χ otherwise. then we can bound the summation of regret and constraint violation as follows: (b -V µ t g,1 (s 1 )) V µ * CMDP r,1 (s 1 ) - 1 K K t=1 V µ t r,1 (s 1 ) + χ b - 1 K K t=1 V µ t g, + ≤ O H 2 + H λ sla K BEE . This concludes our proof. (s 1 ) -V µ r,1 (s 1 ) λ sla ≤ H λ sla , which concludes our proof. L.2 PROOF OF LEMMA 9 Notice that we have: 0 ≤ Y 2 K+1 = K t=1 Y 2 t+1 -Y 2 t = K t=1 Proj [0, χ ] (Y t + α(b -V t g (µ t ))) 2 -Y 2 t ≤ K t=1 (Y t + α(b -V t g (µ t ))) 2 -Y 2 t = K t=1 2αY t (b -V t g (µ t )) + K t=1 α 2 (b -V t g (µ t )) 2 ≤ K t=1 2αY t (V t g (µ * CMDP ) -V t g (µ t )) + α 2 KH 2 , where the last step is due to optimism and V µ * CMDP g,1 (s 1 ) ≥ b. This implies that - K t=1 Y t (V t g (µ * CMDP ) -V t g (µ t )) ≤ αH 2 K 2 = H 2 √ K 2 . This concludes our proof.

L.3 PROOF OF LEMMA 10

Notice that we have for any t ∈ [K] and Y ∈ [0, χ ]: |Y t+1 -Y | 2 ≤ |Y t + α(b -V t g (µ t )) -Y | 2 = (Y t -Y ) 2 + 2α(b -V t g (µ t ))(Y t -Y ) + α 2 H 2 . Repeating the above expansion procedures, we have 0 ≤ |Y K+1 -Y | 2 ≤ (Y 1 -Y ) 2 + 2α K t=1 (b -V t g (µ t ))(Y t -Y ) + α 2 H 2 K, which is equivalent to K t=1 (b -V t g (µ t ))(Y -Y t ) ≤ 1 2α (Y 1 -Y ) 2 + α 2 H 2 K ≤ (H 2 + χ 2 ) √ K 2 . This concludes our proof.

L.4 PROOF OF LEMMA 11

First we extend Π in a reasonable way to make the policy class more structured while not changing its optimal policy. Define the set of state-action visitation distributions induced by the policy Π as follows: Notice that there exists a one-to-one mapping from state-action visitation distributions to policies [41] . Let conv(Π) denote the policy class that induces conv(P Π ), and then there exists µ such that d = d µ , which implies P Π = {(d µ h ( V µ r,1 (s 1 ) = 1 K K t=1 V µ t r,1 (s 1 ), V µ g,1 (s 1 ) = 1 K K t=1 V µ t g,1 (s 1 ). Therefore, the condition of this lemma says (V µ * CMDP r,1 (s 1 ) -V µ r,1 (s 1 )) + C * [b -V µ g,1 (s 1 )] + ≤ δ. Next we show that µ * CMDP is still the optimal policy in conv(Π) when Assumption 6, i.e., strong duality, holds. First notice that 



Further, let D(Y ) := max µ∈Π L CMDP (µ, Y ) denote the dual function and suppose the optimal dual variable is Y * = arg min Y ≥0 D(Y ). To ensure Y * is bounded, we need to assume the standard Slater's Condition holds: There exists λ sla > 0 and µ

a, b) : (s, a, b) ∈ S × A × B}. Then the following proposition shows that the covering number of F h is upper bounded by the effective dimension of the kernel MG: Proposition 2. If the kernel MG has effective dimension d( ), then log N F h ( ) ≤ O d( /2H) log(1 + Hd( /2H)/ ) .

KERNEL MGS Consider the kernel MG defined in Definition 10 and F h = {φ h (•) w|w ∈ B H (H -h + 1)}, then we have the following proposition showing that the BEE dimension of a kernel MG is upper bounded by its effective dimension (Definition 11): Proposition 3. If the kernel MG has effective dimension d( ), then for any policy classes Π and Π , we have d BEE (F, , Π, Π ) ≤ d( /2H).

Therefore we have dim E (F h , ) ≤ d( /2H) for all h ∈ [H], which implies d BEE (F, , Π, Π ) ≤ d( /2H). This concludes our proof. Tabular MGs. Tabular MGs are a special case of kernel MGs where the feature vectors are |S||A||B|-dimensional one-hot vectors. From the standard elliptical potential lemma, we know d( ) = O(|S||A||B|) for tabular MDPs, suggesting their BEE dimension is also upper bounded O(|S||A||B|). Linear MGs. When the feature vectors are d-dimensional, we can recover linear MGs. Similarly, by the standard elliptical potential lemma, we have the BEE dimension of linear MGs is upper bounded O(d).

CMDP ), we have for anyY ∈ [0, χ ], -V µ t r,1 (s 1 )) + YKd BEE,r log (N cov,r |Π|/δ) .

PROOF OF LEMMA 1 Notice that D(Y * ) = V = D(Y * ) ≥ L CMDP ( µ, Y * ) = V µ r,1 (s 1 ) + Y * (V µ g,1 (s 1 ) -b) ≥ V µ r,1 (s 1 ) + Y * λ sla .

s, a)) h∈[H],s∈S,a∈A ∈ (∆ |S|×|A| ) H : µ ∈ Π}. Let conv(P Π ) denote the convex hull of P Π , i.e., for any d ∈ conv(P Π ), there exists {w µ } µ∈Π ≥ 0 such that for any h ∈ [H], s ∈ S.a ∈ A, we have d h (s, a) = µ∈Π w µ d µ h (s, a), µ∈Π w As a special case, there exists d h (s, a) ∈ conv(P Π ) such that for any h ∈ [H], s ∈ S.a ∈ A, d h (s, a) = 1 K K t=1 d µ t h (s, a).

CMDP (d, Y).

However, given Y ≥ 0,L CMDP (d, Y ) is linear in d, which means the maximum is always attained at the vertices of conv(P Π ), i.e., P Π . Therefore we knowmax µ∈conv(Π) L CMDP (µ, Y ) = D(Y ),

Assumption 6 (Strong duality). Assume strong duality holds for Problem 1, i.e., Remark 6. One example case where strong duality (12) holds is when policy class Π satisfies global realizability. Let µ * glo denote the solution to max µ h (•|s)∈∆ A min Y ≥0 L CMDP (µ, Y ). [13] shows that max µ∈(∆ A ) |S|H min Y ≥0 L CMDP (µ, Y ) satisfies strong duality, and thus as long as µ * glo ∈ Π, Problem 1 also has strong duality.

Comparison with related works on decentralized learning with an adversarial opponent. Here General FA means general function approximation setting. From the table we can see that Tian et al.

1,t -w 2,t | ≥ |σ(φ t w 1,t ) -σ(φ t w 2,t )| ≥ .

By Freedman's inequality, with probability at least 1 -δ/4, we have

,1 (s 1 ) and V µ t g,1 (s 1 ) are small. Let f t,µ,r = arg max f ∈B D r1:t-1 (µ) f (s 1 , µ) and f t,µ,g = arg max f ∈B D g 1:t-1 (µ) f (s 1 , µ). Then we have Lemma 7. With probability at least 1 -δ/4, we have for all t ∈ [K], h ∈ [H] and µ ∈ Π,

r f t,µ t ,r h+1 )(s h , a h )], -T µ t ,g f t,µ t ,g h+1 )(s h , a h )]. Kd BEE,r log (N cov,r |Π|/δ) .

2.Substituting Lemma 9 into(38), we can obtain the bound on Regret(K):(s 1 ) -V µ t r,1 (s 1 )) ≤ O H 2 + H 2 λ sla Kd BEE,r log (N cov,r |Π|/δ) .Step 4: Constraint Violation Analysis. Next we need to bound the constraint violation. First notice that as shown in the following lemma whose proof is deferred to Appendix L.3: Lemma 10. For any Y ∈ [0, χ ], we have

1 (s 1 ) Further, when Assumption 6 and Assumption 7 hold, we have the following lemma showing that an upper bound on (V (s 1 )] + implies an upper bound on [b -1 (s 1 )] + : Lemma 11. Suppose Assumption 6 and Assumption 7 hold and 2Y * ≤ C * . If {µ t } K t=1 ⊆ Π satisfies See Appendix L.4 for the proof. Combining Lemma 11, Lemma 1 and (39), we have

annex

Combining (33) and (34) , we obtainThis implies thatChoose ρ = V max /K and we can obtain (b). For (a), simply let F k,h be the filtration induced byi=1 ∪ µ k and repeat the above arguments, which concludes our proof. J.3 PROOF OF LEMMA 4 First notice that V t (µ t ) = f t,µ t 1 (s 1 , µ t , ν t ). Therefore, we haveRepeat the above procedures and we can obtain Lemma 4. This concludes our proof.

K PROOF OF COROLLARY 1

From Theorem 2, we have with probability at least 1 -δ, for all i ∈ [n]By the definition of π, this is equivalent towhere µ -i is uniformly sampled from {µ t -i } K t=1 and thus is the marginal distribution of π over the agents other than i. Therefore, by the definition of CCE in (2), π is -approximate CCE with probability at least 1 -δ, which concludes our proof.

L PROOF OF THEOREM 3

In this section we present the proof for Theorem 3. Our proof mainly consists of four steps:• Prove V t r (µ) and V t g (µ) are optimistic estimations of V µ r,1 (s 1 ) and V µ g,1 (s 1 ) for all t ∈ [K] and µ ∈ Π.• Bound the total estimation error). • Bound the regret by decomposing it into estimation error and online learning error induced by Hedge. • Bound the constraint violation by strong duality. Similar to Section I, from Lemma 5, conditioning on the event in Lemma 7 holds true, we have with probability at least 1 -δ/4Substitute the above bounds into (35) and we have:Similarly, we haveStep 3: Bound the regret. Now we can bound the regret. We first decompose the fictitious total regret)) to the following terms:.From Lemma 6, we know (1) ≤ 0. Since p t is updated using Hedge with loss function V t , we have (2) with probability at least 1 -δ/4. Finally, Step 2 has bounded term (4) in (36) , which impliesBy strong duality, we haveCombining ( 41),( 42) and ( 43), we know all the inequalities have to take equality, which impliesBesides, strong duality also holds for max µ∈conv(Π) min Y ≥0 L CMDP (µ, Y ).where the third step comes from strong duality. Therefore, for any µ ∈ conv(Π) and τ ∈ R which satisfies V µ g,1 (s 1 ) ≥ b + τ , we haveThis implies that for any τ ∈ R, VOn the other hand, ( 40) is equivalent to, which concludes our proof.

M PROOF OF THEOREM 4

In this section we present the proof for Theorem 4. Our proof mainly consists of four steps:Step 1: Prove pessimism. First we can show that the true action-value function Q µ belongs to the constructed set B D1:t-1 (µ) with high probability: Lemma 12. With probability at least 1 -δ/4, we have for all t ∈ [K] and µ ∈ Π, Q µ ∈ B D1:t-1 (µ).Proof. Repeat the arguments in the proof of Lemma 2 for each dimension j ∈ [d] and the lemma follows directly.Then since V t (µ) = f 1 (s 1 , µ) where f = arg min f ∈B D 1:t-1 (µ) f 1 (s 1 , µ), θ t , we know for all t ∈ [K] and µ ∈ Π,Step 2: Bound estimation error. Next we need to show the estimation errorLet f t,µ,j denotes the j-the dimension of f t,µ . Then we have Lemma 13. With probability at least 1 -δ/4, we have for allProof. Repeat the arguments in the proof of Lemma 3 for each dimension j ∈ [d] and the lemma follows directly.Besides, using performance difference lemma we have: Lemma 14. For any t ∈ [K] and j ∈ [d], we havewhere V t,j (µ t ) is the j-th dimension of V t (µ t ).Therefore, from Lemma 14 we can obtain for any t ∈ [K] and j ∈Similar to Section I, from Lemma 5, conditioning on the event in Lemma 13 holds true, with probability at least 1 -δ/4, we have for any j ∈ [d] and h ∈ [H],Substitute the above bounds into (44) and we have for any j ∈ [d]:which implies if the event in Lemma 13 is true,Step 3: Bound the distance. Now we can bound the distance dist(V µ (s 1 ), C). First since µ is sampled uniformly from {µ t } K t=1 , we knowBy Fenchel's duality, we knowwhere the second step is due to maxNotice by Cauchy-Schwartz inequality and Step 2, we haveRecall that we update θ t using online gradient descent. Using the conclusions from the online learning literature [20] , we knowFurther, notice that p t is updated via Hedge with loss function being θ t , V t (µ) , similarly to the analysis in Section I, we have with probability at least 1 -δ,where µ * VMDP = arg min µ∈Π dist(V µ 1 (s 1 ), C). Let P (V Therefore we have This concludes our proof.

