NEARLY MINIMAX OPTIMAL OFFLINE REINFORCE-MENT LEARNING WITH LINEAR FUNCTION APPROXI-MATION: SINGLE-AGENT MDP AND MARKOV GAME

Abstract

Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation. * The first two authors contribute equally. Notations. Given a semi-definite matrix Λ and a vector u, we denote √ u ⊤ Λu as ∥u∥ Λ . The 2norm of a vector w is ∥w∥ 2 . We also denote λ min (A) as the smallest eigenvalue of the matrix A. The subscript of f (x) [0,M ] means that we clip the value f (x) to the range of [0, M ], i.e., 3 PRELIMINARIES AND TECHNICAL CHALLENGES Pessimistic Value Iteration (PEVI) is proposed in the seminal work of Jin et al. (2021c), which constructs the value function estimations { V h (•)} H h=1 and Q-value estimations { Q h (•, •)} H h=1 backward from h = H to h = 1, with the initialization V H+1 = 0. Specifically, given V h+1 , PEVI

1. INTRODUCTION

Reinforcement learning (RL) has achieved tremendous empirical success in both single-agent (Kober et al., 2013) and multi-agent scenarios (Silver et al., 2016; 2017) . Two components play a critical role -function approximations and efficient simulators. For RL problems with a large (or even infinite) number of states, storing a table as in classical Q-learning is generally infeasible. In these cases, practical algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017; Haarnoja et al., 2018) approximate the true value function or policy by a function class (e.g., neural networks). Meanwhile, an efficient simulator allows for learning a policy in an online trial-and-error fashion using millions or even billions of trajectories. However, due to the limited availability of data samples in many practical applications, e.g., healthcare (Wang et al., 2018) and autonomous driving (Pan et al., 2017) , instead of collecting new trajectories, we may have to extrapolate knowledge only from past experiences, i.e., a pre-collected dataset. This type of RL problems is usually referred to as offline RL or batch RL (Lange et al., 2012; Levine et al., 2020 ). An offline RL algorithm is usually measured by its sample complexity to achieve the desired statistical accuracy. A line of works (Xie et al., 2021b; Shi et al., 2022; Li et al., 2022) demonstrates that nearoptimal sample complexity in tabular single-agent MDPs is attainable. However, these algorithms cannot solve the problem with large or infinite state spaces where function approximation is involved. To our best knowledge, existing algorithms cannot attain the statistical limit even for linear function approximation, which is arguably the simplest function approximation setting. Specifically, for linear function approximation, Jin et al. (2021c) proposes the first efficient algorithm for offline linear MDPs, but their upper bound is suboptimal compared with the existing lower bounds in Jin et al. (2021c) ; Zanette et al. (2021) . Recently, Yin et al. (2022) tries to improve the result by incorporating variance information in the algorithmic design of offline MDPs with linear function approximation. However, a careful examination reveals a technical gap, and some additional assumptions may be needed to fix it (cf. Section 3 and Section 5). Beyond the single-agent MDPs, Zhong et al. (2022) studies the Markov games (MGs) with linear function approximation and provides the only provably efficient algorithm with a suboptimal result. Therefore, the following problem remains open: Can we design computationally efficient offline RL algorithms for problems with linear function approximation that are nearly minimax optimal? In this paper, we first answer this question affirmatively under linear MDPs (Jin et al., 2020) and then extend our results to the two-player zero-sum Markov games (MGs) (Xie et al., 2020) . Our contributions are summarized as follows: • We identify an implicit and restrictive assumption required by existing approaches in the literature, which originates from omitting the complicated temporal dependency between different time steps. See Section 3 for a detailed explanation. • We handle the temporal dependency by an uncertainty decomposition technique via a reference function, thus closing the gap to the information-theoretic lower bound without the restrictive independence assumption. The uncertainty decomposition serves to avoid a √ d-amplification of the value function error and also the measurability issue from incorporating variance information to improve the H-dependence, where d and H are the feature dimension and planning horizon, respectively. To the best of our knowledge, this technique is new in the literature of offline learning under linear function approximation. • We further generalize the developed techniques to two-player zero-sum linear MGs (Xie et al., 2020) , thus demonstrating the broad adaptability of our methods. Meanwhile, we establish a new performance lower bound for MGs, which tightens the existing results, and verifies the nearly minimax optimality of the proposed algorithm. 1.1 RELATED WORK Due to space limit, we defer a comprehensive review of related work to Appendix A.2 but focus on the works that are most related to the problem setup and our algorithmic designs. Offline RL with Linear Function Approximation. Jin et al. (2021c) and Zhong et al. (2022) provide the first results for offline linear MDPs and two-player zero-sum linear MGs, respectively. However, their algorithms are based on Least-Squares Value Iteration (LSVI), and establish pessimism by adding bonuses at every time step thus suffering from a √ d-amplification to the lower bound (Zanette et al., 2021; Zhong et al., 2022) . The amplification results from the statistical dependency between different time steps. After these, Min et al. (2021) studies the offline policy evaluation (OPE) in linear MDP with an additional independence assumption that the data samples between different time steps are independent thus circumventing this core issue. Yin et al. (2022) studies the policy optimization in linear MDP, which also (implicitly) require the independence assumption. Another line of work addresses the error amplification from temporal dependency with different algorithmic designs. Zanette et al. (2020) designs an actor-critic-based algorithm and establishes pessimism via direct perturbations of the parameter vectors in a linear function approximation scheme. Xie et al. (2021a) ; Uehara and Sun (2021) establish pessimism only at the initial state but at the expense of computational tractability. These algorithmic ideas are fundamentally different from ours and do not apply to the LSVI-type algorithms. We will compare the proposed algorithms with them in Section 5. f (x) [0,M ] = max{0, min{M, f (x)}}. Given a set X, we define the set of probability measure on it by ∆ X . We will use the shorthand notations ϕ h = ϕ(x h , a h ), ϕ τ h = ϕ(x τ h , a τ h ), r h = r h (x h , a h ), and r τ h = r h (x τ h , a τ h ) (formally defined in the next subsection). With a slight abuse of notations, we also use similar notations (e.g. ϕ h = ϕ(x h , a h , b h )) for MGs, which shall be clear from the context. Y ≲ X means Y ≤ CX for some constant C > 0. To improve readability, we also provide a summary of notations in Appendix A. Markov Decision Process. We consider an episodic MDP, denoted as M(S, A, H, P, r), where S and A are the state space and action space, H is the episode length, P = {P h } H h=1 and r = {r h } H h=1 are the state transition kernels and reward functions, respectively. For each h ∈ [H], P h (•|x, a) is the distribution of the next state given the state-action pair (x, a) at step h, r h (x, a) ∈ [0, 1] is the deterministic reward given the state-action pair (x, a) at step h.foot_0  Policy and Value function. A policy π = {π h } H h=1 is a collection of mappings from a state x ∈ S to a distribution of action space π h (•|x) ∈ ∆ A . For any policy π, we define the Q-value function Q π h (x, a) = E π [ H h ′ =h r h ′ (x h ′ , a h ′ )|(x h , a h ) = (x, a)] and the V-value function V π h (x) = E π [ H h ′ =h r h ′ (x h ′ , a h ′ )|x h = x], where a h ′ ∼ π h ′ (•|x h ′ ) and x h ′ +1 ∼ P h ′ (•|x h ′ , a h ′ ). For any function V : S → R, we denote the conditional mean as (P h V )(x, a) := x ′ ∈S P h (x ′ |x, a)V (x ′ ) and conditional variance as [Var h V ](x, a) := [P h V 2 ](x, a) -([P h V ](x, a)) 2 . The Bellman operator is defined as (T h V )(x, a) := r h (x, a) + (P h V )(x, a). We consider the MDPs whose rewards and transitions possess a linear structure (Jin et al., 2020) . Definition 1 (Linear MDP). MDP(S, A, H, P, r) is a linear MDP with a (known) feature map ϕ : S × A → R d , if for any h ∈ [H], there exist d unknown signed measures µ h = (µ (1) h , • • • , µ (d) h ) over S and an unknown vector θ h ∈ R d , such that for any (x, a) ∈ S × A, we have P h (• | x, a) = ⟨ϕ(x, a), µ h (•)⟩ , r h (x, a) = ⟨ϕ(x, a), θ h ⟩. With loss of generality, we assume that ∥ϕ(x, a)∥ ≤ 1 for all (x, a) ∈ S × A, and max{∥µ h (S)∥ , ∥θ h ∥} ≤ √ d for all h ∈ [H]. For linear MDP, we have the following result, whose proof can be found in Jin et al. (2020) . Lemma 1. For any function V : S → [0, V max -1] and h ∈ [H], there exist vectors β h , w h ∈ R d with ∥β h ∥ ≤ ∥w h ∥ ≤ √ dV max , such that ∀(x, a) ∈ S × A such that the conditional expectation and Bellman equation are both linear in the feature: (P h V )(x, a) = ϕ(x, a) ⊤ β h , and (T h V )(x, a) = ϕ(x, a) ⊤ w h . (1) Offline RL. In offline RL, the algorithm needs to learn a near-optimal policy or to approximate Nash Equilibrium (NE) from a pre-collected dataset without further interacting with the environment. We suppose that we have access to a batch dataset D = {x τ h , a τ h , r τ h : h ∈ [H], τ ∈ [K]} , where each trajectory is independently sampled by a behavior policy µ. The induced distribution of the stateaction pair at step h is denoted as d b h . We make the following standard dataset coverage assumption for offline RL with linear function approximation (Wang et al., 2020; Duan et al., 2020) : Assumption 1. We assume κ = min h∈[H] λ min (E d b h [ϕ(x, a)ϕ(x, a) ⊤ ]) > 0 for MDPs. This assumption requires the behavior policy to explore the state-action space well and is not information-theoretic as Jin et al. (2021c) does not necessarily require it. However, we make this assumption so that we can employ the variance information, which seems challenging otherwise. We remark the assumption is also made by the existing works that employ the variance information for the offline RL problems with linear function approximation (Min et al., 2021; Yin et al., 2022) . approximates the Bellman equation by ridge regression: Q h (x, a) ← T h V h+1 (x, a) -β ∥ϕ(x, a)∥ Λ -1 h , where the linear approximation T h V h+1 (•, •) = ϕ(•, •) ⊤ w h is the solution of: w h = arg min w∈R d τ ∈D r τ h + V h+1 (x τ h+1 ) -(ϕ τ h ) ⊤ w 2 + λ∥w∥ 2 2 = Λ -1 h τ ∈D ϕ τ h r τ h + V h+1 (x τ h+1 ) , where Λ h = τ ∈D ϕ (x τ h , a τ h ) ϕ (x τ h , a τ h ) ⊤ + λI d . Γ h (x, a) = β ∥ϕ(x, a)∥ Λ -1 h is a bonus function such that |(T h V h+1 -T h V h+1 )(x, a)| ≤ Γ h (x, a ) for all (x, a) ∈ S × A with high probability. Intuitively, pessimism means that we use lower confidence bound by subtracting the bonus Γ h (•, •) to penalize the uncertainty so that Q h (x, a) ≤ T h V h+1 (x, a). If pessimism is achieved for all steps h ∈ [H], then we have the following key result (formally presented in Lemma 2): V * 1 (x) -V π 1 (x) ≤ 2 H h=1 E π * [Γ h (x h , a h )|x 1 = x] . Therefore, to establish a sharper suboptimality bound, it suffices to construct a smaller bonus function Γ h that can ensure pessimism. There remains, however, a gap of Õ( √ dH) between the suboptimality bound of PEVI and the minimax lower bound in Zanette et al. (2021) . Self-normalized Process and Uniform Concentration. Given a function f h+1 , to construct the bonus function, we can bound |T h f h+1 -T h f h+1 | as follows (formally presented in Lemma 3): T h f h+1 (x, a) -T h f h+1 (x, a) ≲ τ ∈D ϕ (x τ h , a τ h ) • ξ τ h (f h+1 ) Λ -1 h (A) ≤ β ∥ϕ(x, a)∥ Λ -1 h , where ξ τ h (f h+1 ) := r τ h + f h+1 (x τ h+1 ) -(T h f h+1 )(x τ h , a τ h ). Bounding (A) is referred to as the concentration of a self-normalized process in the literature (Abbasi-Yadkori et al., 2011) . For any fixed V h+1 , Lemma 9 ensures a high-probability upper bound Õ(H √ d) for (A). However, in the backward iteration, V h+1 is computed by ridge regression in later steps [h + 1, H] and thus inevitably depends on {x τ h+1 } τ ∈D , which is also used to estimate the Bellman equation at step h. Consequently, the concentration inequality cannot directly be applied since the martingale filtration in Lemma 9 is not well-defined. To resolve the above measurability issue, the standard approach is to establish a uniform concentration result over an ϵ-covering of the following function class of V h+1 : V h+1 := max a {ϕ(•, a) ⊤ w + β ∥ϕ(•, a)∥ Λ -1 h , ∥w∥ ≤ R, β ∈ [0, B], Λ ⪰ λ • I} [0,H-h+1] . Then, the self-normalized process is bounded by the uniform bound of the ϵ-covering, plus some approximation error. We can tune the parameter ϵ > 0 and obtain that ((B.20) of Jin et al. (2021c) ): τ ∈D ϕ(x τ h , a τ h ) • ξ τ h ( V h+1 ) 2 Λ -1 h ≤ cH 2 • log (HN h+1 (ϵ)/δ) + d log(1 + K) + d 2 ≤ cH 2 • d 2 Uniform conver.

+ d

Conver. + fixed V

+ d 2

Approximation error , where we use an upper bound of the covering number (Lemma 11), and omit all the logarithmic terms for a clear comparison. Therefore, we conclude that the uniform concentration leads to an extra √ d from the logarithmic covering number log N h+1 (ϵ). The prior work Yin et al. (2022) omits the dependency between V h+1 and {x τ h , a τ h , x τ h+1 } τ ∈D , thus circumventing the uniform concentration. To fix this gap, one might make an additional assumption that the dataset is independent across different time steps h as in Min et al. (2021) , which is not realistic in practice because the behavior policy collects the trajectories by playing the episode starting from the initial state.

4. REFERENCE-ADVANTAGE DECOMPOSITION UNDER LINEAR FUNCTION APPROXIMATION

In this section, we introduce the reference-advantage decomposition under linear function approximation, which serves to avoid a √ d amplification of the error due to the uniform concentration. We first define P h g h+1 (•, •) to be an estimator of the conditional expectation, which is obtained by setting r τ h = 0 in T h g h+1 (•, •). We observe that the following affine properties hold: T h (f h+1 + g h+1 ) (•, •) = ϕ(•, •) ⊤ Λ -1 h τ ∈D ϕ τ h (r τ h + f h+1 (x τ h+1 )) + ϕ(•, •) ⊤ Λ -1 h τ ∈D ϕ τ h g h+1 (x τ h+1 ) = T h f h+1 (•, •) + P h g h+1 (•, •), (T h (f h+1 + g h+1 )) (•, •) = ⟨ϕ(x, a), θ h ⟩ + S f h+1 (x ′ ) ϕ(x, a), dµ h (x ′ ) + S g h+1 (x ′ ) ϕ(x, a), dµ h (x ′ ) = T h f h+1 (•, •) + P h g h+1 (•, •). Instead of directly bounding the uncertainty as in Eqn. (4), we now make the following decomposition: T h V h+1 (x, a) -T h V h+1 (x, a) = T h ( V h+1 + V * h+1 -V * h+1 ) (x, a) -T h ( V h+1 + V * h+1 -V * h+1 ) (x, a) = T h V * h+1 (x, a) -T h V * h+1 (x, a) Reference uncertainty ≤ b 0,h (x, a) + P h ( V h+1 -V * h+1 )(x, a) -P h ( V h+1 -V * h+1 )(x, a) Advantage uncertainty ≤ b 1,h (x, a) . (5) We have the following key observations about the reference part and the advantage part. Circumventing the Uniform Concentration for Reference Function. As V * h+1 is deterministic, we can directly invoke standard concentration inequality (Lemma 3 and 9  ) to set b 0,h (x, a) = Õ( √ dH) ∥ϕ(x, a)∥ Λ -1 h , which avoids a √ d amplification due to uniform concentration. High-order Error from Correlated Advantage Function. Although we still need the standard uniform concentration argument to analyze the advantage function and obtain a sub-optimal ddependency, under Assumption 1, by a carefully-crafted induction procedure, we can show that ∥ V h+1 -V * h+1 ∥ ∞ = Õ( √ dH 2 √ Kκ ). The much smaller range implies that we can invoke Lemma 9 to set b 1,h (x, a) = Õ( d 3/2 H 2 √ Kκ ) ∥ϕ(x, a)∥ Λ -1 h , which is non-dominating when K ≥ Ω d 2 H 2 /κ . The detailed proof can be found in Appendix D. Based on the above reasoning, we conclude that we can set Γ h (•, •) = Õ √ dH ∥ϕ(x, a)∥ Λ -1 h in the original PEVI algorithm (Jin et al., 2021c) and obtain a new algorithm referred to as LinPEVI-ADV. Together with a refined analysis, we have the following theoretical guarantee. Theorem 1 (LinPEVI-ADV). Under the Assumptions 1, and K > Ω(d 2 H 2 /κ), if we set λ = 1 and β 1 = Õ( √ dH) in Algorithm 2, then, with probability at least 1 -δ, for any x ∈ S, we have V * 1 (x) -V π 1 (x) ≤ Õ √ dH H h=1 E π * ∥ϕ(x, a)∥ Λ -1 h | x 1 = x , where Λ h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λI d . We remark that LinPEVI-ADV shares the same pseudo code with PEVI, except for a √ d-improvement in the choice of β. For completeness, we present the code in Algorithm 2. Compared to Xie et al. (2021b) . To the best of our knowledge, the reference-advantage decomposition is new in the literature of offline linear MDP, but it is relatively well-studied in tabular MDPs (Azar et al., 2017; Zanette and Brunskill, 2019; Zhang et al., 2020; Xie et al., 2021b) . Among them, Xie et al. (2021b) studies the offline tabular MDP and is most related to ours. While we share similar algorithmic ideas in terms of uncertainty decomposition, we comment on our differences as follows. First, in terms of algorithmic design, Xie et al. (2021b)  Σ h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ / σ 2 h (x τ h , a τ h ) + λI d ; 5: w h = Σ -1 h τ ∈D ϕ(x τ h , a τ h ) r τ h + V h+1 (x τ h+1 ) σ 2 h (x τ h ,a τ h ) ; 6: Γ h (•, •) ← β 2 ∥ϕ(•, •)∥ Σ -1 h ; 7: Q h (•, •) ← {ϕ(•, •) ⊤ w h -Γ h (•, •)} [0,H-h+1] ; 8: π h (• | •) ← arg max π h ⟨ Q h (•, •), π h (• | •)⟩ A , V h (•) ← ⟨ Q h (•, •), π h (• | •)⟩ A . 9: end for 10: Output: π = { π h } H h=1 . uncertainty estimation, Xie et al. (2021b) estimates the uncertainty separately of each state-action pair by counting its frequency. On the contrary, in the linear setting, we deal with the regression and the state-action pairs are coupled with each other because the analysis of the self-normalized process involves the estimated covariance matrix thus all the samples at step h (4). This brings distinct challenges to the linear setting, both in terms of the analysis and the coverage condition for achieving a sharp bound. Finally, the theoretical analyses are distinctly different due to the previous two points. For instance, we adopt a carefully-crafted induction procedure to control the advantage function, while Xie et al. (2021b) further introduces some dataset splitting techniques to handle the temporal dependency in the advantage function. Moreover, for linear MDP, the way in which the variance is introduced is also different (see Section 5).

5. LEVERAGE THE VARIANCE INFORMATION

The variance-weighted ridge regression technique was introduced by Zhou et al. (2021) in the literature of RL and was later firstly adapted by Min et al. (2021) ; Yin et al. (2022) to offline RL under linear function approximation. Specifically, given a fixed function f h+1 : S → [0, H -1] as the target, we perform the following variance-weighted ridge regression to estimate w h as: argmin w∈R d τ ∈D (ϕ τ h ) ⊤ w -r τ h -f h+1 (x τ h+1 ) 2 σ 2 h (x τ h , a τ h ) + λ∥w∥ 2 2 = Σ -1 h τ ∈D ϕ (x τ h , a τ h ) • (r τ h + f h+1 (x τ h+1 )) σ 2 h (x τ h , a τ h ) , where H 2 ] is an independent variance estimator. In this case, we can bound the uncertainty by Lemma 3 as follows: Σ h = τ ∈D ϕ(x τ h ,a τ h )ϕ(x τ h ,a τ h ) ⊤ σ 2 h (x τ h ,a τ h ) + λI d and σ 2 h (•, •) ∈ [1, |T h f h+1 -T h f h+1 |(x, a) ≲ ∥ τ ∈D ϕ (x τ h , a τ h ) • ξ τ h (f h+1 )∥ Σ -1 h • ∥ϕ(x, a)∥ Σ -1 h , with ξ τ h (f h+1 ) = r τ h +f h+1 (x τ h+1 )-(T h f h+1 )(x τ h ,a τ h ) σ h (x τ h ,a τ h ) . The high-level idea is that the normalized ξ τ h (f h+1 ) is of a conditional variance O(1) and the Bernstein-type inequality (Lemma 10) gives a bound of β 2 := Õ( √ d) for the self-normalized process instead of β 1 = Õ(H √ d) from the Hoeffding-type one. Therefore, instead of depending on the planning horizon H explicitly, for β 2 ∥ϕ(x, a)∥ Σ -1 h , the H factor is "hidden" in the covariance matrix Σ h . Furthermore, as Σ -1 h ≼ H 2 Λ -1 h , we know that the new bonus function is never worse because β 2 ∥ϕ(x, a)∥ Σ -1 h ≲ β 1 ∥ϕ(x, a)∥ Λ -1 h . Combining this observation with (3), we conclude that the PEVI with variance-weighted regression is superior as long as we can construct an independent and approximately accurate variance estimator σ 2 h (•, •). To this end, we also carefully handle the temporal dependency due to the following two reasons: (i) we need to avoid the measurability issue as described in Section 4; and (ii) the estimation of the conditional variance of ξ τ h (f h+1 ) will be hard otherwise. To further illustrate the idea, we now discuss the limitations of existing approaches when we do not omit the temporal dependency, thus motivating our corresponding modifications. Limitation of the Existing Approach. Yin et al. (2022)  ϕ(x τ h ,a τ h ) σ h (x τ h ,a τ h ) } K τ =1 are dependent so the concentration of the covariance matrix with Lemma 12 (Lemma C.3 of Yin et al. (2022) ) does not apply. Moreover, a similar measurability issue arises from the statistical dependency between the σ h (•, •) and D h so the concentration of the self-normalized process also fails. Finally, the conditional variance of V h+1 (x τ h+1 )/ σ h (x τ h , a τ h ) is hard to control because the numerator and denominator are tightly coupled with each other. Variance Estimator. Equipped with the reference-advantage decomposition, it suffices to focus on the reference function V * h+1 . If we can construct a σ 2 h (•, •) that is independent of D, then ξ τ h (V * h+1 ) = r τ h +V * h+1 (x τ h+1 )-(T h V * h+1 )(x τ h ,a τ h ) σ h (x τ h ,a τ h ) will be independent with each other and easy to deal with. To this end, we first use an independent dataset D ′ to run Algorithm 2 and construct { V ′ h } H h=1 (this only incurs a factor of 2 in the final sample complexity). Then, by Lemma 1, we know that there exist β h,1 , β h,2 ∈ R d such that [P h ( V ′ h+1 ) 2 ](x, a) = ⟨ϕ(x, a), β h,2 ⟩ and ([P h V ′ h+1 (x, a)]) 2 = [⟨ϕ(x, a), β h,1 ⟩] 2 . Similarly, we can approximate β 1,h and β 2,h via ridge regression: β h,2 = argmin β∈R d τ ∈D ′ ⟨ϕ (x τ h , a τ h ) , β⟩ -( V ′ h+1 ) 2 (x τ h+1 ) 2 + λ∥β∥ 2 2 , β h,1 = argmin β∈R d τ ∈D ′ ⟨ϕ (x τ h , a τ h ) , β⟩ -V ′ h+1 (x τ h+1 ) 2 + λ∥β∥ 2 2 . ( ) We then employ the following variance estimator: σ 2 h (x, a) := max 1, ϕ(x, a) ⊤ β h,2 [0,H 2 ] -ϕ(x, a) ⊤ , β h,1 2 [0,H] -Õ dH 3 √ Kκ . With proof essentially similar to that of PEVI, we can show that with high probability (cf. Lemma 5), [V h V * h+1 ](x, a) -Õ dH 3 √ Kκ ≤ σ 2 h (x, a) ≤ [V h V * h+1 ](x, a), where 8) approximates the conditional variance of ξ τ h (V * h+1 ) well when K exceeds a threshold. Moreover, since the target function V * h+1 is deterministic thus measurable, we know that [V h V * h+1 ](x, a) = max{1, [Var h V * h+1 ](x, a)} is the truncated variance of V * h+1 (•). Therefore, σ 2 h (•, •) defined in ( Var x τ h+1 |x 1:τ h ,a 1:τ h ,x 1:τ -1 h+1 V * h+1 (x τ h+1 ) σ h (x τ h , a τ h ) = Var x τ h+1 |x τ h ,a τ h V * h+1 (x τ h+1 ) σ h (x τ h , a τ h ) = Var x τ h+1 |x τ h ,a τ h [V * h+1 (x τ h+1 )] σ h (x τ h , a τ h ) ≈ O(1), where x 1:τ h , a 1:τ h , x 1:τ -1 h+1 is short for {(x i h , a i h ) : 1 ≤ i ≤ τ } ∪ {x i h+1 : 1 ≤ i ≤ τ -1}. Therefore, the high-level idea stated at the beginning of this section is realized by the variance estimator defined by ( 8). Combining the variance-weighted regression with the reference-advantage decomposition, we propose LinPEVI-ADV+ (Algorithm 1), which enjoys the following theoretical guarantee. Theorem 2 (LinPEVI-ADV+). Under Assumption 1, for K ≥ Ω(d 2 H 6 /κ), if we set λ = 1/H 2 and β 2 = Õ( √ d) in Algorithm 1, then, we have with probability at least 1 -δ, for any x ∈ S, we have V * 1 (x) -V π 1 (x) ≤ Õ( √ d) • H h=1 E π * ∥ϕ(x h , a h )∥ Σ * -1 h |x 1 = x , where Σ * h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ /[V h V * h+1 ](x τ h , a τ h ) + λI d . Interpretation of the result. Since [V h V * h+1 ](•, •) ∈ [1, H 2 ], we know that Σ * -1 h ≼ H 2 Λ -1 h . This implies that LinPEVI-ADV+ is never worse than LinPEVI-ADV thus also superior to the PEVI. Such an improvement is indeed strict when it reduces to the tabular setting. Specifically, let d * h (x, a) denote the distribution of visitation of (x, a) at step h under the optimal policy π * . LinPEVI-ADV gives a Published as a conference paper at ICLR 2023 Independence Assumption Suboptimality Bound Jin et al. (2021c) Õ(dH) • H h=1 E π * [∥ϕ(x, a)∥ Λ -1 h | x 1 = x] Yin et al. (2022) Õ( √ d) • H h=1 E π * [∥ϕ(x h , a h )∥ Σ * -1 h |x 1 = x] Theorem 1 Õ( √ dH) • H h=1 E π * [∥ϕ(x, a)∥ Λ -1 h | x 1 = x] Theorem 2 Õ( √ d) • H h=1 E π * [∥ϕ(x h , a h )∥ Σ * -1 h |x 1 = x] Table 1 : A comparison with existing results. Here Λ h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λI d and Σ * h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ /[V h V * h+1 ](x τ h , a τ h ) + λI d . The independence assumption represents the assumption that the data samples are independent across different time steps h. We also remark that, in the new version of Jin et al. (2021c) , they adopt a dataset splitting technique to split the dataset into H independent subsets. This technique essentially shares the same spirit with the assumption and is at the expense of an additional H factor in the final sample complexity. bound Õ( √ dH h,x,a d * h (x, a) 1 Kd b h (x,a) ) which has a horizon dependence of Hfoot_1 . On the other hand, LinPEVI-ADV+ gives a bound of the form Õ( √ d h,x,a d * h (x, a) [V h V * h+1 ](x,a) Kd b h (x,a) 1/2 ), which has a horizon dependence of H 3/2 by law of total variance. Meanwhile, LinPEVI-ADV+ enjoys the same rate as the VAPVI (Yin et al., 2022) without an additional independence assumption on the offline dataset, as summarized in the Table 1 . Compared to other methods. Zanette et al. (2020) and Xie et al. (2021a) also achieve a √ d improvement but the algorithmic ideas are fundamentally different from ours. In specific, the actorcritic-based Zanette et al. (2020) establishes pessimism via direct perturbations of the parameter vectors 2 . Xie et al. (2021a) establishes pessimism only at the initial value and is only informationtheoretic. We develop techniques to resolve the issue in the LSVI framework because it possesses an appealing feature that we can assign sample-dependent weights in the regression thus obtaining a sharp H-dependence. To the best of our knowledge, no similar result is available in the other two frameworks. When rescaling the range of V-value to [0, H], their bounds will be sub-optimal. Moreover, their bounds depend on the cardinality of the action space, while the LSVI-based one can deal with an infinite action space. To further interpret our result, we state the following lower bound for offline linear MDP. Theorem 3 (Lower Bound for MDP). For fixed episode length H, dimension d, probability δ, and sample size K ≥ Ω(d 4 ). There exists a class M of linear MDPs and an offline dataset D with |D| = K, such that for any policy π, it holds with probability at least 1 -δ that for some universal constant c > 0, we have sup M ∈M E M [V * 1 (x 1 ) -V π 1 (x 1 )] ≥ c √ d • H h=1 E π * ∥ϕ(x h , a h )∥ Σ * -1 h . The lower bound matches Theorem 2, thus establishing the optimality of LinPEVI-ADV+ for sufficiently large K. We also remark that Yin et al. (2022) establishes the lower bound c √ d • H h=1 ∥E π * ϕ(x h , a h )∥ Σ * -1 h , which is smaller than our lower bound due to Jensen's inequality.

6.1. LINEAR MDP WITH FINITE FEATURE SET

As shown in the seminal work of linear contextual bandit Chu et al. (2011) , one can further improve the suboptimality bound by a factor of √ d, when the action set is finite. Equipped with the technique developed above, we can obtain a similar improvement in the case of finite features. Assumption 2 (Finite Feature Set). We assume that |{ϕ(x, a) ∈ R d : x ∈ S, a ∈ A}| = M < ∞.

We start with bounding |T

h V * h+1 -T h V * h+1 | , which is the key to establishing the bonus function: T h V * h+1 -T h V * h+1 (x, a) (a) ≲ τ ∈D ϕ(x, a), Λ -1 h ϕ τ h constant ξ τ h (V * h+1 ) (b) ≤ ∥ϕ(x, a)∥ Λ -1 h τ ∈D ξ τ h (V * h+1 )ϕ τ h Λ -1 h , where we prove (a) in Appendix I and (b) follows from Cauchy-Schwarz inequality. The reasoning proceeds as follows. The key observation is that since V * h+1 is deterministic, {ξ τ h (V * h+1 )} τ ∈D are independent conditioned on D h = {(x τ h , a τ h )} τ ∈D thus the classic Hoeffding's inequality applies. To get a high-probability bound for all features, Assumption 2 allows us to bound the middle term directly by paying for a log(M ) from a union bound argument, instead of a √ d from Lemma 9. We remark that the decomposition trick is necessary for the above reasoning. One cannot take condition on D h otherwise because this will influence the distribution of {ξ τ h ( V h+1 )} τ ∈D . Combining this with (3), we have the following result. Theorem 4 (LinPEVI-ADV with Finite Feature Set). Suppose Assumptions 1 and 2 hold. If K ≥ Ω(d 2 H 2 /κ) and we set β 1 = O log(2H 2 M/δ) and λ = 1/d in Algorithm 2, then, with probability at least 1 -δ, for any x ∈ S, with Λ h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λI d , we have V * 1 (x) -V π 1 (x) ≤ O H log 2H 2 M/δ 1/2 H h=1 E π * ∥ϕ(x h , a h )∥ Λ -1 h | x 1 = x .

6.2. LINEAR TWO-PLAYER ZERO-SUM MARKOV GAME

In the two-player zero-sum MGs (Xie et al., 2020) where at each step, there is another player taking action simultaneously from another action space B and the reward function and transition kernel are linear in a feature map ϕ(x, a, b) : S × A × B → R d . The learning objective is to approximate the Nash equilibrium (NE): (π * , ν * ) such that V π * ,ν * h (x) = max π min ν V π,ν h (x), where the V-value function is now defined as V π,ν h (x) = E π,ν [ H h ′ =h r h ′ |x h = x]. With slight abuse of notation, we can define the Bellman operator for the two-player zero-sum MG as follows: (T h V )(x, a, b) := r h (x, a, b) + x ′ ∈S P h (x ′ |x, a, b)V (x ′ ) = ϕ(x, a, b) ⊤ w h , where the linear structure of the Bellman equation (i.e. the existence of w h ∈ R d ) follows from the linearity of reward and transition. The Pessimistic Minimax Value Iteration (PMVI) proposed in Zhong et al. (2022) also establishes pessimism at every step and we have V * 1 (x) -min ν V π,ν 1 (x) Gap to the NE Value ≤ 2 sup π H h=1 E π,ν * [Γ h (x, a, b) | x 1 = x], where Γ h (x, a, b) is a bonus function such that |T h V h+1 (x, a, b) -T h V h+1 (x, a, b)| ≤ Γ h (x, a, b) with high probability and V h+1 is the estimated NE value of the first agent at step h + 1. Although the learning objective is different, the suboptimality bound essentially also reduces to the uncertainty estimation at each step, and Zhong et al. (2022) suffers from exactly the same challenge from the statistical dependency between V h+1 and the data samples used to construct ( T h V h+1 ). Therefore, our techniques can be readily extended to this the MG setting and improve the result in Zhong et al. (2022) . We defer the details to Appendix B.

7. CONCLUSION

In this paper, we study the linear MDPs in the offline setting. We identify the complicated statistical dependency between different time steps as the bottleneck of the algorithmic design and theoretical analysis. To address this issue, we develop a new reference-advantage decomposition technique under the linear function approximation, which serves to avoid a √ d-amplification of the value function error due to temporal dependency and is also critical for leveraging the variance information to achieve a sharp dependence on the planning horizon H. We further generalize the developed techniques to the linear MDP with finite features and also the two-player zero-sum MGs, which demonstrate the broad adaptability of our methods.

A NOTATION TABLE AND COMPARISONS A.1 NOTATIONS TABLE

We summarize the notations used in this paper in the Table 2 . Notation Explanation κ = min h∈[H] λ min (E d b h [ϕ(x, a)ϕ(x, a) ⊤ ]) > 0 Assumption 1 (MDP) and 3 (MG) (P h V )(x, a) = x ′ ∈S P h (x ′ |x, a)V (x ′ ) conditional expectation [Var h V ](x, a) = [P h V 2 ](x, a) -([P h V ](x, a)) 2 conditional variance (T h V )(x, a) = r h (x, a) + (P h V )(x, a) Bellman equation and Bellman operator σ 2 h (•, •) ∈ [1, H 2 ] empirical variance estimator [V h V * h+1 ](x, a) = max{1, [Var h V * h+1 ](x, a)} clipped conditional variance of V * h+1 Λ h = τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λI d regular covariance estimator Σ h = τ ∈D ϕ(x τ h ,a τ h )ϕ(x τ h ,a τ h ) ⊤ σ 2 h (x τ h ,a τ h ) + λI d variance-weighted covariance estimator Σ * h = τ ∈D ϕ(x τ h ,a τ h )ϕ(x τ h ,a τ h ) ⊤ [V h V * h+1 ](x τ h ,a τ h ) + λI d variance-weighted covariance matrix ξ τ h (f h+1 ) = r τ h +f h+1 (x τ h+1 )-(T h f h+1 )(x τ h ,a τ h ) σ h (x τ h ,a τ h ) noise in the self-normalized process 

A.2 ADDITIONAL RELATED WORK

We review existing works that are closely related to our paper in this section. Offline RL. The principle of pessimism is first used by Jin et al. (2021c) to empower efficient offline learning under only partial coverage. It shows that we can design an efficient offline RL algorithm with only sufficient coverage over the optimal policy, instead of the previous uniform one required by Precup (2000) ; Antos et al. (2008) ; Levine et al. (2020) . After that, a line of work (Rashidinejad et al., 2021; Yin and Wang, 2021; Uehara et al., 2021; Zanette et al., 2021; Xie et al., 2021a; Uehara and Sun, 2021; Shi et al., 2022; Li et al., 2022) leverages the principle of pessimism, either in the tabular case or in the case with function approximation, and we elaborate them separately. Offline tabular RL. For tabular MDP, a line of works has incorporated the principle of pessimism to design efficient offline RL algorithms (Rashidinejad et al., 2021; Yin and Wang, 2021; Xie et al., 2021b; Shi et al., 2022; Li et al., 2022) . In particular, Xie et al. (2021b) proposes a variance-reduction offline RL algorithm for tabular MDP which is nearly optimal after the total sample size exceeds a certain threshold. After that, Li et al. (2022) proposed an algorithm that is nearly optimal by introducing a novel subsampling trick to cancel the temporal dependency among time steps. Shi et al. (2022) proposed the first nearly optimal model-free offline RL algorithm. Offline RL with function approximation. For RL problems with linear function approximation, Jin et al. (2021c) designs the first pessimism-based efficient offline algorithm for linear MDP. After that, Min et al. (2021) considers offline policy evaluation problem under linear MDP and designs a novel offline algorithm which incorporates the variance information of the value function to improve the sample efficiency. This technique is later adopted by Yin et al. (2022) . However, Min et al. (2021) ; Yin et al. (2022) depend (explicitly or implicitly) on an assumption that the data samples are independent across different time steps h so they do not need to handle the temporal dependency, which considerably complicates the analysis. Moreover, such an assumption is not very realistic when the dataset is collected by some behavior policy by interacting with the underlying MDP. Therefore, it remains open whether we can design computationally efficient algorithms that achieve minimax optimal sample efficiency for offline learning with linear MDP. Beyond the linear function approximation, Xie et al. (2021a) ; Uehara and Sun (2021) propose pessimistic offline RL algorithms with general function approximation. However, their works are only information-theoretic as they require an optimization subroutine over the general function class which is computationally intractable in general. Algorithm 2 PEVI (LinPEVI-ADV) 2022). Among these works, Cui and Du (2022) and Zhong et al. (2022) are most closely related to our algorithm where they study the offline two-player zero-sum MGs in the tabular and linear case, respectively. In terms of the minimal dataset coverage assumption, both of them identify that the unilateral concentration, i.e., a good coverage on the {(π * , ν), (π, ν * ) : (π * , ν * ) is an NE and (π, ν) is arbitrary.}, is necessary and sufficient for efficient offline learning. In terms of sample complexity, similarly, while for the tabular game, Cui and Du ( 2022) is nearly optimal in terms of the dependency on the number of states, there is a Õ( √ dH) gap between the upper and lower bounds for linear MG (Zhong et al., 2022) . 1: Initialize: Input dataset D, β 1 ; V H+1 (•) = 0. 2: for h = H, . . . , 1 do 3: Λ h ← τ ∈D ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λI d ; 4: w h ← Λ -1 h ( τ ∈D ϕ (x τ h , a τ h ) (r τ h + V h+1 (x τ h+1 )); 5: Γ h (•, •) ← β 1 ∥ϕ(•, •)∥ Λ -1 h ; 6: Q h (•, •) ← {ϕ(•, •) ⊤ w h -Γ h (•, •)} [0,H-h+1] ; 7: π h (• | •) ← arg max π h ⟨ Q h (•, •), π h (• | •)⟩ A , V h (•) ← ⟨ Q h (•, •), π h (• | •)⟩ A . 8: end for 9: Output: π = { π h } H h=1 and V = { V h } H Online RL with function approximation. Jin et al. (2020) and Xie et al. (2020) propose the first provably efficient algorithm for online linear MDP and linear MG, respectively. However, there is a gap between their regret bounds and existing lower bounds where we remark that similar issues of temporal dependency also exist in the analysis of online algorithms for linear MDP and linear MG. Beyond the linear function approximation, several works on MDP (Jiang et al., 2017; Jin et al., 2021a; Dann et al., 2021; Du et al., 2021; Foster et al., 2021) and MG (Jin et al., 2021b; Huang et al., 2021; Xiong et al., 2022) design algorithms in the online setting with general function approximation. When applied to the linear setting, though their regret bounds are sharper than that of Jin et al. (2020) ; Xie et al. (2020) , their algorithms are only information-theoretic and are computationally inefficient. Variance-weighted Regression. It is known that variance information is essential for sharp horizon dependence (Azar et al., 2017; Zhang et al., 2020; 2021; Zhou et al., 2021) . Particularly, for online linear mixture MDP, Zhou et al. (2021) develops the variance-weighted regression and achieves the minimax optimal regret bound; Zhang et al. (2021) considers the time-homogeneous setting and achieves a horizon-free guarantee. This innovative idea is first introduced to the offline setting by Min et al. (2021) and Yin et al. (2022) . In this paper, we mainly generalize this technique to the more challenging offline setting without the additional independence assumption required in the existing approaches. See Section 5 for details.

B RESULTS FOR MARKOV GAME

We extend our techniques to the linear Markov games in this section.

B.1 PROBLEM SETUP

We introduce the two-player zero-sum Markov games (MGs) with notations similar to single-agent MDPs, which is a slight abuse of notation but should be clear from the context. Two-player Zero-sum Markov Game with Linear Function Approximation is defined by a tuple (S, A, B, H, P, r), where S denotes the state space, A and B are the action spaces for the two players, H is the length of each episode, P = {P h : S × A × B → ∆ S } H h=1 is the transition kernel, and r = {r h : S × A × B → [0, 1]} H h=1 is the reward function. The first player (referred to as the max-player) takes action from A aiming to maximize the cumulative reward, while the second player (referred to as the min-player) want to minimize it. The policy of the max-player is defined as π = {π h : S → ∆ A } H h=1 . Analogously, the policy for the min-player is defined by ν = {ν h : S → ∆ B }. Value Function and Nash Equilibrium. For any fixed policy pair (π, ν), we define the value function V π,ν h and the Q-function Q π,ν h as V π,ν h (x) = E π,ν [ H h ′ =h r h ′ |x h = x], Q π,ν h (x, a, b) = E π,ν [ H h ′ =h r h ′ |(x h , a h , b h ) = (x, a, b)]. For any function V : S → R, we also define the shorthand notations for the conditional mean, conditional variance, and Bellman operator as follows: (P h V )(x, a, b) := x ′ ∈S P h (x ′ |x, a, b)V (x ′ ), [Var h V ](x, a, b) := [P h V 2 ](x, a, b) -([P h V ](x, a, b)) 2 , (T h V )(x, a, b) := r h (x, a, b) + (P h V )(x, a, b). For any max-player's policy π, we define the best-response as br(π) = argmin ν V π,ν h (x) for all (x, h) ∈ S × [H]. Similarly, we can define br(ν) by br(ν) = argmax π V π,ν h (x) for all (x, h) ∈ S × [H]. We say (π * , ν * ) is a Nash equilibrium (NE) if π * and ν * are the best response to each other. For simplicity, we denote V π, * h = V π,br(π) h , V * ,ν h = V br(ν),ν h , and V * h (x) = V π * ,ν * h (x). It is well known that (π * , ν * ) is the solution to max π min ν V π,ν h (x). Then we can measure the optimality of a policy pair (π, ν) by the duality gap, which is defined in Zhong et al. (2022) ; Xie et al. (2020) as follows: Gap ((π, ν), x) = V * ,ν 1 (x) -V π, * 1 (x). (11) We consider the linear MG (Xie et al., 2020) , which generalizes the definition of linear MDP. Definition 2 (Linear MG). MG(S, A, B, H, P, r) is a linear MG with a (known) feature map ϕ : S × A × B → R d , if for any h ∈ [H], there exist d unknown signed measures µ h = µ (1) h , • • • , µ (d) h over S and an unknown vector θ h ∈ R d , such that for any (x, a) ∈ S × A, we have P h (• | x, a, b) = ⟨ϕ(x, a, b), µ h (•)⟩ , r h (x, a, b) = ⟨ϕ(x, a, b), θ h ⟩ . ( ) We assume that ∥ϕ(x, a, b)∥ ≤ 1 for all (x, a, b) ∈ S × A × B, and max{∥µ h (S)∥ , ∥θ h ∥} ≤ √ d for all h ∈ [H]. With a slight abuse of notation, we make the following coverage assumption for MGs. Assumption 3. We assume κ = min h∈ [H] λ min (E d b h [ϕ(x, a, b)ϕ(x, a, b) ⊤ ]) > 0 for MGs.

B.2 LINPMVI-ADV+

LinPMVI-ADV+ (Algorithm 3) is a variant of pessimistic minimax value iteration (PMVI) from (Zhong et al., 2022) . At a high level, LinPMVI-ADV+ constructs pessimistic estimations of the Q-functions for both players and outputs a policy pair by solving two Nash Equilibrium (NE) based on these two estimated value functions. For linear MG, these can also be done by regressions. Suppose we have constructed value functions (V h+1 , V h+1 ) at (h + 1)-th step, and two independent variance estimators σ 2 h and σ 2 h , which are constructed similarly to Section 5. For now, let us focus on the main components of Algorithm 3 and defer the construction of the variance estimators to next subsection. Given (9), we approximate the Bellman equations T h V h+1 and T h V h+1 by solving the following regression problems:  w h ← argmin w∈R d τ ∈D [r τ h + V h+1 (x τ h+1 ) -(ϕ τ h ) ⊤ w] 2 σ 2 h (x τ h , a τ h , b τ h ) + λ∥w∥ 2 2 , and T h V h+1 (o) := ϕ(o) ⊤ w h , w h ← argmin w∈R d τ ∈D [r τ h + V h+1 (x τ h+1 ) -(ϕ τ h ) ⊤ w] 2 σ 2 h (x τ h , a τ h , b τ h ) + λ∥w∥ 2 2 , and T h V h+1 (o) := ϕ(o) ⊤ w h (13) Σ h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ σ 2 h (x τ h ,a τ h ,b τ h ) + λI d , Σ h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ σ 2 h (x τ h ,a τ h ,b τ h ) + λI d ; 5: w h = Σ -1 h ( τ ∈D ϕ τ h (r τ h + V h+1 (x τ h+1 ))), w h = Σ -1 h ( τ ∈D ϕ τ h (r τ h + V h+1 (x τ h+1 ))); 6: Γ h (•, •, •) ← β 3 ∥ϕ(•, •, •)∥ Σ -1 h , Γ h (•, •, •) ← β 3 ∥ϕ(•, •, •)∥ Σ -1 h 7: Q h (•, •, •) ← {ϕ(•, •, •) ⊤ w h -Γ h (•, •, •)} [0,H-h+1] . 8: Q h (•, •, •) ← {ϕ(•, •, •) ⊤ w h + Γ h (•, •, •)} [0,H-h+1] . 9: ( π h (• | •), ν ′ h (• | •)) ← NE(Q h (•, •, •)); V h (•) ← ⟨Q h (•, •, •), π h (• | •) × ν ′ h (• | •)⟩ A×B . 10: (π ′ h (• | •), ν h (• | •)) ← NE(Q h (•, •, •)); V h (•) ← ⟨Q h (•, •, •), π ′ h (• | •) × ν h (• | •)⟩ A×B . 11: end for 12: Output: ( π = { π h } H h=1 , ν = { ν h } H h=1 ). where (o) is short for (x, a, b), and P h g h+1 , P h g h+1 can be obtained by setting r τ h = 0 in T h g h+1 and T g h+1 . Denoting the covariance estimators as Σ h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ σ 2 h (x τ h ,a τ h ,b τ h ) + λI d , and Σ h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ σ 2 h (x τ h ,a τ h ,b τ h ) + λI d , we can estimate the Q-functions by LCB for the max-player and UCB for the min-player, respectively: Q h (o) ← ϕ(o) ⊤ w h -Γ h (x, a, b), Q h (o) ← ϕ(o) ⊤ w h + Γ h (x, a, b). where we remark that they are pessimistic for the max-player and the min-player, respectively. Next, we solve the matrix games with payoffs Q h and Q h : ( π h (• | •), ν ′ h (• | •)) ← NE(Q h (•, •, •)), (π ′ h (• | •), ν h (• | •)) ← NE(Q h (•, •, •)). ( ) The V-functions estimations V h and V h are then given by V h = E a∼ π h (•|•),b∼ν ′ h (•|•) Q h (•, a, b) and V h = E a∼π ′ h (•|•),b∼ ν h (•|•) Q h (•, a, b). ( ) After H steps, the algorithm outputs the policy pair ( π = { π h } H h=1 , ν = { ν h } H h=1 ) and value functions (V = {V h } H h=1 , V = {V h } H h=1 ). Similar to the linear MDP, a sharper uncertainty bonus leads to a smaller suboptimality gap. The techniques developed for MDPs can be separately applied to the max-player and min-player. Specifically, given (V h+1 , V h+1 ), if we denote the Nash Value as V * h+1 , we can decompose the uncertainties as follows: T h V h+1 (o) -T h V h+1 (o) = T h V * h+1 (o) -T h V * h+1 (o) + P h (V h+1 -V * h+1 )(o) -P h (V h+1 -V * h+1 )(o), T h V h+1 (o) -T h V h+1 (o) = T h V * h+1 (o) -T h V * h+1 (o) Reference Part +P h (V h+1 -V * h+1 )(o) -P h (V h+1 -V * h+1 )(o). Then, similar to the single-agent MDP, for sufficiently large K, the uncertainty is dominated by the reference part with V * h+1 . Moreover, when the independent variance estimators could approximate the conditional variance well, we can set Γ h (x, a, b) = Õ √ d ∥ϕ(x, a, b)∥ Σ -1 h and Γ h (x, a, b) = Õ √ d ∥ϕ(x, a, b)∥ Σ -1 h . The full pseudo code is presented in Algorithm 3 and we have the following theoretical guarantee. Theorem 5 (LinPMVI-ADV+). Under Assumption 1, for K ≥ Ω(d 2 H 6 /κ), if we set 0 < λ < κ and β 3 = Õ( √ d) in Algorithm 3, then with probability at least 1 -δ, we have V * , ν 1 (x) -V π, * 1 (x) ≤ Õ( √ d) • max ν H h=1 Eπ * ,ν ∥ϕ(x h , a h , b h )∥ Σ * -1 h + max π H h=1 Eπ,ν * ∥ϕ(x h , a h , b h )∥ Σ * -1 h , where (π * , ν * ) is an NE and Σ * h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ /[V h V * h+1 ](x τ h , a τ h , b τ h ) + λI d . Algorithm 4 PMVI (LinPMVI-ADV) 1: Initialize: Input dataset D, β 3 ; V H+1 (•) = V H+1 (•) = 0. 2: for h = H, . . . , 1 do 3: Λ h ← τ ∈D ϕ(x τ h , a τ h , b τ h )ϕ(x τ h , a τ h , b τ h ) ⊤ + λI. 4: w h ← Λ -1 h ( τ ∈D ϕ (x τ h , a τ h , b τ h ) r τ h + V h+1 (x τ h+1 ). 5: w h ← Λ -1 h ( τ ∈D ϕ (x τ h , a τ h , b τ h ) r τ h + V h+1 (x τ h+1 ). 6: Γ h (•, •, •) ← β 3 • (ϕ(•, •, •) ⊤ Λ -1 h ϕ(•, •, •)) 1/2 . 7: Q h (•, •, •) ← {ϕ(•, •, •) ⊤ w h -Γ h (•, •, •)} [0,H-h+1] . 8: Q h (•, •, •) ← {ϕ(•, •, •) ⊤ w h + Γ h (•, •, •)} [0,H-h+1] . 9: ( π h (• | •), ν ′ h (• | •)) ← NE(Q h (•, •, •)). 10: (π ′ h (• | •), ν h (• | •)) ← NE(Q h (•, •, •)). 11: V h (•) ← ⟨Q h (•, •, •), π h (• | •) × ν ′ h (• | •)⟩ A×B . 12: V h (•) ← ⟨Q h (•, •, •), π ′ h (• | •) × ν h (• | •)⟩ A×B . 13: end for 14: Output: ( π = { π h } H h=1 , ν = { ν h } H h=1 ). Similar with the single-agent MDPs, LinPMVI-ADV+ replace an explicit dependence on the planning horizon H in PMVI (Zhong et al., 2022) with an instance-dependent characterization through Σ * h . The instance-dependent bound of LinPMVI-ADV+ is never worse than that of PMVI, and the improvement is strict when specialized to the special case of tabular setting. A tighter lower bound. To further interpret the result, we establish the following nearly matching lower bound which tightens that in Zhong et al. (2022) . Theorem 6 (Lower Bound for MG). Fix horizon H, dimension d, probability δ > 0, and sample size K ≥ Ω(d 4 ). There exists a class M of linear MGs and an offline dataset D with |D| = K, such that for any policy pair ( π, ν), it holds with probability at least 1 -δ that sup M ∈M E M [V * , ν 1 (x 1 ) -V π, * 1 (x 1 )] ≥ c √ d • max ν H h=1 E π * ,ν ∥ϕ h ∥ Σ * -1 h + max π H h=1 E π,ν * ∥ϕ h ∥ Σ * -1 h , where Σ * h = τ ∈D ϕ τ h (ϕ τ h ) ⊤ /[V h V * h+1 ](x τ h , a τ h , b τ h ) + λI d and c > 0 is a universal constant. As [V h V * h+1 ](•, •, •) ≥ 1, it holds that ∥ϕ(x h , a h , b h )∥ Σ * -1 h ≥ ∥ϕ(x h , a h , b h )∥ Λ -1 h . Therefore, Theorem 6 improves the lower bound in Zhong et al. (2022) at least by a factor of √ d. Moreover, LinPMVI-ADV+ matches this lower bound up to logarithmic factor thus is nearly minimax optimal when K exceeds the threshold specified in the theorem.

B.3 LINPMVI-ADV

In this subsection, we first present the full pseudo code of PMVI (Zhong et al., 2022) . We first follow the reasoning in last subsection but with naive variance 1, and thus with the regular ridge regression. Then, the reference-advantage decomposition allows us to improve PMVI by a factor of √ d by invoking Lemma 9 without a uniform concentration in the reference part. The resulting bonus is Γ h (•, •, •) = Õ( √ dH) ∥ϕ(•, •, •)∥ Λ -1 h . LinPMVI-ADV admits the following theoretical guarantee: Theorem 7 (LinPMVI-ADV). Under Assumption 1, for K > Ω(d 2 H 2 /κ), if we set λ = 1 and β 4 = Õ( √ dH) in Algorithm 4, then with probability at least 1 -δ, we have V * , ν 1 (x) -V π, * 1 (x) ≤ Õ( √ dH) • max ν H h=1 E π * ,ν ∥ϕ h ∥ Λ -1 h + max π H h=1 E π,ν * ∥ϕ h ∥ Λ -1 h , where (π * , ν * ) is an NE and Λ h = τ ∈D ϕ(x τ h , a τ h , b τ h )ϕ(x τ h , a τ h , b τ h ) ⊤ + λI d . We now proceed to construct the variance estimator for Algorithm 3.

Construction of the Variance Estimators.

To begin with, we run Algorithm 4 to construct {V ′ h , V ′ h } H h=1 with an independent dataset D ′ . This will only incur a factor of 2 in the final sample complexity. Similar to Lemma 1, we can show that there exist β h,1 (V ′ h+1 ) and β h,2 (V ′ h+1 ) such that [P h (V ′ h+1 ) 2 ](x, a) = ϕ(x, a), β h,2 (V ′ h+1 ) and [P h V ′ h+1 ](x, a) = ϕ(x, a), β h,1 (V ′ h+1 ) . We approximate them via ridge regression with D ′ : β h,2 = argmin β∈R d τ ∈D ′ ⟨ϕ τ h , β⟩ -(V ′ h+1 ) 2 x τ h+1 2 + λ∥β∥ 2 2 , β h,1 = argmin β∈R d τ ∈D ′ ⟨ϕ τ h , β⟩ -V ′ h+1 x τ h+1 2 + λ∥β∥ 2 2 . Then, the variance estimator is then constructed as σ 2 h (x, a) := max 1, ϕ(x, a) ⊤ β h,2 [0,H 2 ] -ϕ(x, a) ⊤ , β h,1 2 [0,H] -Õ dH 3 √ Kκ . Similarly, we can construct the variance estimator σ 2 h (x, a) with V ′ h+1 and dataset D ′ . In particular, as σ h (•, •) and σ h (•, •) only depend on the dataset D ′ , they are independent of D. The variance estimation error is characterized in Lemma 7 and the proof of MGs is presented in Appendix F.

C AUXILIARY LEMMAS

In this section, we provide several useful lemmas to facilitate the proof of linear MDP. The first lemma states that if we adopt the principle of pessimism, the suboptimality bound essentially reduces to the the uncertainty estimation, i.e., construction of the bonuses. Lemma 2 (Regret Decomposition Lemma for MDP). Under the condition that with probability at least 1 -δ, the functions Γ h : S × A → R in Algorithms 2 and 1 (Γ h = b 0,h + b 1,h ) satisfying |T h V h+1 (x, a) -T h V h+1 (x, a)| ≤ Γ h (x, a), ∀(x, a, h) ∈ S × A × [H], we have that with probability at least 1 -δ, for Algorithm 2 and Algorithm 1 that for any x ∈ S, V * 1 (x) -V π 1 (x) ≤ V * 1 (x) -V 1 (x) ≤ 2 H h=1 E π * [Γ h (x h , a h ) | x 1 = x]. Proof. See Lemma 3.1 and Theorem 4.2 in Jin et al. (2021c) for a detailed proof. Lemma 3 (Decomposition). For a function f h+1 , suppose that T h f h+1 (•, •) = ϕ(•, •) ⊤ w h . With w h = Σ -1 h τ ∈D ϕ(x τ h ,a τ h )•(r τ h +f h+1( x τ h+1 )) σ 2 h (x τ h ,a τ h ) where σ 2 h (x τ h , a τ h ) can be either 1 (for regular ridge regression) or the estimated variance (for variance-weighted ridge regression) and Σ h = τ ∈D ϕ(x τ h ,a τ h )ϕ(x τ h ,a τ h ) ⊤ σ 2 h (x τ h ,a τ h ) + λI d . Then, we have the following decomposition: T h f h+1 -T h f h+1 (x, a) = ϕ(x, a) ⊤ w h -ϕ(x, a) ⊤ w h ≤ λ ∥w h ∥ Σ -1 h ∥ϕ(x, a)∥ Σ -1 h + τ ∈D ϕ (x τ h , a τ h ) σ h (x τ h , a τ h ) ξ τ h (f h+1 ) Σ -1 h ∥ϕ(x, a)∥ Σ -1 h , ( ) where ξ τ h (f h+1 ) = r τ h +f h+1 (x τ h+1 )-T h f h+1 (x τ h ,a τ h ) σ h (x τ h ,a τ h ) . In particular, if |f h+1 | is bounded by H -1 and ∥ τ ∈D ϕ(x τ h ,a τ h ) σ h (x τ h ,a τ h ) • ξ τ h (f h+1 )∥ Σ -1 h ≤ β, then we can set λ sufficiently small so that √ λdH ≤ β. In this case, the second term is dominating. P h f h+1 (•, •) admits similar results by setting r h ≡ 0. Proof. See Appendix I for a detailed proof.

D PROOFS OF LINPEVI-ADV D.1 PROOF OF THEOREM 1

The proof requires a more refined induction analysis to deal with the temporal dependency. For instance, when analyzing the h-step, we cannot take the condition on that V h+1 -V * h+1 is small, which is required to make sure that the uncertainty of the advantage function is non-dominating. This is because due to the temporal dependency, taking condition on V h+1 may influence the distribution at step h. A carefully crafted induction analysis is employed to solve the challenge. Proof of Theorem 1. We will prove the theorem by induction. For h = H, for a function ∥g h+1 ∥ ∞ ≤ R -1, we invoke Lemma 3: T h g h+1 -T h g h+1 (x, a) ≤ √ λdR ∥ϕ(x, a)∥ Σ -1 h (a) + τ ∈D ϕ (x τ h , a τ h ) • ξ τ h (g h+1 ) Σ -1 h ∥ϕ(x, a)∥ Σ -1 h (b) . For the reference part with g h+1 = V * H+1 , since V * H+1 is independent of D, we can directly apply Lemma 9 with λ = 1 to obtain that with probability at least 1 -δ H = 1 -δ/H 2 , T H V * H+1 -T H V * H+1 (x, a) ≤ 2 √ dH √ ι ∥ϕ(x, a)∥ Λ -1 H , where ι = log(2H 2 K/δ) ≥ 1. To simplify the proof, we set b 0,H (x, a) = 3 √ dH √ ι ∥ϕ(x, a)∥ Λ -1 H to further capture the uncertainty of (a) from the advantage function. Then, we can focus on the analysis of (b) for the advantage function. By construction, we have E H+1 = {0 ≤ V * H+1 -V H+1 ≤ 8 √ dH•0 √ ι √ Kκ } holds with probability 1. By Lemma 9, it suffices to set b 1,H (x, a) = 8d 3/2 H 2 ι √ Kκ ∥ϕ(x, a)∥ Λ -1

H

. It follows that we can set Γ H (•, •) = b 0,H (•, •)+b 1,H (•, •) = 3 √ dH √ ι + 8d 3/2 H 2 ι √ Kκ ∥ϕ(•, •)∥ Λ -1 H ≤ 4 √ dH √ ι ∥ϕ(•, •)∥ Λ -1 H , where we use K ≥ Ω(d 2 H 2 /κ) to obtain the last inequality. If pessimism at step H is achieved, we know that Q * H (x, a) = T H V * H+1 (x, a) ≥ T H V H+1 (x, a) ≥ Q H (x, a), ∀(x, a) ∈ S × A, where the last step is due to |T h V H+1 (x, a) -T h V H+1 (x, a)| ≤ Γ H (x, a). This implies that V * H (x) ≥ V H (x) for all x ∈ S and we proceed to bound the error as follows: V * H (x) -V H (x) = Q * H (x, •) -Q H (x, •), π * H (•|x) + Q H (x, •), π * H (•|x) -π H (•|x) ≤ T H V H+1 (x, •) -T H V H+1 (x, •) + Γ H (x, •), π * H (•|x) + T H (V * h+1 -V H+1 )(x, •), π * H (•|x) ≤ 2E π * [Γ H (•, •)|x H = x] + 0, ≤ 8 √ dH • 1 √ ι √ Kκ := R H , ∀x ∈ S, where the last inequality uses Lemma 13. To summarize, the event E H = {0 ≤ V * H (x) -V H (x) ≤ R H , ∀x ∈ S} holds with probability at least 1 -δ H = 1 -δ H 2 . This is the base case. Now suppose that the event E h+1 = {0 ≤ V * h+1 (x) -V h+1 (x) ≤ R h+1 := 8 √ dH(H-h) √ ι √ Kκ , ∀x ∈ S} holds with probability at least 1 -δ h+1 . We are going to establish the result for step h. Clearly, we can still set b 0,h (•, •) = 3 √ dH √ ι ∥ϕ(•, •)∥ Λ -1 h . It remains to determine b 1,h (•, •) for (b) of the advantage function and to ensure that it is non-dominating. We need to deal with the temporal dependency, which requires a more involved analysis. We first state the following lemma. Lemma 4 (Lemma B.2 of Jin et al. (2021c) ). Let f : S → [0, R -1] be any fixed function. For any δ ∈ (0, 1), we have P τ ∈D ϕ τ h • ξ τ h (f ) 2 Λ -1 h ≥ R 2 (2 log( 1 δ ) + d log(1 + K λ )) ≤ δ. However, as ( V h+1 -V * h+1 ) is correlated to {(x τ h , a τ h , x τ h+1 )} τ ∈D , we need a uniform concentration argument. In particular, we remark that we cannot take condition that V * h+1 -V h+1 ≤ R h+1 directly. We consider the function class V h (D, B, λ) = {V h (x; θ, β, Σ) : S → [0, H] with ∥θ∥ ≤ D, β ∈ [0, B], Σ ⪰ λ • I} , where V h (x; θ, β, Σ) = max a∈A ϕ(x, a) ⊤ θ -β • ϕ(x, a) ⊤ Σ -1 ϕ(x, a) [0,H-h+1] . With V * h+1 -V h+1 denoted as f , we can estimate D as follows: ∥ w h (f )∥ = Λ -1 h τ ∈D ϕ(x τ h , a τ h ) • (r τ h + f (x τ h+1 )) ≤ H τ ∈D ϕ(x τ h , a τ h ) ⊤ Λ -1/2 h Λ -1 h Λ -1/2 h ϕ(x τ h , a τ h ) ≤ H K λ τ ∈D ϕ(x τ h , a τ h ) ⊤ Λ -1 h ϕ(x τ h , a τ h ) = H K λ Tr Λ -1 h (Λ h -λI d ) ≤ H Kd λ . It follows that f = V * h+1 -V h+1 ∈ F h+1 := {V * h+1 -V h+1 : V h+1 ∈ V h+1 (D 0 , B 0 , λ)} where D 0 = H Kd λ , λ = 1, and B 0 = 8 √ dH √ ι. For any ϵ > 0, we denote the ϵ-cover of F h+1 with respect to the supremum norm as N h+1 (ϵ) (short for N h+1 (ϵ; D, B, λ)) and its ϵ-covering number as |N h+1 (ϵ)|. For each f ∈ F h+1 , we can find f ϵ ∈ N h+1 (ϵ), such that sup x∈S |f (x) -f ϵ (x)| ≤ ϵ. It follows that τ ∈D ϕ τ h • ξ τ h (f ) 2 Λ -1 h 1 {∥f ∥ ∞ ≤ R h+1 } ≤ 2 τ ∈D ϕ τ h • ξ τ h (f ϵ ) 2 Λ -1 h 1 {∥f ϵ ∥ ∞ ≤ (R h+1 + ϵ)} + 2 τ ∈D ϕ τ h • (ξ τ h (f ) -ξ τ h (f ϵ )) 2 Λ -1 h ≤ 2 τ ∈D ϕ τ h • ξ τ h (f ϵ ) 2 Λ -1 h 1 {∥f ϵ ∥ ∞ ≤ (R h+1 + ϵ)} + 2ϵ 2 K 2 /λ, where the first inequality uses {∥f ∥ ∞ ≤ R h+1 } implies {∥f ϵ ∥ ∞ ≤ (R h+1 + ϵ)} and the second inequality is because the following estimation of the second term: 2 τ ∈D ϕ τ h • (ξ τ h (f ) -ξ τ h (f ϵ )) 2 Λ -1 h ≤ 2ϵ 2 K τ,τ ′ =1 |ϕ τ h Λ -1 h ϕ τ ′ h | ≤ 2ϵ 2 ∥ϕ τ h ∥ ∥ϕ τ ′ h ∥ Λ -1 h op ≤ 2ϵ 2 K 2 /λ, where ∥•∥ op denotes the operator norm and Λ -1 h op ≤ λ -1 . With a union bound over N h+1 (ϵ) and Lemma 4, we obtain that P   sup fϵ∈N h+1 (ϵ) τ ∈D ϕ τ h • ξ τ h (f ϵ ) 2 Λ -1 h 1 {∥f ϵ ∥ ∞ ≤ R h+1 + ϵ} > (R h+1 + ϵ) 2 (2 log( H 2 • |N h+1 (ϵ)| δ ) + d log(1 + K λ )) ≤ δ/H 2 . With probability at least 1 -δ/H 2 , for all f ∈ F h+1 , we have τ ∈D ϕ τ h • ξ τ h (f ) 2 Λ -1 h 1 {∥f ∥ ∞ ≤ R h+1 } ≤ inf ϵ>0    8 √ dH 2 √ ι √ Kκ + ϵ 2 2 log( H 2 • |N h+1 (ϵ)| δ ) + d log(1 + K λ ) + 2ϵ 2 K 2 /λ    ≤ 128dH 4 ι Kκ 2 log(H 2 /δ) + 4d 2 log( 512K 3 ι d 3/2 H 2 ) + 2d 3 H 4 Kκ ≤ 256dH 4 ι Kκ 2 log(H 2 /δ) + 4d 2 log( 512K 3 ι d 3/2 H 2 ) , where the second inequality is because the covering number of F h+1 is bounded by that of V h+1 (D, B, λ) and by Lemma 11, we can take ϵ = d 3/2 H 2 /(K 3/2 √ κ) to obtain log |N h+1 (ϵ)| ≤ d log(1 + 4K 2 √ κ dH ) + d 2 log(1 + 512K 3 κι d 3/2 H 2 ) ≤ 2d 2 log( 512K 3 ι d 3/2 H 2 ), where the second inequality holds when K > √ dH/(128 √ κι). As ι > 1 and log ι ≤ ι, it further holds that 256dH 4 ι Kκ 2 log(H 2 /δ) + 4d 2 log( 512K 3 ι d 3/2 H 2 ) ≤ 256dH 4 ι Kκ 2ι + 4d 2 (ι + log(512)ι + 3ι) ≤ 8704d 3 H 4 ι 2 Kκ , To summarize, it suffices to set b 1,h = 94d 3/2 H 2 ι √ Kκ ∥ϕ(x, a)∥ Λ -1 h and we have P τ ∈D ϕ τ h ξ τ h (V * h+1 -V h+1 ) > 94d 3/2 H 2 ι √ Kκ ≤ P τ ∈D ϕ τ h ξ τ h (V * h+1 -V h+1 ) 1 V * h+1 -V h+1 ∞ ≤ R h+1 > 94d 3/2 H 2 ι √ Kκ + P 1 V * h+1 -V h+1 ∞ > R h+1 ≤ δ/H 2 + δ h+1 := δ h , where δ h+1 is the failure probability at step h + 1. As K > Ω d 2 H 2 /κ , we can set Γ h (•, •) ≤ 3 √ dH √ ι + 94d 3/2 H 2 ι √ Kκ ∥ϕ(•, •)∥ Λ -1 h ≤ 4 √ dH √ ι ∥ϕ(•, •)∥ Λ -1 h , With R h := 8 √ dH(H-h+1) √ ι √ Kκ , we proceed to analyze the failure probability of E h = {0 ≤ V * h+1 (x)- V h+1 (x) ≤ R h , ∀x ∈ S}. First of all, if |T h V h+1 -T h V h+1 | ≤ Γ h and event E h+1 holds, we know that Q * h (x, a) = T h V * h+1 (x, a) ≥ T h V h+1 (x, a) ≥ Q h (x, a), ∀(x, a) ∈ S × A, and thus V * h (x) ≥ V h (x) for all x ∈ S. We also have V * h (x) -V h (x) = Q * h (x, •) -Q h (x, •), π * h (•|x) + Q h (x, •), π * h (•|x) -π h (•|x) ≤ T h V h+1 (x, •) -T h V h+1 (x, •) + Γ h (x, •), π * h (•|x) + T h (V * h+1 -V h+1 )(x, •), π * h (•|x) ≤ 2E π * [Γ h (•, •)|x h = x] + R h+1 , ≤ 8 √ dH • (H -h) √ ι √ Kκ := R h , ∀x ∈ S. Therefore, the failure probability at step h can be upper bounded as P (E c h ) ≤ P E c h+1 ∪ E h+1 ∩ {Γ h (•, •) does not ensures pessimism} ≤ δ h+1 + δ H 2 = δ h . Therefore, we have shown that with probability at least 1 -δ h = 1 -δ h+1 -δ H 2 , pessimism is achieved at step h and E h holds with probability at least 1 -δ h . By induction, and a union bound over h ∈ [H], we know that if we set Γ h = 4 √ dH √ ι ∥ϕ(•, •)∥ Λ -1 h , then with probability at least 1 -(δ H + • • • + δ 1 ) = 1 -δH(H+1) 2H 2 > 1 -δ, |T h V h+1 -T h V h+1 |(x, a) ≤ Γ h (x, a) holds for all (h, x, a) ∈ [H] × S × A. The theorem then follows from Lemma 2.

D.2 PROOF OF THEOREM 4

Proof of Theorem 4. The proof basically follows the same arguments of that of Theorem 1 except that we can leverage the finite feature condition to derive a different bonus term for the dominating reference function. We focus on deriving the Γ h (•, •) and omit other details for simplicity. We first elaborate Eqn. (6.1) via the proof of Lemma 3 (see Section I for details): T h V * h+1 -T h V * h+1 (x, a) ≤ √ λdH ∥ϕ(x, a)∥ Σ -1 h (a) + τ ∈D ϕ(x, a), Λ -1 h ϕ τ h ξ τ h (V * h+1 ) , where ξ τ h (V * h+1 ) = r τ h + V * h+1 (x τ h+1 ) -T h V * h+1 (x τ h , a τ h ). We will bound (b) by β ∥ϕ(x, a)∥ Λ -1 h with some β > 0 and we can set λ sufficiently small so that √ λdH ≤ β. Therefore, we can focus on the analysis of (b). We denote the state-action pairs at step h as D h = {(x τ h , a τ h )} τ ∈D . The key is that because V * h+1 is deterministic, conditioned on D h , the only randomness is from x τ h+1 and {ξ τ h (V * h+1 )} τ ∈D are still independent and bounded random variables. In particular, for any fixed D h = D h and fixed ϕ(x, a), by the Hoeffding's inequality, with probability at least 1 -δ HM , we have (b) ≤ τ ∈D H 2 ϕ(x, a), Λ -1 h ϕ(x τ h , a τ h ) 2 log 2H 2 M δ = H ∥ϕ(x, a)∥ 2 Λ -1 h -λ ∥ϕ(x, a)∥ 2 Λ -2 h log 2H 2 M δ ≤ H log 2H 2 M δ ∥ϕ(x, a)∥ Λ -1 h , ∀(x, a, h) ∈ S × A × [H], where in the equality, we use , a) . Then, if we denote the event τ ∈D ϕ(x, a), Λ -1 h ϕ(x τ h , a τ h ) 2 = τ ∈D ϕ(x, a) ⊤ Λ -1 h ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ Λ -1 h ϕ(x, a) = ∥ϕ(x, a)∥ 2 Λ -1 h -λϕ(x, a) ⊤ (Λ -1 h ) 2 ϕ(x E h (x, a) := { τ ∈D ϕ(x, a), Λ -1 h ϕ τ h ξ τ h (V * h+1 ) > H log 2H 2 M δ ∥ϕ(x, a)∥ Λ -1 h }, We then use Theorem 2 and Lemma 13 to show that with probability at least 1 -δ/2, for all (x, a, h) ∈ S × A × [H], Var h V ′ h+1 -Var h V * h+1 (x, a) ≤ P h ( V ′ h+1 ) 2 -(V * h+1 ) 2 (x, a) + P h V ′ h+1 2 -P h V * h+1 2 (x, a) ≤2H V ′ h+1 -V * h+1 (x, a) + 2H P h V ′ h+1 -P h V * h+1 (x, a) ≤ Õ √ dH 3 √ Kκ . With a union bound over the estimations in Eqn. ( 22) and Eqn. ( 23), we know that with probability at least 1 -δ, the following derivations hold. First, by triangle inequality, we have B h -Var h V * h+1 (x, a) ≤ B h -Var h V ′ h+1 (x, a) + Var h V ′ h+1 -Var h V * h+1 (x, a) ≤ Õ dH 3 √ Kκ , where we use Eqn. ( 22) and Eqn. ( 23) in the last inequality. This shows that B h (x, a) - Õ dH 3 √ Kκ ≤ [Var h V * h+1 ](x, a) and the second inequality of the lemma follows from the fact that max{1, •} preserves the order of two numbers. On the other hand, we note that max{1, •} is non-expansive, meaning that | max{1, a} -max{1, b}| ≤ |a -b|. Then, we have σ 2 h -V h V * h+1 (x, a) ≤ σ 2 h -V h V ′ h+1 (x, a) + [V h V ′ h+1 ] -[V h V * h+1 ] (x, a) ≤ B h -Õ dH 3 √ Kκ -Var h V ′ h+1 (x, a) + Var h V ′ h+1 -Var h V * h+1 (x, a) ≤ Õ dH 2 √ Kκ + Õ dH 3 √ Kκ + Õ dH 3 √ Kκ = Õ dH 3 √ Kκ , where the third inequality follows from Eqn. ( 22) and Eqn. ( 23).

E.2 PROOF OF THEOREM 2

The proof idea is similar to that of LinPEVI-ADV, except that we need to additionally estimate the conditional variance and apply the Bernstein-type Lemma 10 for the reference function. To simplify the presentation, we will focus on the uncertainty of the reference function as it is dominating, and focus on determining the threshold. To this end, we will omit the constants and logarithmic terms by Õ (•) throughout the proof. Proof of Theorem 2. We still start with the following decomposition: for a function ∥g h+1 ∥ ∞ ≤ R -1, we invoke Lemma 3: T h g h+1 -T h g h+1 (x, a) ≤ √ λdR ∥ϕ(x, a)∥ Σ -1 h (a) + τ ∈D ϕ (x τ h , a τ h ) [ Vh V ′ h+1 ](x τ h , a τ h ) • ξ τ h (g h+1 ) Σ -1 h ∥ϕ(x, a)∥ Σ -1 h (b) . Similarly, we can set λ = 1/H 2 to ensure that (a) ≤ β 2 ∥ϕ(x, a)∥ Σ -1 h = Õ √ d ∥ϕ(x, a)∥ Σ -1 h , so we focus on the analysis of (b). For the reference function, as [ Vh V ′ h+1 ](x τ h , a τ h ) ≥ 1, we know that ξ τ h (V * h+1 ) ≤ H. Then we consider the filtration F τ -1,h = σ {(x j h , a j h )} τ j=1,j∈D ∪ {(r j h , x j h+1 )} τ -1 j=1,j∈D where σ(•) denotes the σ-algebra generated by the random variables. As [ Vh V ′ h+1 ](x τ h , a τ h ) is independent of D, ξ τ h (V * h+1 ) is mean-zero conditioned on F τ -1,h . We proceed to estimate the conditional variance: D τ -1,h [ξ τ h (V * h+1 )] = D τ -1,h [V * h+1 (x τ h+1 )] [ Vh V ′ h+1 ](x τ h , a τ h ) ≤ [V h V * h+1 (x τ h+1 )] [ Vh V ′ h+1 ](x τ h , a τ h ) ≤ 1 + Õ dH 3 √ Kκ [ Vh V ′ h+1 ](x τ h , a τ h ) -Õ dH 3 √ Kκ ≤ 1 + 2 Õ dH 3 √ Kκ = O(1), where in the first equality, we use the fact that [ Vh V ′ h+1 ](•, •) is independent of D, and in the last inequality, we use K ≥ Ω d 2 H 6 /κ to ensure that [ Vh V ′ h+1 ](x τ h , a τ h ) -Õ dH 3 √ Kκ ≥ 1 2 . Then, we can directly invoke Lemma 10 to obtain that: τ ∈D ϕ (x τ h , a τ h ) • ξ τ h (V * h+1 ) Σ -1 h ≤ Õ √ d . Similar to the proof of LinPEVI-ADV, the uncertainty of the advantage function is non-dominating for sufficiently large K (determined later) so we can set Γ h (•, •) = Õ √ d ∥ϕ(x, a)∥ Σ -1 h . Moreover, by Lemma 5, we have [ Vh V ′ h+1 ](x τ h , a τ h ) ≤ [V h V * h+1 ](x τ h , a τ h ) ≤ H 2 , which implies that τ ∈D ϕ τ h (ϕ τ h ) ⊤ [ Vh V ′ h+1 ](x τ h , a τ h ) + λI d -1 ⪯ τ ∈D ϕ τ h (ϕ τ h ) ⊤ [V h V * h+1 ](x τ h , a τ h ) + λI d -1 ⪯ H 2 τ ∈D ϕ τ h (ϕ τ h ) ⊤ + λI d -1 . This implies that ∥ϕ(x, a)∥ Σ -1 h ≤ ∥ϕ(x, a)∥ Σ * -1 h ≤ H ∥ϕ(x, a)∥ Λ -1 h , ∀(x, a). Following the same induction analysis procedure of the proof of Theorem 1, we know that V h+1 -V * h+1 ∞ ≤ Õ √ dH 2 √ Kκ . Using the standard ϵ-covering argument and Lemma 9, we know that we can set b 1,h (x, a) = Õ d 3/2 H 2 √ Kκ ∥ϕ(x, a)∥ Σ -1 h . To make it non-dominating, we require that K ≥ Ω d 2 H 4 /κ . Moreover, to make (a) = √ λdR ∥ϕ(x, a)∥ Σ -1 h non-dominating, we set λ = 1/H 2 . Then, the theorem follows from Lemma 2.

F PROOF OF MARKOV GAME

Proof of Theorem 5. The techniques developed for MDPs are readily extended to the MGs by decoupling the estimations into the max-player part and min-player part so we start with the following lemma, which is a counterpart of Lemma 2. Lemma 6 (Decomposition Lemma for MG). Under the condition that the functions Γ h : S ×A×B → R in Algorithms 3 and 4 satisfying |T h V h+1 (x, a, b) -T h V h+1 (x, a, b)| ≤ Γ h (x, a, b), |T h V h+1 (x, a, b) -T h V h+1 (x, a, b)| ≤ Γ h (x, a, b), for any (x, a, b, h) ∈ S × A × B × [H], then for Algorithms 4 and 3, we have V * , ν 1 (x) -V * 1 (x) ≤ V 1 (x) -V * 1 (x) ≤ 2 sup ν H h=1 E π * ,ν [Γ h (x, a, b) | x 1 = x], V * 1 (x) -V π, * 1 (x) ≤ V * 1 (x) -V 1 (x) ≤ 2 sup π H h=1 E π,ν * [Γ h (x, a, b) | x 1 = x]. Furthermore, we can obtain that for any x ∈ S, V * , ν 1 (x) -V π, * 1 (x) = V * , ν 1 (x) -V * 1 (x) + V * 1 (x) -V π, * 1 (x) ≤ 2 sup ν H h=1 E π * ,ν [Γ h (x, a, b) | x 1 = x] + 2 sup π H h=1 E π,ν * [Γ h (x, a, b) | x 1 = x]. Proof. See Appendix A in Zhong et al. (2022) for a detailed proof. Therefore, it suffices to determine the Γ h (•, •, •) that establishes pessimism. Before continuing, we first prove Theorem 7, which is required in our subsequent analysis. Proof of Theorem 7. We first note that Lemma 3 is constructed for (weighted) ridge regression and can be applied to linear MG by replacing ϕ(x, a) ∈ R d with ϕ(x, a, b) accordingly. Therefore, for a function g : S × A × B → [0, H -1], we have T h g h+1 -T h g h+1 (x, a, b) ≲ τ ∈D ϕ (x τ h , a τ h , b τ h ) σ h (x τ h , a τ h , b τ h ) • ξ τ h (g h+1 ) Σ -1 h ∥ϕ(x, a)∥ Σ -1 h |T h g h+1 -T h g h+1 | (x, a, b) ≲ τ ∈D ϕ (x τ h , a τ h , b τ h ) σ h (x τ h , a τ h , b τ h ) • ξ τ h (g h+1 ) Σ -1 h ∥ϕ(x, a)∥ Σ -1 h . where Σ h = τ ∈D ϕ(x τ h ,a τ h ,b τ h )ϕ(x τ h ,a τ h ,b τ h ) ⊤ σ 2 h (x τ h ,a τ h ,b τ h ) + λI d , ξ τ h (g h+1 ) = r τ h +g h+1 (x τ h+1 )-T h g h+1 (x τ h ,a τ h ,b τ h ) σ h (x τ h ,a τ h ,b τ h ) , and Σ h , ξ τ h (g h+1 ) are defined similarly for σ h (•, •, •). Moreover, we again omit the √ λdH as we can set λ sufficiently small (determined later). For LinPMVI-ADV, we set σ h = σ h = 1 so Σ h = Σ h = Λ h . By the reference-advantage decomposition for MG (Eqn. (B.2)), it suffices to focus on the reference function with the Nash value V * h+1 where Lemma 9 can be applied directly: T h V h+1 -T h V h+1 (x, a, b) ≲ Õ √ dH ∥ϕ(x, a, b)∥ Λ -1 h = Γ h (x, a, b), T h V h+1 -T h V h+1 (x, a, b) ≲ Õ √ dH ∥ϕ(x, a, b)∥ Λ -1 h = Γ h (x, a, b). We follow the same induction analysis procedure of Theorem 1 to obtain that V h+1 -V * h+1 ∞ ≤ Õ H(H-h) √ Kκ and V h+1 -V * h+1 ∞ ≤ Õ H(H-h) √

Kκ

. By standard ϵ-covering argument and Lemma 9, we can set b 1,h (x, a, b) = O d 3/2 H 2 ι √ Kκ ∥ϕ(x, a, b)∥ Λ -1 h . To make it non-dominating, we require that K ≥ Ω(d 2 H 2 /κ). Also, to make √ λdH ≤ Õ √ dH , it suffices to set λ = 1. Now we invoke Lemma 6 with Γ h = Γ h = Γ h to obtain that for any x ∈ S, V * , ν 1 (x) -V π, * 1 (x) ≤ 2 sup ν H h=1 E π * ,ν [Γ h (x, a, b) | x 1 = x] + 2 sup π H h=1 E π,ν * [Γ h (x, a, b) | x 1 = x] ≤ Õ( √ dH) • max ν H h=1 E π * ,ν ∥ϕ(x h , a h , b h )∥ Λ -1 h + max π H h=1 E π,ν * ∥ϕ(x h , a h , b h )∥ Λ -1 h . With Theorem 7 in hand, similar to Lemma 5, we have the following lemma to control the variance estimation error and we omit the proof for simplicity. Lemma 7 (Variance Estimation Error for MGs). Under Assumption 1, in Algorithm 3, if K ≥ Ω(d 2 H 2 /κ), then with probability at least 1 -δ, for all (x, a, b, h) ∈ S × A × B × [H], we have [V h V * h+1 ](x, a, b) -Õ dH 3 √ Kκ ≤ σ h (x, a, b) ≤ [V h V * h+1 ](x, a, b) [V h V * h+1 ](x, a, b) -Õ dH 3 √ Kκ ≤ σ h (x, a, b) ≤ [V h V * h+1 ](x, a, b). Similar to the proof of Theorem 2, if K ≥ Ω d 2 H 6 /κ , the conditional variances of ξ τ h (V * h+1 ) and ξ τ h (V * h+1 ) are O(1) so it suffices to set Γ h (•, •, •) = Õ √ d ∥ϕ(•, •, •)∥ Σ -1 h and Γ h (•, •, •) = Õ √ d ∥ϕ(•, •, •)∥ Σ -1 h . Moreover, because [V h V * h+1 ](x, a, b) ≤ H 2 , we have Σ -1 h ⪯ Σ * -1 h ⪯ H 2 Λ -1 h , Σ -1 h ⪯ Σ * -1 h ⪯ H 2 Λ -1 h . We can similarly establish V h+1 -V * h+1 ∞ ≤ Õ √ dH 2 √ Kκ and V h+1 -V * h+1 ∞ ≤ Õ √ dH 2 √ Kκ , respectively. It suffices to set b 1,h (•, •, •) = Õ d 3/2 H 2 / √ Kκ ∥ϕ(•, •, •)∥ Σ -1 h and b 1,h (•, •, •) = Õ d 3/2 H 2 / √ Kκ ∥ϕ(•, •, •)∥ Σ -1 h where K ≥ Ω(d 2 H 4 /κ ) is sufficient to make them nondominating. Moreover, we need to set λ = 1/H 2 to make √ λdH ≤ Õ √ d . The theorem then follows from Lemma 6.

G PROOF OF LOWER BOUNDS

We only provide the proof of the lower bound for MGs, and the lower bound for MDP follows from the similar argument (see Remark 1 for details). In particular, we remark that the √ d-dependency does not contradict Theorem 4 as the feature size of the constructed instances is exponentially in d. Proof of Theorem 6. Our proof largely follows Zanette et al. (2021) ; Yin et al. (2022) . We construct a family of MGs M = {M u } u∈U , where U = {u = (u 1 , . . . , u H ) | u h ∈ {-1, 1} d-2 , ∀h ∈ [H]}. For any fixed u ∈ U, the associated MDP M u is defined by State space. The state space S = {-1, +1}.

Action space. The action space

A = B = {-1, 0, 1} d-2 . Feature map. The feature map ϕ : S × A × B → R d defined by ϕ(1, a, b) =   a √ 2d 1 √ 2 0   ∈ R d , ϕ(-1, a, b) =   a √ 2d 0 1 √ 2   ∈ R d . Transition kernel. Let µ h (1) = µ h (-1) =   0 d-2 1 √ 2 1 √ 2   ∈ R d . By the assumption that P h (s ′ | s, a, b) = ⟨ϕ(s, a, b), µ h (s ′ )⟩, we know the MDP reduces to a homogeneous Markov chain with transition matrix P = 1 2 1 2 1 2 1 2 ∈ R 2×2 . Reward observation. Let θ u,h =   ζu h 1 √ 3 -1 √ 3   ∈ R d , where ζ ∈ [0, 1 √ 3d ]. By the assumption that r h (s, a, b) = ⟨ϕ(s, a, b), θ h ⟩, we have r u,h (s, a, b) = s √ 6 + ζ √ 2d ⟨a, u h ⟩ We further assume the reward observation follows a Gaussian distribution: R u,h (s, a, b) ∼ N s √ 6 + ζ √ 2d ⟨a, u h ⟩, 1 . Date collection process. Let {e 1 , e 2 , • • • , e d-2 } be the canonical bases of R d-2 and 0 d-2 ∈ R d-2 be the zero vector. The behavior policy µ : S → ∆ A×B is defined as µ(e j , 0 d-2 | s) = 1 d , ∀j ∈ [d -2], µ(0 d-2 , 0 d-2 | s) = 2 d . By construction, a Nash equilibrium of M u is (π * , ν * ) satisfying π * h (•) = u h and ν * h (•) = u h . Since the reward and transition are irrelevant with the min-player's policy, we use the notation u π = (u π 1 , . . . , u π H ) = (sign(E π [a 1 ]), . . . , sign(E π [a H ])) . Moreover, for any vector v, we denote by its i-th element v [i] . By the proof of Lemma 9 in Zanette et al. (2021) we know V * u -V π,ν u ≥ ζ √ 2d H h=1 d i=1 1{u π h [i] -u h [i]} := ζ √ 2d D H (u π , u). By Assouad's method (cf. Lemma 2.12 in Tsybakov ( 2009)), we have sup u∈U E u [D H (u π , u)] ≥ (d -2)H 2 min u,u ′ |D H (u,u ′ )=1 inf ψ [P u (ψ ̸ = u) + P u ′ (ψ ̸ = u ′ )], where ψ is the test function mapping from observations to {u, u ′ }. Furthermore, by Theorem 2.12 in Tsybakov (2009) , we have min u,u ′ |D H (u,u ′ )=1 inf ψ [P u (ψ ̸ = u) + P u ′ (ψ ̸ = u ′ )] ≥ 1 - 1 2 max u,u ′ |D H (u,u ′ )=1 KL(Q u ∥Q u ′ ) 1/2 , ( ) where Q u takes form Q u = K k=1 ξ 1 s k 1 H h=1 µ a k h , b k h | s k h R u,h s k h , a k h , b k h r k h P h s k h+1 | s k h , a k h , b k h , where ξ = [ 1 2 , 1 2 ] is the initial distribution. KL(Q u ∥Q u ′ ) = K • H h=1 E u [log([R u,h (s 1 h , a 1 h , b 1 h )](r 1 h )/[R u ′ ,h (s 1 h , a 1 h , b 1 h )](r 1 h ))] = K d d-2 j=1 KL N ζ √ 2d ⟨e j , u h ⟩, 1 N ζ √ 2d ⟨e j , u ′ h ⟩, 1 = K d • KL N ζ √ 2d , 1 N -ζ √ 2d , 1 = 2Kζ 2 d 2 , ( ) where the third equality uses the fact that D  H (u, u ′ ) = 1. Choosing ζ = Θ(d/ √ K), E u [V * u -V π,ν u ] ≳ d √ dH √ K . ( ) Here K ≥ Ω(d 3 ) ensures that ζ ≤ 1 √ 3d . On the other hand, we have  Var[V * h+1 ](s, a, b) = 1 2 V * h+1 (-1) - 1 2 V * h+1 (+1) + V * h+1 (-1) 2 + 1 2 V * h+1 (+1) - 1 2 V * h+1 1 ⊤ d-2 a + d -1 2(d -2) = 4 3K d 4 • a ⊤ a + d 4(d -2) 1 -1 ⊤ d-2 a 2 + 1 4 ≤ 4 3K d(d -2) 4 + d 4(d -2) 1 -1 ⊤ d-2 a 2 + 1 4 ≤ 4 3K d(d -2) 4 + d(d -1) 2 4(d -2) + 1 4 ≲ d 2 K , where the first inequality uses Lemma 13. Hence, we have max ν H h=1 E π * ,ν ∥ϕ(s h , a h , b h )∥ (Σ * h ) -1 + max π H h=1 E π,ν * ∥ϕ(s h , a h , b h )∥ (Σ * h ) -1 ≲ dH √ K . ( ) for any u ∈ U. Combining ( 28), ( 29 Remark 1. In our construction, the min-player will not affect the rewards and transitions. So by the similar derivations of (28) and (29) , we can obtain inf π max u∈U E u [V * u -V π u ] ≳ d √ dH √ K and H h=1 E π * ∥ϕ(s h , a h )∥ (Σ * h ) -1 ≲ dH √ K . Hence, we can establish the lower bound for MDP as desired.

H NUMERICAL SIMULATIONS

For completeness, we adopt a similar synthetic linear MDP instance that is also used in Min et al. (2021) and Yin et al. (2022) , and redo the experiments to verify the theoretical findings. The adopted MDP has S = {1, 2}, A = {0, • • • , 99}, and d = 10. For the feature, we apply binary encoding to represent a ∈ A by a ∈ R 8 . For the last two bits of the feature, we define δ(x, a) = 1(x = 0, a = 0) where 1 is the indicator function. Then, the MDP is characterized as follows. • True measure ν h : ν h (x) = (0, • • • , 0, (1 -s) ⊕ α h , x ⊕ α h ), where {α h } h∈[H] is a sequence of integers taking values in 0 or 1 generated randomly and fixed, and ⊕ is the XOR operator. The tranision is given by P h (x ′ |x, a) = ⟨ϕ(x, a), ν h (x ′ )⟩. • Reward function: we define θ h = (0, • • • , 0, r, 1 -r) ∈ R 10 with r = 0.9 to obtain the mean reward r h (x, a) = ⟨ϕ(x, a), θ h ⟩. Thr reward is then generated as a Bernoulli random variable. • Behavior policy: always choose a = 0 with probability p, and other actions uniformly with (1 -p)/99. We choose p = 0.5. • The initial state is chosen uniformly from S. • The regularization parameter λ is set to be 0.1 as suggested by Yin et al. (2022) and we estimate the value of the learned policy by 1000 i.i.d. trajectories where we also set the reward to be its mean during the this evaluation process. Figure 1 (a) and Figure 1 (b) match our theoretical findings that a sharper bonus function leads to a smaller suboptimality. Therefore, LinPEVI-ADV+ achieves the best sample complexity, and PEVI performs worst. In particular, as H increases, we can see that both LinPEVI-ADV and PEVI perform worse significantly, while the convergence rate of LinPEVI-ADV+ is rather stable. This demonstrates the power of variance information, which has been observed by the previous work on offline linear MDP (Min et al., 2021; Yin et al., 2022) .

I PROOF OF AUXILIARY LEMMAS

Proof of Lemma 3. We add and subtract ϕ(x, a) ⊤ Σ -1 h (Σ h -λI d )w h in the second equality to obtain that (T h f h+1 ) (x, a) -T h f h+1 (x, a) = ϕ(x, a) ⊤ (w h -w h ) =ϕ(x, a) ⊤ w h -ϕ(x, a) ⊤ Σ -1 h τ ∈D ϕ τ h • r τ h + f h+1 x τ h+1 σ 2 h (x τ h , a τ h ) =ϕ(x, a) ⊤ w h -ϕ(x, a) ⊤ Σ -1 h (Σ h -λI d ) w h + ϕ(x, a) ⊤ Σ -1 h τ ∈D ϕ τ h (ϕ τ h ) ⊤ σ 2 h (x τ h , a τ h ) w h -ϕ(x, a) ⊤ Σ -1 h τ ∈D ϕ τ h • r τ h + f h+1 x τ h+1 σ 2 h (x τ h , a τ h ) =λϕ(x, a) ⊤ Σ -1 h w h + ϕ(x, a) ⊤ Σ -1 h τ ∈D ϕ τ h • T h f h+1 (x τ h , a τ h ) -r τ h -f h+1 (x τ h+1 ) σ 2 h (x τ h , a τ h ) ≤ λ ∥w h ∥ Σ -1 h ∥ϕ(x, a)∥ Σ -1 h (i) + τ ∈D ϕ τ h σ h (x τ h , a τ h ) • ξ τ h (f h+1 ) Σ -1 h (ii) ∥ϕ(x, a)∥ Σ -1 h . This proves the first part of the lemma. Now suppose that |f h+1 | is bounded by H -1. We have, ∥w h ∥ = θ h + S f h+1 (x ′ )µ h (x ′ )dx ′ ≤ (1 + max x ′ f h+1 (x ′ )) √ d ≤ H √ d, and λ w ⊤ h Σ -1 h w h ≤ ∥w h ∥ Σ -1 h ≤ λ ∥w h ∥ √ λ -1 ≤ √ λdH, where we use λ min (Σ h ) ≤ λ -1 . Therefore, by setting λ sufficiently small such that √ λdH ≤ β, (i) is non-dominating and we can focus on bounding (ii).

J TECHNICAL LEMMAS

Lemma 8 (Hoeffding's inequality (Wainwright, 2019) ). Let X 1 , • • • , X n be mean-zero independent random variables such that |X i | ≤ ξ i almost surely. Then, for any t > 0, we have P 1 n n i=1 X i ≥ t ≤ exp - 2n 2 t 2 n i=1 ξ 2 i . Lemma 9 (Hoeffding-type inequality for self-normalized process Abbasi-Yadkori et al. ( 2011)). Let {η t } ∞ t=1 be a real-valued stochastic process and let {F t } ∞ t=0 be a filtration such that η t is F tmeasurable. Let {x t } ∞ t=1 be an R d -valued stochastic process where x t is F t-1 measurable and ∥x t ∥ ≤ L. Let Λ t = λI d + t s=1 x s x ⊤ s . Assume that conditioned on F t-1 , η t is mean-zero and R-subGaussian. Then for any δ > 0, with probability at least 1 -δ, for all t > 0, we have Lemma 10 (Bernstein-type inequality for self-normalized process Zhou et al. (2021) 



The results readily generalize to the stochastic reward as the uncertainty of reward is non-dominating compared with that of state transition. See page 17 ofZanette et al. (2020) for a discussion about the error amplification issue.



Offline MGs. The existing works studying sample-efficient equilibrium finding in offline MARL include Zhong et al. (2021); Chen et al. (2021); Cui and Du (2022); Zhong et al. (

For instance, in Lemma C.3 of Jin et al. (2020), they also need to analyze the self-normalized process where the uniform concentration leads to an extra factor of √ d. The recent work Hu et al. (2022) leverages similar ideas of reference-advantage decomposition, trying to improve the regret bound for the linear MDP but focus on the online setting, whose algorithmic design (optimism v.s. pessimism, choices of weights) and proof techniques (online v.s. offline) are different from ours. Chen et al. (2022) considers the linear mixture MG, which is different from the model considered in this paper.

, u h+1 , u h+1 ) -r * h+1 (-1, u h+1 , u h+1 ) equality follows from Bellman equation, and the last inequality uses the facts thatr u,h+1 (1, a, b) -r u,h+1 (-1, a, b) (d-2) I 2 ∈ R d×d .By Gaussian elimination, we know∥(EΣ * h /K) -1 ∥ ≤ d 2 . For all h ∈ [H],(s, a, b) and K ≥ Ω(d 4 log(2Hd/δ)), it holds with probability 1 -δ that ∥ϕ(s, a, b)

), and the fact that|V * u -V π,ν u | ≤ |V * ,ν u -V π, * u | for any u ∈ U, we have with probability at least 1 -δ that inf π * ,ν ∥ϕ(s h , a h , b h )∥ (Σ * h ) ν * ∥ϕ(s h , a h , b h )∥ (Σ * h ) -1 ,which concludes our proof.

(a) Horizon H = 20. (b) Horizon H = 40.

Figure 1: Suboptimality v.s. The number of trajectories K. The results are averaged aver 100 independent trails and the mean result is plotted as solid lines. The error bar area corresponds to the standard deviation.

constructs the variance estimator σ 2 h (•, •) and estimate the Bellman equation both with the target function V h+1 . As discussed in Section 3, σ 2



Input datasets D, D ′ , and β 3 ; V H+1 (•) = V H+1 (•) = 0. 2: Construct variance estimator σ 2 h and σ 2 h as in Appendix B.3 with D ′ . 3: for h = H, . . . , 1 do

). Let {η t } ∞ t=1 be a real-valued stochastic process and let {F t } ∞ t=0 be a filtration such that η t is F t -measurable. Let {x t } ∞ t=1 be an R d -valued stochastic process where x t is F t-1 measurable and∥x t ∥ ≤ L. Let Λ t = λI d + t s=1 x s x ⊤ s . Assume that |η t | ≤ R, E[η t |F t-1 ] = 0, E[η 2 t |F t-1 ] ≤ σ 2 .Then for any δ > 0, with probability at least 1 -δ, for all t > 0, we have

ACKNOWLEDGEMENTS

Wei Xiong and Tong Zhang acknowledge the funding supported by the GRF 16310222 GRF 16201320. Chengshuai Shi and Cong Shen acknowledge the funding support by the US National Science Foundation under Grant ECCS-2029978, ECCS-2033671, ECCS-2143559, CNS-2002902, and the Bloomberg Data Science Ph.D. Fellowship. Liwei Wang is supported by National Key R&D Program of China (2022ZD0114900) and National Science Foundation of China (NSFC62276005).

annex

then it follows thatwhere the last inequality holds for any fixed D h . By a union bound over all ϕ(x, a), we know that with probability at least 1 -δ/H 2 , for all (x, a) ∈ S × A, we haveBy similar induction procedure, we can set, where we require K ≥ Ω d 2 H 2 /κ and λ = 1/d to make the advantage function and (a) non-dominating.Then, the theorem follows from Lemma 2.

E PROOF OF LINPEVI-ADV+

In this section, we present the proof of Theorem 2.

E.1 ANALYSIS OF THE VARIANCE ESTIMATION ERROR

First, we analyze the estimation error of the conditional variance estimator. We recall that we estimatewe haveProof. We first bound the difference between B h (x, a) and.We note that both (a) and (b) are analysis of the regular value-target ridge regression, with target ( V ′ h+1 ) 2 and ( V ′ h+1 ), respectively. The analysis thus follows the same line of those presented in Appendix D when we deal with the correlated advantage part, except that we invoke Lemma 9 with range H for V ′ h+1 and H 2 for ( V ′ h+1 ) 2 . We omit the details here for simplicity and present the results directly: with probability at least 1 -δ/2, for all (x, a, h) ∈ S × A × [H],Lemma 11 (ϵ-Covering (Jin et al., 2021c) ). For all h ∈ [H] and all ϵ > 0, let N (•) be the covering number of the function space specified in (19), we haveLemma 12 (Lemma H. Then, for any δ ∈ (0, 1), with probability at least 1 -δ, it holds that Then, for any δ ∈ (0, 1), if K satisfies thatThen with probability at least 1 -δ, it holds simultaneously for all u ∈ R d that

