WHAT ARE THE STATISTICAL LIMITS OF OFFLINE RL WITH LINEAR FUNCTION APPROXIMATION?

Abstract

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of every policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon to nontrivially estimate the value of any given policy. Our results highlight that sampleefficient offline policy evaluation is not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability). ) *+! , 𝑎 = 𝑟 " ' 𝑑 $/( 𝑅 𝑠 ! ) *+! , 𝑎 = 𝑟 " ( ' 𝑑 $/( -' 𝑑 ($%!)/( ) 𝑄 𝑠 ( ) *+! , 𝑎 = 𝑟 " ' 𝑑 ($%!)/( 𝑅 𝑠 ( ) *+! , 𝑎 = 𝑟 " ( ' 𝑑 ($%!)/( -' 𝑑 ($%()/( ) 𝑄 𝑠 $%! ) *+! , 𝑎 = 𝑟 " ' 𝑑 𝑅 𝑠 $%! ) *+! , 𝑎 = 𝑟 " ( ' 𝑑 -' 𝑑 !/( ) 𝑄 𝑠 $ ) *+! , 𝑎 = 𝑟 " ' 𝑑 !/( 𝑅 𝑠 $ ) *+! , 𝑎 = 𝑟 " ' 𝑑 !/( 𝑎 ! 𝑎 ( 𝜙 𝑠 , -, 𝑎 ! = 𝑒 - 𝜙 𝑠 , -, 𝑎 ( = 𝑒 -+ )

1. INTRODUCTION

Offline methods (also known as off-policy methods or batch methods) are a promising methodology to alleviate the sample complexity burden in challenging reinforcement learning (RL) settings, particularly those where sample efficiency is paramount (Mandel et al., 2014; Gottesman et al., 2018; Wang et al., 2018; Yu et al., 2019) . Off-policy methods are often applied together with function approximation schemes; such methods take sample transition data and reward values as inputs, and approximate the value of a target policy or the value function of the optimal policy. Indeed, many practical deep RL algorithms find their prototypes in the literature of offline RL. For example, when running on off-policy data (sometimes termed as "experience replay"), deep Q-networks (DQN) (Mnih et al., 2015) can be viewed as an analog of Fitted Q-Iteration (Gordon, 1999) with neural networks being the function approximators. More recently, there are an increasing number of both model-free (Laroche et al., 2019; Fujimoto et al., 2019; Jaques et al., 2020; Kumar et al., 2019; Agarwal et al., 2020) and model-based (Ross & Bagnell, 2012; Kidambi et al., 2020) offline RL methods, with steady improvements in performance (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2020; Kidambi et al., 2020) . However, despite the importance of these methods, the extent to which data reuse is possible, especially when off-policy methods are combined with function approximation, is not well understood. For example, deep Q-network requires millions of samples to solve certain Atari games (Mnih et al., 2015) . Also important is that in some safety-critical settings, we seek guarantees when offline-trained policies can be effective (Thomas, 2014; Thomas et al., 2019) . A basic question here is that if there are fundamental statistical limits on such methods, where sample-efficient offline RL is simply not possible without further restrictions on the problem. In the context of supervised learning, it is well-known that empirical risk minimization is sampleefficient if the hypothesis class has bounded complexity. For example, suppose the agent is given a d-dimensional feature extractor, and the ground truth labeling function is a (realizable) linear function with respect to the feature mapping. Here, it is well-known that a polynomial number of samples in d suffice for a given target accuracy. Furthermore, in this realizable case, provided the training data has a good feature coverage, then we will have good accuracy against any test distribution. 1In the more challenging offline RL setting, it is unclear if sample-efficient methods are possible, even under analogous assumptions. This is our motivation to consider the following question: What are the statistical limits for offline RL with linear function approximation? Here, one may hope that value estimation for a given policy is possible in the offline RL setting under the analogous set of assumptions that enable sample-efficient supervised learning, i.e., 1) (realizability) the features can perfectly represent the value functions and 2) (good coverage) the feature covariance matrix of our off-policy data has lower bounded eigenvalues. The extant body of provable methods on offline RL either make representational assumptions that are far stronger than realizability or assume distribution shift conditions that are far stronger than having coverage with regards to the spectrum of the feature covariance matrix of the data distribution. For example, Szepesvári & Munos (2005) analyze offline RL methods by assuming a representational condition where the features satisfy (approximate) closedness under Bellman updates, which is a far stronger representation condition than realizability. Recently, Xie & Jiang (2020a) propose a offline RL algorithm that only requires realizability as the representation condition. However, the algorithm in (Xie & Jiang, 2020a ) requires a more stringent data distribution condition. Whether it is possible to design a sample-efficient offline RL method under the realizability assumption and a reasonable data coverage assumption -an open problem in (Chen & Jiang, 2019) -is the focus of this work. Our Contributions. Perhaps surprisingly, our main result shows that, under only the above two assumptions, it is information-theoretically not possible to design a sample-efficient algorithm to non-trivially estimate the value of a given policy. The following theorem is an informal version of the result in Section 4. Theorem 1.1 (Informal). In the offline RL setting, suppose the data distributions have (polynomially) lower bounded eigenvalues, and the Q-functions of every policy are linear with respect to a given feature mapping. Any algorithm requires an exponential number of samples in the horizon H to output a non-trivially accurate estimate of the value of any given policy π, with constant probability. This hardness result states that even if the Q-functions of all polices are linear with respect to the given feature mapping, we still require an exponential number of samples to evaluate any given policy. Note that this representation condition is significantly stronger than assuming realizability with regards to only a single target policy; it assumes realizability for all policies. Regardless, even under this stronger representation condition, it is hard to evaluate any policy, as specified in our hardness result. This result also formalizes a key issue in offline reinforcement learning with function approximation: geometric error amplification. To better illustrate the issue, in Section 5, we analyze the classical Least-Squares Policy Evaluation (LSPE) algorithm under the realizability assumption, which demonstrates how the error propagates as the algorithm proceeds. Here, our analysis shows that, if we only rely on the realizability assumption, then a far more stringent condition is required for sample-efficient offline policy evaluation: the off-policy data distribution must be quite close to the distribution induced by the policy to be evaluated. Our results highlight that sample-efficient offline RL is simply not possible unless either the distribution shift condition is sufficiently mild or we have stronger representation conditions that go well beyond realizability. See Section 5 for more details. Furthermore, our hardness result implies an exponential separation on the sample complexity between offline RL and supervised learning, since supervised learning (which is equivalent to offline RL with H = 1) is possible with polynomial number of samples under the same set of assumptions. A few additional points are worth emphasizing with regards to our lower bound construction: • Our results imply that Least-Squares Policy Evaluation (LSPE, i.e., using Bellman backups with linear regression) will fail. Interestingly, while LSPE will provide an unbiased estimator, our results imply that it will have exponential variance in the problem horizon. • Our construction is simple and does not rely on having a large state or action space: the size of the state space is only O(d • H) where d is the feature dimension and H is the planning horizon, and the size of the action space is only is 2. This stands in contrast to other RL lower bounds, which typically require state spaces that are exponential in the problem horizon (e.g. see (Du et al., 2020) ). • We provide two hard instances, one with a sparse reward (and stochastic transitions) and another with deterministic dynamics (and stochastic rewards). These two hard instances jointly imply that both the estimation error on reward values and the estimation error on the transition probabilities could be geometrically amplified in offline RL. • Of possibly broader interest is that our hard instances are, to our knowledge, the first concrete examples showing that geometric error amplification is real in RL problems (even with realizability). While this is a known concern in the analysis of RL algorithms, there have been no concrete examples exhibiting such behavior under only a realizability assumption.

2. RELATED WORK

We now survey prior work on offline RL, largely focusing on theoretical results. We also discuss results on the error amplification issue in RL. Concurrent to this work, Xie & Jiang (2020a) propose a offline RL algorithm under the realizability assumption, which requires stronger distribution shift conditions. We will discuss this work shortly. Existing Algorithms and Analysis. Offline RL with value function approximation is closely related to Approximate Dynamic Programming (Bertsekas & Tsitsiklis, 1995) . Existing work (Munos, 2003; Szepesvári & Munos, 2005; Antos et al., 2008; Munos & Szepesvári, 2008; Tosatto et al., 2017; Xie & Jiang, 2020b; Duan & Wang, 2020 ) that analyze the sample complexity of approximate dynamic programming-based approaches usually make the following two categories of assumptions: (i) representation conditions that assume the function class approximates the value functions well and (ii) distribution shift conditions that assume the given data distribution has sufficient coverage over the state-action space. As mentioned in the introduction, the desired representation condition would be realizability, which only assumes the value function of the policy to be evaluated lies in the function class (for the case of offline policy evaluation) or the optimal value function lies in the function class (for the case of finding near-optimal policies), and existing works usually make stronger assumptions. For example, Szepesvári & Munos (2005) ; Duan & Wang (2020) assume (approximate) closedness under Bellman updates, which is much stronger than realizability. Whether it is possible to design a sample-efficient offline RL method under the realizability assumption and reasonable data coverage assumption, is left as an open problem in (Chen & Jiang, 2019) . To measure the coverage over the state-action space of the given data distribution, existing works assume the concentrability coefficient (introduced by Munos (2003) ) to be bounded. The concentrability coefficient, informally speaking, is the largest possible ratio between the probability for a state-action pair (s, a) to be visited by a policy, and the probability that (s, a) appears on the data distribution. Since we work with linear function approximation in this work, we measure the distribution shift in terms of the spectrum of the feature covariance matrices (see Assumption 2), which is a well-known sufficient condition in the context of supervised learning and is much more natural for the case of linear function approximation. Concurrent to this work, Xie & Jiang (2020a) propose an algorithm that works under the realizability assumption instead of other stronger representation conditions used in prior work. However, the algorithm in (Xie & Jiang, 2020a ) requires a much stronger data distribution condition which assumes a stringent version of concentrability coefficient introduced by (Munos, 2003) to be bounded. In contrast, in this work we measure the distribution shift in terms of the spectrum of the feature covariance matrix of the data distribution, which is more natural than the concentrability coefficient for the case of linear function approximation. Recently, there has been great interest in approaching offline policy evaluation (Precup, 2000) via importance sampling. For recent work on this topic, see (Dudík et al., 2011; Mandel et al., 2014; Thomas et al., 2015; Li et al., 2015; Jiang & Li, 2016; Thomas & Brunskill, 2016; Guo et al., 2017; Wang et al., 2017; Liu et al., 2018; Farajtabar et al., 2018; Xie et al., 2019; Kallus & Uehara, 2019; Liu et al., 2019; Uehara & Jiang, 2019; Kallus & Uehara, 2020; Jiang & Huang, 2020; Feng et al., 2020) . Offline policy evaluation with importance sampling incurs exponential variance in the planning horizon when the behavior policy is significantly different from the policy to be evaluated. Bypassing such exponential dependency requires non-trivial function approximation assumptions (Jiang & Huang, 2020; Feng et al., 2020; Liu et al., 2018) . Finally, Kidambi et al. (2020) provide a model-based offline RL algorithm, with a theoretical analysis based on hitting times. Hardness Results. Algorithm-specific hardness results have been known for a long time in the literature of Approximate Dynamic Programming. See Chapter 4 in (Van Roy, 1994) and also (Gordon, 1995; Tsitsiklis & Van Roy, 1996) . These works demonstrate that certain approximate dynamic programming-based methods will diverge on hard cases. However, such hardness results only hold for a restricted class of algorithms, and to demonstrate the fundamental difficulty of offline RL, it is more desirable to obtain information-theoretic lower bounds as initiated by Chen & Jiang (2019) . Existing (information-theoretic) exponential lower bounds (Krishnamurthy et al., 2016; Sun et al., 2017; Chen & Jiang, 2019) usually construct unstructured MDPs with an exponentially large state space. Du et al. (2020) prove an exponential lower bound under the assumption that the optimal Qfunction is approximately linear. The condition that the optimal Q-function is only approximately linear is crucial for the hardness result in Du et al. (2020) . The techniques in (Du et al., 2020) are later generalized to other settings (Kumar et al., 2020; Wang et al., 2020; Mou et al., 2020) . Error Amplification In RL. Error amplification induced by distribution shift and long planning horizon is a known issue in the theoretical analysis of RL algorithms. See (Gordon, 1995; 1996; Munos & Moore, 1999; Ormoneit & Sen, 2002; Kakade, 2003; Zanette et al., 2019) for papers on this topic and additional assumptions that mitigate this issue. Error amplification in offline RL is also observed in empirical works (see e.g. (Fujimoto et al., 2019) ). In this work, we provide the first information-theoretic lower bound showing that geometric error amplification is real in offline RL.

3. THE OFFLINE POLICY EVALUATION PROBLEM

Throughout this paper, for a given integer H, we use [H] to denote the set {1, 2, . . . , H}. Episodic Reinforcement Learning. Let M = (S, A, P, R, H) be a Markov Decision Process (MDP) where S is the state space, A is the action space, P : S × A → ∆ (S) is the transition operator which takes a state-action pair and returns a distribution over states, R : S × A → ∆ (R) is the reward distribution, H ∈ Z + is the planning horizon. For simplicity, we assume a fixed initial state s 1 ∈ S. A (stochastic) policy π : S → ∆ (A) chooses an action a randomly based on the current state s. The policy π induces a (random) trajectory s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s H , a H , r H , where a 1 ∼ π 1 (s 1 ), r 1 ∼ R(s 1 , a 1 ), s 2 ∼ P (s 1 , a 1 ), a 2 ∼ π 2 (s 2 ) , etc. To streamline our analysis, for each h ∈ [H], we use S h ⊆ S to denote the set of states at level h, and we assume S h do not intersect with each other. We assume, almost surely, that r h ∈ [-1, 1] for all h ∈ [H]. Value Functions. Given a policy π, h ∈ [H] and (s, a) ∈ S h × A, define Q π h (s, a) = E H h =h r h | s h = s, a h = a, π and V π h (s) = E H h =h r h | s h = s, π . For a policy π, we define V π = V π 1 (s 1 ) to be the value of π from the fixed initial state s 1 . Linear Function Approximation. When applying linear function approximation schemes, it is commonly assumed that the agent is given a feature extractor φ : S × A → R d which can either be hand-crafted or a pre-trained neural network that transforms a state-action pair to a d-dimensional embedding, and the Q-functions can be predicted by linear functions of the features. In this paper, we are interested in the following realizability assumption. Assumption 1 (Realizable Linear Function Approximation). For every policy π : S → ∆(A), there exists θ π 1 , . . . θ π H ∈ R d such that for all (s, a) ∈ S × A and h ∈ [H], Q π h (s, a) = (θ π h ) φ(s, a). Note that our assumption is much stronger than assuming realizability with regards to a single policy π (say the policy that we wish to evaluate); our assumption imposes realizability for all policies. Offline Reinforcement Learning. This paper is concerned with the offline RL setting. In this setting, the agent does not have direct access to the MDP and instead is given access to data distributions {µ h } H h=1 where for each h ∈ [H], µ h ∈ ∆(S h × A). The inputs of the agent are H datasets {D h } H h=1 , and for each h ∈ [H], D h consists i.i.d. samples of the form (s, a, r, s ) ∈ S h × A × R × S h+1 tuples, where (s, a) ∼ µ h , r ∼ r(s, a), s ∼ P (s, a). In this paper, we focus on the offline policy evaluation problem with linear function approximation: given a policy π : S → ∆ (A) and a feature extractor φ : S × A → R d , the goal is to output an accurate estimate of the value of π (i.e., V π ) approximately, using the collected datasets {D h } H h=1 , with as few samples as possible. Notation. For a vector x ∈ R d , we use x 2 to denote its 2 norm. For a positive semidefinite matrix A, we use A 2 to denote its operator norm, and σ min (A) to denote its smallest eigenvalue. For two positive semidefinite matrices A and B, we write A B to denote the Löwner partial ordering of matrices, i.e, A B if and only if A -B is positive semidefinite. For a policy π : S → ∆ (A), we use µ π h to denote the marginal distribution of s h under π, i.e., µ π h (s) = Pr[s h = s | π]. For a vector x ∈ R d and a positive semidefinite matrix A ∈ R d×d , we use x A to denote

√

x Ax.

4. THE LOWER BOUND: REALIZABILITY AND COVERAGE ARE INSUFFICIENT

We now present our main hardness result for offline policy evaluation with linear function approximation. It should be evident that without feature coverage in our dataset, realizability alone is clearly not sufficient for sample-efficient estimation. Here, we will make the strongest possible assumption, with regards to the conditioning of the feature covariance matrix. Assumption 2 (Feature Coverage). For all (s, a) ∈ S × A, assume our feature map is bounded such that φ(s, a) 2 ≤ 1. Furthermore, suppose for each h ∈ [H], the data distributions µ h satisfy the following minimum eigenvalue condition: σ min E (s,a)∼µ h [φ(s, a)φ(s, a) ] = 1/d. 2 Clearly, for the case where H = 1, the realizability assumption (Assumption 1), and feature coverage assumption (Assumption 2) imply that the ordinary least squares estimator will accurately estimate θ 1 . 3 Our main result now shows that these assumptions are not sufficient for offline policy evaluation for long horizon problems. Theorem 4.1. Suppose Assumption 2 holds. Fix an algorithm that takes as input both a policy and a feature mapping. There exists a (deterministic) MDP satisfying Assumption 1, such that for any policy π : S → ∆(A), the algorithm requires Ω((d/2) H ) samples to output the value of π up to constant additive approximation error with probability at least 0.9. Although we focus on offline policy evaluation in this work, our hardness result also holds for finding near-optimal policies under Assumption 1 in the offline RL setting. Below we give a simple reduction. At the initial state, if the agent chooses action a 1 , then the agent receives a fixed reward value (say 0.5) and terminates. If the agent chooses action a 2 , then the agent transits to our hard instance. In order to find a policy with suboptimality at most 0.5, the agent must evaluate the value of the optimal policy in our hard instance up to an error of 0.5, and hence the hardness result holds. Remark 1 (The sparse reward case). As stated, the theorem uses a deterministic MDP (with stochastic rewards). See Appendix C for another hard case where the transition is stochastic and the reward is deterministic and sparse (only occurring at two states at h = H). Remark 2 (Least-Squares Policy Evaluation (LSPE) has exponential variance). For offline policy evaluation with linear function approximation, the most naïve algorithm here would be LSPE, i.e., using ordinary least squares (OLS) to estimate θ π , starting at level h = H and then proceeding backwards to level h = 1, using the plug-in estimator from the previous level. Here, LSPE will provide an unbiased estimate (provided the feature covariance matrices are full rank, which will occur with high probability). As a direct corollary, the above theorem implies that LSPE has exponential variance in H. See Section 5 for a more detailed discussion on LSPE. More generally, our theorem implies that there is no estimator that can avoid such exponential dependence in the offline setting. Remark 3 (Least-Squares Value Iteration (LSVI) versus Least-Squares Policy Iteration (LSPI)). In the offline setting, under Assumptions 1 and 2, in order to find a near-optimal policy, the most naïve algorithm would be LSVI, i.e., using ordinary least squares (OLS) to estimate θ * , starting at level h = H and then proceeding backwards to level h = 1, using the plug-in estimator from the previous level and the bellman operator. The above theorem implies that LSVI will require an exponential number of samples to find a near-optimal policy. On the other hand, if the regression targets are collected by using rollouts (i.e. on-policy sampling) as in LSPI (Lagoudakis & Parr, 2003) , then a polynomial number of samples suffice. See Section D in (Du et al., 2020) for an analysis. Thus, Theorem 4.1 implies an exponential separation on the sample complexity between LSVI and LSPI. Of course, LSPI requires adaptive data samples and thus does not work in the offline setting. One may wonder if Theorem 4.1 still holds when the data distributions {µ h } H h=1 are induced by a policy. In Appendix C, we prove another exponential sample complexity lower bound under the additional assumption that the data distributions are induced by a fixed policy π. However, under such an assumption, it is impossible to prove a hardness result as strong as Theorem 4.1 (which shows that evaluating any policy is hard), since one can at least evaluate the policy π that induces the data distributions. Nevertheless, we are able to prove the hardness of offline policy evaluation, under a weaker version of Assumption 1. See Appendix C for more details. In the remaining part of this section, we give the hard instance construction and the proof of Theorem 4.1. We use d the denote the feature dimension, and we assume d is even for simplicity. We use d to denote d/2 for convenience. We also provide an illustration of the construction in Figure 1 .

State Space, Action Space and Transition Operator. The action space

A = {a 1 , a 2 }. For each h ∈ [H], S h contains d + 1 states s 1 h , s 2 h , . . . , s d h and s d+1 h . For each h ∈ [H -1], for each c ∈ {1, 2, . . . , d + 1}, we have P (s c h , a 1 ) = s d+1 h+1 and P (s c h , a 1 ) = s c h+1 . Reward Distributions. Let 0 ≤ r 0 ≤ d-H/2 be a parameter to be determined. For each Verifying Assumption 1. The following lemma shows that Assumption 1 holds for our construction. The formal proof can be found in Appendix A. (h, c) ∈ [H -1] × [ d] and a ∈ A, we set R(s c h , a) = 0 and R(s d+1 h , a) = r 0 • ( d1/2 -1) • d(H-h)/2 . For the last level, for each c ∈ [ d] and a ∈ A, we set R(s c H , a) = 1 with probability (1 + r 0 )/2 -1 with probability (1 -r 0 )/2 so that E[R(s c H , a)] = r 0 . Moreover, for all actions a ∈ A, R(s d+1 H , a) = r 0 • d1/2 . Lemma 4.2. For every policy π : S → ∆(A), for each h ∈ [H], for all (s, a) ∈ S h × A, we have H , s 2 H , . . . , s d H . We mark the expectation of the reward value when it is stochastic. For each level h ∈ [H], for the data distribution µ h , the state is chosen uniformly at random from those states in the dashed rectangle, i.e., {s 1 h , s 2 h , . . . , s d h }, while the action is chosen uniformly at random from {a 1 , a 2 }. Suppose the initial state is s d+1 1 . When r 0 = 0, the value of the policy is 0. When r 0 = d-H/2 , the value of the policy is r 0 • dH/2 = 1. Q π h (s, a) = (θ π h ) φ(s, a) for some θ π h ∈ R d . … … … … 𝑄 𝑠, 𝑎 ! = 𝑟 " ' 𝑑 ($%!)/( 𝑅 𝑠, 𝑎 = 0 𝑄 𝑠, 𝑎 ! = 𝑟 " ' 𝑑 ($%()/( 𝑅 𝑠, 𝑎 = 0 𝑄 𝑠, 𝑎 ! = 𝑟 " ' 𝑑 !/( 𝑅 𝑠, 𝑎 = 0 𝑄 𝑠, 𝑎 = 𝑟 " 𝔼[𝑅 𝑠, 𝑎 ] = 𝑟 " 𝑄 𝑠 ! 𝑠 ( ( 𝑠 ( . 𝑠 ( ) * 𝑠 ( ) *+! 𝑠 $%! ! 𝑠 $%! ) *+! 𝑠 $%! ) * 𝑠 $%! ( 𝑠 $%! . 𝑠 $ ! 𝑠 $ ( 𝑠 $ . 𝑠 $ ) * 𝑠 $ ) *+! ⋯ ⋯ ⋯ ⋯ ⋯ … 𝑄 𝑠 , ) *+! , 𝑎 = 𝑟 " ' 𝑑 ($%,+!)/( 𝑅 𝑠 , ) *+! , 𝑎 = 𝑟 " ( ' 𝑑 ($%,+!)/( -' 𝑑 ($%,)/( ) 𝑄 𝑠, 𝑎 ! = 𝑟 " ' 𝑑 ($%,)/( 𝑅 𝑠, 𝑎 = 0 The Data Distributions. For each level h ∈ [H], the data distribution µ h is a uniform distribution over {(s 1 h , a 1 ), (s 1 h , a 2 ), (s 2 h , a 1 ), (s 2 h , a 2 ), . . . , (s d h , a 1 ), (s d h , a 2 )}. Notice that (s d+1 h , a) is not in the support of µ h for all a ∈ A. It can be seen that, E (s,a)∼µ h φ(s, a)φ(s, a) = 1 The Lower Bound. We show that it is information-theoretically hard for any algorithm to distinguish the case r 0 = 0 and r 0 = d-H/2 . We fix the initial state to be s d+1 1 , and consider any policy π. When r 0 = 0, all reward values will be zero, and thus the value of π is zero. On the other hand, when r 0 = d-H/2 , the value of π would be r 0 • dH/2 = 1. Thus, if the algorithm approximates the value of the policy up to an error of 1/2, then it must distinguish the case that r 0 = 0 and r 0 = d-H/2 . We first notice that for the case r 0 = 0 and r 0 = d-H/2 , the data distributions {µ h } H h=1 , the feature mapping φ : S × A → R d , the policy π to be evaluated and the transition operator P are the same. Thus, in order to distinguish the case r 0 = 0 and r 0 = d-H/2 , the only way is to query the reward distribution by using sampling taken from the data distributions. For all state-action pairs (s, a) in the support of the data distributions of the first H -1 levels, the reward distributions will be identical. This is because for all s ∈ S h \ {s d+1 h } and a ∈ A, we have R(s, a) = 0. For the case r 0 = 0 and r 0 = d-H/2 , for all state-action pairs (s, a) in the support of the data distribution of the last level, R(s, a) = 1 with probability (1 + r 0 )/2 -1 with probability (1 -r 0 )/2 . Therefore, to distinguish Algorithm 1 Least-Squares Policy Evaluation 1: Input: policy π to be evaluated, number of samples N , regularization parameter λ > 0 2: Let Q H+1 (•, •) = 0 and V H+1 (•) = 0 3: for h = H, H -1, . . . , 1 do 4: Take samples (s i h , a i h ) ∼ µ h , r i h ∼ r(s i h , a i h ) and s i h ∼ P (s i h , a i h ) for each i ∈ [N ] 5: Let Λh = i∈[N ] φ(s i h , a i h )φ(s i h , a i h ) + λI 6: Let θh = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + Vh+1 (s i h )) 7: Let Qh (•, •) = φ(•, •) θh and Vh (•) = Q(•, π(•)) the case that r 0 = 0 and r 0 = d-H/2 , the agent needs to distinguish two reward distributions r 1 = 1 with probability 1/2 -1 with probability 1/2 and r 2 = 1 with probability (1 + d-H/2 )/2 -1 with probability (1 -d-H/2 )/2 . Now we invoke Lemma B.1 in Section B by setting ε = d-H/2 /2 and δ = 0.9. By Lemma B.1, in order to distinguish r 1 and r 2 with probability at least 0.9, any algorithm requires Ω( dH ) samples. Remark 4. The key in our construction is the state s d+1 h in each level, whose feature vector is defined to be c∈ d e c / d1/2 . In each level, s d+1 h amplifies the Q-value by a d1/2 factor, due to the linearity of the Q-function. After all the H levels, the value will be amplified by a dH/2 factor. Since s d+1 h is not in the support of the data distribution, the only way to estimate the value of the policy is to estimate the expected reward value in the last level. Our construction forces the estimation error of the last level to be amplified exponentially and thus implies an exponential lower bound.

5. UPPER BOUNDS: LOW DISTRIBUTION SHIFT OR POLICY COMPLETENESS ARE SUFFICIENT

In order to illustrate the error amplification issue and discuss conditions that permit sample-efficient offline RL, in this section, we analyze Least-Squares Policy Evaluation when applied to the offline policy evaluation problem under the realizability assumption. The algorithm is presented in Algorithm 1. For simplicity here we assume the policy π to be evaluated is deterministic.

Notation. For each h ∈ [H],

define Λ h = E (s,a)∼µ h φ(s, a)φ(s, a) to be the feature covariance matrix at level h. For each h ∈ [H -1], define Λ h+1 = E (s,a)∼µ h ,s∼P (•|s,a) φ(s, π(s))φ(s, π(s)) to be the feature covariance matrix of the one-step lookahead distribution at level h. Moreover, define Λ 1 = φ(s 1 , π(s 1 ))φ(s 1 , π(s 1 )) . We define Φ h to be a N × d matrix, whose i-th row is φ(s i h , a i h ), and define Φ h+1 to be another N × d matrix whose i-th row is φ(s i h , π(s i h )). For each h ∈ [H] and i ∈ [N ], define ξ i h = r i h + V (s i h ) -Q(s i h , a i h ). We use ξ h to denote a vector whose i-th entry is ξ i h . Now we present a general lemma that characterizes the estimation error of Algorithm 1 by an equality. The proof can be found in Appendix D. Later, we apply this general lemma to special cases. Lemma 5.1. Suppose λ > 0 in Algorithm 1, and for the given policy π, there exists θ 1 , θ 2 , . . . , θ d ∈ R d such that for each h ∈ [H], Q π h (s, a) = φ(s, a) θ h for all (s, a) ∈ S h × A. Then we have (Q π (s 1 , π(s 1 )) -Q(s 1 , π(s 1 ))) 2 = H h=1 Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 • • • ( Λ-1 h Φ h ξ h -λ Λ-1 h θ h ) 2 Λ1 . (1) Now we consider two special cases where the estimation error in Equation ( 1) can be upper bounded. Low Distribution Shift. The first special we focus on is the case where the distribution shift between the data distributions and the distribution induced by the policy to be evaluated is low. To measure the distribution shift formally, our main assumption is as follows. Assumption 3. We assume that for each h ∈ [H], there exists C h ≥ 1 such that Λ h C h Λ h . Remark 5. For each h ∈ [H], if σ min (Λ h ) 1 C h I for some C h ≥ 1, we have Λ h I C h Λ h . Therefore, Assumption 3 can be replaced with the assumption that C h Λ h I. However, we stick to the original version of Assumption 3 as it gives a tighter characterization of the distribution shift. Now we state the theoretical guarantee of Algorithm 1. The proof can be found in Appendix D. Theorem 5.2. Suppose for the given policy π, there exists θ 1 , θ 2 , . . . , θ d ∈ R d such that for each h ∈ [H], Q π h (s, a) = φ(s, a) θ h for all (s, a) ∈ S h × A and θ h 2 ≤ H √ d. 4 Let λ = CH d log(dH/δ)N for some C > 0. With probability at least 1 -δ, for some c > 0, (Q π 1 (s 1 , π(s 1 )) -Q1 (s 1 , π(s 1 ))) 2 ≤ c • H h=1 C h • dH 5 • d log(dH/δ)/N . Remark 6. The factor H h=1 C h in Theorem 5.2 implies that the estimation error will be amplified geometrically. Now we discuss how the error is amplified when running Algorithm 1 on the instance in Section 4 to better illustrate the issue. If we run Algorithm 1 on the hard instance in Section 4, when h = H, the estimation error on V (s c H ) would be roughly N -1/2 for each c ∈ [ d]. When using the linear predictor at level H to predict the value of s * H , the error will be amplified by d1/2 . When h = H -1, the dataset contains only s c H-1 for c ∈ [ d], and the estimation error on the value of s c H-1 will be the same as that of s * H , which is roughly ( d/N ) 1/2 . Again, the estimation error on the value of s * H-1 will be ( d2 /N ) 1/2 when using the linear predictor at level H -1. The error will eventually be amplified by a factor of dH/2 , which corresponds to the factor H h=1 C h in Theorem 5.2. Policy Completeness. In offline RL, another representation condition is closedness under Bellman update (Szepesvári & Munos, 2005; Duan & Wang, 2020) , which is stronger than realizability. In the context of offline policy evaluation, we have the following policy completeness assumption. Assumption 4. For the given policy π, for any h > 1 and θ h ∈ R d , there exists θ ∈ R d such that for any (s, a) ∈ S h-1 × A, E[R(s, a)] + s ∈S h P (s | s, a)φ(s , π(s )) θ h = φ(s, a) θ . Under Assumption 4 and the assumption that σ min (Λ h ) ≥ λ 0 for all h ∈ [H] for some λ 0 > 0, Duan & Wang (2020) have shown that for Algorithm 1, by taking N = poly(H, d, 1/ε, 1/λ 0 ), we have (Q π 1 (s 1 , π(s 1 )) -Q1 (s 1 , π(s 1 ))) 2 ≤ ε. We refer interested readers to (Duan & Wang, 2020) . We remark that the above analysis again implies that geometric error amplification is a real issue in offline RL, and sample-efficient offline RL is impossible unless the distribution shift is sufficiently low, i.e., H h=1 C h is bounded, or strong representation condition (e.g. policy completeness) holds.

6. CONCLUSION

While the extant body of provable results in the literature largely focus on sufficient conditions for sample-efficient offline RL, this work focuses on obtaining a better understanding of the necessary conditions, where we seek to understand to what extent mild assumptions can imply sample-efficient offline RL. This work shows that for off-policy evaluation, even if we are given a representation that can perfectly represent the value function of the given policy and the data distribution has good coverage over the features, any provable algorithm still requires an exponential number of samples to non-trivially approximate the value of the given policy. These results highlight that provable sample-efficient offline RL is simply not possible unless either the distribution shift condition is sufficiently mild or we have stronger representation conditions that go well beyond realizability. A PROOF OF LEMMA 4.2 Proof. We first verify Q π is linear for the first H -1 levels. For each (h, c) ∈ [H -1] × [ d], we have Q π h (s c h , a 1 ) =R(s c h , a 1 ) + R(s d+1 h+1 , a 1 ) + R(s d+1 h+2 , a 1 ) + . . . + R(s d+1 H , a 1 ) = r 0 • d(H-h)/2 . Moreover, for all a ∈ A, Q π h (s d+1 h , a) =R(s d+1 h , a) + R(s d+1 h+1 , a 1 ) + R(s d+1 h+2 , a 1 ) + . . . + R(s d+1 H , a 1 ) = r 0 • d(H-h+1)/2 . Therefore, if we define θ π h = d c=1 r 0 • d(H-h)/2 • e c + d c=1 Q π h (s c h , a 2 ) • e c+ d, then Q π h (s, a) = (θ π h ) φ(s, a ) for all (s, a) ∈ S h × A. Now we verify that the Q-function is linear for the last level. Clearly, for all c ∈ [ d] and a ∈ A, Q π H (s c H , a) = r 0 and Q π H (s d+1 H , a) = r 0 • d. Thus by defining θ π H = d c=1 r 0 • e c , we have Q π H (s, a) = (θ π H ) φ(s, a) for all (s, a) ∈ S H × A.

B A TECHNICAL LEMMA

We need the following lemma in the proof of our hardness results. Lemma B.1. Let α be a random variable uniformly distributed on {α + , α -}, where α -= 1/2 and α + = 1/2 + ε with 0 < ε < 1. Suppose that ξ 1 , ξ 2 , . . . , ξ m are i.i.d. {+1, -1}-valued random variables with Pr[ ξ i = +1] = α for all i ∈ [m]. Let f be a function from {+1, -1} m to {α + , α -}. Suppose m ≤ C/ε 2 log(1/δ) for some fixed constant C. Then Pr[f (ξ 1 , ξ 2 , . . . , ξ m ) = α] > δ. To our best knowledge, Lemma B.1 was first proved in (Chernoff, 1972) and has enormous applications in statistical learning theory (see, e.g., Chapter 5 in (Anthony & Bartlett, 2009) ) and bandits (Mannor & Tsitsiklis, 2004) . To prove Lemma B.1, one can first prove that the maximum likelihood estimator (MLE) is optimal and then show that MLE requires Ω(1/ε 2 log(1/δ)) samples to correctly output α with probability 1 -δ by anti-concentration. Lemma B.1 can also be proved by using information theory. See, e.g., (Kaufmann et al., 2016) for such a proof.

C ANOTHER HARD INSTANCE

In this section, we present another hard case under a weaker version of Assumption 1. Here the transition operator is stochastic and the reward is deterministic and sparse, meaning that the reward value is non-zero only for the last level. Moreover, the data distributions {µ h } H h=1 are induced by a fixed policy π data . We also illustrate the construction in Figure 2 . Throughout this section, we use d the denote the feature dimension, and we assume d is an even integer for simplicity. We use d to denote d/2 -1. In this section, we adopt the following realizability assumption, which is a weaker version of Assumption 1. Assumption 5 (Realizability). For the policy π : S → ∆(A) to be evaluated, there exists θ 1 , . . . θ H ∈ R d such that for all (s, a) ∈ S × A and h ∈ [H],  Q π h (s, a) = θ h φ(s, a). … … … 𝑄 𝑠 ! " , 𝑎 = 𝑟 # ' 𝑑 (%&')/! 𝑄 𝑠 %&* " , 𝑎 = 𝑟 # 𝑃 𝑠 % + 𝑠 %&* " , 𝑎) = (1 + 𝑟 # )/2 𝑃 𝑠 % & 𝑠 %&* " , 𝑎) = (1 -𝑟 # )/2 𝑄 𝑠 % " , 𝑎 = 0 𝑅 𝑠 % + , 𝑎 = 1 𝑅 𝑠 % & , 𝑎 = -1 𝑃 𝑠 ' + 𝑠 ! , -+* , 𝑎 -) = (1 + 𝑟 # ' 𝑑 (%&!)/! )/2 𝑃 𝑠 ' & 𝑠 ! , -+* , 𝑎 -) = (1 -𝑟 # ' 𝑑 (%&!)/! )/2 𝑄 𝑠 ! , -+* , 𝑎 -= 𝑟 # ' 𝑑 (%&!)/! 𝑃 𝑠 % + 𝑠 %&* , -+* , 𝑎 -) = (1 + 𝑟 # ' 𝑑 */! )/2 𝑃 𝑠 % & 𝑠 %&* , -+* , 𝑎 -) = (1 -𝑟 # ' 𝑑 */! )/2 𝑄 𝑠 %&* , -+* , 𝑎 -= 𝑟 # ' 𝑑 */! 𝑠 %&* , -+* 𝑠 %&* & 𝑠 %&* + 𝑠 %&* , - 𝑠 %&* ! 𝑠 %&* * ⋯ ⋯ ⋯ ⋯ ⋯ … 𝑠 % * 𝑠 % ! 𝑠 % , - 𝑠 % + 𝑠 % & 𝑠 % , -+* 𝑠 ! ! 𝑠 ! * 𝑠 ! , -+* 𝑠 * 𝑠 ! & 𝑠 ! + 𝑠 ! , - ⋯ 𝑄 𝑠 * , 𝑎 -= 𝑟 # ' 𝑑 (%&!)/! 𝑄 𝑠 . " , 𝑎 = 𝑟 # ' 𝑑 (%&.&*)/! 𝑃 𝑠 .+* + 𝑠 . , -+* , 𝑎 -) = (1 + 𝑟 # ' 𝑑 (%&.)/! )/2 𝑃 𝑠 .+* & 𝑠 . , -+* , 𝑎 -) = (1 -𝑟 # ' 𝑑 (%&.)/! )/ - h . Let 0 ≤ r 0 ≤ d-(H-2) /2 be a parameter to be determined. We first define the transition operator for the first level. We have Clearly, for all (s, a) ∈ S × A, φ(s, a) 2 ≤ 1. P (s 1 , a) =          s c 2 a = a c , c ∈ [ d] s + 2 a = a d+1 s - 2 a = a d+2 Verifying Assumption 5. Now we consider the deterministic policy π : S → A, which is defined to be π(s) = a d for all s ∈ S. We show that Assumption 5 holds. When h = 1, define θ 1 = d c=1 r 0 • d(H-3)/2 • e c + e d+1 -e d+2 + d c=1 r 0 • d(H-2)/2 • e d+2+c . For each h ∈ {2, 3, . . . , H -2}, define θ h = d c=1 r 0 • d(H-h-1)/2 • e c + e d+1 -e d+2 + d c=1 r 0 • d(H-h-2)/2 • e d+2+c . For the second last level h = H -1, define θ H-1 = d c=1 r 0 • e c + e d+1 -e d+2 . Finally, for the last level h = H, define θ H = e d+1 -e d+2 .

It can be verified that for each

h ∈ [H], Q π h (s, a) = θ h φ(s, a) for all (s, a) ∈ S h × A. The Data Distributions. For the first level, the data distribution µ 1 is defined to be the uniform distribution over {(s 1 , a c ) | c ∈ [d]}. For each h ≥ 2, the data distribution µ h is a uniform distribution over {(s 1 h , a 1 ), (s 2 h , a 1 ), . . . , (s d h , a 1 ), (s + h , a 1 ), (s - h , a 1 ), (s d+1 h , a 1 ), (s d+1 h , a 2 ), . . . , (s d+1 h , a d)}. Notice that again (s d+1 h , a) is not in the support of µ h for all actions a ∈ a d+1 , a d+2 , . . . , a d . It can be seen that for all h ∈ [H], E (s,a)∼µ h [φ(s, a)φ(s, a) ] = 1 d d c=1 e c e c = 1 d I. Moreover, by defining π data (s) =              Uniform(A) s = s 1 a 1 s ∈ {s c h | h ∈ {2, 3, . . . , H}, c ∈ [ d]} a 1 s ∈ {s + h | h ∈ {2, 3, . . . , H}} a 1 s ∈ {s - h | h ∈ {2, 3, . . . , H}} Uniform({a 1 , a 2 , . . . , a d}) s ∈ {s d+1 h | h ∈ {2, 3, . . . , H}} , we have µ h = µ π data h for all h ∈ [H]. The Lower Bound. Now we show that it is information-theoretically hard for any algorithm to distinguish the case r 0 = 0 and r 0 = d-(H-2)/2 in the offline setting by taking samples from the data distributions {µ h } H h=1 . Here we consider the above policy π defined above which returns action a d for all input states. Notice that when r 0 = 0, the value of the policy would be zero. On the other hand, when r 0 = d-(H-2)/2 , the value of the policy would be r 0 • d(H-2)/2 = 1. Therefore, if the algorithm approximates the value of the policy up to an approximation error of 1/2, then it must distinguish the case that r 0 = 0 and r 0 = d-(H-2)/2 . We first notice that for the case r 0 = 0 and r 0 = d-(H-2)/2 , the data distributions {µ h } H h=1 , the feature mapping φ : S × A → R d , the policy π to be evaluated and the reward distributions R are the same. Thus, in order to distinguish the case r 0 = 0 and r 0 = d-(H-2)/2 , the only way is to query the transition operator P by using sampling taken from the data distributions. Now, for all state-action pairs (s, a) in the support of the data distributions of the first H -2 levels (namely µ 1 , µ 2 , . . . , µ H-2 ), the transition operator will be identical. This is because changing r 0 only changes the transition distributions of (s d+1 h , a d+1 ), (s d+1 h , a d+2 ), . . . , (s d+1 h , a d ), and such state-actions are not in the support of µ h for all h ∈ [H -2]. Moreover, for any (s, a) ∈ {s + H-1 , s - H-1 , s d+1 H-1 } × A in the support of µ H-1 , P (s, a) will also be identical no matter r 0 = 0 or r 0 = d-(H-2)/2 . For those state-action pairs (s, a) in the support of µ H-1 with s / ∈ {s + H-1 , s - H-1 , s d+1 H-1 }, we have P (s, a) = s + H with probability (1 + r 0 )/2 s - H with probability (1 -r 0 )/2 . Again, this is because (s d+1 H-1 , a) is not in the support of µ H-1 for all a ∈ a d+1 , a d+2 , . . . , a d . Therefore, in order to distinguish the case r 0 = 0 and r 0 = d-(H-2)/2 , the agent needs distinguish two transition distributions p 1 = s + H with probability 1/2 s - H with probability 1/2 and p 2 = s + H with probability (1 + d-(H-2)/2 )/2 s - H with probability (1 -d-(H-2)/2 )/2 . Again, by Lemma B.1, in order to distinguish p 1 and p 2 with probability at least 0.9, one needs Ω( dH-2 ) samples. Formally, we have the following theorem. Theorem C.1. Suppose Assumption 2 holds, and rewards are deterministic and could be nonezero only for state-action pairs in the last level. Fix an algorithm that takes as input both a policy and a feature mapping. There exists an MDP satisfying Assumption 5, such that for a fixed policy π : S → A, the algorithm requires Ω((d/2 -1) H/2 ) samples to output the value of π up to constant additive approximation error with probability at least 0.9.

D ANALYSIS OF ALGORITHM 1

D.1 PROOF OF LEMMA 5.1 Clearly, θh = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + Vh+1 (s i h )) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + Qh+1 (s i h , π(s i h ))) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + φ(s i h , π(s i h )) θh+1 ) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + φ(s i h , π(s i h )) θ h+1 ) + N i=1 φ(s i h , a i h ) • φ(s i h , π(s i h )) ( θh+1 -θ h+1 ) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + φ(s i h , π(s i h )) θ h+1 ) + Λ-1 h N i=1 φ(s i h , a i h ) • φ(s i h , π(s i h )) ( θh+1 -θ h+1 ) . For the first term, we have Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + φ(s i h , π(s i h )) θ h+1 ) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + Q π (s i h , π(s i h ))) = Λ-1 h N i=1 φ(s i h , a i h ) • (r i h + V π (s i h )) = Λ-1 h N i=1 φ(s i h , a i h ) • (Q π (s i h , a i h ) + ξ i h ) = Λ-1 h N i=1 φ(s i h , a i h ) • ξ i h + Λ-1 h N i=1 φ(s i h , a i h ) • φ(s i h , a i h ) θ h = Λ-1 h N i=1 φ(s i h , a i h ) • ξ i h + Λ-1 h (Φ h Φ h )θ h = Λ-1 h Φ h ξ h + θ h -λ Λ-1 h θ h . Therefore, θ1 -θ 1 = ( Λ-1 1 Φ 1 ξ 1 -λ Λ-1 1 θ 1 ) + Λ-1 1 Φ 1 Φ 2 (θ 2 -θ2 ) = ( Λ-1 1 Φ 1 ξ 1 -λ Λ-1 1 θ 1 ) + Λ-1 1 Φ 1 Φ 2 ( Λ-1 2 Φ 2 ξ 2 -λ Λ-1 2 θ 2 ) + Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 (θ 3 -θ3 ) = . . . = H h=1 Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • ( Λ-1 h Φ h ξ h -λ Λ-1 h θ h ). Also note that (Q π (s 1 , π(s 1 )) -Q(s 1 , π(s 1 ))) 2 = θ 1 -θ1 2 Λ1 . D.2 PROOF OF THEOREM 5.2 By matrix concentration inequality (Tropp, 2015) , we have the following lemma. Note that (Q π (s 1 , π(s 1 )) -Q(s 1 , π(s 1 ))) 2 ≤H • H h=1 Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • ( Λ-1 h Φ h ξ h -λ Λ-1 h θ h ) 2 Λ1 ≤2H • H h=1 Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • Λ-1 h Φ h ξ h 2 Λ1 + H h=1 Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • λ Λ-1 h θ h 2 Λ1 . For each h ∈ [H], Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • Λ-1 h Φ h ξ h 2 Λ1 ≤ Φ 1 Λ-1 1 Λ 1 Λ-1 1 Φ 1 2 • Φ 2 Λ-1 2 Φ 2 Φ 3 • • • Λ-1 h Φ h ξ h 2 2 ≤ Λ-1/2 1 Λ 1 Λ-1/2 1 2 • Φ 1 Λ-1 1 Φ 1 2 • Φ 2 Λ-1 2 Φ 2 Φ 2 • • • Λ-1 h Φ h ξ h 2 2 ≤ Λ-1/2 1 Λ 1 Λ-1/2 1 2 • h-1 h =1 Φ h Λ-1 h Φ h 2 • Λ-1/2 h +1 (Φ h +1 Φ h +1 ) Λ-1/2 h +1 2 • ξ h 2 Φ h Λ-1 h Φ h . Similarly, Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • λ Λ-1 h θ h 2 Λ1 ≤ Λ-1/2 1 Λ 1 Λ-1/2 1 2 • h-1 h =1 Φ h Λ-1 h Φ h 2 • Λ-1/2 h +1 (Φ h +1 Φ h +1 ) Λ-1/2 h+1 2 • λ 2 • θ h 2 Λ-1 h ≤ Λ-1/2 1 Λ 1 Λ-1/2 1 2 • h-1 h =1 Φ h Λ-1 h Φ h 2 • Λ-1/2 h +1 (Φ h +1 Φ h +1 ) Λ-1/2 h +1 2 • λ • H 2 d. For all h ∈ [H], we have Φ h Λ-1 h Φ h 2 ≤ 1 and Λ-1/2 h (Φ h Φ h ) Λ-1/2 h 2 ≤ N Λ-1/2 h Λ h Λ-1/2 h 2 + Λ-1/2 h (Φ h Φ h -N Λ h ) Λ-1/2 h 2 . Conditioned on the event in Lemma D.1, (Hsu et al., 2012a) , with probability 1 -δ/(4H), for some constant C , we have Λh N Λ h N C h Λ h , which implies N Λ-1/2 h Λ h Λ- ξ h 2 Φ h Λ-1 h Φ h ≤ C H 2 d log(H/δ). Therefore, Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • Λ-1 h Φ h ξ h 2 Λ1 + Λ-1 1 Φ 1 Φ 2 Λ-1 2 Φ 2 Φ 3 • • • λ Λ-1 h θ h 2 Λ1 ≤ C 1 N (C 2 + C d log(d/δ)N /λ) × • • • × (C h + C d log(d/δ)N /λ) × (C H 2 d log(H/δ) + λH 2 d) ≤ C 1 N (C 2 + 1/H) × • • • × (C h + 1/H) × (C H 2 d log(H/δ) + λH 2 d) ≤ e N C 1 × C 2 × • • • × C h × (C H 2 d log(H/δ) + CdH 3 d log(dH/δ)N ). Let c > 0 be a large enough constant. We now have E s1 [(Q π 1 (s 1 , π(s 1 )) -Q1 (s 1 , π(s 1 ))) 2 ] ≤ c • H h=1 C h • dH 5 • d log(dH/δ) N .



Specifically, if the features have a uniformly bounded norm and if the minimum eigenvalue of the feature covariance matrix of our data is bounded away from 0, say by 1/poly(d), then we have good accuracy on any test distribution. See Assumption and the comments thereafter. Note that 1/d is the largest possible minimum eigenvalue due to that, for any data distribution µ h , σmin(E (s,a)∼ µ h [φ(s, a)φ(s, a) ]) ≤ 1/d since φ(s, a) 2 ≤ 1 for all (s, a) ∈ S × A. For H = 1, the ordinary least squares estimator will satisfy that θ1 -θOLS 2 2 ≤ O(d/n) with high probability. See e.g.(Hsu et al., 2012b). Without loss of generality, we can work in a coordinate system such that θ h 2 ≤ H √ d and φ(s, a) 2 ≤ 1 for all (s, a) ∈ S × A. This follows due to John's theorem (e.g. see(Ball, 1997;Bubeck et al., 2012)).



Mapping. Let e 1 , e 2 , . . . , e d be a set of orthonormal vectors in R d . Here, one possible choice is to set e 1 , e 2 , . . . , e d to be the standard basis vectors. For each (h, c) ∈ [H] × [ d], we set φ(s c h , a 1 ) = e c , φ(s c h , a 2 ) = e c+ d, and φ(s d+1 h , a) = c∈ d e c / d1/2 for all a ∈ A.

Figure 1: An illustration of the hard instance. Recall that d = d/2. States on the top are those in the first level (h = 1), while states at the bottom are those in the last level (h = H). Solid line (with arrow) corresponds to transitions associated with action a 1 , while dotted line (with arrow) corresponds to transitions associated with action a 2 . For each level h ∈ [H], reward values and Q-values associated with s 1 h , s 2 h , . . . , s d h are marked on the left, while reward values and Q-values associated with s d+1 h are mark on the right. Rewards and transitions are all deterministic, except for the reward distributions associated with s 1H , s 2 H , . . . , s d H . We mark the expectation of the reward value when it is stochastic. For each level h ∈ [H], for the data distribution µ h , the state is chosen uniformly at random from those states in the dashed rectangle, i.e., {s 1 h , s 2 h , . . . , s d h }, while the action is chosen uniformly at random from {a 1 , a 2 }. Suppose the initial state is s d+1

∈ a d+3 , a d+4 , . . . , a d . Now we define the transition operator when h ∈ {2, 3, . . . , H -2}. For each h ∈ {2, 3, . . . , H -2}, a ∈ A and c ∈ [ d], we have P (s c h , a) = s d+1 h+1 , P (s + h , a) = s + h+1 and P (s - h , a) = s - h+1 . For each h ∈ {2, 3, . . . , H -2} and c ∈ [ d], we have P (s d+1h , a c ) = s c h+1 . For all a ∈ a d+1 , a d+2 , . . . , a d , we haveP s d+1 h , a = s + h+1 with probability (1 + r 0 • d(H-h)/2 )/2 s - h+1 with probability (1 -r 0 • d(H-h)/2 )/2Now we define the transition operator for the second last level. For all c ∈ [ d] and a ∈ A, we have For all a ∈ A, we haveP (s + H-1 , a) = s + H and P (s - H-1 , a) = s - H . For each c ∈ [ d], we have P (s d+1 H-1 , a c ) = s c H . For all a ∈ a d+1 ,a d+2 , . . . , a d , we have In this hard case, all reward values are deterministic, and reward values can be non-zero only for the last level. Formally, we have R(s, a) = As in the in hard instance in Section 4, let e 1 , e 2 , . . . , e d be a set of of orthonormal vectors in R d . For the initial state, for each c ∈ [d], we have φ(s 1 , a c ) = e c . Now we define the feature mapping when h ∈ {2, 3, . . . , H}. For each h ∈ {2, 3, . . . , H}, a ∈ A and c ∈ [ d], φ(s c h , a) = e c , φ(s + h , a) = e d+1 and φ(s - h , a) = e d+2 . Moreover, for all actions a ∈ A, φ(s d+1 h , a) = e d+2+c a = a c , c ∈ [ d] 1 d1/2 e 1 + e 2 + . . . + e d a ∈ a d+1 , a d+2 , . . . , a d .

Lemma D.1. For each h ∈ [H], with probability 1 -δ/(4H), for some universal constant C, we have1 N Φ h Φ h -Λ h 2 ≤ C d log(dH/δ)/N . and 1 N Φ h+1 Φ h+1 -Λ h+1 2 ≤ C d log(dH/δ)/N .Therefore, since λ = CH d log(dH/δ)N , with probability 1 -δ/(4H), we haveΛh = Φ h Φ h + λI N Λ h .

Figure 2: An illustration of the hard instance. Recall that d = d/2 -1.States on the top are those in the first level (h = 1), while states at the bottom are those in the last level (h = H). Dotted line (with arrow) corresponds to transitions associated with actions a 1 , a 2 , . . . , a d, while solid line (with arrow) corresponds to transitions associated with actions a d+1 , a d+2 , . . . , a d . We omit the transition associated with a 1 , a 2 , . . . , a d in the figure if all actions give the same transition. For each level h ∈ [H], Q-values associated with s 1 Consider the fixed policy that returns a d for all input states. When r 0 = 0, the value of the policy is 0. When r 0 = d-(H-2)/2 , the value of the policy is = r 0 State Space, Action Space and Transition Operator. In this hard case, the action space A = {a 1 , a 2 , . . . , a d } contains d elements. S 1 contains a single state s 1 . For each h ≥ 2, S h contains

1/2 h ≤ C h . Moreover, conditioned on the event in Lemma D.1,

ACKNOWLEDGMENTS

The authors would like to thank Akshay Krishnamurthy, Alekh Agarwal, Wen Sun, and Nan Jiang for numerous helpful discussion. Sham M. Kakade gratefully acknowledges funding from the ONR award N00014-18-1-2247, and NSF Awards CCF-1703574 and CCF-1740551. Ruosong Wang was supported in part by the NSF IIS1763562, US Army W911NF1920104, and ONR Grant N000141812861. Research performed while Ruosong Wang was an intern at Microsoft Research.

