THE ROLE OF COVERAGE IN ONLINE REINFORCEMENT LEARNING

Abstract

Coverage conditions-which assert that the data logging distribution adequately covers the state space-play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing-somewhat surprisingly-that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability-that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability-can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the sequential extrapolation coefficient, to provide a unification.

1. INTRODUCTION

The last decade has seen development of reinforcement learning algorithms with strong empirical performance in domains including robotics (Kober et al., 2013; Lillicrap et al., 2015) , dialogue systems (Li et al., 2016) , and personalization (Agarwal et al., 2016; Tewari and Murphy, 2017) . While there is great interest in applying these techniques to real-world decision making applications, the number of samples (steps of interaction) required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach human-level performance in challenging domains. Developing algorithms with improved sample efficiency, which entails efficiently generalizing across high-dimensional states and actions while taking advantage of problem structure as modeled practitioners, remains a major challenge. Investigation into design and analysis of algorithms for sample-efficient reinforcement learning has largely focused on two distinct problem formulations: • Online reinforcement learning, where the learner can repeatedly interact with the environment by executing a policy and observing the resulting trajectory. • Offline reinforcement learning, where the learner has access to logged transitions and rewards gathered from a fixed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly interact with the underlying environment. While these formulations share a common goal (learning a near-optimal policy), the algorithms used to achieve this goal and conditions under which it can be achieved are seemingly quite different. Focusing on value function approximation, sample-efficient algorithms for online reinforcement learning require both (a) representation conditions, which assert that the function approximator is flexible enough to represent value functions for the underlying MDP (optimal or otherwise), and (b) exploration conditions (or, structural conditions) which limit the amount of exploration required to learn a near-optimal policytypically by enabling extrapolation across states or limiting the number of effective state distributions (Russo and Van Roy, 2013; Jiang et al., 2017; Sun et al., 2019; Wang et al., 2020b; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021) . Algorithms for offline reinforcement learning typically require similar representation conditions. However, since data is collected passively from a fixed logging policy/distribution rather than actively, the exploration conditions used in online RL are replaced with coverage conditions, which assert that the data collection distribution provides sufficient coverage over the state space (Antos et al., 2008; Chen and Jiang, 2019; Xie and Jiang, 2020; 2021; Jin et al., 2021b; Rashidinejad et al., 2021; Foster et al., 2022; Zhan et al., 2022) . The aim for both lines of research (online and offline) is to identify the weakest possible conditions under which learning is possible, and design algorithms that take advantage of these conditions. The two lines have largely evolved in parallel, and it is natural to wonder whether there are deeper connections. Since the conditions for sample-efficient online RL and offline RL mainly differ via exploration versus coverage, this leads us to ask: If an MDP admits a data distribution with favorable coverage for offline RL, what does this imply about our ability to perform online RL efficiently? Beyond intrinsic theoretical value, this question is motivated by the observation that many realworld applications lie on a spectrum between online and offline. It is common for the learner to have access to logged/offline data, yet also have the ability to actively interact with the underlying environment, possibly subject to limitations such as an exploration budget (Kalashnikov et al., 2018) . Building a theory of real-world RL that can lead to algorithm design insights for such settings requires understanding the interplay between online and offline RL.

1.1. OUR RESULTS

We investigate connections between coverage conditions in offline RL and exploration in online RL by focusing on the concentrability coefficient, the most ubiquitous notion of coverage in offline RL. Concentrability quantifies the extent to which the data collection distribution uniformly covers the state-action distribution induced by any policy. We introduce a new structural property, coverability, which reflects the best concentrability coefficient that can be achieved by any data distribution, possibly designed by an oracle with knowledge of the underlying MDP. Our main results are as follows: 1. We show (Section 3) that coverability (that is, mere existence of a distribution with good concentrability) is sufficient for sample-efficient online exploration, even when the learner has no prior knowledge of this distribution. This result requires no additional assumptions on the underlying MDP beyond standard Bellman completeness, and-perhaps surprisingly-is achieved using standard algorithms (Jin et al., 2021a) , albeit with analysis ideas that go beyond existing techniques. 2. We show (Section 4) that several weaker notions of coverage in offline RL, including single-policy concentrability (Jin et al., 2021b; Rashidinejad et al., 2021) and conditions based on Bellman residuals (Chen and Jiang, 2019; Xie et al., 2021a) , are insufficient for sample-efficient online exploration. This shows that in general, coverage in offline reinforcement learning and exploration in online RL not compatible, and highlights the need for additional investigation going forward. Our results serve as a starting point for systematic study of connections between online and offline learnability in RL. To this end, we provide several secondary results: 1. We show (Section 5) that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, do not optimally capture coverability, and provide a new complexity measure, the sequential extrapolation coefficient, which unifies these notions. 2. We establish (Appendix C) connections between coverability and reinforcement learning with exogenous noise, with applications to learning in exogenous block MDPs (Efroni et al., 2021; 2022a) . 3. We give algorithms for reward-free exploration (Jin et al., 2020a; Chen et al., 2022) under coverability (Appendix G). While our results primarily concern analysis of existing algorithms rather than algorithm design, they highlight a number of exciting directions for future research, and we are optimistic that the notion of coverability can guide the design of practical algorithms going forward. Notation. For an integer n ∈ N, we let [n] denote the set {1,...,n}. For a set X , we let ∆(X ) denote the set of all probability distributions over X . We adopt standard big-oh notation, and write f = O(g) to denote that f = O(g•max{1,polylog(g)}) and a ≲ b as shorthand for a = O(b).

2. BACKGROUND: ONLINE/OFFLINE RL, COVERAGE, AND COVERABILITY

Markov decision processes. We consider an episodic reinforcement learning setting. Formally, a Markov decision process M = (X ,A,P,R,H,x 1 ) consists of a (potentially large) state space X , action space A, horizon H, probability transition function P = {P h } H h=1 , where P h : X ×A → ∆(X ), reward function R = {R h } H h=1 , where R h : X ×A → [0,1], and deterministic initial state x 1 ∈ X .foot_0 A (randomized) policy is a sequence of per-timestep functions π = {π h : X → ∆(A)} H h=1 . The policy induces a distribution over trajectories (x 1 ,a 1 ,r 1 ),...,(x H ,a H ,r H ) via the following process. For h = 1,...,H: a h ∼ π(• | x h ), r h = R h (x h ,a h ), and x h+1 ∼ P h (• | x h ,a h ). For notational convenience, we use x H+1 to denote a deterministic terminal state with zero reward. We let E π [•] and P π [•] denote expectation and probability under this process, respectively. The Q-function for policy π is Q π h (x, a) := E π H h ′ =h r h ′ | x h = x, a h = a , the value function for π is V π h (x) := E a∼π h (•|x) [Q π h (x, a)], and the expected reward for π is J(π) := V π 1 (x 1 ). We let π ⋆ denote the optimal (deterministic) policy, which maximizes Q π h (x,a) for all (x,a) ∈ X × A simultaneously; we define V ⋆ h = V π ⋆ h and Q ⋆ h = Q π ⋆ h . We define the occupancy measure for policy π via d π h (x,a) := P π [x h = x,a h = a] and d π h (x) := P π [x h = x]. We let T h denote the Bellman operator for layer h, defined via [T h f ](x,a) = R h (x,a)+E x ′ ∼P h (x,a) [max a ′ f (x ′ ,a ′ )] for f : X ×A → R. We also assume that rewards are normalized such that h∈[H] r h ∈ [0,1] . To simplify technical presentation, we assume that X and A are countable; we anticipate that this assumption can be removed. Online Reinforcement Learning. Our main results concern online reinforcement learning in an episodic framework, where the learner repeatedly interacts with an unknown MDP by executing a policy and observing the resulting trajectory, with the goal of maximizing total reward. Formally, the protocol proceeds in T rounds, where at each round t = 1,...,T , the learner: i) Selects a policy π (t) = π (t) h h∈ [H] to execute in the (unknown) underlying MDP M ⋆ ; ii) Observe the resulting trajectory (x (t) 1 ,a (t) 1 ,r (t) 1 ),...,(x (t) H ,a (t) H ,r (t) H ). The learner's goal is to minimize their cumulative regret, defined via Reg := t∈[T ] J(π ⋆ )-J(π (t) ). To achieve sample-efficient online reinforcement learning guarantees that do not depend on the size of the state space, one typically appeals to value function approximation methods that take advantage of a function class F ⊂ (X ×A → R) that attempts to model the value functions for the underlying MDP M ⋆ (optimal or otherwise). An active line of research provides structural conditions under which such approaches succeed (Russo and Van Roy, 2013; Jiang et al., 2017; Sun et al., 2019; Wang et al., 2020b; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021) , based on assumptions that control the interplay between the function approximator F and the dynamics of the MDP M ⋆ . These results require (i) representation conditions, which require that F is flexible enough to model value functions of interest (e.g., Q ⋆ ∈ F or T h F h+1 ⊆ F h ) and (ii) exploration conditions, which either explicitly or implicitly limit the amount of exploration required for a deliberate algorithm to learn a near-optimal policy. This is typically accomplished by either enabling extrapolation from states already visited, or by limiting the number of effective state distributions that can be encountered. Offline Reinforcement Learning and Coverage Conditions. Our aim is to investigate parallels between online and offline reinforcement learning. In offline reinforcement learning, the learner cannot actively execute policies in the underlying MDP M ⋆ . Instead, for each layer h, they receive a dataset D h of n tuples (x h ,a h ,r h ,x h+1 ) with r h = R h (x h ,a h ), x h+1 ∼ P h (• | x h ,a h ), and (x h ,a h ) ∼ µ h i.i.d., where µ h ∈ ∆(X ×A) is the data collection distribution; we define µ = {µ h } h∈[H] . The goal of the learner is to use this data to learn an ε-optimal policy π, that is: J(π ⋆ )-J( π) ≤ ε. Algorithms for offline reinforcement learning require representation conditions similar to those required for online RL. However, since it is not possible to actively explore the underlying MDP, one dispenses with exploration conditions and instead considers coverage conditions, which require that each data distribution µ h sufficiently covers the state space. As an example, consider Fitted Q-Iteration (FQI), one of the most well-studied offline reinforcement learning algorithms (Munos, 2007; Munos and Szepesvári, 2008; Chen and Jiang, 2019) . The algorithm, which uses least-squares to approximate Bellman backups, is known to succeed under (i) a representation condition known as Bellman completeness (or "completeness"), which requires that T h f ∈ F h for all f ∈ F h+1 , and (ii) a coverage condition called concentrability. To state, the result, recall that ∥x∥ ∞ := max i |x i | for x ∈ R d . Definition 1 (Concentrability). The concentrability coefficient for a data distribution µ = {µ h } H h=1 and policy class Π is given by C conc (µ) := sup π∈Π,h∈[H] ∥d π h /µ h ∥ ∞ . Concentrability requires that the data distribution uniformly covers all possible induced state distributions. With concentrabilityfoot_1 and completeness, FQI can learn an ε-optimal policy using poly(C conc (µ),log|F|,H,ε -1 ) samples. Importantly, this result scales only with the concentrability coefficient C conc (µ) and the capacity log|F| for the function class, and has no explicit dependence on the size of the state space. There is a vast literature which provides algorithms with similar, often more refined guarantees (Chen and Jiang, 2019; Xie and Jiang, 2020; 2021; Jin et al., 2021b; Rashidinejad et al., 2021; Foster et al., 2022; Zhan et al., 2022) . The Coverability Coefficient. Having seen that access to a data distribution µ with low concentrability C conc (µ) is sufficient for sample-efficient offline RL, we now ask what existence of such a distribution implies about our ability to perform online RL. To this end, we introduce a new structural parameter, the coverability coefficient, whose value reflects the best concentrability coefficient that can be achieved with oracle knowledge of the underlying MDP M ⋆ . Definition 2 (Coverability). The coverability coefficient C cov > 0 for a policy class Π is given by C cov := inf µ1,...,µ H ∈∆(X ×A) {C conc (µ)} . Coverability is an intrinsic structural property of the MDP M ⋆ which implicitly restricts the complexity of the set of possible state distributions. While it is always the case that C cov ≤ |X |•|A|, the coefficient can be significantly smaller (in particular, independent of |X |) for benign MDPs such as block MDPs and MDPs with low-rank structure (Chen and Jiang, 2019, Prop 5) ; see Appendix C for details. With this definition in mind, we ask: If the MDP M ⋆ satisfies low coverability, is sample-efficient online reinforcement learning possible? Note that if the learner were given access to data from the distribution µ that achieves the value of C cov , it would be possible to simply appeal to offline RL methods such as FQI, but since the learner has no prior knowledge of µ, this question is non-trivial, and requires deliberate exploration.

3. COVERABILITY IMPLIES SAMPLE-EFFICIENT ONLINE EXPLORATION

We now present our main result, which shows that low coverability is sufficient for sample-efficient online exploration. We first describe the algorithm and regret bound, then sketch the proof and give intuition (Section 3.1). We conclude (Section 3.2) by applying the main result to give regret bounds for learning in exogenous block MDPs (Efroni et al., 2021) , highlighting structural properties of coverability.

Function approximation. We work with a value function class

F = F 1 × ••• × F H , where F h ⊂ (X × A → [0, 1] ), with the goal of modeling value functions for the underlying MDP. We adopt the convention that f H+1 = 0, and for each f ∈ F, we let π f denote the greedy policy with π f,h (x) := argmax a∈A f h (x,a), and we use x,a) ] for any π h . We take our policy class to be the induced class Π := {π f | f ∈ F} for the remainder of the paper unless otherwise stated. We make the following standard completeness assumption, which requires that the value function class is closed under Bellman backups (Wang et al., 2020b; Jin et al., 2020b; Wang et al., 2021b; Jin et al., 2021a) . f h (x,π h ) := E a∼π h (•|x) [f h ( Assumption 1 (Completeness). For all h ∈ [H], we have T h f h+1 ∈ F h for all f h+1 ∈ F h+1 . Completeness implies that F is realizable (that is, Q ⋆ ∈ F), but is a stronger assumption in general. We assume for simplicity that |F| < ∞, and our results scale with log|F|; this can be extended to infinite classes via covering numbers using a standard analysis. Algorithm and main result. Our result is based on a new analysis of the GOLF algorithm of Jin et al. (2021a) , which is presented in Algorithm 1 of Appendix D for completeness. GOLF is based on the principle of optimism in the face of uncertainty. At each round, the algorithm restricts to a confidence set F (t) ⊆ F with the property that Q ⋆ ∈ F (t) , and chooses π (t) = π f (t) based on the value function f (t) ∈ F (t) with the most optimistic estimate f 1 (x 1 ,π f,1 (x 1 )) for the total reward. The confidence sets F (t) are based on an empirical proxy to squared Bellman error, and are constructed in a global fashion that entails optimizing over f h for all layers h ∈ [H] simultaneously (Zanette et al., 2020a) . Note that while GOLF was originally introduced to provide regret bounds based on the notion of Bellman-Eluder dimension, we show (Section 5) that coverability cannot be (optimally) captured by this complexity measure, necessitating a new analysis. Our main result, Theorem 1, shows that GOLF attains low regret for online reinforcement learning whenever the coverability coefficient is small. Theorem 1 (Coverability implies sample-efficient online RL). Under Assumption 1, there exists an absolute constant c such that for any δ ∈ (0,1] and T ∈ N + , if we choose β = c • log( T H|F | /δ) in Algorithm 1, then with probability at least 1-δ, we have Reg ≤ O(H C cov T log( T H|F | /δ)log(T )), where C cov is the coverability coefficient (Definition 2). Beyond the coverability parameter C cov , the regret bound in Theorem 1 depends only on standard problem parameters (the horizon H and function class capacity log|F|). Hence, this result shows that coverability, along with completeness, is sufficient for sample-efficient online RL. Additional features of Theorem 1 are as follows. • While coverability implies that there exists a distribution µ for which the concentrability coefficient C conc is bounded, Algorithm 1 has no prior knowledge of this distribution. We find the fact that the GOLF algorithm-which does not involve explicitly searching such a distribution-succeeds under this condition to be somewhat surprising (recall that given sample access to µ, one can simply run FQI). Our proof shows that despite the fact that GOLF does not explicitly reason about µ, coverability implicitly restricts the set of possible state distributions, and limits the extent to which the algorithm can be "surprised" by substantially new distributions. We anticipate that this analysis will find broader use. • Ignoring factors logarithmic in T , H, and δ -1 , the regret bound in Theorem 1 scales as H C cov T log|F|, which is optimal for contextual bandits (where C cov = |A| and H = 2),foot_2 and hence cannot be improved in general (Agarwal et al., 2012) . The dependence on H matches the regret bound for GOLF based on Bellman-Eluder dimension (Jin et al., 2021a) . • GOLF uses confidence sets based on squared Bellman error, but there are similar algorithms which instead work with average Bellman error (Jiang et al., 2017; Du et al., 2021) and, as a result, require only realizability rather than completeness (Assumption 1). While existing complexity measures such as Bellman rank and Bellman-Eluder dimension can be used to analyze both types of algorithm, and our results critically use the non-negativity of squared Bellman error, which facilitates certain "changeof-measure" arguments. Consequently, it is unclear whether the completeness assumption can be removed (i.e., whether coverability and realizability alone suffice for sample-efficient online RL). On the algorithmic side, our results give guarantees for PAC RL via online-to-batch conversion, which we state here for completeness. We also provide an extension to reward-free exploration in Appendix G. Corollary 2. Under Assumption 1, there exists an absolute constant c such that for any δ ∈ (0,1] and T ∈ N + , if we choose β = c•log( T H|F | /δ) in Algorithm 1, then with probability at least 1-δ, the policy π output by Algorithm 1 hasfoot_3 J(π ⋆ )-J(π) ≤ O H C cov log( T H|F | /δ)log(T )/T .

3.1. PROOF SKETCH FOR THEOREM 1: WHY IS COVERABILITY SUFFICIENT?

We now sketch the main ideas behind the proof of Theorem 1, highlighting the role of coverability in limiting the complexity of exploration. Regret decomposition and change of measure. For each t, we define δ (t) h (•, •) := f (t) h (•, •) - (T h f (t) h+1 )(•,•), which may be viewed as a "test function" at level h induced by f (t) ∈ F. We adopt the shorthand d (t) h ≡ d π (t) h , and we define d (t) h (x,a) := t-1 i=1 d (i) h (x,a) as the cumulative historical visitation for rounds prior to step t. A standard regret decomposition for optimistic algorithms (Lemma 13) allows us to relate regret to the average Bellman error under the learner's sequence of policies: Reg ≤ t∈[T ] f (t) 1 (x 1 ,π f (t) 1 ,1 (x 1 ))-J(π (t) ) = t∈[T ] h∈[H] E d (t) h f (t) h (x,a)-(T h f (t) h+1 )(x,a) =:δ (t) h (x,a) . (1) Fix h ∈ [H]. We use a change-of-measure argument to relate the on-policy average Bellman error ) h (x,a)] to the in-sample squared Bellman error under d (t) h , writing Eq. (1) as Bounding the extrapolation error using coverability. To proceed, we show that the extrapolation error (I) is controlled by coverability. We have: E (x,a)∼d (t) h [δ (t t∈[T ] x,a d (t) h (x,a) d (t) h (x,a) d (t) h (x,a) 1 /2 δ (t) h (x,a) ≤ t∈[T ] x,a d (t) h (x,a) 2 d (t) h (x,a) (I): extrapolation error • t∈[T ] x,a d (t) h (x,a) δ (t) h (x,a) t∈[T ] x,a d (t) h (x,a) 2 d (t) h (x,a) ≤ t∈[T ] x,a max t ′ ∈[T ] d (t ′ ) h (x,a)• d (t) h (x,a) d (t) h (x,a) ≤ max x,a t∈[T ] d (t) h (x,a) d (t) h (x,a) ≲O(log(T )) by Lemma 15 • x,a max t∈[T ] d (t) h (x,a) (b) ≤Ccov by Lemma 14 . Here, the inequality (a) uses a scalar variant of the elliptic potential lemma (Lemma 15; cf. Lattimore and Szepesvári (2020) ), which we apply on a per-state basis. 5 The inequality (b) uses a key result (Lemma 14 in Appendix D), which shows that coverability is equivalent to a quantity we term cumulative reachability, defined via (x,a)∈X ×A sup π∈Π d π h (x, a). Cumulative reachability reflects the variation in visitation probabilities for policies in the class Π, and boundedness of this quantity (which occurs when state-action pairs visited by policies in Π have large overlap) implies that the contributions from potentials for different state-action pairs average out. See Figure 1 for an illustration. To conclude, we substitute the preceding bounds into the term (I), which gives Reg ≤ H h=1 E (x,a)∼d (t) h δ (t) h (x,a) ≤ O H C cov •βT log(T ) . Note that to obtain the expression in term (I), our proof critically uses that the confidence set construction provides a bound on the squared Bellman error E (x,a)∼ d (t) h [δ (t) h (x,a) 2 ] in the change of measure argument. This contrasts with existing works on online RL with general function approximation (e.g., Jiang et al., 2017; Jin et al., 2021a; Du et al., 2021) , which typically move from average Bellman error to squared Bellman error as a lossy step, and only work with squared Bellman error because it permits simpler construction of confidence sets. Confidence sets based on average Bellman error will lead to a larger notion of extrapolation error which cannot be controlled using coverability (cf. Section 5).

3.2. RICH OBSERVATIONS AND EXOGENOUS NOISE: APPLICATION TO BLOCK MDPS

As an application of Theorem 1, we consider the problem of reinforcement learning in Exogenous Block MDPs (Ex-BMDPs), a problem which has received extensive recent interest (Efroni et al., 2021; 2022a; b; Lamb et al., 2022) . Recall that the block MDP (Jiang et al., 2017; Du et al., 2019; Misra et al., 2020) is a model in which the ("observed") state space X is large/high-dimensional, but can be mapped by an (unknown) decoder ϕ ⋆ to a small latent state space which governs the dynamics. Exogenous block MDPs generalize this model further by factorizing the latent state space into small controllable ("endogenous") component S and a large irrelevant ("exogenous") component Ξ, which may be temporally correlated. The main challenge of learning in block MDPs is that the decoder ϕ ⋆ is not known to the learner in advance. Indeed, given access to the decoder, one can obtain regret poly(H,|S|,|A|) (Efroni et al., 2021) or allow for stochastic dynamics but heavily restrict the observation process (Efroni et al., 2022a) , and existing complexity measures such as Bellman Rank and Bellman-Eluder dimension can be arbitrarily large for this setting (see discussion in Section 5). See Appendix C for details and discussion.

4. ARE WEAKER NOTIONS OF COVERAGE SUFFICIENT?

In Section 3, we showed that existence of a distribution with good concentrability (coverability) is sufficient for sample-efficient online RL. However, while concentrability is the most ubiquitous coverage condition in offline RL, there are several weaker notions of coverage which also lead to sample-efficient offline RL algorithms. In this section, we show that analogues of coverability based on these conditions, single-policy concentrability and generalized concentrability for Bellman residuals, do not suffice for sample-efficient online RL. This indicates that in general, the interplay between offline coverage and online exploration is nuanced. Single-policy concentrability. Single-policy concentrability is a widely used coverage assumption in offline RL which weakens concentrability by requiring only that the state distribution induced by π ⋆ is covered by the offline data distribution µ, as opposed to requiring coverage for all policies (Jin et al., 2021b; Rashidinejad et al., 2021) . Definition 3 (Single-policy concentrability). The single-policy concentrability coefficient for a data distribution µ = {µ h } H h=1 is given by C ⋆ conc (µ) := d π ⋆ h /µ h ∞ . For offline RL, algorithms based on pessimism provide sample guarantee complexity guarantees that scale with C ⋆ conc (µ) (Jin et al., 2021b; Rashidinejad et al., 2021) . However, for the online setting, it is trivial to show that an analogous notion of "single-policy coverability" (i.e., existence of a distribution with good single-policy coverability) is not sufficient for sample-efficient learning, since for any MDP, one can take µ = d π ⋆ to attain C ⋆ conc (µ) = 1. This suggests that any notion of coverage that suffices for online RL must be more uniform in nature. Generalized concentrability for Bellman residuals. Another approach to weaker coverage in offline RL is to relax concentrability by only requiring coverage with respect to the Bellman residuals for value functions in F (Chen and Jiang, 2019; Xie et al., 2021a; Cheng et al., 2022) ; the following definition adapts this notion to the finite-horizon setting. Definition 4 (Generalized concentrability). We define the generalized concentrability coefficient C conc (µ,F) for a policy class Π and value function class F as the least constant C > 0 such that the offline data distribution µ = {µ h } H h=1 satisfies that for all f ∈ F and π ∈ Π, h∈[H] E d π h (f h (s h ,a h )- (T h f h+1 )(s h ,a h )) 2 ≤ C • h∈[H] E µ h f h (s h ,a h )-(T h f h+1 )(s h ,a h ) 2 . Note that C conc (µ,F) ≤ C conc (µ) (in particular, they coincide if one chooses F to be the set of all functions over X × A) but in general C conc (µ,F) can be much smaller. For example, in the linear Bellman-complete setting, it is possible to bound C conc (µ,F) in terms of feature coverage conditions (Wang et al., 2021a; Zanette et al., 2021) . Using offline data from µ, sample complexity guarantees that scale with C conc (µ,F) can be obtained under Assumption 1 via MSBO (see, e.g., Xie and Jiang, 2020, Section 5) or by running a "one-step" variant of GOLF (Algorithm 1); we provide this result (Proposition 16) in Appendix E for completeness. Given that this notion leads to positive results for offline RL, it is natural to consider a generalized notion of coverability based upon it. Definition 5 (Generalized coverability). We define the generalized coverability coefficient for a policy class Π value function class F and as C cov (F) = inf µ1,...,µ H ∈∆(X ×A) {C conc (µ,F)}. Unfortunately, we show that this condition does not suffice for sample-efficient online RL, even when the number of actions is constant and Assumption 1 is satisfied. Theorem 3. For any X,H,C ∈ N, there exists a family of MDPs with |X | = X, |A| = 2 and horizon H and a function class F with log|F| ≤ H log(2|X |) such that: i) Assumption 1 (completeness) is satisfied for F and we have C cov (F) ≤ C and ii) Any online RL algorithm that returns a 0.1-optimal policy with probability 0.9 requires at least Ω min X,2 Ω(H) ,2 Ω(C) trajectories. Theorem 3 highlights that in general, notions of coverage that suffice for offline RL-even those that are uniform in nature-can fail to lead to useful structural conditions for online RL. Briefly, the issue is that bounding regret for online RL entails controlling the extent to which a deliberate algorithm that has observed state distributions d (1) h ,...,d (t-1) h can be "surprised" by a substantially new state distribution d (t) h ; here, surprise is typically measure in terms of Bellman residual. The proof of Theorem 3 shows that existence of a distribution with good coverage with respect to Bellman residuals does suffice to provide meaningful control of distribution shift. We caution, however, that the lower bound construction makes use of the fact that Definition 5 requires coverage only on average across layers, and it is unclear whether a similar lower bound holds under uniform coverage across layers. Developing a more unified and fine-grained understanding of what coverage conditions lead to efficient exploration is an important question for future research.

5. A NEW STRUCTURAL CONDITION FOR SAMPLE-EFFICIENT ONLINE RL

Having shown that coverability facilitates sample-efficient online RL, an immediate question is whether this structural condition is related to existing complexity measures such as Bellman-Eluder dimension (Jin et al., 2021a) and Bellman/Bilinear rank (Jiang et al., 2017; Du et al., 2021) , which attempt to unify existing approaches to sample-efficient RL. We now show that these complexity measures are insufficient to capture coverability, then provide a new complexity measure, the Sequential Extrapolation Coefficient, which bridges the gap.

5.1. INSUFFICIENCY OF EXISTING COMPLEXITY MEASURES

Bellman-Eluder dimension (Jin et al., 2021a) and Bellman/Bilinear rank (Jiang et al., 2017; Du et al., 2021) can fail to capture coverability for two reasons: (i) insufficiency of average Bellman error (as opposed to squared Bellman error), and (ii) incorrect dependence on scale. To highlight these issues, we focus on Q-type Bellman-Eluder dimension (Jin et al., 2021a) , which subsumes Bellman rank.foot_5 See Appendix F for discussion of other complexity measures. Let Jin et al. (2021a) , we define the (Q-type) Bellman-Eluder dimension as follows. Definition 6 (Bellman-Eluder dimension). The Bellman-Eluder dimension dim BE (F,Π,ε,h) for the layer h is the largest d ∈ N, such that there exist sequences {d (1) h ,d D Π h := {d π h : π ∈ Π} and F h -T h F h+1 := {f h -T h f h+1 : f ∈ F }. Following (2) h ,...,d (d) h } ⊆ D Π h and {δ (1) h ,...,δ (d) h } ⊆ F h -T h F h+1 such that for all t ∈ [d], |E d (t) h [δ (t) h ]| > ε (t) , and t-1 i=1 E d (i) h [δ (t) h ] 2 ≤ ε (t) , for ε (1) ,...,ε (d) ≥ ε. We define dim BE (F,Π,ε) = max h∈[H] dim BE (F,Π,ε,h). Issue #1: Insufficiency of average (vs. squared) Bellman error. The Bellman-Eluder dimension reflects the length of the longest consecutive sequence of value function pairs for which we can be "surprised" by a large Bellman residual for a new policy if the value function has low Bellman residual on all preceding policies. Note that via Definition 6, the Bellman-Eluder dimension measures the size of the surprise and the error on preceding points via average Bellman error (e.g., E d (i) h [δ (t) h ] ). On the other hand, the proof of Theorem 1 critically uses squared Bellman error t) h ) 2 ] bound regret by coverability; this is because the (point-wise) nonnegativity of squared Bellman error facilitates change-of-measure in a similar fashion to offline reinforcement learning. The following result shows that this issue is fundamental, and Bellman-Eluder dimension can be exponential large relative to the regret bound in Theorem 1. ), yet we have dim BE (F,Π,1/2) = Ω(d). The construction, which is based on Efroni et al. (2022b, Section B.1) , critically leverages cancellations in the average Bellman error; these cancellations are ruled out by squared Bellman error, which is why Theorem 1 gives a regret bound that scales only logarithmically in d. Bilinear rank (Du et al., 2021) and V -type Bellman rank suffer from similar drawbacks; see Appendix F for further discussion. E d (i) h [(δ ( Issue #2: Incorrect dependence on scale. In light of the previous example, a seemingly reasonable fix is to adapt the Bellman-Eluder dimension to consider squared Bellman error rather than average Bellman error (i.e., use t) in Definition 6). We show (Appendix F.1) that while it is possible to bound this modified Bellman-Eluder dimension in terms of the coverability parameter, the dependence on the scale parameter ε is polynomial, and it is not possible to derive regret bounds better than T 2/3 under coverability with this approach. Informally, the issue is scale: Bellman-eluder dimension only checks whether the average Bellman error violates the threshold ε, and does not consider how far the error violates the threshold (e.g., t-1 i=1 E d (i) h [(δ (t) h ) 2 ] ≤ ε ( |E d (t) h [δ (t) h ]| > ε and |E d (t) h [δ (t) h ]| > 1 are counted the same).

5.2. THE SEQUENTIAL EXTRAPOLATION COEFFICIENT

To address the issues above, we introduce a new complexity measure, the Sequential Extrapolation Coefficient (SEC), which i) leads to regret bounds via GOLF and ii) subsumes both coverability and the Bellman-Eluder dimension. Conceptually, the Sequential Extrapolation Coefficient should be thought of as a minimal abstraction of the main ingredient in regret bounds based on GOLF and other optimistic algorithms: extrapolation from in-sample error to on-policy error. We begin by stating a variant of the Sequential Extrapolation Coefficient for abstract function classes, then specialize it to RL. Definition 7 (Sequential Extrapolation Coefficient). Let Z be an abstract set. Given a test function class Ψ ⊂ (Z → R) and distribution class D ⊂ ∆(Z), the sequential extrapolation coefficient for length T is given by SEC(Ψ,D,T ) := sup ψ (1) ,...,ψ (T ) ∈Ψ sup d (1) ,...,d (T ) ∈D t∈[T ] E d (t) [ψ (t) ] 2 1∨ t-1 i=1 E d (i) [(ψ (t) ) 2 ] . To apply the Sequential Extrapolation Coefficient to RL, we use Bellman residuals for F as test functions and consider state-action distributions induced by policies in Π. Definition 8 (SEC for RL). We define SEC RL (F,Π,T ) := max h∈[H] SEC(F h -T h F h+1 ,D Π h ,T ). The following result, which is a near-immediate consequence of the definition, shows that the Sequential Extrapolation Coefficient leads to regret bounds via GOLF; recall that Π = {π f | f ∈ F} is the set of greedy policies induced by F. Theorem 5. Under Assumption 1, there exists an absolute constant c such that for any δ ∈ (0,1] and T ∈ N + , if we choose β = c•log( T H|F | /δ) in Algorithm 1, then with probability at least 1-δ, we have Reg ≤ O H SEC RL (F,Π,T )•T •log( T H|F | /δ) . We defer the proof of Theorem 5 to Appendix F, and conclude by showing that the Sequential Extrapolation Coefficient subsumes coverability C cov (Definition 2) and Bellman-Eluder dimension. Proposition 6 (Coverability =⇒ SEC). SEC RL (F,Π,T ) ≤ O(C cov •log(T )). Proposition 7 (Bellman-Eluder dim. =⇒ SEC). SEC RL (F,Π,T ) ≤ O(dim BE (F,Π, 1 /T )•log(T )). The Sequential Extrapolation Coefficient can likely be generalized further along many directions (e.g., by allowing for different test functions in the vein of Du et al. (2021) ). Further unifying these notions is an interesting question for future research; see Appendix F.3 for further discussion.

6. CONCLUSION

This paper initiates the systematic study of parallels between online and offline learnability in reinforcement learning and uncovers surprising new connections. The possible future directions include general theories under weaker notions of coverability or approximation conditions (see Appendix A for open problems) as well as the connection to the practical algorithm design.  = {µ h } H h=1 such that E d π h ϕ(x h ,a h )ϕ(x h ,a h ) ⊤ ⪯ C • E µ h ϕ(x h ,a h )ϕ(x h ,a h ) ⊤ for some coverage parameter C. Is this condition (or a variant) sufficient for sample-efficient online exploration? • Further conditions from offline RL. There are many conditions used to provide sample-efficient learning guarantees in offline RL beyond those considered in this paper, including (i) pushforward concentrability (Munos, 2003; Xie and Jiang, 2021) , (ii) L p variants of concentrability (Farahmand et al., 2010; Xie and Jiang, 2020) , and (iii) weight function realizability (Xie and Jiang, 2020; Jiang and Huang, 2020; Zhan et al., 2022) . Which of these conditions can be adapted for online exploration, and to what extent?

B ADDITIONAL RELATED WORK

In this section we briefly highlight some relevant related work not otherwise discussed. Online RL with access to offline data. A separate line of work develops algorithms for online reinforcement learning that assume additional access to offline data gathered with a known data distribution µ or known exploratory policy (Abbasi-Yadkori et al., 2019; Xie et al., 2021b) . These results are complementary to our own, since we assume only that a good exploratory distribution exists, but do not assume that such a distribution is known to the learner. Further structural conditions for online RL. 

C APPLICATION TO EXOGENOUS BLOCK MDPS

As an application of Theorem 1, we consider the problem of reinforcement learning in Exogenous Block MDPs (Ex-BMDPs). Following Efroni et al. (2021) , an Ex-BMDP M = (X ,A,P,R,H,x 1 ) is defined by an (unobserved) latent state space, which consists of an endogenous state s h ∈ S and exogenous state ξ h ∈ Ξ, and an observation process which generates the observed state x h . We first describe the dynamics for the latent space. Given initial endogenous and exogenous states s 1 ∈ S and ξ 1 ∈ Ξ, the latent states evolve via s h+1 ∼ P endo h (s h ,a h ), and ξ h+1 ∼ P exo h (ξ h ); that is while both states evolve in a temporally correlated fashion, only the endogenous state s h evolves as a function of the agent's action. The latent state (s h ,ξ h ) is not observed. Instead, we observe x h ∼ q h (s h ,ξ h ), where q h : S × Ξ → ∆(X ) is an emission distribution with the property that supp(q h (s, ξ)) ∩ supp(q h (s ′ ,ξ ′ )) = ∅ if (s,ξ) ̸ = (s ′ ,ξ ′ ). This property (decodability) ensures that there exists a unique mapping ϕ ⋆ h : X → S that maps the observed state x h to the corresponding endogenous latent state s h . We assume that R h (x,a) = R h (ϕ ⋆ h (x),a), which implies that optimal policy π ⋆ depends only on the endogenous latent state, i.e. π ⋆ h (x) = π ⋆ h (ϕ ⋆ h (x) ). The main challenge of learning in block MDPs is that the decoder ϕ ⋆ is not known to the learner in advance. Indeed, given access to the decoder, one can obtain regret poly(H,|S|,|A|)•

√

T by applying tabular reinforcement learning algorithms to the latent state space. In light of this, the aim of the Ex-BMDP setting is to obtain sample complexity guarantees that are independent of the size of the observed state space |X | and exogenous state space |Ξ|, and scale as poly(|S|,|A|,H,log|F|), where F is an appropriate class of function approximators (typically either a value function class F or a class of decoders Φ that attempts to model ϕ ⋆ directly). Ex-BMDPs present substantial additional difficulties compared to classical block MDPs because we aim to avoid dependence on the size |Ξ| of the exogenous latent state space. Here, the main challenge is that executing policies π whose actions depend on ξ h can lead to spurious correlations between endogenous exogenous states. In spite of this apparent difficulty, we show that the coverability coefficient for this setting is always bounded by the number of endogenous states. This bound is a consequence of a structural result from Efroni et al. (2021) , which shows that for any (s,a) ∈ S × A, all x ∈ X with ϕ ⋆ (x) = s admit a common policy that maximizes d π h (x,a), and this policy is endogenous, i.e., only depends on the endogenous state s h = ϕ ⋆ h (x h ). As a corollary, we obtain the following regret bound. Corollary 9. For the Ex-BMDP setting, under Assumption 1, Algorithm 1 ensures that with probability at least 1-δ, Reg ≤ O H |S||A|T log( T H|F | /δ)log(T ) . Critically, this result scales only with the cardinality |S| for the endogenous latent state space, and with the capacity log|F| for the value function class. Let us briefly compare to prior work. For general Ex-BMDPs, existing complexity measures such as Bellman Rank and Bellman-Eluder dimension can be arbitrarily large (see discussion in Section 5). Existing algorithms either require that the endogenous latent dynamics P endo are deterministic (Efroni et al., 2021) or allow for stochastic dynamics but heavily restrict the observation process (Efroni et al., 2022a) . Corollary 9 is the first result for this setting that allows for stochastic latent dynamics and emission process, albeit with the extra assumption of completeness. This result is best thought of as a "luckiness" guarantee in the sense that it is unclear how to construct a value function class that is complete for every problem instance,foot_6 but the algorithm will succeed whenever F does happen to be complete for a given instance. Understanding whether general Ex-BMDPs are learnable without completeness is an interesting question for future work, and we are hopeful that the perspective of coverability will lead to further insights for this setting.

C.1 INVARIANCE OF COVERABILITY

Proposition 8 is a consequence of two general invariance properties of coverability, which show that C cov is unaffected by the following augmentations to the underlying MDP: (i) addition of rich observations, and (ii) addition of exogenous noise. The first property shows that for a given MDP M , creating a new block MDP M ′ by equipping M with a decodable emission process (so that M acts as a latent MDP), does not increase coverability. Proposition 10 (Invariance to rich observations). Let an MDP M = (S,A,P,R,H,s 1 ). Let M ′ = (X ,A,P ′ ,R ′ ,H,x 1 ) be the MDP defined implicitly by the following process. For each h ∈ [H]: • s h+1 ∼ P h (s h ,a h ) and r h = R h (s h ,a h ). Here, s h is unobserved, and may be thought of as a latent state. • x h ∼ q h (s h ), where q h : S → ∆(X ) is an emission distribution with the property that supp(q h (s))∩ supp(q h (s ′ )) = ∅ for s ̸ = s ′ . Then, writing C cov (M ) to make the dependence on M explicit, we have C cov (M ′ ) ≤ C cov (M ). The second result shows that coverability is also preserved if we expand the state space to include temporally correlated exogenous state whose evolution does not depend on the agent's actions. Proposition 11 (Invariance to exogenous noise). Let an MDP M = (S,A,P,R,H,s 1 ), conditional distribution P exo : Ξ → ∆(Ξ), and ξ 1 ∈ Ξ be given, where Ξ is an abstract set. Let X := S ×Ξ, and let M ′ = (X ,A,P ′ ,R ′ ,H,x 1 ) be the MDP with state x h = (s h ,ξ h ) defined implicitly by the following process. For each h ∈ [H]: • s h+1 ∼ P h (s h ,a h ), r h = R h (s h ,a h ). • ξ h+1 ∼ P exo h (ξ h ). Then we have C cov (M ′ ) ≤ C cov (M ). This result is non-trivial because policies that act based on the endogenous state s h and ξ h can cause these processes to become coupled (Efroni et al., 2021) Efroni et al. (2021) shows that for all z = (s,ξ), if we define := (s h ,ξ h ). For each z = (s,ξ) ∈ S × Ξ, let d π h (z) := P π (z h = z). Proposition 4 of π s = argmax π∈Π P π (s h = s), then max π∈Π d π h (z) = d πs h (z). That is, π s maximizes P π (z h = (s,ξ)) for all ξ ∈ Ξ simultaneously. With this in mind, let us define µ h (x,a) = 1 |S||A| s∈S d πs h (x). We proceed to bound the concentrability coefficient for µ. Fix π ∈ Π and x ∈ X , and let z = (s,ξ) ∈ S ×Ξ be the unique latent state such that x ∈ supp(q h (s,ξ)). We first observe that d π h (x,a) µ h (x,a) ≤ |S||A|• d π h (x) d πs h (x) . Next, since x h ∼ q h (z h ), we have d π h (x) d πs h (x) = q h (x | z)d π h (z) q h (x | z)d πs h (z) = d π h (z) d πs h (z) . Finally, by Eq. ( 2), we have d π h (z) d πs h (z) ≤ max π d π h (z) d πs h (z) = d πs h (z) d πs h (z) = 1. Since this holds for all x ∈ X simultaneously, this choice for µ h certifies that that C cov ≤ |S||A|. Proof of Proposition 10. Let Π denote the space of all randomized policies acting on the latent state space S, and let Π ′ denote the space of all randomized policies acting on the observed state space X . Let P π denote distribution over trajectories in M induced by π ∈ Π, and let Q π ′ denote the distribution over trajectories in M induced by π ′ ∈ Π ′ . Fix h ∈ [H], and let µ h ∈ ∆(S ×A) witness the coverability coefficient for M . Define µ ′ h (x,a) = q h (x | ϕ ⋆ (x))µ h (ϕ ⋆ (x),a) , where ϕ ⋆ h : X → S is the decoder that maps x ∈ X to the unique state s ∈ S such that x ∈ supp(q h (s)). For any π ′ ∈ Π ′ and (x,a) ∈ X ×A, letting s = ϕ ⋆ h (x), we have d π ′ h (x,a) µ ′ h (x,a) = q h (x | s)Q π ′ (s h = s,a h = a) q h (x | s)µ h (s,a) = Q π ′ (s h = s,a h = a) µ h (s,a) ≤ max π ′ ∈Π Q π ′ (s h = s,a h = a) µ h (s,a) . Finally, because the observation process is decodable, we have max π ′ ∈Π Q π ′ (s h = s, a h = a) = max π∈Π P π (s h = s,a h = a), max π∈Π P π (s h = s,a h = a) µ h (s,a) ≤ C cov (M ). Proof of Proposition 11. Let Π denote the space of all randomized policies acting on the latent state space S, and let Π ′ denote the space of all randomized policies acting on the observed state space X . Let P π denote distribution over trajectories in M induced by π ∈ Π, and let Q π ′ denote the distribution over trajectories in M induced by π ′ ∈ Π ′ . Fix h ∈ [H], and let µ h ∈ ∆(S ×A) witness the coverability coefficient for M . For x = (s,ξ) ∈ S ×Ξ, let µ ′ h (x,a) = Q(ξ h = ξ)µ h (s,a) , where Q(ξ h = ξ) is the marginal probability of the event that ξ h = ξ in M ′ , which does not depend on the policy under consideration. For any π ′ ∈ Π and (s,ξ,a) ∈ S ×Ξ×A, we have d π ′ h (x,a) µ ′ h (x,a) = Q π ′ (s h = s,ξ h = ξ,a h = a) Q(ξ h = ξ)µ h (s,a) ≤ max π ′ ∈Π ′ Q π ′ (s h = s,ξ h = ξ,a h = a) Q(ξ h = ξ)µ h (s,a) . From Propositions 3 and 4 of Efroni et al. (2021) , we have max π ′ ∈Π ′ Q π ′ (s h = s,ξ h = ξ,a h = a) = Q(ξ h = ξ)•max π ′ ∈Π ′ Q π ′ (s h = s,a h = a) = Q(ξ h = ξ)•max π∈Π P π (s h = s,a h = a), so that max π ′ ∈Π ′ Q π ′ (s h = s,ξ h = ξ,a h = a) Q(ξ h = ξ)µ h (s,a) = Q(ξ h = ξ)max π∈Π P π (s h = s,a h = a) Q(ξ h = ξ)µ h (s,a) = max π∈Π P π (s h = s,a h = a) µ h (s,a) ≤ C cov (M ). D PROOFS AND ADDITIONAL DETAILS FROM SECTION 3 D.1 GOLF ALGORITHM AND PROOFS FROM SECTION 3 Algorithm 1 GOLF (Jin et al., 2021a) input: Function class F, confidence width β > 0. initialize: F (0) ← F, D (0) h ← ∅ ∀h ∈ [H]. 1: for episode t = 1,2,...,T do 2: Select policy π (t) ← π f (t) , where f (t) := argmax f ∈F (t-1) f (x 1 ,π f,1 (x 1 )).

3:

Execute π (t) for one episode and obtain trajectory (x (t) 1 ,a (t) 1 ,r (t) 1 ),...,(x (t) H ,a (t) H ,r (t) H ).

4:

Update dataset: D (t) h ← D (t-1) h ∪ x (t) h ,a (t) h ,x (t) h+1 ∀h ∈ [H].

5:

Compute confidence set: F (t) ← f ∈ F : L (t) h (f h ,f h+1 )-min f ′ h ∈F h L (t) h (f ′ h ,f h+1 ) ≤ β ∀h ∈ [H] , where L (t) h (f,f ′ ) := (x,a,r,x ′ )∈D (t) h f (x,a)-r-max a ′ ∈A f ′ (x ′ ,a ′ ) 2 , ∀f,f ′ ∈ F. 6: Output π = unif(π (1:T ) ). // For PAC guarantee only. Lemma 12 (Jin et al. (2021a, Lemmas 39 and 40) ). Suppose Assumption 1 holds. Then if β > 0 is selected as in Theorem 1, then with probability at least 1-δ, for all t ∈ [T ], Algorithm 1 satisfies 1. Q ⋆ ∈ F (t) . 2. i<t E (x,a)∼d (i) h (f h (x,a)-[T h f h+1 ](x,a)) 2 ≤ O(β) for all f ∈ F (t) . Lemma 13 (Jiang et al. (2017, Lemma 1) ). For any value function f = (f 1 ,...,f H ), f 1 (x 1 ,π f1,1 (x 1 ))-J(π f ) = H h=1 E (x,a)∼d π f h [f h (x,a)-(T h f h+1 )(x,a)]. Lemma 14 (Equivalence of coverability and cumulative reachability). The following definition is equivalent to Definition 2: C cov := max h∈[H] (x,a)∈X ×A sup π∈Π d π h (x,a). Proof of Lemma 14. We relate coverability and cumulative reachability for each choice for h ∈ [H]. Coverability bounds cumulative reachability. It follows immediately from the definition of coverability that if µ h ∈ ∆(X ×A) realizes the value of C cov , then (x,a)∈X ×A max π∈Π d π h (x,a) = (x,a)∈X ×A max π∈Π d π h (x,a) µ h (x,a) µ h (x,a) ≤ (x,a)∈X ×A C cov •µ h (x,a) (by Definition 2) = C cov . Cumulative reachability bounds coverability. Define µ h (x,a) ∝ max π∈Π d π h (x,a). Then for any π ∈ Π and any (x,a) ∈ X ×A, we have d π h (x,a) µ h (x,a) = d π h (x,a) max π ′′ ∈Π d π ′′ h (x,a) / (x ′ ,a ′ )∈X ×A max π ′ ∈Π d π ′ h (x ′ ,a ′ ) ≤ (x ′ ,a ′ )∈X ×A max π ′ ∈Π d π ′ h (x ′ ,a ′ ). This completes the proof. Proof of Theorem 1. Equipped with Lemma 14, we prove Theorem 1. Preliminaries. For each t, we define δ (t) h (•,•) := f (t) h (•,•)-(T h f (t) h+1 )(•,• ), which may be viewed as a "test function" at level h induced by f (t) ∈ F. We adopt the shorthand d (t) h ≡ d π (t) h , and we define d (t) h (x,a) := t-1 i=1 d (i) h (x,a), and µ ⋆ h := argmin µ h ∈∆(X ×A) sup π∈Π d π h µ h ∞ . That is, d (t) h unnormalized average of all state visitations encountered prior to step t, and µ ⋆ h is the distribution that attains the value of C cov for layer h.foot_7 Throughout the proof, we perform a slight abuse of notation and write E d (t) h [f ] := t-1 i=1 E d (i) h [f ] for any function f : X ×A → R. Regret decomposition. As a consequence of completeness (Assumption 1) and the construction of t) , a standard concentration argument (Lemma 12) guarantees that with probability at least 1-δ, for all t ∈ [T ]: t) , and (ii) F ( (i) Q ⋆ ∈ F ( x,a d (t) h (x,a) δ (t) h (x,a) 2 ≤ O(β). We condition on this event going forward. Since Q ⋆ ∈ F (t) , we are guaranteed that f (t) is optimistic (i.e., f (t) 1 (x 1 ,π f (t) ,1 (x 1 )) ≥ Q ⋆ 1 (x 1 ,π f ⋆ ,1 (x 1 ) )), and a regret decomposition for optimistic algorithms (Lemma 13) allows us to relate regret to the average Bellman error under the learner's sequence of policies: Reg ≤ T t=1 f (t) 1 (x 1 ,π f (t) 1 ,1 (x 1 ))-J(π (t) ) = T t=1 H h=1 E (x,a)∼d (t) h f (t) h (x,a)-(T h f (t) h+1 )(x,a) =:δ (t) h (x,a) . To proceed, we use a change of measure argument to relate the on-policy average Bellman error E (x,a)∼d (t) h [δ (t) h (x,a)] appearing above to the in-sample squared Bellman error E (x,a)∼ d (t) h [δ (t) h (x,a) 2 ]; the latter is small as a consequence of Eq. ( 5). Unfortunately, naive attempts at applying change-ofmeasure fail because during the initial rounds of exploration, the on-policy and in-sample visitation probabilities can be very different, making it impossible to relate the two quantities (i.e., any natural notion of extrapolation error will be arbitrarily large). To address this issue, we introduce the notion of a "burn-in" phase for each state-action pair (x,a) ∈ X ×A by defining τ h (x,a) = min t | d (t) h (x,a) ≥ C cov •µ ⋆ h (x,a ) , which captures the earliest time at which (x,a) has been explored sufficiently; we refer to t < τ h (x,a) as the burn-in phase for (x,a). Going forward, let h ∈ [H] be fixed. We decompose regret into contributions from the burn-in phase for each state-action pair, and contributions from pairs which have been explored sufficiently and reached a stable phase "stable phase". T t=1 E (x,a)∼d (t) h δ (t) h (x,a) on-policy average Bellman error = T t=1 E (x,a)∼d (t) h δ (t) h (x,a)1[t < τ h (x,a)] burn-in phase + T t=1 E (x,a)∼d (t) h δ (t) h (x,a)1[t ≥ τ h (x,a)] stable phase . We will not show that every state-action pair leaves the burn-in phase. Instead, we use coverability to argue that the contribution from pairs that have not left this phase is small on average. In particular, we use that |δ (t) h | ≤ 1 to bound T t=1 E (x,a)∼d (t) h δ (t) h (x,a)1[t < τ h (x,a)] ≤ x,a t<τ h (x,a) d (t) h (x,a) = x,a d (τ h (x,a)) h (x,a) ≤ 2C cov x,a µ ⋆ h (x,a) = 2C cov , where the last inequality holds because 4) and the definition of τ h . For the stable phase, we apply change-of-measure as follows: T t=1 E (x,a)∼d (t) h δ (t) h (x,a)1[t ≥ τ h (x,a)] = T t=1 x,a d (t) h (x,a) d (t) h (x,a) d (t) h (x,a) 1 /2 δ (t) h (x,a)1[t ≥ τ h (x,a)] ≤ T t=1 x,a 1[t ≥ τ h (x,a)]d (t) h (x,a) 2 d (t) h (x,a) (I): extrapolation error • T t=1 x,a d (t) h (x,a) δ (t) h (x,a) 2 (II): in-sample squared Bellman error , where the last inequality is an application of Cauchy-Schwarz. Using part (II) of Eq. ( 5), we bound the in-sample error above by (II) ≤ O βT . ( ) Bounding the extrapolation error using coverability. To proceed, we show that the extrapolation error (I) is controlled by coverability. We begin with a scalar variant of the standard elliptic potential lemma (Lattimore and Szepesvári, 2020) ; this result is proven in the sequel. Lemma 15 (Per-state-action elliptic potential lemma). Let d (1) ,d (2) ,...,d (T ) be an arbitrary sequence of distributions over a set Z (e.g., Z = X × A), and let µ ∈ ∆(Z) be a distribution such that d (t) (z)/µ(z) ≤ C for all (z,t) ∈ Z ×[T ]. Then for all z ∈ Z, we have T t=1 d (t) (z) i<t d (i) (z)+C •µ(z) ≤ O(log(T )). We bound the extrapolation error (I) by applying Lemma 15 on a per-state basis, then using coverability (and the equivalence to cumulative reachability) to argue that the potentials from different stateaction pairs average out. Observe that by the definition of τ h , we have that for all t ≥ τ h (s, a), d (t) h (x,a) ≥ C cov µ ⋆ h (x,a) ⇒ d (t) h (x,a) ≥ 1 2 ( d (t) h (x,a)+C cov µ ⋆ h (x,a) ), which allows us to bound term (I) of extrapolation error by T t=1 x,a 1[t ≥ τ h (x,a)]d (t) h (x,a) 2 d (t) h (x,a) ≤ 2 T t=1 x,a d (t) h (x,a)•d (t) h (x,a) d (t) h (x,a)+C cov •µ ⋆ h (x,a) ≤ 2 T t=1 x,a max t ′ ∈[T ] d (t ′ ) h (x,a)• d (t) h (x,a) d (t) h (x,a)+C cov •µ ⋆ h (x,a) ≤ 2 x,a max t∈[T ] d (t) h (x,a) ≤Ccov by Lemma 14 • max (s,a)∈S×A T t=1 d (t) h (x,a) d (t) h (x,a)+C cov •µ ⋆ h (x,a) ≤O(log(T )) by Lemma 15 ≤ O(C cov log(T )). To conclude, we substitute Eqs. ( 7) and ( 8) into Eq. ( 6), which gives Reg ≤ H h=1 E (x,a)∼d (t) h δ (t) h (x,a) ≤ O H C cov •βT log(T ) . Proof of Lemma 15. Using the fact for any u ∈ [0,1], u ≤ 2log(1+u), we have T t=1 d (t) (z) i<t d (i) (z)+C •µ(x,a) ≤ 2 T t=1 log 1+ d (t) (x,a) i<t d (i) (z)+C •µ(x,a) (since d (t) (x,a)/µ(x,a) ≤ C ∀t ∈ [T ]) = 2 T t=1 log i<t+1 d (i) (z)+C •µ(x,a) i<t d (i) (z)+C •µ(x,a) = 2log T t=1 i<t+1 d (i) (z)+C •µ(x,a) i<t d (i) (z)+C •µ(x,a) = 2log T i=1 d (i) (z)+C •µ(x,a) C •µ(x,a) ≤ 2log(T +1). (since d (t) (x,a)/µ(x,a) ≤ C ∀t ∈ [T ]) This completes the proof.

E PROOFS AND ADDITIONAL DETAILS FROM SECTION 4 E.1 ADDITIONAL DETAILS: OFFLINE RL

Proposition 16 (Generalized concentrability is sufficient for offline RL). Given access to an offline data distribution µ satisfying generalized concentrability (Definition 4), if F satisfies Assumption 1, one can find an ε-optimal policy using poly(C conc (µ,F),H,log|F|,ε -1 ) samples. Proof of Proposition 16. Given an offline dataset D = {D h } H h=1 with n samples for each layer h ∈ [H] under the distribution µ h , the MSBO algorithm (e.g., Xie and Jiang, 2020) produces a value function f ∈ F of the form f ← argmin f ∈F H h=1 L h (f h ,f h+1 )-min f ′ h ∈F h L h (f ′ h ,f h+1 ) , where L h (f,f ′ ) := (x,a,r,x ′ )∈D h f (x,a)-r-max a ′ ∈A f ′ (x ′ ,a ′ ) 2 , ∀f,f ′ ∈ F. By adapting the proof of Theorem 5 of Xie and Jiang (2020) (or Lemma 12), one can show that under Assumption 1, with probability at least 1-δ, f satisfies H h=1 E (x,a)∼µ h ( f h (x,a)-T h f h+1 )(x,a) 2 ≤ H • log( |F | /δ) n . The result now follows by applying an adaptation of Xie and Jiang (2020, Corollary 4), which shows that for any f ∈ F, J(π ⋆ )-J(π f ) ≤ 2max π∈Π H h=1 E (x,a)∼d π h [|f h (x,a)-(T h f h+1 )(x,a)|] ≤ 2 Hmax π∈Π H h=1 E (x,a)∼µ h [(f h (x,a)-(T h f h+1 )(x,a)) 2 ] ≤ 2 HC conc (µ,F) H h=1 E (x,a)∼µ h [(f h (x,a)-(T h f h+1 )(x,a)) 2 ] (by Definition 4) ≤ 2H C conc (µ,F)log( |F | /δ) n .

E.2 PROOFS FROM SECTION 4

Proof of Theorem 3. Assume without loss of generality that H ≤ min{log 2 (X),C}; if this does not hold, the result is obtained by applying the argument that follows with H ′ = min{H,⌊log 2 (X)⌋,C}. We consider a family of deterministic MDPs with horizon H. We use a layered state space X = X 1 ∪•••∪X H , where only states in X h are reachable at layer h. The state space is a binary tree of depth H -1, which has log 2 (X)-1 h=0 2 h = X -1 states. The are two actions, left and right, which determine whether the next state is the left or right successor in the tree. For each MDP in the family, we allow a single action at a single leaf at h = H to have reward r H = 1, give reward 0 to all actions in all other states. For each such MDP, we use (x ⋆ H ,a ⋆ H ) to denote the single state-action pair with r = 1. We also use (x ⋆ h ,a ⋆ h ) for h ∈ [H] to denote the unique path from x 1 to (x ⋆ H ,a ⋆ H ). Note that the optimal policy is to follow this path, i.e. d π ⋆ h (x,a) = 1[(x,a) = (x ⋆ h ,a ⋆ h )]. We choose F h to be the set of all possible indicator functions for a single state-action pair: F h := {f h (x ′ ,a ′ ) = 1(x ′ = x,a ′ = a) | ∀(x,a) ∈ X h ×A}. We define F = F 1 ×•••×F H . Note that for each h ∈ [H], Q ⋆ h (x h ,a h ) = 1(x h = x ⋆ h ,a h = a ⋆ h ) ∈ F h . In addition, we have log|F| ≤ Hlog(2X). Completeness. We first verify that the construction satisfies completeness. Fix f h ∈ F h , and let f h (x, a) = 1(x = x f,h , a = a f,h ) for some (x f,h , a f,h ) ∈ X h × A. Then for any (x h-1 , a h-1 ) ∈ X h-1 × A, we consider two cases. First, if x f,h is not the unique successor of (x h-1 ,a h-1 ), then (T h-1 f h )(x h-1 ,a h-1 ) = 0. Otherwise, (T h-1 f h )(x h-1 ,a h-1 ) = x h P(x h | x h-1 ,a h-1 )max a h f (x h ,a h ) = P(x f,h | x h-1 ,a h-1 ). (as max a h f (x h ,a h ) = 1(x h = x f,h )) = 1. This means T h-1 f h ∈ F h-1 , because there exists a single (x h-1 ,a h-1 ) pair in X h-1 × A such that (T h-1 f h )(x h-1 ,a h-1 ) ̸ = 0. Generalized coverability. We now show that the construction satisfies generalized coverability. Fix an MDP in the family with optimal path {(x ⋆ h ,a ⋆ h )} H h=1 . We will show that for all f = f 1:H ∈ F, if f 1:H ̸ = Q ⋆ 1:H , then there exists h ′ ∈ [H], such that E d π ⋆ h ′ (f h (x h ′ ,a h ′ )-(T h ′ f h ′ +1 )(x h ′ ,a h ′ )) 2 = (f h ′ (x ⋆ h ′ ,a ⋆ h ′ )-(T h ′ f h ′ +1 )(x ⋆ h ′ ,a ⋆ h ′ )) 2 = 1. (9) From here, the result will follow by choosing µ h = d π ⋆ h ∀h ∈ [H]. Indeed, using the boundedness of f 1:H ∈ F, we have H h=1 E d π h (f h (x h ,a h )-(T h f h+1 )(x h ,a h )) 2 ≤ H, for all π ∈ Π, meaning that Eq. ( 9) implies that C cov (µ,F) ≤ H ≤ C in this problem instance. We proceed to prove Eq. ( 9). Based on the definition of F, we know that (f h (x ⋆ h ,a ⋆ h )-(T h f h+1 )(x ⋆ h ,a ⋆ h )) 2 ∈ {0, 1} for all h ∈ [H]. Therefore, if we assume by contradiction that f 1:H ̸ = Q ⋆ 1:H and there does not exist an h ′ ∈ [H] that satisfies Eq. ( 9), we must have f h (x ⋆ h ,a ⋆ h ) = (T h f h+1 )(x ⋆ h ,a ⋆ h ), ∀h ∈ [H]. (10) By the condition Eq. ( 10), we have (T H f H+1 )(x ⋆ h , a ⋆ H ) = R H (x ⋆ H , a ⋆ H ) = 1, which implies that f h (x ⋆ h ,a ⋆ h ) = 1 for all h ∈ [H]. From the construction of F, we know Q ⋆ 1:H is the only function with Q ⋆ h (x ⋆ h ,a ⋆ h ) = 1 for all h ∈ [H], which gives the desired contraction, and proves that such h ′ ∈ [H] must exist, establishing Eq. ( 9). Lower bound on sample complexity. A lower bound of 2 Ω(H) samples to learn a 0.1-optimal with probability 0.9 follows from standard lower bounds for binary tree-structured MDPs (Krishnamurthy et al., 2016; Jiang et al., 2017) (recall that since there are 2 H/2 leaves at layer H, and only one has non-zero reward, finding a policy with non-trivial regret is no easier than solving a multi-armed bandit problem with 2 H/2 actions and binary rewards).

F PROOFS AND ADDITIONAL RESULTS FROM SECTION 5 F.1 ADDITIONAL DETAILS: SEQUENTIAL EXTRAPOLATION COEFFICIENT VERSUS BELLMAN-ELUDER DIMENSION

The discussion in Section 5 (in particular, Proposition 4 shows that Bellman-Eluder dimension and Bellman rank fail to capture coverability as a result of only considering average Bellman error rather than squared Bellman error. In light of this observation, a seemingly reasonable fix is to adapt the Bellman-Eluder dimension to consider squared Bellman error rather than average Bellman error. Consider the following variant. Definition 9 (Squared Bellman-Eluder dimension). We define the Squared Bellman-Eluder dimension dim sq BE (F,Π,ε,h) for layer h is the largest d ∈ N such that there exist sequences {d t) , and (1) h ,d (2) h ,...,d (d) h } ⊆ D Π h and {δ (1) h ,...,δ (d) h } ⊆ F h -T h F h+1 such that for all t ∈ [d], |E d (t) h [δ (t) h ]| > ε ( t-1 i=1 E d (i) h (δ (t) h ) 2 ≤ ε (t) , ( ) for ε (1) ,...,ε (d) ≥ ε. We define dim sq BE (F,Π,ε) = max h∈[H] dim sq BE (F,Π,ε,h). This definition is identical to Definition 6, except that the constraint t) in Definition 6 has been replaced by the constraint t) , which uses squared Bellman error instead of average Bellman error. By adapting the analysis of Jin et al. (2021a) it is possible to show that this definition yields t-1 i=1 (E d (i) h [δ (t) h ]) 2 ≤ ε ( t-1 i=1 E d (i) h (δ (t) h ) 2 ≤ ε ( Reg ≤ O H inf ε>0 {ε 2 T +dim sq BE (F,Π,ε)}•T log|F| . If one could show that dim sq BE (F,Π,ε) ≲ C cov •polylog(ε -1 ), this would recover Theorem 1. Unfortunately, it turns out that in general, one can have dim sq BE (F,Π,ε) = Ω(C cov /ε), which leads to suboptimal T 2/3 -type regret using the result above. The following result shows that this guarantee cannot be improved without changing the complexity measure under consideration. Proposition 17. Fix T ∈ N, and let ε T := T -1/3 . There exist MDP class/policy class/value function class tuples (M 1 ,Π 1 ,F 1 ) and (M 2 ,Π 2 ,F 2 ) with the following properties. 1. All MDPs in M 1 (resp. M 2 ) satisfy Assumption 1 with respect to F 1 (resp. F 2 ). In addition, log|F 1 | = log|F 2 | = O(1). 2. For all MDPs in M 1 , we have dim sq BE (F 1 ,Π 1 ,ε T ) ∝ 1/ε T , and any algorithm must have E[Reg] ≥ Ω(T 2/3 ) for some MDP in the class 3. For all MDPs in M 2 , we also have dim sq BE (F 2 ,Π 2 ,ε T ) ∝ 1/ε T , yet C cov = O(1) and GOLF attains E[Reg] ≤ O( √ T ). This result shows that there are two classes for which the optimal rate differs polynomially (Ω(T 2/3 ) vs. O( √ T ) ), yet the Bellman-Eluder dimension has the same size, and implies that the Bellman-Eluder dimension cannot provide rates better than Ω(T 2/3 ) for classes with low coverability in general. Informally, the reason why Bellman-Eluder dimension fails capture the optimal rates for the problem instances in Proposition 17 is that the definition in Eq. ( 11) only checks whether the average Bellman error violates the threshold ε, and does not consider how far the error violates the threshold (e.g., ) h ]| > 1 are counted the same). In spite of this counterexample, it is possible to show that the Bellman-Eluder dimension with squared Bellman error is always bounded by the Sequential Extrapolation Coefficient up to a poly(ε -1 ) factor, and hence can always be bounded by coverability, albeit suboptimally. Proposition 18. Let F be a [0, 1]-valued function class. For all T ∈ N and ε > 0, we have |E d (t) h [δ (t) h ]| > ε and |E d (t) h [δ (t min{dim sq BE (F,Π,ε),T } ≤ SEC RL (F ,Π,T ) ε 2 . F.1.1 PROOFS FROM ADDITIONAL DETAILS Proof of Proposition 17. Let the time horizon T ∈ N be fixed. We first construct the class M 1 and verify that it satisfies the properties in the statement of Proposition 17, then do the same for M 2 . Class M 1 . We choose M 1 to be a class of bandit problems with H = 1. Let a parameter ε 1 ∈ [0,1/2] be fixed, and let A := ε -1 1 . We define M 1 = {M (1) ,...,M (A) }, where for each M (i) : • The action space is A = {1,...,A}. • The reward distribution for action a ∈ A in state x 1 is Ber( 1 /2+ε 1 1{a = i}). For each i ∈ M (i) , the mean reward function is f (i) 1 (x 1 ,π) = 1 /2+ε 1 1{a = i}. We define F = {f (i) } A i=1 and Π = {π f | f ∈ F}. Note that since H = 1, completeness of F is immediate. Lower bounding the Bellman-Eluder dimension. Let M (A) be the underlying instance. We will lower bound the Bellman-Eluder dimension for layer h = 1. Consider the sequence d (1) 1 ,...,d (A-1)

1

, where d (t) 1 := d π f (t) 1 and δ (1) 1 ,...,δ (A-1) , where δ (t) 1 := f (t) 1 -T 1 f (t) 2 = f (t) 1 -f (A) (recall that we adopt the convention f H+1 = 0). Observe that for each t ∈ [A-1], we have |E d (t) 1 [δ (t) 1 ](x 1 ,a 1 )| = |f (t) 1 (x 1 ,t)-f (A) 1 (x 1 ,t)| = ε 1 , yet i<t E d (i) 1 (δ (t) 1 (x 1 ,a 1 )) 2 = ε 1 i<t (f (t) 1 (x 1 ,i)-f (A) 1 (x 1 ,i)) 2 = 0. This certifies that dim sq BE (F,Π,ε) ≥ A-1 ≥ ε -1 1 /2 for all ε < ε 1 . Lower bounding regret. A standard result (e.g., Lattimore and Szepesvári, 2020) is that for any family of multi-armed bandit instances of the form {M (1) ,...,M (A) }, where M (i) has Bernoulli rewards with mean 1 /2+∆1{a = i} for ∆ ≤ 1/4, any algorithm must have regret E[Reg] ≥ Ω(1)•min ∆T, A ∆ for some instance. We apply this result with the class M 1 , which has ∆ = ε 1 and A = ε -1 1 , which gives E[Reg] ≥ Ω(1)•min ε 1 T, 1 ε 2 1 . Choosing ε 1 = ε T = T -1/3 yields E[Reg] ≥ Ω(T 2/3 ) whenever T is greater than an absolute constant. Class M 2 . Let a parameter ε 2 ∈ [0,1/2] be fixed, and let A := ε -1 2 (we assume without loss of generality that ε -1 2 ∈ N). We define M 2 = {M (1) ,...,M (A) }, where each MDP M (i) is as defined follows: • We have H = 2, and there is a layered state space X = X 1 ×X 2 , where X 1 = {x 1 } and X 2 = {y,z}. • The action space is A = {1,...,A}. • x 1 is the deterministic initial state. Regardless of the action, we transition to z with probability 1-ε 2 and y with probability ε 2 . • For each MDP M (i) all actions have zero reward in states x 1 and z. For state y, action i has reward 1 and all other actions have reward 0. We let f (i) denote the optimal Q-function for M (i) , which has: • f (i) 1 (x 1 ,•) = ε 2 and f (i) 2 (z,•) = 0. • f (i) 2 (y,a) = 1{a = i}. We define F = {f (i) } i∈[A] ; it is clear that this class satisfies completeness. We define Π = {π f | f ∈ F}; for states where there are multiple optimal actions (i.e., f h (x,a) = f h (x,a ′ )), we take π f,h (x) to be the optimal with the least index, which implies that π f,1 (x 1 ) = π f,2 (z) = 1 for all f ∈ F. Verifying coverability. We choose µ 1 (x, a) = 1{x = x 1 , a = 1}. We choose µ 2 (z, 1) = 1 2 and µ 2 (y,a) = 1 2A for all a ∈ A. It is immediate that coverability is satisfied with constant 1 for h = 1. For h = 2, we have that for all π ∈ Π, d π 2 (z,1) µ 2 (z,1) = 1-ε 2 1 /2 ≤ 2 and d π 2 (y,a) µ 2 (y,a) ≤ ε 2 µ 2 (y,a) ≤ 2Aε 2 ≤ 2. Hence, we have C cov ≤ 2; note that this holds for any choice of ε 2 . Lower bounding the Bellman-Eluder dimension. Let M (A) be the underlying MDP. We will lower bound the Bellman-Eluder dimension for layer h = 2. Consider the sequence d (1) 2 ,...,d (A-1)

2

, where d (t) 2 := d π f (t) 2 and δ (1) 2 ,...,δ (A-1) , where δ (t) 2 := f (t) 2 -T 2 f (t) 3 = f (t) 2 -f (A) (recall that we adopt the convention f H+1 = 0). Observe that for each t ∈ [A-1], we have |E d (t) 2 [δ (t) 2 ](x 2 ,a 2 )| = ε 2 |f (t) 2 (y,t)-f (A) 2 (y,t)| = ε 2 , yet i<t E d (i) 2 (δ (t) 2 (x 2 ,a 2 )) 2 = ε 2 i<t (f (t) 2 (y,i)-f (A) 2 (y,i)) 2 = 0. This certifies that dim sq BE (F,Π,ε,2) ≥ A-1 ≥ ε -1 2 /2 for all ε < ε 2 . Upper bound on regret. To conclude, we set ε 2 = ε T = 1/T -1/3 . With this choice, we have dim sq BE (F 2 ,Π 2 ,ε T ) ≥ Ω(ε -1 T ). Since the construction satisfies completeness (Assumption 1) and has C cov ≤ 2 and H = 2, Theorem 1 yields Reg ≤ O( T log(|F|T /δ)) = O( T log(T /(ε 2 δ))) = O( T log(1/δ)). Proof of Proposition 18. Fix h ∈ [H], and n ∈ N, and consider sequences {d (1) h ,d (2) h ,...,d (n) h } and {δ (1) h ,δ (2) h ,...,δ (n) h } that satisfy Eq. ( 11) (that is, the sequences witness the value of dim BE (F,Π,ε,h)). Then dim sq BE (F,Π,ε,h) ≤ n t=1 E d (t) h δ (t) h 2 (ε (t) ) 2 ≤ n t=1 1+(ε (t) ) 2 • E d (t) h δ (t) h 2 (ε (t) ) 2 1+ t-1 i=1 E d (i) h [δ (t) h ] 2 (by t-1 i=1 E d (i) h [δ (t) h ] 2 ≤ (ε (t) ) 2 ) ≤ n t=1 1+(ε (t) ) 2 (ε (t) ) 2 E d (t) h δ (t) h 2 1+ t-1 i=1 E d (i) h [δ (t) h ] 2 ≤ n t=1 2 (ε (t) ) 2 E d (t) h δ (t) h 2 1+ t-1 i=1 E d (i) h [δ (t) h ] 2 (by ε (t) < |E d (t) h [δ (t) h ]| ≤ 1) ≤ 1 ε 2 n t=1 E d (t) h δ (t) h 2 1∨ t-1 i=1 E d (i) h [δ (t) h ] 2 (by ε (t) ≥ ε) ≤ SEC RL (F,Π,n) ε 2 . This implies for any T > 0, min{dim sq BE (F,Π,ε),T } ≤ SEC RL (F,Π,T ) ε 2 . F.2 PROOFS FROM SECTION 5 Proof of Proposition 4. We present a counterexample for both Q-type and V -type Bellman-Eluder dimension. We recall that the V -type Bellman-Eluder dimension is defined by replacing This means that if we take {f (1) ,f (2) ,...,f (d-1) } to be any ordering of the set of functions in F h -T h F h+1 with V F h -T h F h+1 and D Π h with D Π h,x in Definition 6, where V F h -T h F h+1 := {(f h -T h f h+1 )(•,π f,h ) : f ∈ F} ⊂ (X → R) and D Π h,x := {d π h (•) : π ∈ Π} ⊂ ∆( f ∈ F with f ̸ = Q ⋆ , π f is 1/8-suboptimal. 2. For all f,f ′ ∈ F \Q ⋆ , we have (note that H = 2) E x∼d π f ′ 2 ,a∼π f,2 [f 2 (x,a)-R 2 (x,a)] = 1 2 1{f = f ′ }. F \{Q ⋆ }, then set δ (i) h := f (i) h -T h f (i) h+1 and d (t) h := d π f (t) h , we have that for all t ∈ [d-1], E x∼d (t) 2 ,a∼π f (t) ,2 [δ (t) 2 ] = 1 2 , and t-1 i=1 E x∼d (i) 2 ,a∼π f (t) ,2 [δ (t) 2 ] 2 = 0. This implies that the V -type Bellman-Eluder dimension dim BE-v (F,Π F ,ε) is at least d -1 for all ε ≤ 1 /2. It is straightforward to verify that this construction in Efroni et al. (2022a) satisfies Assumption 1 (completeness), because functions in the class have f 1 = T 2 f 2 (that is, zero Bellman error at h = 1). As a result, since H = 2, completeness for this construction is implied by Q ⋆ ∈ F.

Q-type

Bellman-Eluder dimension. The construction above immediately extends to Q-type. This is because in the construction, the value of R 2 (x,•) and f 2 (x,•) depends only on x (i.e., is independent of the action) for all f ∈ F (cf. Efroni et al., 2022a, Proposition B.1) . Therefore, for any f,g ∈ F, we have, E x∼d π f 2 ,a∼πg,2 [g 2 (x,a)-R 2 (x,a)] = E (x,a)∼d π f 2 [g 2 (x,a)-R 2 (x,a)]. This implies that the Q-type Bellman residual matrix E (x,a)∼d π f ′ 2 [f 2 (x,a)-R 2 (x,a)] f,f ′ ∈F \{Q ⋆ } embeds the scaled identity matrix and, via the same argument as for V -type above, immediately implies that dim BE (F,Π,ε) ≥ d-1 for all ε ≤ 1 /2. As before, we have C cov ≤ 6, and F is complete. Proof of Proposition 19. We now show that that OLIVE, a canonical average-Bellman-error-based hypothesis elimination algorithm, also suffers from the lower bound in the construction from Proposition 4. By Eq. ( 12) (V-type OLIVE) and Eq. ( 13) (Q-type OLIVE), we know that any sub-optimal hypothesis f ∈ F \Q ⋆ cannot be eliminated until π f is executed. On the other hand, the construction ensures E[max a f (s 1 ,a)] = 7 /8 whereas J(π ⋆ ) = 3 /4. This means OLIVE will enumerate over F \Q ⋆ before finding a 0.1-optimal policy for this instance, and hence suffers from complexity of Ω(d) (|F| = d). Proof of Theorem 5. As in Theorem 1, as a consequence of completeness (Assumption 1), the construction of F (t) , and Lemma 12, we have that with probability at least 1-δ, for all t ∈ [T ]: t) , and (ii) (i) Q ⋆ ∈ F ( x,a d (t) h (x,a) δ (t) h (x,a) 2 ≤ O(β), and whenever this event holds, Reg ≤ T t=1 f (t) 1 (x 1 ,π f (t) 1 ,1 (x 1 ))-J(π (t) ) = T t=1 H h=1 E (x,a)∼d (t) h f (t) h (x,a)-(T h f (t) h+1 )(x,a) =:δ (t) h (x,a) . To proceed, we have that for all h ∈ [H], T t=1 E d (t) h δ (t) h = T t=1 E d (t) h δ (t) h   1∨ t-1 i=1 E d (i) h (δ (t) h ) 2 1∨ t-1 i=1 E d (i) h (δ (t) h ) 2   1 /2 ≤ T t=1 E d (t) h δ (t) h 2 1∨ t-1 i=1 E d (i) h (δ (t) h ) 2 T t=1 1∨ t-1 i=1 E d (i) h (δ (t) h ) 2 (by Cauchy-Schwarz inequality) ≤ T t=1 E d (t) h δ (t) h 2 1∨ t-1 i=1 E d (i) h (δ (t) h ) 2 βT ≤ SEC RL (F,Π,T )•βT . (by Definition 13) Therefore, we obtain Reg ≤ H SEC RL (F,Π,T )•βT . Plugging in the choice for β completes the proof. Proof of Proposition 6. We prove a more general result. Consider a set of distributions D ⊂ ∆(Z), and a set of test functions Ψ ⊂ (Z → [0,1]). We define a generalized form of coverability with respect to D by C cov (D) := inf µ∈∆(Z) sup d∈D d µ ∞ . We will show that, for any T > 0, SEC(Ψ,D,T ) ≲ C cov (D)log(T ), which is implies Proposition 6. Going forward, we fix an arbitrary sequence {d (1) ,d (2) ,...,d (T ) } ⊂ D as well as an arbitrary sequence of {ψ (1) ,ψ (2) ,...,ψ (T ) } ⊂ Ψ. Following Eq. ( 4), we define µ ⋆ := argmin µ∈∆(Z) sup d∈D d µ ∞ . In addition, define d (t) = i<t d (t) . For each z ∈ Z, let τ (z) := min t t-1 i=1 d (i) (z) ≥ C cov µ ⋆ (z) . We decompose E d (t) [ψ (t) ] as E d (t) ψ (t) = E d (t) ψ (t) (z)1[t < τ (z)] +E d (t) ψ (t) (z)1[t ≥ τ (z)] . Then, T t=1 E d (t) ψ (t) 2 1∨ t-1 i=1 E d (i) (ψ (t) ) 2 ≲ T t=1 E d (t) ψ (t) (z)1[t < τ (z)] 2 1∨ t-1 i=1 E d (i) (ψ (t) ) 2 (I) + T t=1 E d (t) ψ (t) (z)1[t ≥ τ (z)] 2 1∨ t-1 i=1 E d (i) (ψ (t) ) 2 (II) , where we use a ≲ b as shorthand for a ≤ O(b). We first bound the term (I), (I) ≤ T t=1 E d (t) ψ (t) (z)1[t < τ (z)] 2 ≤ T t=1 E d (t) 1[t < τ (z)] 2 (by ψ(•) ∈ [0,1], ∀ψ ∈ Ψ) ≤ T t=1 E d (t) 1[t < τ (z)] (by E d (t) 1[t < τ (z)] ≤ 1) = z∈Z T t=1 d (t) h (z)1[t < τ (z)] = z∈Z d (τ (z)-1) (z)+d (τ (z)-1) (z) (a) ≤ z∈Z 2C cov (D)µ ⋆ (z) ≤ C cov (D), where (a) follows because d (τ (z)-1) (z),d (τ (z)-1) (z) ≤ C cov (D)µ ⋆ (z), for all z ∈ Z, as a consequence of Eqs. ( 15) and ( 16). We now turn to the term (II). First, observe that z∈Z 1[t ≥ τ (z)]d (t) (z)ψ (t) (z) = z∈Z 1[t ≥ τ (z)]d (t) (z) t-1 i=1 d (i) (z) t-1 i=1 d (i) (z) 1 /2 ψ (t) (z) ≤ z∈Z 1[t ≥ τ (z)](d (t) (z)) 2 t-1 i=1 d (i) (z) t-1 i=1 E d (i) (ψ (t) ) 2 . (by Cauchy-Schwarz inequality) By rearranging this inequality, we have (II) ≤ T t=1 z∈Z 1[t ≥ τ (z)] d (t) h (z) 2 t-1 i=1 d (i) (z) (defining 0/0 = 0) ≤ 2 T t=1 z∈Z 1[t ≥ τ (z)](d (t) (z)) 2 C cov •µ ⋆ (z)+ t-1 i=1 d (i) (z) (by Eq. ( 16)) ≲ T t=1 z∈Z (d (t) (z)) 2 C cov •µ ⋆ (z)+ t-1 i=1 d (i) (z) ≤ T t=1 z∈Z max i≤T d (i) (z) d (t) (z) t-1 i=1 d (i) (z)+C cov •µ ⋆ (z) ≤ C cov (D h ) z∈Z µ ⋆ (z) T t=1 d (t) (z) t-1 i=1 d (i) (z)+C cov •µ ⋆ (z) (by Eq. ( 14)) ≲ C cov (D) z∈Z µ ⋆ (z)log(T ) (by Lemma 15) = C cov (D)log(T ). Substituting Eqs. ( 18) and ( 19) into Eq. ( 17), we obtain SEC(Ψ,D,T ) ≲ C cov (D)log(T ). Proof of Proposition 7. This proof provides a slightly more general result. Consider a set of distributions D ⊂ ∆(Z) and a set of test functions Ψ ⊂ (Z → [0,1]) be given. We consider an abstract version of the Bellman-Eluder dimension with respect to D and Ψ. We define dim BE (Ψ,D,ε) is the largest d ∈ N such that there exist sequences {d (1) ,d (2) ,...,d (d) } ⊂ D and {ψ (1) ,ψ (2) ,...,ψ (d) } ⊂ Ψ such that for all t ∈ [d], 10 |E d (t) [ψ (t) ]| > ε (t) , and t-1 i=1 E d (i) [ψ (t) ] 2 ≤ ε (t) , for ε (1) ,...,ε (d) ≥ ε. We will show that, for any all T ∈ N, SEC(Ψ,D,T ) ≲ inf ε>0 ε 2 T +dim BE (Ψ,D,ε) •log(T ), which immediately implies Proposition 7. A generalized definition of ε-dependent sequence. In what follows, we rely on a slightly different notion of an ε-(in)dependent sequence from the one given in Jin et al. (2021a, Definition 6) and Russo and Van Roy (2013). We provide background on both definitions below. ε-(in)dependent sequence (e.g., Jin et al., 2021a, Definition 6)  . A distribution ν ∈ D is ε-dependent on a sequence {ν (1) ,...,ν (k) } ⊆ D if: When |E ν [ψ]| > ε for some ψ ∈ Ψ, we also have k i=1 (E ν (i) [ψ]) 2 > ε 2 . Otherwise, ν is ε-independent if this does not hold. Generalized ε-(in)dependent sequence. A distribution ν ∈ D is (generalized) ε-dependent on a sequence {ν (1) , ... , ν (k) } ⊆ D if: for all ε ′ ≥ ε, if |E ν [ψ]| > ε ′ for some ψ ∈ Ψ, we also have k i=1 (E ν (i) [ψ]) 2 > ε ′2 . We say that ν is (generalized) ε-independent if this does not hold, i.e., for some ε ′ ≥ ε, it has |E ν [ψ]| > ε ′ but k i=1 (E ν (i) [ψ]) 2 ≤ ε ′2 . The generalized definition above naturally induces a new implication (which the original definition may not have): If ε ′ ≥ ε, then ε-dependent sequence ⇒ ε ′ -dependent sequence, or in other words, ε ′ -independent sequence ⇒ ε-independent sequence. The definition of the distributional Eluder dimension (see Eq. ( 20)) can be written in two equivalent ways using original and generalized definition for a ε-independent sequence: dim BE (Ψ,D,ε) is the largest d ∈ N such that there exists a sequence {d (1) ,d (2) ,...,d (d) } ⊂ D such that for all t ∈ [d]: (i) 2) ,...,d (t-1) } ⇐=[by the implication above]=⇒ 2) ,...,d (t-1) } for some ε ′ ≥ ε. This indicates that the distributional Eluder dimension can be equivalently written in terms of generalized independent sequences. Going forward, we only use the generalized ε-(in)dependent definition, and omit the word generalized. d (t) is ε ′ -independent of {d (1) ,d (2) ,...,d (t-1) } for some ε ′ ≥ ε. (ii) d (t) is (generalized) ε-independent of {d (1) ,d d (t) is (generalized) ε ′ -independent of {d (1) ,d Setup. Let us use dim BE (ε) as shorthand for dim BE (Ψ,D,ε). By Eq. ( 20), we know dim BE (ε) also upper bounds the length of sequences {d (1) ,d (2) ,...,d (d) } ⊂ D and {ψ (1) ,ψ (2) ,...,ψ (d) } ⊂ Ψ such that for all t ∈ [d], |E d (t) [δ (t) ]| > ε (t) , and t-1 i=1 E d (i) [(ψ (t) ) 2 ] ≤ ε (t) , for ε (1) ,...,ε (d) ≥ ε (note that the square is inside the expectation which is different from Eq. ( 20)). Now, for any {d (1) , d (2) , ... , d (T ) } ⊂ D and {ψ (1) , ψ (2) , ... , ψ (T ) } ⊂ Ψ, we define [(ψ (t) ) 2 ]. We will study the sequence β (t) := t-1 i=1 E d (i) E d (1) [ψ (1) ] 2 1∨β (1) , E d (2) [ψ (2) ] 2 1∨β (3) ,..., E d (T ) [ψ (T ) ] 2 1∨β (T ) . ( ) Fix a parameter α > 0, whose value will be specified later. For the remainder of the proof, we use L (t) to denote the number of disjoint α 1∨β (t) h -dependent subsequences of d (t) in {d (1) ,d (2) ,...,d (t-1) }, for each t ∈ [T ]. Step 1. Suppose the t-th term of Eq. ( 21) is greater than α 2 , so that t) . From the definition of L (t) , we know there have at least L (t) disjoint subsequences of {d (1) ,...,d (t-1) } (denoted by S (1) ,...,S (L (t) ) ), such that |E d (t) [ψ (t) ]| > α √ 1∨β L (t) i=1 ν∈S (i) (E ν [ψ (t) ]) 2 ≥ (1∨β (t) )α 2 . ( ) On the other hand, by the definition of β (t) h , we have L (t) i=1 ν∈S (i) (E ν [ψ (t) ]) 2 ≤ t-1 i=1 E d (i) [(ψ (t) ) 2 ] ≤ β (t) . Therefore, combining Eqs. ( 22) and ( 23) we obtain that, if t) for some t ∈ [T ], |E d (t) [ψ (t) ]| > α √ 1∨β β (t) ≥ L (t) (1∨β (t) )α 2 =⇒ L (t) ≤ 1 α 2 . ( ) Step 2. On the other hand, let {i 1 ,i 2 ,...,i κ } be the longest subsequence of [T ], where E d (i j ) [δ (i j ) ] 2 1∨β (i j ) > α 2 , ∀j ∈ [κ]. For compactness, we use {ν (1) ,ν (2) ,...,ν (κ) } abbreviate {d (i 1 ) ,d (i 2 ) ,...,d (iκ ) }. We now argue that there exists j ⋆ ∈ [κ], such that for ν (j ⋆ ) , there must exist at least L ⋆ ≥ κ dim BE (α)+1 ≥ κ dim BE (α)+1 -1 α-dependent disjoint subsequences in {ν (1) ,ν (2) ,...,ν (j ⋆ -1) } (the actual number of disjoint subsequences is denoted by L ⋆ ). This is because we can construct such disjoint subsequences by the following procedure: 1) ,...,S (L ⋆ ) , terminate the procedure (goal achieved). ⟨1⟩ For j ∈ [L ⋆ ], S (j) ← {ν (j) }. Then, set j ← L ⋆ +1. ⟨2⟩ If ν (j) is α-dependent on S ( ⟨3⟩ Otherwise, we know ν (j) is α-independent on at least one of S (1) ,...,S (L ⋆ ) (denoted by S ⋆ ). Update S ⋆ ← S ⋆ {ν (j) }, j ← j +1, and go to ⟨2⟩. From the definition of dim BE (α), we know if |S (i) | ≥ dim BE (α) + 1, any ν ∈ D h must be α- dependent on S (i) (for each i ∈ [L ⋆ ]) . Therefore, such a procedure must terminate before or on j (max) = L ⋆ dim BE (α)+L ⋆ . Thus, if j (max) ≤ κ, termination in ⟨2⟩ must happen. This only requires L ⋆ to satisfy L ⋆ dim BE (α)+L ⋆ ≤ κ =⇒ L ⋆ ≤ κ dim BE (α)+1 . That is, as long as L ⋆ ≤ κ dim BE (α)+1 , the termination in ⟨2⟩ must happen for some j ⋆ ≤ κ. Step 3. As we discussed at the beginning, α-dependence implies α ′ -dependence for all α ′ ≥ α. This means the L ⋆ in Step 2 lower bounds max t∈[T ] L (t) in Step 1, because {d (i 1 ) ,d (i 2 ) ,...,d (iκ ) } is a subset of {d (1) ,d (2) ,...,d (iκ) }. Thus, combining Eqs. ( 24) and ( 25), we can obtain that, 1 α 2 ≥ max t∈[T ] L (t) ≥ L ⋆ ≥ κ dim BE (α)+1 -1. This implies that κ ≤ 1+ 1 α 2 (dim BE (α)+1) ≤ 3dim BE (α) α 2 +1. (suppose α ≤ 1) As a consequence, for any ε ∈ (0,1], by setting α = √ ε, T t=1 1 E d (t) [ψ (t) ] 2 1∨β (t) > ε ≤ 3dim BE ( √ ε) ε +1. ( ) Step 4. Let e (1) ≥ e (2) ≥ ••• ≥ e (T ) denote the sequence in Eq. ( 21) reordered in a decreasing fashion. For any parameter w ∈ (0,1] to be specified later, we have T t=1 E d (t) [ψ (t) ] 2 1∨ t-1 i=1 E d (i) [(ψ (t) ) 2 ] = T t=1 e (t) ≤ T w+ T t=1 e (t) 1(e (t) > w). Observe that for any t ∈ [T ] such that e (t) > w, if 2η ≥ e (t) > η ≥ w, we have t ≤ T i=1 1(e i > η) ≤ 3 η dim BE ( √ η)+1 (by Eq. ( 26)) ≤ 3 η dim BE √ w +1 =⇒ η ≤ 3d t-1 (define d := dim BE ( √ w)) =⇒ e (t) ≤ min 6d t-1 ,1 . (2η ≥ e (t) > η) Therefore, T t=1 e (t) 1(e (t) > w) ≤ d+ T t=d+1 6d t-1 ≤ d+6dlog(T ). =⇒ T t=1 e (t) ≤ T w+dim BE √ w +6dim BE √ w log(T ). Selecting w = ε 2 implies SEC(T ) ≲ inf ε>0 ε 2 T +dim BE (ε) •log(T ). This completes the proof. F.3 DISCUSSION: RELATIONSHIP TO ADDITIONAL COMPLEXITY MEASURES F.3.1 SEQUENTIAL EXTRAPOLATION COEFFICIENT: Q-TYPE VERSUS V -TYPE The Sequential Extrapolation Coefficient, as defined in Definition 8), can be thought of as a generalization of Q-type Bellman-Eluder dimension (Jin et al., 2021a) . In this section we sketch how one can adapt Sequential Extrapolation Coefficient so as to generalize V -type Bellman-Eluder dimension instead. Note that V -type Bellman-Eluder dimension subsumes the original notion of Bellman rank from Jiang et al. (2017) . We define the V -type Sequential Extrapolation Coefficient for RL as follows. Definition 10 (Sequential Extrapolation Coefficient for RL, V -type). For each h ∈ [H], let D Π h,x := {d π h (•) : π ∈ Π} ⊂ ∆(X ) and V F h -T h F h+1 := {(f h -T h f h+1 )(•,π f,h ) : f ∈ F } ⊂ (X → R). Then we define, SEC RL-v (F,Π,T ) := max h∈[H] SEC(V F h -T h F h+1 ,D Π h,x ,T ). We recall that the V -type Bellman-Eluder dimension dim BE-v (F, Π, ε) is defined analogously, by replacing F h -T h F h+1 → V F h -T h F h+1 and D Π h → D Π h, x in Definition 6. Lastly, we give a V -type generalization of Definition 2 (i.e., coverability w.r.t. state only), for a policy class Π as follows: C cov-v := inf µ1,...,µ H ∈∆(X ) sup π∈Π,h∈[H] d π h µ h ∞ . ( ) As a simple implication, we have C cov-v ≤ C cov ≤ C cov-v •|A|. Note that the V -type variants of sequential extrapolation coefficient, Bellman-Eluder dimension, and coverability differ from their Q-type counterparts only in the choices for the distribution and test function sets. Since our proofs for Propositions 6 and 7 hold for arbitrary distributions and test function sets, we immediately obtain the following V -type extensions of Propositions 6 and 7. Proposition 20 (Coverability =⇒ SEC, V -type). Let C cov-v be the V -type coverability coefficient (Eq. ( 27)) with policy class Π. Then for any value function class F, SEC RL-v (F, Π, T ) ≤ O(C cov-v •log(T )). Proposition 21 (Bellman-Eluder dimension =⇒ SEC, V -type). Suppose dim BE-v (F,Π,ε) be the V -type Bellman-Eluder dimension with function class F and policy Π, then SEC RL-v (F,Π,T ) ≤ O inf ε>0 ε 2 T +dim BE-v (F,Π,ε) •log(T ) . As shown in Jin et al. (2021a) , GOLF (Algorithm 1) can be extended to V -type by simply replacing Line 3 in Algorithm 1 with sampling (s h ,a h ,r h ,s h+1 ) ∼ d (t) h ×π unif (s h ∼ d (t) h and a h ∼ unif(A)) each h ∈ [H] . By slightly modifying the proof of Theorem 5 one can obtain similar sample complexity guarantees based on the V -type Sequential Extrapolation Coefficient. We omit the details here, since the only differences are 1) a V -type analog of Lemma 12 (provided by Jin et al., 2021a, Lemma 44) ; and 2) trivially upper bounding the quantity E d (i) h ×π (t) h [(δ (t) h ) 2 ] (used in SEC RL-v ) by |A|•E d (i) h ×π unif [(δ (t) h ) 2 ] (controlled by in-sample error). Note, however, that due to the uniform exploration, this algorithm leads to a sample complexity guarantee of the form J(π ⋆ )-J(π) ≤ O H SEC RL-v (F,Π,T )|A|log( T H|F | /δ) T , but not a regret bound.

F.3.2 CONNECTION TO BILINEAR CLASSES

The Bilinear class framework (Du et al., 2021) generalizes the notion of Bellman rank (Jiang et al., 2017) , which captures various more structural conditions via an additional class of discrepancy functions. In this section we sketch how one can generalize the sequential extrapolation coefficient  E d ψ (t) [ψ (t) ] 2 1∨ t-1 i=1 E p ψ (i) [ℓ 2 ψ (t) ] . ( ) To apply the generalized SEC to reinforcement learning, one can set (for each level h) Ψ = F h -T h F h+1 , D Ψ = {d π h (•,•) : π ∈ Π F }, and P Ψ = {(d π h ×π est,ψ h )(•,•) : π ∈ Π F } , where (d×π)(x,a) := d(x)π(a | x) (for any d ∈ ∆(X ), π ∈ (X → ∆(A)) and (x,a) ∈ X ×A), and π est,ψ h denotes the estimation policy depending on ψ h (e.g., greedy policy w.r.t. ψ h or uniformly random policy over A). The discrepancy function class L Ψ can be selected according to the original Bilinear rank for covering various structural conditions, and setting L Ψ = Ψ recovers the original SEC. By combining GOLF and Theorem 5 with the approach from Du et al. (2021) , one can provide sample complexity guarantees that scale with the Gen-SEC. We omit the details, but the basic idea is to form the confidence set using the discrepancy function class L Ψ rather than working with squared Bellman error. Bounding the generalized SEC by bilinear rank. In what follows, we show that the abstract version of the Generalized SEC in Eq. ( 28) can be bounded by an abstract generalization of the notion of Bilinear rank from Du et al. (2021) . Definition 12 (Bilinear rank, finite dimension (Du et al., 2021) ). Let Z be an abstract set. Let Ψ ⊂ (Z → R) be a function class, and let D Ψ (:= {d ψ : ψ ∈ Ψ}),P Ψ (:= {p ψ : ψ ∈ Ψ}) ⊂ ∆(Z) be two corresponding distribution classes, and L Ψ (:= {ℓ ψ : ψ ∈ Ψ}) ⊂ (Z → R) be a corresponding discrepancy function class. The class Ψ is said to have Bilinear rank d if there exists ψ ⋆ ∈ Ψ and functions X,W ⊂ (Ψ → R d ) such that 1) ψ∈Ψ ∥X(ψ)∥ 2 ≤ 1 and ψ∈Ψ ∥W (ψ)∥ 2 ≤ B W , and 2) E d ψ [ψ] ≤ |⟨W (ψ)-W (ψ ⋆ ),X(ψ)⟩| ∀ψ ∈ Ψ, E p ψ [ℓ ψ ′ ] = |⟨W (ψ ′ )-W (ψ ⋆ ),X(ψ)⟩| ∀ψ,ψ ′ ∈ Ψ. We define dim bi (Ψ,D Ψ ,P Ψ ,L Ψ ) as the least dimension d for which this property holds. Proposition 22 (Bilinear rank =⇒ Gen-SEC). Suppose SEC gen (Ψ, D Ψ , P Ψ , L Ψ , T ) and dim bi (Ψ, D Ψ , P Ψ , L Ψ ) be the gen-SEC and Bilinear rank defined in Definitions 11 and 12 with respect to function class Ψ, distribution classes D Ψ and D Ψ , and discrepancy function class L Ψ . Then we have, SEC gen (Ψ,D Ψ ,P Ψ ,L Ψ ,T ) ≲ dim bi (Ψ,D Ψ ,P Ψ ,L Ψ )log 1+ 4B 2 W T d . Proof of Proposition 22. Throughout the proof, we use d (t) , p (t) and ℓ (t) as the shorthands of d ψ (t) , p ψ (t) and ℓ ψ (t) . We study the quantity, T t=1 E d (t) [ψ (t) ] 2 1∨ t-1 i=1 E p (i) [(ℓ (t) ) 2 ] ≤ 2 T t=1 E d (t) [ψ (t) ] 2 1+ t-1 i=1 E p (i) [ℓ (t) ] 2 . By Definition 12, we have E d (t) [ψ (t) ] 2 ≤ |⟨W (ψ (t) )-W (ψ ⋆ ),X(ψ (t) )⟩| 2 , 1+ t-1 i=1 E p (i) [ℓ (t) ] 2 = 1+ t-1 i=1 |⟨W (ψ (t) )-W (ψ ⋆ ),X(ψ (i) )⟩| 2 ≥ (W (ψ (t) )-W (ψ ⋆ )) ⊤ Σ t (W (ψ (t) )-W (ψ ⋆ )) = ∥W (ψ (t) )-W (ψ ⋆ )∥ 2 Σt , where Σ t := 1 4B 2 W I+ t-1 i=1 X(ψ (i) )X(ψ (i) ) ⊤ . We bound E d (t) [ψ (t) ] 2 ≤ |⟨W (ψ (t) )-W (ψ ⋆ ),X(ψ (t) )⟩| 2 ≤ ∥W (ψ (t) )-W (ψ ⋆ )∥ 2 Σt •∥X(ψ (t) )∥ 2 Σ -1 t , which implies T t=1 E d (t) [ψ (t) ] 2 1+ t-1 i=1 E p (i) [ℓ (t) ] 2 ≤ T t=1 1∧∥X(ψ (t) )∥ 2 Σ -1 t ≤ 2log det(Σ T ) det(Σ 1 ) ≤ 2dim bi (Ψ,D Ψ ,P Ψ ,L Ψ )log 1+ 4B 2 W T d , where the last two inequalities follow from the elliptical potential lemma (Lattimore and Szepesvári, 2020, Lemma 19.4) . Putting everything together, we obtain SEC gen (Ψ,D Ψ ,P Ψ ,L Ψ ,T ) ≤ 4dim bi (Ψ,D Ψ ,P Ψ ,L Ψ )log 1+ 4B 2 W T d .

G EXTENSION: REWARD-FREE EXPLORATION

Reward-free exploration investigates is a problem where 1) the learning agent interacts with an environment without rewards, aiming to gather information so that 2) in a subsequent offline phase, the information collected can be used to learn near-optimal policies for a wide range of possible reward functions (Jin et al., 2020a; Zhang et al., 2020; Wang et al., 2020a; Zanette et al., 2020b; Chen et al., 2022) . This section provides a reward-free extension of our main results, and gives sample complexity bounds based on coverability for a reward-free extension of GOLF. Function approximation. We assume access to a value function class F, which is used for the offline optimization, and a function class, G, which is used for the reward-free exploration phase. Following the normalized reward assumption, we assume g h ∈ X ×A → [0,1], ∀(g,h) ∈ G ×[H]. We define P h as be the "zero-reward" Bellman operator for horizon h ∈ [H]. That is, for any g h ∈ G h and any h ∈ [H], (P h g h+1 )(x h ,a h ) := x ′ P h (x h+1 | x h ,a h ) max a h+1 ∈A g h+1 (x h+1 ,a h+1 ). We let R denote the target reward function used in the offline phase, which is not known to the algorithm in the offline exploration phase. We make the following assumption. Assumption 2 (Reward-free completeness). Let T 1:H be the Bellman operator with the target reward function R, and F be the function class used to optimize the target reward function. Then for all h ∈ [H] (a) P h G h+1 ∈ G h for all g h+1 ∈ G h+1 (b) F h -T h F h+1 ⊆ G h -P h G h+1 . Analogous Reward-free Sequential Extrapolation Coefficient. The main guarantees for this section are stated in terms of a reward-free variant of the sequential extrapolation coefficient, which we define as follows. Definition 13 (Sequential Extrapolation Coefficient for Reward-Free RL). For each h ∈ [H], let D Π G h := {d π h : π ∈ Π G } and G h -P h G h+1 := {g h -P h g h+1 : g ∈ G}. Then we define, SEC RL,rf (G,Π G ,T ) := max h∈[H] SEC(G h -P h G h+1 ,D Π G h ,T ). Using the same arguments (and same proofs) as Section 5.2, the reward-free variant of sequential extrapolation coefficient can be shown to subsume coverability (as well as reward-free counterpart of the Bellman-Eluder dimension, which we omit).

G.1 ALGORITHM AND THEORETICAL ANALYSIS

Algorithm 2 Reward-Free Exploration with GOLF input: Function class for reward-free exploration G. initialize: D (0) h,rf ← ∅, ∀h ∈ [H]. G (0) ← G. 1: for episode t = 1,2,...,T do 2: Select policy π (t) ← π g (t) , where g (t) = argmax g∈G (t-1) g(x 1 ,π g,1 ).

3:

Execute π (t) for one episode and obtain x (t) 1 ,a (t) 1 ,x (t) 2 ,...,x (t) H ,a (t) H ,x (t) H+1 .

4:

Update historical data D (t) h,rf ← D (t-1) h,rf x (t) h ,a (t) h ,x (t) h+1 , ∀h ∈ [H].

5:

Compute confidence set: G (t) ← g ∈ G : L (t) h,rf (g h ,g h+1 )-min g ′ h ∈G h L (t) h,rf (g ′ h ,g h+1 ) ≤ β rf , ∀h ∈ [H] , where L (t) h,rf (g,g ′ ) := (x,a,x ′ )∈D (t) h,rf g(x,a)-max a ′ ∈A g ′ (x ′ ,a ′ ) 2 , ∀g,g ′ ∈ G. 6: Select t ⋆ ← argmin t∈[T ] g (t) 1 (x 1 ,π (t) 1 ). 7: Return data D (t⋆ -1) h,rf , ∀h ∈ [H]. Algorithm 3 Offline GOLF with Exploration Data and Target Reward input: • Target reward function, R. • Function class F for offline RL. • Exploration data from Algorithm 2, denoted by D h,rf , ∀h ∈ [H]. 1: Compute confidence set: F (off) ← f ∈ F : L (off) h (f h ,f h+1 )-min f ′ h ∈F h L (off) h (f ′ h ,f h+1 ) ≤ β off , ∀h ∈ [H] , where L (off) h (f,f ′ ) := (x,a,x ′ )∈D h,rf f (x,a)-R(x,a)-max a ′ ∈A f ′ (x ′ ,a ′ ) 2 , ∀f,f ′ ∈ F. 2: Return π ← π f , where f = argmax f ∈F (off) f (x 1 ,π f,1 ). Recall that the key ideas in GOLF are: 1) using optimism to relate regret to on-policy average Bellman error; 2) using squared Bellman error to construct a confidence set, which ensures optimism. In the reward-free setting, one can apply these ideas by running GOLF (Algorithm 2) with rewards set to zero. Intuitively, this strategy ensures exploration because the algorithm must explore to rule out test functions in G. However, a-priori it is unclear whether running some standard offline RL algorithms on the exploration data produced by this strategy should lead to a near-optimal policy, especially given that the PAC guarantee of GOLF relies on outputting a uniform mixture of all historical policies (see, e.g., Corollary 2). To address such issues, one can imagine that, if we know which is the best over all historical policies (say, π (t⋆ ) for some t ⋆ ), could running one-step GOLF on the exploration data at t ⋆ (Algorithm 3) guarantee to find a good policy? Note that, for the original GOLF algorithm (in the known-reward case), running so directly reproduces π (t⋆ ) . Although knowing which is the best over all historical policies seems impossible in the known-reward case, thanks to the reward-free nature, we will show that the value of g(x 1 ,π g,1 ) directly captures "how bad is g" (akin to the regret in the known-reward case), which allow us to find the best step over the reward-free exploration phase. The following result provides a sample complexity guarantee for this strategy. Theorem 23. Under Assumptions 1 and 2, there exists an absolute constants c 1 and c 2 such that for any δ ∈ (0,1] and T ∈ N + , if we choose β off = c 1 • log( T H|G| /δ) and β rf = (c 1 + c 2 ) • log( T H|G| /δ) in Algorithms 2 and 3, then with probability at least 1-δ, the policy π output by Algorithm 3 has J(π ⋆ )-J( π) ≤ O H SEC RL,rf (G,Π G ,T )log( T H|G| /δ) T . We defer the proof to Appendix G.2. We also introduce the following two lemmas, which are key to adapting the known-reward results to the reward-free case. Lemma 24 (Reward-free exploration overestimates regret). For any f ∈ F, let g be defined as g h = f h -T h f h+1 + P h g h+1 , ∀h ∈ [H]. Then for any (x,a,h) ∈ X × A × [H], we have g h (x,a) ≥ f h (x,a)-Q π f h (x,a ). Since the Q-function for all policies in the zero-reward case are zero, Lemma 24 guarantees that, regret in the reward-free exploration phase-(g 1 (x 1 ,π g,1 )-0) always upper bounds its counterpart of the offline phase-(f 1 (x 1 ,π f,1 )-Q π f h (x,π f,1 )). Equipped with the optimism argument, we can show that if g 1 (x 1 ,π g,1 ) is small, its corresponding π f (the f with f h -T h f h+1 = g h -P h g h+1 , ∀h ∈ [H]) also has small regret. Lemma 25 (Reward-free exploration has larger confidence set). Suppose Assumption 2 holds and under the same conditions as Theorem 23. For any f ∈ F (off) (defined in Eq. ( 30)), there must exist g ∈ G (t⋆ -1) (defined in Eq. ( 29)), such that f h -T h f h+1 = g h -P h g h+1 , ∀h ∈ [H]. Lemma 25 ensures that the reward-free version space G (t⋆ -1) subsumes the offline version space F (off) . Thus, we can use the metrics during reward-free exploration to upper bound that of the offline phase.

G.1.1 RELATED WORK

Our approach adapts techniques for reward-free exploration in nonlinear RL introduced in Chen et al. (2022) . In what follows, we discuss the connection to this work in greater detail. We focus on the Q-type results of Chen et al. (2022) , but similar arguments are likely apply to the V -type. Briefly, Chen et al. (2022) extends the OLIVE algorithm to the reward-free setting by using the idea of online exploration with zero rewards. The most important difference here is that, as discussed in Section 5, since OLIVE only considers average Bellman residuals, it cannot capture coverability. Beyond this difference, let us compare the completeness assumptions in Assumption 2 to those made in Chen et al. (2022) . We will show that the completeness assumption used by Chen et For Assumption 2(b): T h F h+1 = R h +P h (Ψ h+1 +R h+1 ) ⊆ R h +Ψ h = F h . ( F h -T h F h+1 = Ψ h +R h -R h -P h (Ψ h+1 +R h+1 ) = Ψ h -P h (Ψ h+1 +R h+1 ) ⊆ Ψ h -Ψ h . (by Chen et al. (2022, Assumption 2)) G h -P h G h+1 = Ψ h -Ψ h -P h (Ψ h+1 -Ψ h+1 ) ⊇ Ψ h -Ψ h . (0 ∈ Ψ h+1 -Ψ h+1 ) =⇒ F h -T h F h+1 ⊆ G h -P h G h+1 .

G.2 PROOFS

We first present the following form of Freedman's inequality for martingales (e.g., Agarwal et al., 2014) . Lemma 26 (Freedman's Inequality). Let {X (1) ,X (2) ,...,X (T ) } be a real-valued martingale difference sequence adapted to a filtration {F (1) , F (2) , ... , F (T ) } (i.e., E[X (t) | F (t-1) ] = 0, ∀t ∈ [T ]). If |X (t) | ≤ R almost surely for all t ∈ [T ], then for any η ∈ (0, 1 /R), with probability at least 1-δ, T t=1 X (t) ≤ η T t=1 E (X (t) ) 2 F (t-1) + log( 1 /δ) η . We now provide proofs from Appendix G.1. Proof of Theorem 23. Over this section, the test function class is selected as g (t) 1 (x 1 ,π (t) 1 ), then, δ (t g (t⋆ ) 1 (x 1 ,π (t⋆ ) ) = 1 T T t=1 g (t) 1 (x 1 ,π (t) 1 ) = 1 T T t=1 H h=1 E d (t) h δ (t) h,rf (by Eq. ( 33)) ≤ H SEC RL,rf (G,Π G ,T )β rf T , where the last inequality follows from Eq. ( 31). By Lemma 25, we know there exists a g ∈ G (t⋆ -1) , such that f h -T h f h+1 = g h -P h g h+1 , ∀h ∈ [H]. In addition, we can obtain J(π ⋆ )-J(π f ) ≤ f (x 1 ,π f ,1 )-J(π f ) (by Lemma 12) ≤ g(x 1 ,π g,1 ). (by Lemma 24) Therefore, we have J(π ⋆ )-J(π f ) ≤ g(x 1 ,π g,1 ) ≤ g (t⋆ ) 1 (x 1 ,π (t⋆ ) 1 ) ≤ H SEC RL,rf (G,Π G ,T )β rf T . (by Eq. ( 34)) Plugging back the selection of β rf completes the proof. Proof of Lemma 24. We establish the proof by induction. For h = H, the the inductive hypothesis holds because g H = f H -R H = f H -Q π f H . Suppose the inductive hypothesis holds at h+1, we have for any x ∈ X , g h+1 (x,a) ≥ f h+1 (x,a)-Q π f h+1 (x,a), ∀a ∈ A. =⇒ g h+1 (x,π f,h+1 ) ≥ f h+1 (x,π f,h+1 )-Q π f h+1 (x,π f,h+1 ). =⇒ max a∈A g h+1 (x,a) ≥ f h+1 (x,π f,h+1 )-Q π f h+1 (x,π f,h+1 ). =⇒ g h+1 (x,π g,h+1 ) ≥ f h+1 (x,π f,h+1 )-V π f h+1 (x). (35) Then, as g h = f h -T h f h+1 +P h g h+1 , we have for any (x,a) ∈ X ×A, Therefore, we prove that the inductive hypothesis also holds at h using the inductive hypothesis at h+1. This completes the proof. Proof of Lemma 25. Over this proof, we use d (t) h as the shorthand of d π (t) h . The proof of this lemma consists of two parts. (i) There exists a radius β 1 , such that for any g ∈ G, if such g satisfies t⋆-1 t=1 E d (t) h (g h -P h g h+1 ) 2 ≤ β 1 , ∀h ∈ [H] then g ∈ G (t⋆ -1) . (ii) There exists another radius β 2 , where β 2 ≤ β 1 . For any f ∈ F off , we have t⋆-1 t=1 E d (t) h (f h -T h f h+1 ) 2 ≤ β 2 , ∀h ∈ [H]. Proof of part (i). For any (t,h,g) ∈ [T ]×[H]×G, let Y (t) h (g) be defined as Y (t) h (g) := g h (x (t) h ,a (t) h )-g h+1 (x (t) h+1 ,π g,1 ) 2 -(P h g h+1 )(x (t) h ,a (t) h )-g h+1 (x (t) h+1 ,π g,1 ) 2 . Also, let F (t) h be the filtration induced by {x (i) 1 ,a (i) 1 ,x (i) 2 ,a (i) 2 ,...,x Proof of part (ii). Similar to (i), for any (t,h,f ) ∈ [T ]×[H]×F, let X (t) h (f ) be defined as X (t) h (f ) := f h (x (t) h ,a (t) h )-R(x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,Π G ) 2 -(T h f h+1 )(x (t) h ,a (t) h )-R(x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,π f,h+1 ) 2 . Also let X(t) h (f ) := E X (t) h (f ) F (t-1) h -X (t) h (f ), so that X(t) h (f ) T t=1 is a martingale difference sequence adapts to the filtration F (t) h T t=1 , and | X(t) h (f )| ≤ 2 almost surely. Thus, by same arguments as Eqs. ( 36) and (37) (as well as applying Lemma 26), we have E X (t) h (f ) F (t) h = E d (t) h (f h -T h f h+1 ) 2 (39) and for any (h,f ) ∈ [H]×F and any η ∈ (0, 1 /2), with probability at least 1-δ, t⋆-1 t=1 E X (t) h (f ) F (t-1) h ≤ η t⋆-1 t=1 E X (t) h (f ) F (t-1) h + log( H|F | /δ) η + t⋆-1 t=1 X (t) h (f ) =⇒ (1-η) t⋆-1 t=1 E X (t) h (f ) F (t-1) h ≤ log( H|F | /δ) η + t⋆-1 t=1 X (t) h (f ). Therefore, if f ∈ F (off) , we have t⋆-1 t=1 X (t) h (f ) = t⋆-1 t=1 f h (x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,π f,h+1 ) 2 - t⋆-1 t=1 (T h f h+1 )(x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,π f,h+1 ) 2 ≤ t⋆-1 t=1 f h (x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,π f,h+1 ) 2 -min f ′ h ∈F h t⋆-1 t=1 f ′ h (x (t) h ,a (t) h )-f h+1 (x (t) h+1 ,π f,h+1 ) 2 ≤ L (off) h (f h ,f (t⋆ ) h+1 )-min f ′ h ∈F h L (off) h (f ′ h ,f (t⋆ ) h+1 ) ≤ β off . We then combine Eqs. ( 39) to ( 41) and obtain  t⋆-1 t=1 E d (t) h (f h -T h f h+1 ) 2 = t⋆-1 t=1 E X (t) h (f ) F (t- Putting everything together. By Eqs. ( 38) and ( 42), we know we only need the following inequality to hold: 5log( H|F | /δ)+2β off ≤ β rf 3 -log( H|G| /δ). =⇒ β rf ≥ 6β off +18log( H|G| /δ). This is satisfied via the condition of Theorem 23. Combining (i) and (ii), we can simply obtain for any h ∈ [H], f h -T h f h+1 : f ∈ F off ⊆ f h -T h f h+1 : t⋆-1 t=1 E d (t) h (f h -T h f h+1 ) 2 ≤ β 2 ,∀h ∈ [H],f ∈ F (by (ii)) ⊆ g h -P h f h+1 : t⋆-1 t=1 E d (t) h (g h -P h f h+1 ) 2 ≤ β 1 ,∀h ∈ [H],g ∈ G (by Assumption 2 and β 2 ≤ β 1 ) ⊆ {g h -P h g h+1 : g ∈ G (t⋆ -1) }. (by (i)) This completes the proof.



While our results assume that the initial state is fixed for simplicity, this assumption is straightforward to relax. Specifically, FQI requires concentrability with Π chosen to be the set of all admissible policies (see, e.g.,Chen and Jiang, 2019). Other algorithms (Xie and Jiang, 2020) can leverage concentrability w.r.t smaller policy classes. We require H = 2 to apply the result to contextual bandits due to assuming the deterministic starting state. π is the non-Markov policy obtained by sampling t ∼ [T ] and playing π(t) . Applying this result formally requires a separate argument to handle early rounds in which pairs (x,a) have been visited very little; this is given in Appendix D. Q-type and V -type are similar, but define the Bellman residual with respect to different action distributions. For example, it is not clear how to construct a complete value function class given access to a class of decoders Φ that contains ϕ ⋆ . If the minimum in Eq. (4) is not obtained, we can repeat the argument that follows for each element of a limit sequence attaining the infimum. Technically, the construction inEfroni et al. (2022a) has a stochastic initial state with known distribution. This can be embedded in our framework, which has a deterministic initial state, by lifting the horizon from 2 to 3. This definition coincides with distributional Eluder dimension (see, e.g.,Jin et al., 2021a), which only differs from Bellman-Eluder dimension on the notation of test function. We overload the notation for dim BE over this proof for simplicity.



II): in-sample squared Bellman error , where the inequality is an application of Cauchy-Schwarz. As an immediate consequence of the confidence set construction in Eq. (3), completeness, and a standard concentration argument (Lemma 12 in Appendix D), we can bound the in-sample error by (II) ≤ O √ βT .

Figure 1: An example of coverability ⇐⇒ cumulative reachability (which is equal to the total area of the shaded region without double-counting overlaps. Π = {π1,π2,π3,π4}, dashed curves is d π ).

Proposition 4. For any d ∈ N, there exists an MDP M with H = 2 and |A| = 2, policy class Π with |Π| = d, and value function class F with |F| = d satisfying completeness, such that C cov = O(1), but the Bellman-Eluder dimension has dim BE (F,Π,ε) = Ω(min{|F|,|Π|}) = Ω(d) for any ε ≤ 1/2. The lower bound in Proposition 4 is realized by an exogenous block MDP (Appendix C), with d representing the number of exogenous states. The result gives an exponential separation between what can be achieved using Bellman-Eluder dimension and coverability, because GOLF attains Reg ≤ O T log(d) (cf. Corollary 9

For any Ex-BMDP, C cov ≤ |S|•|A|.

a) ≤ 2C cov •µ ⋆ h (x,a), which follows from Eq. (

X ); see Appendix F.3.1 or Jin et al. (2021a) for more background on V -type Bellman-Eluder dimension. V -type Bellman-Eluder dimension. The hard instance for V -type Bellman-Eluder dimension is based on the construction of Efroni et al. (2022a, Proposition B.1), which shows that for any d = 2 i (i ∈ N), there exists an exogenous MDP (ExoMDP) with |S| = 3 endogenous states, |A| = 2, H = 2, and d exogenous factors, with the following properties: 9 1. There exists a function class F such that Q ⋆ ∈ F and |F| = d. In addition for all

C cov ≤ 6; this is a consequence of Proposition 8 and the fact that the ExoMDP model in Efroni et al. (2022a) is a special case of the Ex-BMDP model in Section 3.2.

For any d ∈ N, there exists an MDP M with H = 2 and |A| = 2, a policy class Π with |Π| = d, and a value function class F with |F| = d satisfying completeness, such that C cov = O(1), yet OLIVE(Jiang et al., 2017) requires at least Ω(d) trajectories to return a 0.1-optimal policy.

(SEC) further by allowing for the use of general discrepancy functions to form confidence sets and estimate Bellman residuals, in the vein of Bilinear classes. Definition 11 (Gen-SEC). Let Z be an abstract set. Let Ψ ⊂ (Z → R) be a function class, and let D Ψ (:= {d ψ : ψ ∈ Ψ}),P Ψ (:= {p ψ : ψ ∈ Ψ}) ⊂ ∆(Z) be two corresponding distribution classes, and L Ψ (:= {ℓ ψ : ψ ∈ Ψ}) ⊂ (Z → R) be a corresponding discrepancy function class. The Gen-SEC for length T is given by SEC gen (Ψ,D Ψ ,P Ψ ,L Ψ ,T ) := sup ψ (1) ,...,ψ (T ) ∈Ψ T t=1

al. (2022, Assumption 2) is a sufficient condition for ours (Assumptions 1 and 2). In our notation, Chen et al. (2022), use F := Ψ + R := {ψ 1:H (•,•) + R 1:H (•,•) : ψ ∈ Ψ} for some function class Ψ during offline phase, and select G := Ψ -Ψ := {ψ 1:H (•,•) -ψ ′ 1:H (•,•) : ψ,ψ ′ ∈ Ψ} for the reward-free exploration phase. Thus for any h ∈ [H], we have: For Assumption 1 and Assumption 2(a):

Chen et al. (2022, Assumption 2))P h G h+1 = P h (Ψ h+1 -Ψ h+1 ) ⊆ Ψ h+1 -Ψ h+1 = G h . (byChen et al. (2022, Assumption 2))

g h (x,a) = f h (x,a)-R h (x,a)-E x ′ |x,a max a ′ ∈A f h+1 (x ′ ,a ′ ) +E x ′ |x,a max a ′ ∈A g h+1 (x ′ ,a ′ ) = f h (x,a)-R h (x,a)+E x ′ |x,a max a ′ ∈A g h+1 (x ′ ,a ′ )-max a ′ ∈A f h+1 (x ′ ,a ′ ) = f h (x,a)-R h (x,a)+E x ′ |x,a [g h+1 (x ′ ,π g,h+1 )-f h+1 (x ′ ,π f,h+1 )] ≥ f h (x,a)-R h (x,a)+E x ′ |x,a -V π f h+1 (x ′ )(by Eq. (35))= f h (x,a)-R h (x,a)+E x ′ |x,a V π f h+1 (x ′ ) = f h (x,a)-Q π f h (x,a).

.g., setting η = 1 /3) So we only need to guarantee5log( H|F | /δ)+2β off = β 2 ≤ β 1 .

algorithms to the latent state space. In light of this, the aim of the Ex-BMDP setting is to obtain sample complexity guarantees that are independent of the size of the observed state space |X | and exogenous state space |Ξ|, and scale as poly(|S|,|A|,H,log|F|), where F is an appropriate class of function approximators (typically either a value function class F or a class of decoders Φ that attempts to model ϕ ⋆ directly).

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702-11716, 2021. Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256-2264, 2013. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898-2933. PMLR, 2019. Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, 2017. Andrew Wagenmaker and Kevin Jamieson. Instance-dependent near-optimal policy identification in linear MDPs via online experiment design. arXiv preprint arXiv:2207.02575, 2022. Andrew J Wagenmaker, Max Simchowitz, and Kevin Jamieson. Beyond no regret: Instance-dependent PAC reinforcement learning. In Conference on Learning Theory, pages 358-418. PMLR, 2022. Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730-2775. PMLR, 2022. Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. Invariance of coverability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

While we have already discussed connections to Bellman Rank, Bilinear Classes, and Bellman-Eluder Dimension, another more general complexity measure is the Decision-Estimation Coefficient(Foster et al., 2021). One can show that the Decision-Estimation Coefficient is bounded by coverability, but to apply the algorithm inFoster et al. (2021), one must assume access to a realizable model class M, which leads to regret bounds that scale with log|M| rather than log|F|.

, but holds nonetheless.Proposition 8 can be deduced by combining Propositions 10 and 11 with the observation that any tabular (finite-state/action) MDP with S states and A actions has C cov ≤ SA. However, Propositions 10 and 11 yield more general results, since they imply that starting with any (potentially non-tabular) class of MDPs M with low coverability and augmenting it with rich observations and exogenous noise preserves coverability.Proof of Proposition 8. Let h ∈ [H] be fixed. Let z h

to Assumption 1, Assumption 2(a) is used to control the squared Bellman error with zero reward. Assumption 2(b) guarantees that the class of test functions of interest for the reward-free exploration phase (G h -P h G h+1 for layer h ∈ [H], see Algorithm 2) is sufficiently rich relative to the relevant class of test functions for the offline phase (F h -T h F h+1 for layer h ∈ [H], see Algorithm 3). Without loss of generality, we assume that |G| = max{|F|,|G|}.

) h,rf (x h ,a h ) := g (t) h (x h ,a h )-(P h g (t) h+1 )(x h ,a h ), ∀(h,t) ∈ [H]×[T ]. By Theorem 5 (setting reward to be zero and replacing everything regarding F to G), we have SEC RL,rf (G,Π G ,T )β rf .

Ȳ (t) h (g)| ≤ 2 almost surely. Then, by applying Lemma 26 with a union bound, we have for any (h,g) ∈ [H]×G and any η ∈ (0, 1 /2), with probability at least 1-δ,≤ β 1 , ∀h ∈ [H],then by Eq. (37), we have for any h ∈ [H] ≤ 3(β 1 +log( H|G| /δ)).(e.g., by picking η = 1 /3) So we only need to guarantee 3•(β 1 +log( H|G| /δ)) ≤ β rf

ACKNOWLEDGEMENTS

Nan Jiang acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, NSF CAREER award, and Adobe Data Science Research Award. Sham Kakade acknowledges funding from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #CCF-1703574.

