HYBRID RL: USING BOTH OFFLINE AND ONLINE DATA CAN MAKE RL EFFICIENT

Abstract

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma's Revenge.

1. INTRODUCTION

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve suitable performance. This issue is particularly salient when the environment requires sophisticated exploration and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence, the practical success of online RL and related policy gradient/improvement methods has been largely restricted to settings where a high quality simulator is available. To overcome the issue of sample inefficiency, attention has turned to the offline RL setting (Levine et al., 2020) , where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in some other manner (e.g., by a system running in production or an expert). While these methods still require a large dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without compromising system performance. However, offline RL methods can suffer from distribution shift, where the state distribution induced by the learned policy differs significantly from the offline distribution (Wang et al., 2021) . Existing provable approaches for addressing distribution shift are computationally intractable, while empirical approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see). In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the favorable properties of both offline and online settings. In Hybrid RL, the agent has both an offline dataset and the ability to interact with the environment, as in the traditional online RL setting. The offline dataset helps address the exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a number of empirical works (Rajeswaran et al., 2017; Hester et al., 2018; Nair et al., 2018; 2020; Vecerik et al., 2017) which mainly focus on using expert demonstrations as offline data. Our algorithmic development is closely related to these works, although our focus is on formalizing the hybrid setting and establishing theoretical guarantees against more general offline datasets. Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from a "nice" distribution. A number of simple and effective algorithms, including CPI (Kakade & Langford, 2002) , PSDP (Bagnell et al., 2003) , and policy gradient methods (Kakade, 2001; Agarwal et al., 2020b) -which have further inspired deep RL methods such as TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) -are provably efficient in the reset setting. Yet, a nice reset distribution is a strong requirement (often tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid RL differs from the reset setting in that (a) we have an offline dataset, but (b) our online interactions start from the initial distribution of the environment, which is not assumed to have any nice properties. Both features (offline data and a nice reset distribution) facilitate algorithm design by de-emphasizing the exploration challenge. However, Hybrid RL is much more practical since an offline dataset is much easier to obtain in practice. We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The algorithm is a simple adaptation of the classical fitted Q-iteration algorithm (FQI) and accommodates value-based function approximation. 1 For our theoretical results, we prove that Hy-Q is both statistically and computationally efficient assuming that: (1) the offline distribution covers some high quality policy, (2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least squares regression oracle. The first three assumptions are standard statistical assumptions in the RL literature while the fourth is a widely used computational abstraction for supervised learning. No computationally efficient algorithms are known under these assumptions in pure offline or pure online settings, which highlights the advantages of the hybrid setting. We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination lock (Misra et al., 2020 ) and Montezuma's Revenge from the Arcade Learning Environment (Bellemare et al., 2013) . Starting with an offline dataset that contains some transitions from a high quality policy, our approach outperforms: an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma's Revenge, pure offline RL baselines, imitation learning baselines, and existing hybrid methods. Compared to the online methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efficiency. Compared to the offline and hybrid methods, Hy-Q performs most favorably when the offline dataset also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the significant benefits that can be realized by combining offline and online data.

2. RELATED WORKS

We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, offline RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the system to initial state distribution of the environment, which is not assumed to provide any form of coverage. Pure online RL Beyond tabular settings, many existing statistically efficient RL algorithms are not computationally tractable, due to the difficulty of implementing optimism. This is true in the linear MDP (Jin et al., 2020) with large action spaces, the linear Bellman complete model (Zanette et al., 2020; Agarwal et al., 2019) , and in the general function approximation setting (Jiang et al., 2017; Sun et al., 2019; Du et al., 2021; Jin et al., 2021a) . These computational challenges have inspired results on intractability of aspects of online RL (Dann et al., 2018; Kane et al., 2022) . There are several online RL algorithms that aim to tackle the computational issue via stronger structural assumptions and supervised learning-style computational oracles (Misra et al., 2020; Zhang et al., 2022c; Agarwal et al., 2020a; Uehara et al., 2021; Modi et al., 2021; Zhang et al., 2022a; Qiu et al., 2022) . Compared to these oracle-based methods, our approach operates in the more general "bilinear rank" setting and relies on a standard supervised learning primitive: least squares regression. Notably, our oracle admits efficient implementation with linear function approximation, so we obtain an end-to-end computational guarantee; this is not true for prior oracle-based methods. There are many deep RL methods for the online setting (e.g., Schulman et al. (2015; 2017); Lillicrap et al. (2016) ; Haarnoja et al. (2018) ; Schrittwieser et al. (2020) ). Apart from a few exceptions (e.g., Burda et al. (2018); Badia et al. (2020) ; Guo et al. ( 2022)), most rely on random exploration and are not capable of strategic exploration. In our experiments, we test our approach on Montezuma's Revenge, and we pick RND (Burda et al., 2018) as a deep RL exploration baseline due to its effectiveness. Online RL with reset distributions When an exploratory reset distribution is available, a number of statistically and computationally efficient algorithms are known. The classic algorithms are CPI (Kakade & Langford, 2002) , PSDP (Bagnell et al., 2003 ), Natural Policy Gradient (Kakade, 2001; Agarwal et al., 2020b), and POLYTEX (Abbasi-Yadkori et al., 2019) . Uchendu et al. (2022) recently demonstrated that algorithms like PSDP work well when equipped with modern neural network function approximators. However, these algorithms (and their analyses) heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically unavailable in practice, unless one also has a simulator. In contrast, we assume the offline data covers some high quality policy, which helps with exploration, but we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing. Offline RL Offline RL methods learn policies solely from a given offline dataset, with no interaction whatsoever. When the dataset has global coverage, algorithms such as FQI (Munos & Szepesvári, 2008; Chen & Jiang, 2019) or certainty-equivalence model learning (Ross & Bagnell, 2012) , can find near-optimal policies in an oracle-efficient manner, via least squares or model-fitting oracles. However, with only partial coverage, existing methods either (a) are not computationally efficient due to the difficulty of implementing pessimism both in linear settings with large action spaces (Jin et al., 2021b; Zhang et al., 2022b; Chang et al., 2021) and general function approximation settings (Uehara & Sun, 2021; Xie et al., 2021a; Jiang & Huang, 2020; Chen & Jiang, 2022; Zhan et al., 2022) , or (b) require strong representation conditions such as policy-based Bellman completeness (Xie et al., 2021a; Zanette et al., 2021) . In contrast, in the hybrid setting, we obtain an efficient algorithm under the more natural condition of completeness w.r.t., the Bellman optimality operator only. Among the many empirical offline RL methods (e.g., Kumar et al. (2020) ; Yu et al. (2021) ; Kostrikov et al. (2021) ; Fujimoto & Gu (2021)), we use CQL (Kumar et al., 2020) as a baseline in our experiments, since it has been shown to work in image-based control settings such as Atari games. Online RL with offline datasets Ross & Bagnell (2012) developed a model-based algorithm for a similar hybrid setting. In comparison, our approach is model-free and consequently more suitable for high-dimensional state spaces (e.g., raw-pixel images). Xie et al. (2021b) studied hybrid RL and show that offline data does not yield statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and demonstrates computational benefits of hybrid RL. On the empirical side, several works consider combining offline expert demonstrations with online interaction (Rajeswaran et al., 2017; Hester et al., 2018; Nair et al., 2018; 2020; Vecerik et al., 2017) . A common challenge in offline RL is the robustness against low-quality offline dataset. Previous works mostly focus on expert demonstrations and have no rigorous guarantees for such robustness. In fact, Nair et al. (2020) showed that such degradation in performance indeed happens in practice with low-quality offline data. In our experiments, we observe that DQfD (Hester et al., 2018 ) also has a similar degradation. On the other hand, our algorithm is robust to the quality of the offline data. Note that the core idea of our algorithm is similar to that of Vecerik et al. (2017) , who adapt DDPG to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. (2017) does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. (2017) for the hybrid RL setting. We also include a detailed comparison with previous empirical work in Appendix D.

3. PRELIMINARIES

We consider finite horizon Markov Decision Process M (S, A, H, R, P, d 0 ), where S is the state space, A is the action space, H denotes the horizon, stochastic rewards R(s, a) ∈ ∆([0, 1]) and P (s, a) ∈ ∆(S) are the reward and transition distributions at (s, a), and d 0 ∈ ∆(S) is the initial Algorithm 1 Hybrid Q-learning using both offline and online data (Hy-Q) Require: Value class: F, #iterations: T , offline dataset D ν h of size m off = T for h ∈ [H -1]. 1: Initialize f 1 h (s, a) = 0. 2: for t = 1, . . . , T do 3: Let π t be the greedy policy w.r.t. f t i.e., π t h (s) = argmax a f t h (s, a).

4:

For each h, collect m on = 1 online tuples D t h ∼ d π t h . // Online collection // FQI using both online and offline data 5: Set f t+1 H (s, a) = 0.

6:

for h = H -1, . . . , 0 do 7: Estimate f t+1 h using least squares regression on the aggregated data D t h = D ν h + t τ =1 D τ h : f t+1 h ← argmin f ∈F h E D t h (f (s, a) -r -max a ′ f t+1 h+1 (s ′ , a ′ )) 2 end for 9: end for distribution. We assume the agent can only reset from d 0 (at the beginning of each episode). Since the optimal policy is non-stationary in this setting, we define a policy π := {π 0 , . . . , π H-1 } where π h : S → ∆(A). Given π, d π h ∈ ∆(S ×A) denotes the state-action occupancy induced by π at step h. Given π, we define the state and state-action value functions in the usual manner: V π h (s) = E[ H-1 τ =h r τ |π, s h = s] and Q π h (s, a) = E[ H-1 τ =h r τ |π, s h = s, a h = a] . Q ⋆ and V ⋆ denote the optimal value functions. We define the Bellman operator T such that for any f : S × A → R, T f (s, a) = E[R(s, a)] + E s ′ ∼P (s,a) max a ′ f (s ′ , a ′ ) ∀s, a, We assume that for each h we have an offline dataset of m off samples (s, a, r, s ′ ) drawn iid via (s, a) ∼ ν h , r ∈ R(s, a), s ′ ∼ P (s, a). Here ν = {ν 0 , . . . , ν H-1 } denote the corresponding offline data distributions. For a dataset D, we use ÊD [•] to denote a sample average over this dataset. For our theoretical results, we will assume that ν covers some high-quality policy. We consider the value-based function approximation setting, where we are given a function class F = F 0 × • • • × F H-1 with F h ⊂ S × A → [0, V max ] that we use to approximate the value functions for the underlying MDP. For ease of notation, we define f = {f 0 , . . . , f H-1 } and define π f to be the greedy policy w.r.t., f , which chooses actions as π f h (s) = argmax a f h (s, a).

4. HYBRID Q-LEARNING

In this section, we present our algorithm Hybrid Q Learning -Hy-Q in Algorithm 1. Hy-Q takes an offline dataset D ν that contains (s, a, r, s ′ ) tuples and a Q function class F ⊂ S × A → [0, H] as inputs, and outputs a policy that optimizes the given reward function. The algorithm is conceptually simple: it iteratively executes the Fitted Q Iteration procedure (line 6) using the offline dataset and on-policy samples generated by the learned policies. Specifically, at iteration t, we have an estimate f t of the Q ⋆ function and we set π t to be the greedy policy for f t . We execute π t to collect a dataset D t h of online samples in line 4. Then we run FQI, a dynamic programming style algorithm on both the offline dataset D ν and all previously collected online samples {D τ h } t τ =1 . The FQI update works backward from time step H to 0 and computes f t+1 h via least squares regression with input (s, a) and regression target r + max a ′ f t+1 h+1 (s ′ , a ′ ).foot_1 Let us make several remarks. Intuitively, the FQI updates in Hy-Q try to ensure that the estimate f t has small Bellman error under both the offline distribution ν and the online distributions d π t . The standard offline version of FQI ensures the former, but this alone is insufficient when the offline dataset has poor coverage. Indeed FQI may have poor performance in such cases (see examples in Zhan et al., 2022; Chen & Jiang, 2022) . The key insight in Hy-Q is to use online interaction to ensure that we also have small Bellman error on d π t . As we will see, the moment we find an f t that has small Bellman error on the offline distribution ν and its own greedy policy's distribution d π t , FQI guarantees that π t will be at least as good as any policy covered by ν. This observation results in an explore-or-terminate phenomenon: either f t has small Bellman error on its distribution and we are done, or d π t must be significantly different from distributions we have seen previously and we make progress. Crucially, no explicit exploration is required for this argument, which is precisely how we avoid the computational difficulties with implementing optimism. Another important point pertains to catastrophic forgetting. We will see that the size of the offline dataset m off should be comparable to the total amount of online data {D τ h } T τ =1 , so that the two terms in Eq. 1 have similar weight and we ensure low Bellman error on ν throughout the learning process. In practice, we implement this by having all model updates use a fixed proportion of offline samples even as we collect more online data, so that we do not "forget" the distribution ν. This is quite different from warm-starting with D ν and then switching to online RL, which may result in catastrophic forgetting due to a vanishing proportion of offline samples being used for model training as we collect more online samples. We note that this balancing scheme is analogous to and inspired by the one used by Ross & Bagnell (2012) in the context of model-based RL with a reset distribution. Previously, similar techniques have also been explored for various applications (for example, see Appendix F.3 of Kalashnikov et al. (2018) ). As in Ross & Bagnell (2012) , a key practical insight from our analysis is that the offline data should be used throughout training to avoid catastrophic forgetting.

5. THEORETICAL ANALYSIS: LOW BILINEAR RANK MODELS

In this section we present the main theoretical guarantees for Hy-Q. We start by stating the main assumptions and definitions for the function approximator, the offline data distribution, and the MDP. We state the key definitions and then provide some discussion. Assumption 1 (Realizability and Bellman completeness). For any h, we have Q ⋆ h ∈ F h . Additionally, for any f h+1 ∈ F h+1 , we have T f h+1 ∈ F h . Definition 1 (Bellman error transfer coefficient). For any policy π, define the transfer coefficient as C π := max    0, max f ∈F H-1 h=0 E s,a∼d π h [T f h+1 (s, a) -f h (s, a)] H-1 h=0 E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2    . ( ) Definition 2 (Bilinear model (Du et al., 2021) ). We say that the MDP together with the function class F is a bilinear model of rank d if for any h ∈ [H -1], there exist two (unknown) mappings X h , W h : F → R d with max f ∥X h (f )∥ 2 ≤ B X and max f ∥W h (f )∥ 2 ≤ B W such that: ∀f, g ∈ F : E s,a∼d π f h [g h (s, a) -T g h+1 (s, a)] = |⟨X h (f ), W h (g)⟩| . All concepts defined above are frequently used in the statistical analysis of RL methods with function approximation. Realizability is the most basic function approximation assumption, but is known to be insufficient for offline RL (Foster et al., 2021) unless other strong assumptions hold (Xie & Jiang, 2021; Zhan et al., 2022; Chen & Jiang, 2022) . Completeness is the most standard strengthening of realizability that is used routinely in both online (Jin et al., 2021a) and offline RL (Munos & Szepesvári, 2008; Chen & Jiang, 2019) and is known to hold in several settings including the linear MDP and the linear quadratic regulator. These assumptions ensure that the dynamic programming updates of FQI are stable in the presence of function approximation. The transfer coefficient definition above is somewhat non-standard, but is actually weaker than related notions used in prior offline RL results. First, the average Bellman error appearing in the numerator is weaker than the squared Bellman error notion of (Xie et al., 2021a) ; a simple calculation shows that C 2 π is upper bounded by their coefficient. Second, by using Bellman errors, both of these are bounded by notions involving density ratios (Kakade & Langford, 2002; Munos & Szepesvári, 2008; Chen & Jiang, 2019) . Finally, many works, particularly those that do not employ pessimism (Munos & Szepesvári, 2008; Chen & Jiang, 2019) , require "all-policy" analogs, which places a much stronger requirement on the offline data distribution ν. In contrast, we will only ask that C π is small for some high-quality policy that we hope to compete with (see Appendix A.5 for more details). Lastly, the bilinear model was developed in a series of works (Jiang et al., 2017; Jin et al., 2021a; Du et al., 2021) on sample efficient online RL. 3 The setting is known to capture a wide class of models including linear MDPs, linear Bellman complete models, low-rank MDPs, reactive POMDPs, and more. As a technical note, the main paper focuses on the "Q-type" version of the bilinear model, but the algorithm and proofs easily extend to the "V-type" version. See Appendix C for details. Theorem 1 (Cumulative suboptimality). Fix δ ∈ (0, 1), m off = T and m on = 1, suppose that the function class F satisfies Assumption 1, and together with the underlying MDP admits Bilinear rank d. Then with probability at least 1 -δ, Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy π e , T t=1 V π e -V π t = O max{C π e , 1}V max dH 2 T • log(|F|/δ) , where π t = π f t is the greedy policy w.r.t. f t at round t. A standard online-to-batch conversion (Shalev-Shwartz & Ben-David, 2014) immediately gives the following sample complexity guarantee for Algorithm 1 for finding an ϵ-suboptimal policy w.r.t. the optimal policy π * for the underlying MDP. Corollary 1 (Sample complexity). Under the assumptions of Theorem 1 if C π * < ∞ then Algorithm 1 can find an ϵ-suboptimal policy π for which V π * -V π ≤ ϵ with total sample complexity: n = O V 2 max C 2 π * H 3 d log(|F|/δ)/ϵ 2 The results formalize the statistical properties of Hy-Q. In terms of sample complexity, a somewhat unique feature of the hybrid setting is that both transfer coefficient and bilinear rank parameters are relevant, whereas these (or related) parameters typically appear in isolation in offline and online RL respectively. In terms of coverage, Theorem 1 highlights an "oracle property" of Hy-Q: it competes with any policy that is sufficiently covered by the offline dataset. We also highlight the computational efficiency of Hy-Q: it only requires solving least squares problems over the function class F. To our knowledge, no purely online or purely offline methods are known to be efficient in this sense, except under much stronger "uniform" coverage conditions.

5.1. THE LINEAR BELLMAN COMPLETENESS MODEL

We next showcase one example of low bilinear rank models: the popular linear Bellman complete model which captures the linear MDP model (Yang & Wang, 2019; Jin et al., 2020) , and instantiate the sample complexity bound in Corollary 1. Definition 3. Given a feature function ϕ : S × A → B d (1), a model admits linear Bellman completeness if for any w ∈ B d (B W ), there exists a w ′ ∈ B d (B W ) such that ∀s, a : ⟨w ′ , ϕ(s, a)⟩ = E[R(s, a)] + E s ′ ∼P (s,a) max a ′ ⟨w, ϕ(s ′ , a ′ )⟩. Note that the above condition implies that Q ⋆ h (s, a) = ⟨w ⋆ h , ϕ(s, a)⟩ with ∥w ⋆ h ∥ 2 ≤ B W . Thus, we can define a function class F h = {⟨w h , ϕ(s, a)⟩ : w h ∈ R d , ∥w h ∥ 2 ≤ B W } which by inspection satisfies Assumption 1. Additionally, this model is also known to have bilinear rank at most d (Du et al., 2021) . Thus, using Corollary 1 we immediately get the following guarantee: Lemma 1. Let δ ∈ (0, 1), suppose the MDP is linear Bellman complete, C π * < ∞, and consider F h defined above. Then, with probability 1 -δ, Algorithm 1 finds an ϵ-suboptimal policy with total sample complexity: n = O B 2 W C 2 π * H 4 d 2 log(1/δ)/ϵ 2 . Remark 1 (Computational efficiency). For linear Bellman complete models, we note that Algorithm 1 can be implemented efficiently under mild assumptions. For the class F in Lemma 1, the regression problem in (1) reduces to a least squares linear regression with a norm constraint on the weight vector. This regression problem can be solved efficiently by convex programming with computational efficiency scaling polynomially in the number of parameters (Bubeck et al., 2015) (d here), whenever max a f h+1 (s, a) (or argmax a f h+1 (s, a)) can be computed efficiently. Remark 2. (Linear MDPs) Since linear Bellman complete models generalize linear MDPs (Yang & Wang, 2019; Jin et al., 2020) , as we discuss above, Algorithm 1 can be implemented efficiently whenever max a f h+1 (s, a) can be computed efficiently. The latter is tractable when: • When |A| is small/finite, one can just enumerate to compute max a f h+1 (s, a) for any s, and thus (1) can be implemented efficiently. The computational efficiently of Algorithm 1 in this case is comparable to the prior works, e.g. Jin et al. (2020) . • When the set {ϕ(s, a) | a ∈ A} is convex and compact, one can simply use a linear optimization oracle to compute max a f h+1 (s, a) = max a w ⊤ h+1 ϕ(s, a). This linear optimization problem is itself solvable with computational efficiency scaling polynomially with d. here). Note that even under access to a linear optimization oracle, prior works e.g. Jin et al. (2020) rely on bonuses in the form of argmax a ϕ(s, a) ⊤ w + β ϕ(s, a) ⊤ Σϕ(s, a), where Σ is some positive definite matrix (e.g., the regularized feature covariance matrix). Computing such bonuses could be NP-hard (in the feature dimension d) without additional assumptions (Dani et al., 2008) . Remark 3. (Relative condition number) A common coverage metric in these linear MDP models is the relative condition number. In Appendix A.5, we show that our coefficient C π is upper bounded by the relative condition number of π with respect to ν: E d π ∥ϕ∥ Σ -1 ν , where Σ ν = E s,a∼ν ϕ(s, a)ϕ ⊤ (s, a). Concretely, we have C π ≤ max h E d π h ∥ϕ∥ 2 Σ -1 ν h .

5.2. WHY DON'T OFFLINE RL METHODS WORK?

One may wonder why do pure offline RL methods fail to learn when the transfer coefficient is bounded, and why does online access help? We illustrate with the MDP construction developed by Zhan et al. (2022) ; Chen & Jiang (2022), visualized in Figure 1 .  (A) = L and π * (B) = π * (C) = Uniform({L, R}). With F = {Q ⋆ 1 , Q ⋆ 2 } where Q ⋆ j is the optimal Q function for M j , then one can easily verify that F satisfies Bellman completeness, for both MDPs. Finally with offline distribution ν supported on states A and B only (with no coverage on state C), we have sufficient coverage over d π ⋆ . However, samples from ν are unable to distinguish between f 1 and f 2 or (M 1 and M 2 ), since state C is not supported by ν. Unfortunately, adversarial tie-breaking may result the greedy policies of f 1 and f 2 visiting state C, where we have no information about the correct action. This issue has been documented before, and in order to address it with pure offline RL, existing approaches require additional structural assumptions. For instance, Chen & Jiang (2022) assume that Q ⋆ has a gap, which usually does not hold when action space is large or continuous. Xie et al. (2021a) assumes policy-dependent Bellman completeness for every possible policy π ∈ Π (which is much stronger than our assumption), and Zhan et al. (2022) assumes a somewhat non-interpretable realizability assumption on some "value" function that does not obey the standard Bellman equation. In contrast, by combining offline data and online data, our approach focuses on functions that have small Bellman residual under both the offline distribution and the on-policy distributions, which together with the offline data coverage assumption, ensures near optimality. It is easy to see that the hybrid approach will succeed Figure 1 .

6. EXPERIMENTS

In this section we discuss empirical results comparing Hy-Q to several representative RL methods on two challenging benchmarks. Our experiments focus on answering the following questions: 1. Can Hy-Q efficiently solve problems that SOTA offline RL methods simply cannot? 2. Can Hy-Q, via the use of offline data, significantly improve the sample efficiency of online RL? 3. Does Hy-Q scale to challenging deep-RL benchmarks? Our empirical results provide positive answers to all of these questions. To study the first two, we consider the diabolical combination lock environment (Misra et al., 2020; Zhang et al., 2022c) , a synthetic environment designed to be particularly challenging for online exploration. The synthetic nature allows us to carefully control the offline data distribution to modulate the difficulty of the setup and also to compare with a provably efficient baseline (Zhang et al., 2022c) . To study the third question, we consider the Montezuma's Revenge benchmark from the Arcade Learning environment, which is one of the most challenging empirical benchmarks with high-dimensional image inputs, largely due to the difficulties of exploration. Additional details are deferred to Appendix E. Hy-Q implementation. We largely follow Algorithm 1 in our implementation for the combination lock experiment. Particularly, we use a similar function approximation to Zhang et al. (2022c) , and a minibatch Adam update on Eq. ( 1) with the same sampling proportions as in the pseudocode. For Montezuma's Revenge, in addition to minibatch optimization, since the horizon of the environment is not fixed, we deploy a discounted version of Hy-Q. Concretely, the target value in the Bellman error is calculated from the output of a target network, which is periodically updated, times a discount factor. We refer the readers to Appendix E for more details. Baselines. We include representative algorithms from four categories: (1) for imitation learning we use Behavior Cloning (BC) (Bain & Sammut, 1995) , ( 2) for offline RL we use Conservative Q-Learning (CQL) (Kumar et al., 2020) , ( 3) for online RL we use BRIEE (Zhang et al., 2022c) for combination lockfoot_3 and Random Network Distillation (RND) (Burda et al., 2018) for Montezuma's Revenge, and (4) as a Hybrid-RL baseline we use Deep Q-learning from Demonstrations (DQFD) (Hester et al., 2018) . We note that DQFD and prior hybrid RL methods combine expert demonstrations with online interactions, but are not necessarily designed to work with general offline datasets.

6.1. COMBINATION LOCK

The combination lock benchmark is depicted in Figure 3 and consists of horizon H = 100, three latent states for each time step and 10 actions in each state. Each state has a single "good" action that advances down a chain of favorable states from which optimal reward can be obtained. A single incorrect action transitions to an absorbing chain with suboptimal value. The agent operates on high dimensional observations and must use function approximation to succeed. This is an extremely challenging problem for which many Deep RL methods are known to fail (Misra et al., 2020) , in part because (uniform) random exploration only has 10 -H probability of obtaining the optimal reward. On the other hand, the model has low bilinear rank, so we do have online RL algorithms that are provably sample-efficient: BRIEE currently obtains state of the art sample complexity. However, its sample complexity is still quite large, and we hope that Hybrid RL can address this shortcoming. We are not aware of any experiments with offline RL methods on this benchmark. We construct two offline datasets for the experiments, both of which are derived from the optimal policy π ⋆ . In the optimal trajectory dataset we collect full trajectories by following π ⋆ with ϵ-greedy exploration with ϵ = 1/H. In the optimal occupancy dataset we collect transition tuples from the state-occupancy measure of π ⋆ with random actions. 5 Both datasets have bounded concentrability coefficients (and hence transfer coefficients) with respect to π ⋆ , but the second dataset is much more challenging since the actions do not directly provide information about π ⋆ , as they do in the former. and "Offline" denotes the average trajectory reward in the offline dataset. The y-axis denotes the (moving) average of 100 episodes for the methods involving online interactions. Note that CQL and BC overlap on the last plot. The results are presented in Figure 2 . First, we observe that Hy-Q can reliably solve the task under both offline distributions with relatively low sample complexity (500k offline samples and ≤ 25m online samples). In comparison, BC fails completely since both datasets contain random actions. CQL can solve the task using the trajectory-based dataset with a sample complexity that is comparable to the combined sample size of Hy-Q. However, CQL fails on the occupancy-based dataset since the actions themselves are not informative. Indeed the pessimism-inducing regularizer of CQL is constant on this dataset and so the algorithm reduces to FQI. Finally, Hy-Q can solve the task with a factor of 5-10 reduction in (online and offline) samples when compared with BRIEE. This demonstrates the robustness and sample efficiency provided by hybrid RL.

6.2. MONTEZUMA'S REVENGE

To answer the third question, we turn to Montezuma's Revenge, an extremely challenging imagebased benchmark environment with sparse rewards. We follow the setup from Burda et al. (2018) and introduce stochasticity to the original dynamics: with probability 0.25 the environment executes the previous action instead of the current one. For offline datasets, we first train an "expert policy" π e via RND to achieve V π e ≈ 6400. We create three datasets by mixing samples from π e with those from a random policy: the easy dataset contains only samples from π e , the medium dataset mixes in a 80/20 proportion (80 from π e ), and the hard dataset mixes in a 50/50 proportion. Here we record full trajectories from both policies in the offline dataset, but measure the proportion using the number of transition tuples instead of trajectories. We provide 0.1 million offline samples for the hybrid methods, and 1 million samples for the offline and IL methods. Results are displayed in Figure 4 . CQL fails completely on all datasets. DQFD performs well on the easy dataset due to the large margin loss (Piot et al., 2014) that imitates the policies in the offline dataset. However, DQFD's performance drops as the quality of the offline dataset degrades (medium), and fails when the offline dataset is low quality (hard). We also observe that BC is a competitive baseline in the first two settings, and thus we view these problems as relatively easy to solve. Hy-Q is the only method that performs well on the hard dataset. Note that here, BC's performance is quite poor. We also include the comparison with RND in Figure 5 : with only 100k offline samples from any of the three datasets, Hy-Q is over 10x more efficient in terms of online sample complexity.

7. CONCLUSION

We demonstrate the potential of hybrid RL with Hy-Q, a simple, theoretically principled, and empirically effective algorithm. Our theoretical results showcase how Hy-Q circumvents the computational issues of pure offline or online RL, while our empirical results highlight its robustness and sample efficiency. Yet, Hy-Q is perhaps the most natural hybrid algorithm, and we are optimistic that there is much more potential to unlock from the hybrid setting. We look forward to studying this in the future. Reproducibility Statement. For our theory results, we provide detailed proof in the Appendices. For experiments, we submit anonymous code in the supplemental materials. Our (offline) dataset can be reproduced with the attached instructions, and our results could be reproduced with the given random seeds. For more details, we include implementation, environment and computation hardware details in the Appendices, along with hyperparameters for both our method and the baselines. We also open source our code at https://github.com/yudasong/HyQ.

A PROOFS FOR SECTION 5

Additional notation. Throughout the appendix, we define the feature covariance matrix Σ t;h as Σ t;h = t τ =1 X h (f τ )(X h (f τ )) ⊤ + λI. Furthermore, given a distribution β ∈ ∆(S × A) and a function f : S × A → R, we denote its weighted ℓ 2 norm as ∥f ∥ 2 2,β := E s,a∼β f 2 (s, a).

A.1 SUPPORTING LEMMAS FOR THEOREM 1

Before proving Theorem 1, we first present a few useful lemma. We start with a standard result on least square generalization bound, which is be used by recalling that Algorithm 1 performs least squares on the empirical bellman error. We defer the proof of Lemma 2 to Appendix B. Lemma 2. (Least squares generalization bound) Let R > 0, δ ∈ (0, 1), we consider a sequential function estimation setting, with an instance space X and target space Y. Let H : X → [-R, R] be a class of real valued functions. Let D = {(x 1 , y 1 ), . . . , (x T , y T )} be a dataset of T points where x t ∼ ρ t := ρ t (x 1:t-1 , y 1:t-1 ), and y t is sampled via the conditional probability p(• | x t ): y t ∼ p(• | x t ) := h * (x t ) + ε t , where the function h * satisfies approximate realizability i.e. inf h∈H 1 T T t=1 E x∼ρt (h * (x) -h(x)) 2 ≤ γ, and {ϵ i } n i=1 are independent random variables such that E[y t | x t ] = h * (x t ). Additionally, suppose that max t |y t | ≤ R and max x |h * (x)| ≤ R. Then the least square solution h ← argmin h∈H T t=1 (h(x t ) -y t ) 2 satisfies with probability at least 1 -δ, T t=1 E x∼ρt ( h(x) -h * (x)) 2 ≤ 3γT + 256R 2 log(2|H|/δ). The above lemma is basically an extension of the standard least square regression agnostic generalization bound from i.i.d. setting to the non-i.i.d. case with the sequence of training data forms a sequence of Martingales. We state the result when the realizability only holds approximately upto the approximation γ. However, for all our proofs, we invoke this result by setting γ = 0. In the next two lemmas, we prove two lemmas where we can bound each part of the regret decomposition using the Bellman error of the value function f . Lemma 3 (Performance difference lemma). For any function f = (f 0 , . . . , f H-1 ) where f h : S × A → R and h ∈ [H -1], we have E s∼d0 [max a f 0 (s, a) -V π f 0 (s)] ≤ H-1 h=0 E s,a∼d π f h [f h (s, a) -T f h+1 (s, a)] , where we define f H (s, a) = 0 for all s, a. Proof. We start the proof by noting that π f 0 (s) = argmax a f 0 (s, a), then we have: E s∼d0 [max a f 0 (s, a) -V π f (s)] = E s∼d0 [E a∼π f 0 (s) f 0 (s, a) -V π f 0 (s)] = E s∼d0 [E a∼π f 0 (s) f 0 (s, a) -T f 1 (s, a)] + E s∼d0 [E a∼π f 0 (s) T f 1 (s, a) -V π f 0 (s)] = E s,a∼d π f 0 [f 0 (s, a) -T f 1 (s, a)]+ E s∼d0 [E a∼π f 0 (s) [R(s, a) + γE s ′ ∼P(s,a) max a ′ f 1 (s ′ , a ′ ) -R(s, a) + E s ′ ∼P(s,a) V π f 1 (s ′ )]] = E s,a∼d π f 0 [f 0 (s, a) -T f 1 (s, a)] + E s∼d π f 1 [max a f 1 (s, a) -V π f 1 (s)] Then by recursively applying the same procedure on the second term in (4), we have E s∼d0 [max a f 0 (s, a) -V π f (s)] = H-1 h=0 E s,a∼d π f h [f h (s, a) -T f h+1 (s, a)] + E s∼d π f H [max a f H (s, a) -V π f H (s)]. Finally for h = H, we recall that we set f H (s, a) = 0 and V π f H = 0 for notation simplicity. Thus we have: E s∼d0 [max a f 0 (s, a) -V π f (s)] = H-1 h=0 E s,a∼d π f h [f h (s, a) -T f h+1 (s, a)] ≤ H-1 h=0 E s,a∼d π f h [f h (s, a) -T f h+1 (s, a)] . Now we proceed to how to bound the other half in the regret decomposition: Lemma 4. Let π e = (π e 0 , . . . , π e H-1 ) be a comparator policy, and consider any value function f = (f 0 , . . . , f H-1 ) where f h : S × A → R. Then, E s∼d0 V π e 0 (s) -max a f 0 (s, a) ≤ H-1 i=0 E s,a∼d πe i [T f i+1 (s, a) -f i (s, a)], where we defined f H (s, a) = 0 for all s, a. Proof. The proof is similar to the proof of Lemma 3, and we start with the fact that max a f (s, a) ≥ f (s, a ′ ), ∀a ′ , including actions sampled from π e : E s∼d0 V π e 0 (s) -max a f 0 (s, a) ≤ E s,a∼d πe 0 Q π e 0 (s, a) -f 0 (s, a) = E s,a∼d πe 0 Q π e 0 (s, a) -T f 1 (s, a) + T f 1 (s, a) -f 0 (s, a) = E s,a∼d πe 0 E s ′ ∼P(s,a) V π e 1 (s ′ ) -max a ′ f 1 (s ′ , a ′ ) + E s,a∼d πe 0 [T f 1 (s, a) -f 0 (s, a)] = E s∼d πe 1 V π e 1 (s) -max a f 1 (s, a) + E s,a∼d πe 0 [T f 1 (s, a) -f 0 (s, a)] Again by recursively applying the same procedure on the first term in (5), we have E s∼d0 V π e 0 (s) -max a f 0 (s, a) ≤ E s∼d πe H V π e H (s) -max a f H (s, a) + H-1 h=0 E s,a∼d πe h [T f h+1 (s, a) -f h (s, a)], and recall that f H (s, a) = 0 and V π f H = 0, we have E s∼d0 V π e 0 (s) -max a f 0 (s, a) ≤ H-1 h=0 E s,a∼d πe h [T f h+1 (s, a) -f h (s, a)]. The following result is useful in the bilinear models when we want to bound the potential functions. The result directly follows from the elliptical potential lemma (Lattimore & Szepesvári, 2020, Lemma 19.4 ). Lemma 5. Let X h (f 1 ), . . . , X h (f T ) ∈ R d be a sequence of vectors with ∥X h (f t )∥ ≤ B X < ∞ for all t ≤ T . Then, T t=1 ∥X h (f t )∥ Σ -1 t-1;h ≤ 2dT log 1 + T B 2 X λd , where the matrix Σ t;h := t τ =1 X h (f τ )X h (f τ ) ⊤ + λI for t ∈ [T ] and λ ≥ B 2 X , and the matrix norm ∥X h (f t )∥ Σ -1 t-1;h = E[X h (f t ) ⊤ Σ -1 t-1;h X h (f t )]. Proof. Since λ ≥ B 2 X , we have that ∥X h (f t )∥ 2 Σ -1 t-1;h ≤ 1 λ ∥X h (f t )∥ 2 ≤ 1. Thus, using elliptical potential lemma (Lattimore & Szepesvári, 2020, Lemma 19 .4), we get that T t=1 ∥X h (f t )∥ 2 Σ -1 t-1;h ≤ 2d log 1 + T B 2 X λd . The desired bound follows from Jensen's inequality which implies that T t=1 ∥X h (f t )∥ Σ -1 t-1;h ≤ T • T t=1 ∥X h (f t )∥ 2 Σ -1 t-1;h ≤ 2T d log 1 + T B 2 X λd .

A.2 PROOF OF THEOREM 1

Before delving into the proof, we first state that following generalization bound for FQI. Lemma 6 (Bellman error bound for FQI). Let δ ∈ (0, 1) and let for h ∈ [H -1] and t ∈ [T ], f t+1 h be the estimated value function for time step h computed via least square regression using samples in the dataset D ν h , D 1 h , . . . , D t h in (1) in the iteration t of Algorithm 1. Then, with probability at least 1 -δ, for any h ∈ [H -1] and t ∈ [T ], f t+1 h -T f t+1 h+1 2 2,ν h ≤ 1 m off 256V 2 max log(2HT |F|/δ) =: ∆ off , and t τ =1 f t+1 h -T f t+1 h+1 2 2,µ τ h ≤ 1 m on 256V 2 max log(2HT |F|/δ) =: ∆ on , where ν h denotes the offline data distribution at time h, and the distribution µ τ h ∈ ∆(s, a) is defined such that s, a ∼ d π τ h . Proof. Fix t ∈ [T ], h ∈ [H -1] and f t+1 h+1 ∈ F h+1 and consider the regression problem ((1) in the iteration t of Algorithm 1): f t+1 h ← argmin f ∈F h E D t h (f (s, a) -r -max a ′ f t+1 h+1 (s ′ , a ′ )) 2 , where dataset D t h = D ν h + t τ =1 D τ h consisting of n = m off + t • m on samples {(x i , y i )} i≤n where x i = (s i h , a i h ) and y i = r i + max a f t+1 h+1 (s i h+1 , a). In particular, we define D such that the first m off samples {(x i , y i )} i≤m off = D ν h , the next m on samples {(x i , y i )} m off +mon i=m off +1 = D 1 h , and so on where the samples {(x i , y i )} m off +τ mon i=m off +(τ -1)mon+1 = D τ h . Note that: (a) for any sample (x = (s h , a h ), y = (r + max a f t+1 h+1 (s h+1 , a))) in D, we have that E[y | x] = E s h+1 ∼P (s h ,a h ),r∼R(s h ,a h ) r + max a f t+1 h+1 (s h+1 , a) = T f t+1 h+1 (s h , a h ) ≤ g(s h , a h ) , where the last line holds since the Bellman completeness assumption implies existence of such a function g, (b) for any sample, |y| ≤ V max and f (s, a) ≤ V max for all s, a, (c) our construction of D implies that for each iteration t, the sample (x t , y t ) are generated in the following procedure: x t is sampled from the data generation scheme D t (x 1:t-1 , y 1:t-1 ), and y t is sampled from some conditional probability distribution p(• | x t ) as defined in Lemma 2, finally (d) the samples in D ν h are drawn from the offline distribution ν h , and the samples in D τ h are drawn such that s h ∼ d π t h and a h ∼ π f t (s h ).Thus, using Lemma 2, we get that the least square regression solution f t+1 h satisfies n i=1 E (f t+1 h (s i , a i ) -T f t+1 h+1 (s i , a i )) 2 | D i ≤ 256V 2 max log(2|F|/δ). Using the property-(d) in the above, we get that m off • f t+1 h -T f t+1 h+1 2 2,ν h + m on • t τ =1 f t+1 h -T f t+1 h+1 2 2,µ τ h ≤ 256V 2 max log(2|F|/δ), where the distribution µ τ h ∈ ∆(s, a) is defined by sampling s ∼ d π τ h and a ∼ π f t (s). Taking a union bound over h ∈ [H -1] and t ∈ [T ], and bounding each term separately, gives the desired statement. We next note a change in distribution lemma which allows us to bound expected bellman error under the (s, a) distribution generated by f t in terms of the expected square bellman error w.r.t. the previous policies data distribution, which is further controlled using regression. Lemma 7. For any t ≥ 0 and h ∈ [H -1], we have W h (f t ), X h (f t ) ≤ ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 E s,a∼d f i h f t h -T f t h+1 2 + λB 2 W , where Σ -1 t-1 is defined in (3) and use the notation d f i h to denote d π f i h . Proof. Using Cauchy-Schwarz inequality, we get that W h (f t ), X h (f t ) ≤ ∥X h (f t )∥ Σ -1 t-1;h ∥W h (f t )∥ Σ t-1;h = ∥X h (f t )∥ Σ -1 t-1;h (W h (f t )) ⊤ Σ t-1 W h (f t ) = ∥X h (f t )∥ Σ -1 t-1;h (W h (f t )) ⊤ t-1 i=1 X h (f i )X h (f i ) ⊤ + λI W h (f t ) = ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 |⟨W h (f t ), X h (f i )⟩| 2 + λ∥W h (f t )∥ 2 ≤ ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 |⟨W h (f t ), X h (f i )⟩| 2 + λB 2 W (6) ≤ ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 E s,a∼d f i h f t h -T f t h+1 2 + λB 2 W where the inequality in the second last line holds by plugging in the bound on ∥W h (f t )∥, and the last line holds by using Definition 2 which implies that W h (f t ), X h (f i ) 2 = E s,a∼d f i h f t h -T f t h+1 2 ≤ E s,a∼d f i h f t h -T f t h+1 2 , where the last inequality is due to Jensen's inequality. We now have all the tools to prove Theorem 1. We first restate the bound with the exact problem dependent parameters, assumign that B W and B X are constants which are hidden in the order notation below. Theorem (Theorem 1 restated). Let m off = T and m on = 1. Then, with probability at least 1 -δ, the cumulative suboptimality of Algorithm 1 is bounded as T t=1 V π e -V π f t = O max{C π e , 1}V max dH 2 T • log 1 + T d log HT |F| δ . Proof of Theorem 1. Let π e be any comparator policy with bounded transfer coefficient i.e. C π e := max        0, max f ∈F H-1 h=0 E s,a∼d π e h [f h (s, a) -T f h+1 (s, a)] H-1 h=0 E s,a∼ν h (f h (s, a) -T f h+1 (s, a)) 2        < ∞. We start by noting that T t=1 V π e -V π f t = T t=1 E s∼d0 V π e 0 (s) -V π f t 0 (s) = T t=1 E s∼d0 V π e 0 (s) -max a f t 0 (s, a) + T t=1 E s∼d0 max a f t 0 (s, a) -V π f t 0 (s) . For the first term in the right hand side of (8), note that using Lemma 4 for each f t for 1 ≤ t ≤ T , we get T t=1 E s∼d0 V π e 0 (s) -max a f t 0 (s, a) ≤ T t=1 H-1 h=0 E s,a∼d πe h [T f t h+1 (s, a) -f t h (s, a)] ≤ T t=1 C π e • H-1 h=0 E s,a∼ν h f t h (s, a) -T f t h+1 (s, a) 2 = T C π e • H • ∆ off , where the second inequality follows from plugging in the definition of C πe in (7). The last line follows from Lemma 6. For the second term in (8), using Lemma 3 for each f t for 1 ≤ t ≤ T , we get T t=1 E s∼d0 max a f t 0 (s, a) -V π f t 0 (s) ≤ T t=1 H-1 h=0 E s,a∼d π f t h f t h (s, a) -T f t h+1 (s, a) (10) = T t=1 H-1 h=0 X h (f t ), W h (f t ) ≤ T t=1 H-1 h=0 ∥X h (f t )∥ Σ -1 t-1;h ∆ on + λB 2 W , where the second line follows from Definition 2, the third line follows from Lemma 7 and by plugging in the bound in Lemma 6. Using the bound in Lemma 5 in the above, we get that T t=1 E s∼d0 max a f t 0 (s, a) -V π f t 0 (s) ≤ 2dH 2 log 1 + T B 2 X λd • (∆ on + λB 2 W ) • T ≤ 2dH 2 log 1 + T d • (∆ on + B 2 X B 2 W ) • T , where the second line follows by plugging in λ = B 2 X . Combining the bound ( 9) and ( 11), we get that T t=1 V π e -V π f t ≤ T C π e • H • ∆ off + 2dH 2 log 1 + T d • (∆ on + B 2 X B 2 W ) • T Plugging in the values of ∆ on and ∆ off in the above, and using subadditivity of square-root, we get that T t=1 V π e -V π f t ≤ 16V max C π e T H m off log 2HT |F| δ + 16V max 2dH 2 T m on log 1 + T d log 2HT |F| δ + HB X B W 2dT log 1 + T d . Setting m off = T and m on = 1 in the above gives the cumulative suboptimality bound T t=1 V π e -V π f t = O max{C π e , 1}V max dH 2 T • log 1 + T d log HT |F| δ . ( ) Proof of Corollary 1. We next convert the above cumulative suboptimality bound into sample complexity bound via a standard online-to-batch conversion. Setting π e = π * in ( 12) and defining the policy π = Uniform π 1 , . . . , π T , we get that E V π * -V π = 1 T T t=1 V π * -V π t = O max{C π * , 1}V max dH 2 T • log 1 + T d log HT |F| δ . Thus, we get that for T ≥ O max{C 2 π * ,1}V 2 max dH 2 log(HT |F |/δ) ϵ 2 , we get that E V π * -V π ≤ ϵ. In these T iterations, the total number of offline samples used is m off = T = O max C 2 π * , 1 V 2 max dH 2 log(HT |F|/δ) ϵ 2 , and the total number of online samples used is m on • H • T = O max C 2 π * , 1 V 2 max dH 3 log(HT |F|/δ) ϵ 2 , where the additional H factor appears because we collect m on samples for every h ∈ [H] in the algorithm.

A.3 V-TYPE BILINEAR RANK

Our previous result focus on the Q-type bilinear model. Here we provide the V-type Bilinear rank definition. This V-type Bilinear rank definition is basically the same as the low Bellman rank model proposed by Jiang et al. (2017) . Algorithm 2 V-type Hy-Q Require: Value function class: F, #iterations: T , Offline dataset D ν h of size m off for h ∈ [H -1]. 1: Initialize f 1 h (s, a) = 0. 2: for t = 1, . . . , T do 3: Let π t be the greedy policy w.r.t. f t i.e., π t h (s) = argmax a f t h (s, a).

// Online collection 4:

For each h, collect m on online tuples D t h ∼ d π t h • Uniform(A). // FQI using both online and offline data 5: Set f t+1 H (s, a) = 0.

6:

for h = H -1, . . . , 0 do 7: Estimate f t+1 h using least squares regression on the aggregated data: f t+1 h ← argmin f ∈F h E D ν h (f (s, a) -r -max a ′ f t+1 h+1 (s ′ , a ′ )) 2 + t τ =1 E D τ h (f (s, a) -r -max a ′ f t+1 h+1 (s ′ , a ′ )) 2 (13) 8: end for 9: end for Definition 4 (V-type Bilinear model). Consider any pair of functions (f, g) with f, g ∈ F. Denote the greedy policy of f as π f = {π f h := argmax a f h (s, a), ∀h}. We say that the MDP together with the function F admits a bilinear structure of rank d if for any h ∈ [H -1], there exist two (unknown) mappings X h : F → R d and W h : F → R d with max f ∥X h (f )∥ 2 ≤ B X and max f ∥W h (f )∥ 2 ≤ B W , such that: ∀f, g ∈ F : E s∼d π f h ,a∼πg(s) g h (s, a) -T g h+1 (s, a) = |⟨X h (f ), W h (g)⟩| . Note that different from the Q-type definition, here the action a is taken from the greedy policy with respect to g. This way max a g(s, a) can serve as an approximation of V ⋆ -thus the name of V -type. To make Hy-Q work for the V-type Bilinear model, we only need to make slight change on the data collection process, i.e., when we collect online batch D h , we sample s ∼ d π t h , a ∼ Uniform(A), s ′ ∼ P (•|s, a). Namely the action is taken uniformly randomly here. We provide the pseudocode in Algorithm 2. We refer the reader to Du et al. (2021) ; Jin et al. (2021a) for a detailed discussion.

A.3.1 COMPLEXITY BOUND FOR V-TYPE BILINEAR MODELS

In this section, we give a performance analysis of Algorithm 2 for V-type Bilinear models. The contents in this section extend the results developed for Q-type Bilinear models in Section A.2 to V-type Bilinear models. We first note the following bound for FQI estimates in Algorithm 2. Lemma 8. Let δ ∈ (0, 1) and let for h ∈ [H -1] and t ∈ [T ], f t+1 h be the estimated value function for time step h computed via least square regression using samples in the dataset D ν h , D 1 h , . . . , D t h in (13) in the iteration t of Algorithm 2. Then, with probability at least 1 -δ, for any h ∈ [H -1] and t ∈ [T ], f t+1 h -T f t+1 h+1 2 2,ν h ≤ 1 m off 256V 2 max log(2HT |F|/δ) =: ∆off , and t τ =1 f t+1 h -T f t+1 h+1 2 2,µ τ h ≤ 1 m on 256V 2 max log(2HT |F|/δ) =: ∆on , where ν h denotes the offline data distribution at time h, and the distribution µ τ h ∈ ∆(s, a) is defined such that s ∼ d π τ h and a ∼ Uniform(A). The following change in distribution lemma is the version of Lemma 7 under V-type Bellman rank assumption. Lemma 9. Suppose the underlying model is a V-type bilinear model. Then, for any t ≥ 0 and h ∈ [H -1], we have W h (f t ), X h (f t ) ≤ ∥X h (f t )∥ Σ -1 t-1;h |A| • t-1 i=1 E s∼d π f i h , a∼Uniform(A) f t h -T f t h+1 2 + λB 2 W , where Σ -1 t-1 is defined in (3). Proof. The proof is identical to the proof of Lemma 7. Repeating the analysis till ( 6), we get that W h (f t ), X h (f t ) ≤ ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 |⟨W h (f t ), X h (f i )⟩| 2 + λB 2 W = ∥X h (f t )∥ Σ -1 t-1;h t-1 i=1 E s∼d π f i h , a∼π f t (s) f t h -T f t h+1 2 + λB 2 W ≤ ∥X h (f t )∥ Σ -1 t-1;h |A| • t-1 i=1 E s∼d π f i h , a∼Uniform(A) f t h -T f t h+1 2 + λB 2 W where the second line above follows from the definition of V-type bilinear model in Definition 4, and the last line holds because: E s∼d π f i h , a∼π f t (s) f t h -T f t h+1 2 ≤ E s∼d π f i h ,a∼π f t (s) f t h -T f t h+1 2 ≤ |A| • E s∼d π f i h ,a∼Uniform(A) f t h -T f t h+1 2 where the first inequality above is due to Jensen's inequality and the last inequality follows form a straightforward upper bound since each term inside the expectation is non-negative. We are finally ready to state and prove our main result in this section. Theorem 2 (Cumulative suboptimality bound for V-type bilinear rank models). Let m on = |A| and m off = T . Then, with probability at least 1 -δ, the cumulative suboptimality of Algorithm 2 is bounded as T t=1 V π e -V π f t = O max{C π e , 1}V max dH 2 T • log 1 + T d log HT |F| δ Proof. The proof follows closely the proof of Theorem 1. Repeating the analysis till ( 8) and ( 9), we get that: T t=1 V π e -V π f t ≤ T C π e • H • ∆off + T t=1 E s∼d0 max a f t 0 (s, a) -V π f t 0 (s) . ( ) For the second term in the above, using Lemma 3 for each f t for 1 ≤ t ≤ T , we get T t=1 E s∼d0 max a f t 0 (s, a) -V π f t 0 (s) ≤ T t=1 H-1 h=0 E s,a∼d π f t h f t h (s, a) -T h f t h+1 (s, a) = T t=1 H-1 h=0 X h (f t ), W h (f t ) ≤ T t=1 H-1 h=0 ∥X h (f t )∥ Σ -1 t-1;h |A| • ∆on + λB 2 W , where the second line follows from Definition 4, and the last line follows from Lemma 9 and by plugging in the bound in Lemma 8. Using the elliptical potential Lemma 5 as in the proof of Theorem 1, we get that T t=1 V π e -V π f t ≤ T C π e • H • ∆off + 2dH 2 log 1 + T d • |A| • ∆on + B 2 X B 2 W • T Plugging in the values of ∆on and ∆off from Lemma 8 in the above, and using subadditivity of square-root, we get that T t=1 V π e -V π f t ≤ 16V max C π e T H m off log 2HT |F| δ + 16V max 2dH 2 |A|T m on log 1 + T d log 2HT |F| δ + HB X B W 2dT log 1 + T d . Setting m on = |A| and m off = T , we get the following cumulative suboptimality bound: T t=1 V π e -V π f t = O max{C π e , 1}V max dH 2 T • log 1 + T d log HT |F| δ . Corollary 2 (Sample complexity). Under the assumptions of Theorem 2 if C π * < ∞ then Algorithm 2 can find an ϵ-suboptimal policy π for which V π * -V π ≤ ϵ with total sample complexity of: n = O max C 2 π * , 1 V 2 max dH 3 |A| log(HT |F|/δ) ϵ 2 . Proof. The following follows from a standard online-to-batch conversion. Setting π e = π * in (15) and defining the policy π = Uniform π 1 , . . . , π T , we get that E V π * -V π = 1 T T t=1 V π * -V π t = O max{C π e , 1}V max dH 2 T • log 1 + T d log HT |F| δ . Thus, we the policy returned after T ≥ O max{C 2 π * ,1}V 2 max dH 2 log(HT |F |/δ) ϵ 2 satisfies E V π * -V π ≤ ϵ. In these T iterations, the total number of offline samples used is m off = T = O max C 2 π * , 1 V 2 max dH 2 log(HT |F|/δ) ϵ 2 , and the total number of online samples collected is m on • H • T = O max C 2 π * , 1 V 2 max dH 3 |A| log(HT |F|/δ) ϵ 2 , where the additional H factor appears because we collect m on samples for every h ∈ [H] in the algorithm.

A.4 LOW-RANK MDP

In this section, we briefly introduce the low-rank MDP model (Du et al., 2021) , which is captured by the V-type Bilinear model discussed in Appendix A.3. Unlike the linear MDP model discussed in Section 5.1, low-rank MDP does not assume the feature ϕ is known a priori. Definition 5 (Low-rank MDP). A MDP is called low-rank MDP if there exists µ ⋆ : S → R d , ϕ ⋆ : S × A → R d , such that the transition dynamics P (s ′ |s, a) = µ ⋆ (s ′ ) ⊤ ϕ ⋆ (s, a) for all s, a, s ′ . We additionally assume that we are given a realizable representation class Φ such that ϕ ⋆ ∈ Φ, and that sup s,a ∥ϕ ⋆ (s, a)∥ 2 ≤ 1, and ∥f ⊤ µ ⋆ ∥ 2 ≤ √ d for any f : S → [-1, 1].

Consider the function class

F h = {w ⊤ ϕ(s, a) : ϕ ∈ Φ, w ∈ B d (B W )} , and through the bilinear decomposition we have that B W ≤ 2 √ d. By inspection, we know that this function class satisfies Assumption 1. Furthermore, it is well known that the low rank MDP model has V-type bilinear rank of at most d (Du et al., 2021) . Invoking the sample complexity bound given in Corollary 2 for V-type Bilinear models, we get the following result. Lemma 10. Let δ ∈ (0, 1) and Φ be a given representation class. Suppose that the MDP is a rank d MDP w.r.t. some ϕ ⋆ ∈ Φ, C π * < ∞, and consider F h defined above. Then, with probability 1 -δ, Algorithm 2 finds an ϵ-suboptimal policy with total sample complexity (offline + online): O max C 2 π * , 1 V 2 max d 2 H 4 |A| log(HT d|Φ|/ϵδ) ϵ 2 . Proof sketch of Lemma 10. The proof follows by invoking the result in Corollary 1 for a discretization of the class F, denoted by F ϵ = F 0,ϵ × • • • × F H-1,ϵ . F ϵ is defined such that F h,ϵ = {w ⊤ ϕ(s, a) : ϕ ∈ Φ, w ∈ B d,ϵ (B W )} where B d,ϵ (B W ) is an ϵ-net of the B d (B W ) under ℓ ∞ -distance and contains O((B W /ϵ) d ) many elements. Thus, we get that log(|F ϵ |) = O(Hd log(B W |Φ|/ϵ)). For low-rank MDP, the transfer coefficient C π is upper bounded by a relative condition number style quantity defined using the unknown ground truth feature ϕ ⋆ (see Lemma 13). On the computational side, Algorithm 1 (with the modification of a ∼ Uniform(A) in the online data collection step) requires to solve a least squares regression problem at every round. The objective of this regression problem is a convex functional of the hypothesis f over the constraint set F. While this is not fully efficiently implementable due to the potential non-convex constraint set F (e.g., ϕ could be complicated), our regression problem is still much simpler than the oracle models considered in the prior works for this model (Agarwal et al., 2020a; Sekhari et al., 2021; Uehara et al., 2021; Modi et al., 2021) .

A.5 BOUNDS ON TRANSFER COEFFICIENT

Note that C π takes both the distribution shift and the function class into consideration, and is smaller than the existing density ratio based concentrability coefficient (Kakade & Langford, 2002; Munos & Szepesvári, 2008; Chen & Jiang, 2019) and also existing Bellman error based concentrability coefficient Xie et al. (2021a) . We formalize this in the following lemma. Lemma 11. For any π and offline distribution ν, C π ≤ max f,h ∥f h -T f h+1 ∥ 2 d π h ∥f h -T f h+1 ∥ 2 ν h ≤ sup h,s,a d π h (s, a) ν h (s, a) . Proof. Using Jensen's inequality, we get that C π ≤ max f H-1 h=0 ∥f h -T f h+1 ∥ 2 d π h H-1 h=0 ∥f h -T f h+1 ∥ 2 ν h ≤ max f,h ∥f h -T f h+1 ∥ 2 d π h ∥f h -T f h+1 ∥ 2 ν h ≤ sup h,s,a d π h (s, a) ν h (s, a) ≤ sup h,s,a d π h (s, a) ν h (s, a) , where the second line follows from the Mediant inequality and the last line holds whenever sup h,s,a d π h (s,a) ν h (s,a) ≥ 1. Next we show that in the linear Bellman complete setting, C π is bounded by the relative condition number using the linear features. Lemma 12. Consider the linear Bellman complete setting (Definition 3) with known feature ϕ. Suppose that the feature covariance matrix induced by offline distribution ν: Σ ν h := E s,a∼ν h [ϕ ⋆ (s, a)ϕ ⋆ (s, a) ⊤ ] is invertible. Then for any policy π, we have C π ≤ max h E s,a∼d π h ∥ϕ(s, a)∥ 2 Σ -1 ν h . Proof. Repeating the argument in Lemma 11, we have C π ≤ max f,h ∥f h -T f h+1 ∥ 2 d π h ∥f h -T f h+1 ∥ 2 ν h ≤ max w,h ∥w ⊤ h ϕ -w ′ ⊤ h ϕ∥ 2 d π h ∥w ⊤ h ϕ -w ′ ⊤ h ϕ∥ 2 ν h ≤ max w,h ∥(w h -w ′ h )∥ 2 Σν h E d π h ∥ϕ∥ 2 Σ -1 ν h ∥(w h -w ′ h ) ⊤ ϕ∥ 2 ν h = max h E s,a∼d π h ∥ϕ(s, a)∥ 2 Σ -1 ν h . Recall that in linear Bellman complete setting, we can write f as w ⊤ ϕ, and for any w that defines f , there exists w ′ such that T f = w ′⊤ ϕ. Now we proceed to low-rank MDPs where feature is unknown. We show that for low-rank MDPs, C π is bounded by the partial feature coverage using the unknown ground truth feature. Lemma 13. Consider the low-rank MDP setting (Definition 5) where the transition dynamics P is given by P (s ′ | s, a) = ⟨µ ⋆ (s ′ ), ϕ ⋆ (s, a)⟩ for some µ ⋆ , ϕ ⋆ ∈ R d . Suppose that the offline distribution ν = (ν 0 , . . . , ν H-1 ) is such that max h max s,a π h (a|s) ν h (a|s) ≤ α for any s, a. Furthermore, suppose that ν is induced via trajectories i.e. ν 0 (s) = d 0 (s) and ν h (s) = E s,ā∼ν h-1 P (s|s, ā) for any h ≥ 1, and that the feature covariance matrix 6 Then for any policy π, we have Σ ν h-1 ,ϕ ⋆ := E s,a∼ν h-1 [ϕ ⋆ (s, a)ϕ ⋆ (s, a) ⊤ ] is invertible. C π ≤ √ α H h=1 E s,a∼d π h-1 ∥ϕ ⋆ (s, a)∥ Σ -1 ν h-1 ,ϕ ⋆ + √ α. Proof. We first upper bound the numerator separately. First note that for h = 0, E s,a∼d π 0 [T f 1 (s, a) -f 0 (s, a)] ≤ E s∼d0,a∼π(•|s) (T f 1 (s, a) -f 0 (s, a)) 2 ≤ max s,a d π 0 (s, a) ν 0 (s, a) • E s,a∼ν0 (T f 1 (s, a) -f 0 (s, a)) 2 ≤ α • E s,a∼ν0 (T f 1 (s, a) -f 0 (s, a)) 2 , ( ) where the last inequality follows from our assumption since max s,a d π 0 (s,a) ν0(s,a) = max s,a π0(a|s) ν0(a|s) ≤ α. Next, for any h ≥ 1, we note that backing up one step and looking at the pair s, ā that lead to the state s, we get that E s,a∼d π h [T f h+1 (s, a) -f h (s, a)] = E s,ā∼d π h-1 ,s∼P (s,ā),a∼π(s) [T f h+1 (s, a) -f h (s, a)] = E s,ā∼d π h-1 ϕ ⋆ (s, ā) ⊤ µ ⋆ (s) a π(a|s) [T f h+1 (s, a) -f h (s, a)] ds = E s,ā∼d π h-1 ϕ ⋆ (s, ā) ⊤ a µ ⋆ (s)π(a|s) [T f h+1 (s, a) -f h (s, a)] ds ≤ E s,ā∼d π h-1   ∥ϕ ⋆ (s, ā)∥ Σ -1 ν h-1 ,ϕ ⋆ a µ ⋆ (s)π(a|s) [T f h+1 (s, a) -f h (s, a)] ds Σ ν h-1 ,ϕ ⋆   , where the last line follows from an application of Cauchy-Schwarz inequality. For the term inside the expectation in the right hand side above, we note that, a µ ⋆ (s)π(a|s) [T f h+1 (s, a) -f h (s, a)] ds 2 Σ ν h-1 ,ϕ ⋆ (i) = E s,ā∼ν h-1   a µ ⋆ (s) ⊤ ϕ * (s, ā) π(a|s)(T f h+1 (s, a) -f h (s, a))ds 2   = E s,ā∼ν h-1 E s∼P (s,ā),a∼π(s) [T f h+1 (s, a) -f h (s, a)] 2 (ii) ≤ E s,ā∼ν h-1 ,s∼P (s,ā),a∼π(s) (T f h+1 (s, a) -f h (s, a)) 2 (iii) = E s∼ν h ,a∼π(s) (T f h+1 (s, a) -f h (s, a)) 2 (iv) ≤ α • E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 where (i) follows by expanding the norm , (ii) follows an application of Jensen's inequality, (iii) is due to our assumption that the offline dataset is generated using trajectories such that ν h (s) = E s,s∼ν h-1 [P (s | s, ā)]. Finally, (iv) follows from the definition of α. Plugging ( 18) in ( 17), we get that for h ≥ 1, E s,a∼d π h [T f h+1 (s, a) -f h (s, a)] ≤ E s,ā∼d π h-1 ∥ϕ ⋆ (s, ā)∥ Σ -1 ν h-1 ,ϕ ⋆ α • E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 We are now ready to bound the transfer coefficient. First note that using ( 16), for any f , E s,a∼d π 0 [T f 1 (s, a) -f 0 (s, a)] H-1 h=0 E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 ≤ α • E s,a∼ν0 (T f 1 (s, a) -f 0 (s, a)) 2 H-1 h=0 E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 ≤ √ α. Furthermore, for any f , using (19), we get that H-1 h=1 E s,a∼d π h [T f h+1 (s, a) -f h (s, a)] H-1 h=0 E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 ≤ H-1 h=1 E s,ā∼d π h-1     ∥ϕ ⋆ (s, ā)∥ Σ -1 ν h-1 ,ϕ ⋆ α • E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2 H-1 h=0 E s,a∼ν h (T f h+1 (s, a) -f h (s, a)) 2     ≤ H h=1 E s,ā∼d π h-1 ∥ϕ ⋆ (s, ā)∥ Σ -1 ν h-1 ,ϕ ⋆ √ α , where the last line holds for an appropriate choice of λ (e.g. λ = 0). Combining the above two bounds in the definition of C π we get that C π ≤ √ α H h=1 E s,ā∼d π h-1 ∥ϕ ⋆ (s, ā)∥ Σ -1 ν h-1 ,ϕ ⋆ + √ α. Note that in the above result, the transfer coefficient is upper bounded by the relative coverage under unknown feature ϕ ⋆ and a term α related to the action coverage, i.e., max h max s,a π h (a|s) ν h (a|s) ≤ α. This matches to the coverage condition used in prior offline RL works for low-rank MDPs (Uehara & Sun, 2021) .

B AUXILIARY LEMMAS

In this section, we provide a few results and their proofs that we used in the previous sections. We first with the following form of Freedman's inequality that is a modification of a similar inequality in (Beygelzimer et al., 2011) . Lemma 14 (Freedman's Inequality). Let {X 1 , . . . , X T } be a sequence of non-negative random variables where each x t is sampled from some process that depends on all previous instances, i.e, x t ∼ ρ t = ρ t (x 1:t-1 ). Further, suppose that |X t | ≤ R almost surely for all t ≤ T . Then, for any δ > 0 and λ ∈ [0, 1/2R], with probability at least 1 -δ, T t=1 X t -E[X t | ρ t ] ≤ λ T t=1 2R|E[X t | ρ t ]| + E X 2 t | ρ t + log(2/δ) λ . Proof. Define the random variable Z t = X t -E[X t | ρ t ]. Clearly, {Z t } T t=1 is a martingale difference sequence. Furthermore, we have that for any t, |Z t | ≤ 2R and that E Z 2 t | ρ t = E (X t -E[X t | ρ t ]) 2 | ρ t ≤ 2R|E[X t | ρ t ]| + E X 2 t | ρ t . ( ) where the last inequality holds because |X t | ≤ R. Using the form of Freedman's inequality in Beygelzimer et al. (2011, Theorem 1) , we get that for any λ ∈ [0, 1/2R], T t=1 Z t ≤ λ T t=1 E Z 2 t | ρ t + log(2/δ) λ . Plugging in the form of Z t and using (20), we get the desired statement. Next we give a formal proof of Lemma 2, which gives a generalization bound for least squares regression when the samples are adapted to an increasing filtration (and are not necessarily i.i.d.). The proof follows similarly to Agarwal et al. (2019, Lemma A.11) . Lemma 15 (Lemma 2 restated: Least squares generalization bound). Let R > 0, δ ∈ (0, 1), we consider a sequential function estimation setting, with an instance space X and target space Y. Let H : X → [-R, R] be a class of real valued functions. Let D = {(x 1 , y 1 ), . . . , (x T , y T )} be a dataset of T points where x t ∼ ρ t = ρ t (x 1:t-1 , y 1:t-1 ), and y t is sampled via the conditional probability p(• | x t ): y t ∼ p(• | x t ) := h * (x t ) + ε t , where the function h * satisfies approximate realizability i.e. 2 satisfies with probability at least 1 -δ, inf h∈H 1 T T t=1 E x∼ρt (h * (x) -h(x)) 2 ≤ T t=1 E x∼ρt ( h(x) -h * (x)) 2 ≤ 3γT + 256R 2 log(2|H|/δ). Proof. Consider any fixed function h ∈ H and define the random variable Z h t := (h(x t ) -y t ) 2 -(h * (x t ) -y t ) 2 . Define the notation E[• | ρ t ] to denote E xt∼ρt [•] , and note that E Z h t | ρ t = E xt∼ρt [(h(x t ) -h * (x t ))(h(x t ) + h * (x t ) -2y i )] = E xt∼ρt (h(x t ) -h * (x t )) 2 , where the last line holds because E[y t | x t ] = h * (x t ). Furthermore, we also have that E (Z h t ) 2 | ρ t = E xt∼ρt (h(x t ) -h * (x t )) 2 (h(x t ) + h * (x t ) -2y t ) 2 ≤ 16R 2 E xt∼ρt (h(x t ) -h * (x t )) 2 . ( ) Now we can note that the sequence of random variables Z h 1 , . . . , Z h T satisfies the condition in Lemma 14 with Z h t ≤ 4R 2 . Thus we get that for any λ ∈ [0, 1/8R 2 ] and δ > 0, with probability at least 1 -δ, T t=1 Z h t -E Z h t | ρ t ≤ λ T t=1 8R 2 E Z h t | ρ t + E Z h t 2 | ρ t + log(2/δ) λ ≤ 32λR 2 T t=1 E xt∼ρt (h(x t ) -h * (x t )) 2 + log(2/δ) λ , where the last inequality uses ( 21) and ( 22). Setting λ = 1/64R 2 in the above, and taking a union bound over h, we get that for any h ∈ H and δ > 0, with probability at least 1 -δ, T t=1 Z h t -E Z h t | ρ t ≤ 1 2 T t=1 E xt∼ρt (h(x t ) -h * (x t )) 2 + 64R 2 log(2|H|/δ). Rearranging the terms and using (21) in the above implies that, T t=1 Z h t ≤ 3 2 T t=1 E xt∼ρt (h(x t ) -h * (x t )) 2 + 64R 2 log(2|H|/δ) and T t=1 E xt∼ρt (h(x t ) -h * (x t )) 2 ≤ 2 T t=1 Z h t + 128R 2 log(2|H|/δ). For the rest of the proof, we condition on the event that ( 23) holds for all h ∈ H.

Define the function h

:= argmin h∈H T t=1 E xt∼ρt (h(x t ) -h * (x t )) 2 . Using (23), we get that T t=1 Z h t ≤ 3 2 T t=1 E xt∼ρt ( h(x t ) -h * (x t )) 2 + 64R 2 log(2|H|/δ) ≤ 3 2 γT + 64R 2 log(2|H|/δ), where the last inequality follows from the approximate realizability assumption. Let h denote the least squares solution on dataset {(x t , y t )} t≤T . By definition, we have that T t=1 Z h t = ( h(x t ) -y t ) 2 -(h * (x t ) -y t ) 2 ≤ ( h(x t ) -y t ) 2 -(h * (x t ) -y t ) 2 = T t=1 Z h t . Combining the above two relations, we get that T t=1 Z h t ≤ 3 2 γT + 64R 2 log(2|H|/δ). Finally, using (23) for the function h, we get that T t=1 E xt∼ρt ( h(x t ) -h * (x t )) 2 ≤ 2 T t=1 Z h t + 128R 2 log(2|H|/δ) ≤ 3γT + 256R 2 log(2|H|/δ), where the last inequality uses the relation ( 24).

C LOW BELLMAN ELUDER DIMENSION PROBLEMS

In this section, we consider problems with low Bellman Eluder dimensions Jin et al. (2021a) . This complexity measure is a distributional version of the Eluder dimension applied to the class of Bellman residuals w.r.t. F. We show that our algorithm Hy-Q gives a similar performance guarantee for problems with small Bellman Eluder dimensions. This demonstrates that Hy-Q applies to any general model-free RL frameworks known in the RL literature so far. We first introduce the key definitions: Definition 6 (ε-independence between distributions (Jin et al., 2021a) ). Let G be a class of functions defined on a space X , and ν, µ 1 , . . . , µ n be probability measures over X . We say ν is ε-independent of {µ 1 , µ 2 , . . . , µ n } with respect to G if there exists g ∈ G such that n i=1 (E µi [g]) 2 ≤ ε, but |E ν [g]| > ε. Definition 7 (Distributional Eluder (DE) dimension). Let G be a function class defined on X , and P be a family of probability measures over X . The distributional Eluder dimension dim DE (F, P, ε) is the length of the longest sequence {ρ 1 , . . . , ρ n } ⊂ P such that there exists ε ′ ≥ ε where ρ i is ε ′ -independent of {ρ 1 , . . . , ρ i-1 } for all i ∈ [n]. Definition 8 (Bellman Eluder (BE) dimension (Jin et al., 2021a) ). Given a value function class F, let G h := (f h -T f h+1 | f ∈ F h , f h+1 ∈ F h+1 ) be the set of Bellman residuals induced by F at step h, and P = {P h } H h=1 be a collection of H probability measure families over X × A. The ϵ-Bellman Eluder dimension of F with respect to P is defined as dim BE (F, P, ε) := max h∈[H] dim DE (G h , P h , ϵ) . We also note the following lemma that controls the rate at which Bellman error accumulates. Lemma 16 (Lemma 41, (Jin et al., 2021a) ). Given a function class G defined on a space X with sup g∈G,x∈X |g(x)| ≤ C, and a set of probability measures P over X . Suppose that the sequence {g k } K k=1 ⊂ G and {µ k } K k=1 ⊂ P satisfy that k-1 t=1 (E µt [g k ]) 2 ≤ β for all k ∈ [K]. Then, for all k ∈ [K] and γ > 0, k t=1 |E µt [g t ]| ≤ O dim DE (G, P, γ)βk + min{k, dim DE (G, P, γ)C} + kγ . We next state our main theorem whose proof is similar to that of Theorem 1. Theorem 3 (Cumulative suboptimality). Fix δ ∈ (0, 1), m off = HT /d and m on = H 2 , and suppose that the underlying MDP admits Bellman eluder dimention d, and the function class F satisfies Assumption 1. Then with probability at least 1 -δ, Algorithm 1 obtains the following bound on cumulative subpotimality w.r.t. any comparator policy π e , T t=1 V π e -V π t = O V max max{C π e , 1} dT • log(H|F|/δ) , where π t = π f t is the greedy policy w.r.t. f t at round t and d = dim BE (F, P F , 1/ √ T ). Here P F is the class of occupancy measures that can be be induced by greedy policies w.r.t. value functions in F. Proof. Repeating the analysis till (10) in the proof of Theorem 1, we get that T t=1 V π e -V π t ≤ T C π e • H • ∆ off + T t=1 H-1 h=0 E s,a∼d π f t h f t h (s, a) -T h f t h+1 (s, a) Using the bound in Lemma 6 and Lemma 16 in the above, we get that T t=1 V π e -V π t ≲ T C π e • H • ∆ off + H-1 h=0 dim DE (G h , P F ;h , γ)∆ on T + min{T, dim DE (G h , P F ;h , γ)C} + T γ. where G h := (f h -T f h+1 | f ∈ F h , f h+1 ∈ F h+1 ) denotes the set of Bellman residuals induced by F at step h, and P = {P F ;h } H h=1 is the collection of occupancy measures at step h induced by greedy policies w.r.t. value functions in F. We set γ = 1/

√

T and define d = dim BE (F, P, γ) = max h dim DE (G h , P F ;h , γ). Ignoring the lower order terms, we get that T t=1 V π e -V π t ≲ T C π e • H • ∆ off + H d∆ on T ≲ T C π e V max • H • log(HT |F|/δ) m off + HV max dT • log(HT |F|/δ) m on , where ≲ hides lower order terms, multiplying constants and log factors. Setting m off = HT /d and m on = H 2 , we get that T t=1 V π e -V π t = O C π e V max dT log(HT |F|/δ) .

D COMPARISON WITH PREVIOUS WORKS

As mentioned in the main text, many previous empirical works consider combining offline expert demonstrations with online interaction (Rajeswaran et al., 2017; Hester et al., 2018; Nair et al., 2018; 2020; Vecerik et al., 2017; Lee et al., 2022; Jia et al., 2022; Niu et al., 2022) . Thus the idea of performing RL algorithm on both offline data (expert demonstrations) and online data is also explored in some of the previous works, for example, Vecerik et al. (2017) runs DDPG on both the online and expert data, and Hester et al. (2018) uses DQN on both data but with an additional supervised loss. Since we already compared with Hester et al. (2018) in the experiment, here we focus on our discussion with Vecerik et al. (2017) . We first emphasize that Vecerik et al. (2017) only focuses on expert demonstrations and their experiments entirely rely on using expert demonstrations, while we focus on more general offline dataset that is not necessarily coming from experts. Said though, the DDPG-based algorithm from Vecerik et al. (2017) potentially can be used when offline data is not from experts. Although the algorithm from Vecerik et al. (2017) and Hy-Q share the same high-level intuition that one should perform RL on both the datasets, there are still a few differences : (1) Hy-Q uses Q-learning instead of deterministic policy gradients; note that deterministic policy gradient methods cannot be directly applied to discrete action setting; (2) Hy-Q does not require n-step TD style update, since in off-policy case, without proper importance weighting, n-step TD could incur strong bias. While proper tuning on n could balance bias and variance, one does not need to tune such n-step at all in Hy-Q; (3) The idea of keeping a non-zero ratio to sample offline dataset is also proposed in Vecerik et al. (2017) . Our buffer ratio is derived from our theory analysis but meanwhile proves the advantage of the similar heuristic applied in Vecerik et al. (2017) . ( 4) In their experiment, Vecerik et al. (2017) only considers expert demonstrations. In our experiment, we considered offline datasets with different amounts of transitions from very low-quality policies and showed Hy-Q is robust to low-quality transitions in offline data. Note that some of the differences may seem minor on the implementation level, but they may be important to the theory. Regarding the experiments, our experimental evaluation adds the following insights over those in Vecerik et al. (2017) : (i) hybrid methods can succeed without expert data, (ii) hybrid methods can succeed in hard exploration discrete-action tasks, (iii) the core algorithm (Q-learning vs DDPG) is not essential although some details may matter. Due to the similarity between the two methods, we believe some of these insights may also translate to Vecerik et al. (2017) and we expect that the choice between Hy-Q and Hy-DDPG will be environment specific, as it is with the purely online versions of these methods. In some situations, Q-learning works does not immediately imply Deterministic policy gradient methods work, nor vice versa. Nevertheless, it is beyond the scope of this paper to rigorously verify this claim and we deem the study of Actor-critic algorithms in Hybrid RL setting an interesting future direction.

E EXPERIMENT DETAILS E.1 COMBINATION LOCK

In this section we provide a detailed description of combination lock experiment. The combination lock environment has a horizon H and 10 actions at each state. There are three latent states z i,h , i ∈ {0, 1, 2} for each timestep h, where z i,h , i ∈ {0, 1} are good states and z 2,h is the bad state. For each good state, we randomly pick a good action a i,h , such that in latent state z i,h , i ∈ {0, 1}, taking the good action a i,h will result in 0.5 probability of transiting to z 0,h+1 and 0.5 probability of transiting to z 1,h+1 while taking all other actions will result in a 1 probability of transiting to z 2,h+1 . At z 2,h , all actions will result in a deterministic transition to z 2,h+1 . For the reward, we give an optimal reward of 1 for landing z i,H , i ∈ {0, 1}. We also give an anti-shaped reward of 0.1 for all transitions from a good state to a bad state. All other transitions have a reward of 0. The initial distribution is a uniform distribution over z 0,0 and z 1,0 . The observation space has dimension 2 ⌈log(H+1)⌉ , created by concatenating a one-hot representation of the latent state and a one-hot representation of the horizon (appending 0 if necessary). Random noise from N (0, 0.1) is added to each dimension, and finally the observation is multiplied by a Hadamard matrix. Note that in this environment, the agent needs to perform optimally for all H timesteps to hit the final good state for an optimal reward of 1. Once the agent chooses a bad action, it will stay in the bad state until the end with at most 0.1 possible reward for the trajectory received while transitting from a good state to a bad state.

E.2 IMPLEMENTATION DETAILS OF COMBINATION LOCK EXPERIMENT

We train H separate Q-functions for all H timesteps. Our function class consists of an encoder and a decoder. For the encoder, we feed the observation into one linear layer with 3 outputs, followed by a softmax layer to get a state-representation. This design of encoder is intended to learn a one-hot representation of the latent state. We take a Kronecker Product of the state-representation and the action, and feed the result to a linear layer with only one output, which will be our Q value. In order to stabilize the training, we warm-start the Q-function of timestep h -1 with the encoder from h Q-function of the current iteration and the decoder from the h -1 Q-function of the last iteration, for each iteration of training. One remark is that since combination lock belongs to Block MDPs, we require a V-type algorithm instead of the Q-type algorithm as shown in the main text. The only difference lies in the online sampling process: instead of sampling from d π t h , for each h, we sample from d π t h • Uniform(A), i.e., we first rollin with respect to π t to timestep h -1, then take a random action, observe the transition and collect that tuple. We provide Algorithm 2 for completeness. Note that the only difference is in line 4. For CQL, we implemented the variant of CQL-DQN and picked the peak in the learning curve to report in the main paper (so it should represent an upper bound of the performance of CQL).

E.3 IMPLEMENTATION DETAILS OF MONTEZUMA'S REVENGE EXPERIMENT

In this section we provide the detailed algorithm for the discounted setting. The overall algorithm is described in Algorithm 3. For the function approximation, we use a class of convolutional neural networks (parameterized by class Θ) as promoted by the original DQN paper. We include several standard empirical design choices that have been practically proven to stabilize the training: we use Prioritize Experience Replay (Schaul et al., 2015) for our buffer. We also add Double DQN (Van Hasselt et al., 2016) and Dueling DQN (Wang et al., 2016) during our Q-update. We also observe that a decaying schedule on the offline sample ratio β and the exploration rate ϵ also helps provide better performance. Note that an annealing β does not contradict to our comment in Section 4 on catastrophic forgetting because we set β to small after our online trajectory distribution covers d π e . In addition, we also perform per step update instead of per episode update since this has been the popular design choice and leads to better efficiency in practice.

E.4.1 COMBINATION LOCK

We use the open-sourced implementation https://github.com/BY571/CQL/tree/main/ CQL-DQN for CQL. For BRIEE, we use the official code released by the authors: https:// github.com/yudasong/briee, where we rely on the code there for the combination lock environment.

E.4.2 MONTEZUMA'S REVENGE

We use the open-sourced implementation https://github.com/jcwleo/ random-network-distillation-pytorch for RND. For CQL, we use https: //github.com/takuseno/d3rlpy for their implementation of CQL for atari. We use https://github.com/felix-kerkhoff/DQfD for DQFD. For all baselines, we keep the hyperparameters used in these public repositories. For CQL and DQFD, we provide the offline datasets as described in the main text instead of using the offline dataset provided in the public repositories. 7 All baselines are tested in the same stochastic environment setup as in Burda et al. (2018) .

E.5 HARDWARE INFRASTRUCTURE

We run our experiments on a cluster of computes with Nvidia RTX 3090 GPUs and various CPUs which do not incur any randomness to the results.



We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results analyze Q-iteration, but we use an algorithm with an online/mini-batch flavor that is closer to Q-learning for our experiments. Note that FQI and Hy-Q extend to the infinite horizon discounted setting(Munos & Szepesvári, 2008). Jin et al. (2021a) consider the Bellman Eluder dimension, which is related but distinct from the Bilinear model. However, our proofs can be easily translated to this setting; see Appendix C for more details. We note that BRIEE is currently the state-of-the-art method for the combination lock environment. In particular,Misra et al. (2020) show that many Deep RL baselines fail in this enviroment. Formally, we sample h ∼ Unif([H]), s ∼ d π ⋆ h , a ∼ Unif(A), r ∼ R(s, a), s ′ ∼ P (s, a). This is for notation simplicity, and we emphasize that we do not assume eigenvalues are lower bounded. In other words, eigenvalue of this feature covariance matrix could approach to 0 + . We note that CQL also fails completely with the original offline dataset (with 1 million samples) provided in the public repository.



Figure 1: A hard instance for offline RL (Zhan et al., 2022, reproduced with permission) Consider two MDPs {M 1 , M 2 } with H = 2, three states {A, B, C}, two actions {L, R} and the fixed start state A. The two MDPs have the same dynamics but different rewards. In both, actions from state B yield reward 1. In M 1 , (C, R) yields reward 1 while (C, L) yields reward 1 in M 2 . All other rewards are 0. In both M 1 and M 2 , an optimal policy is π * (A) = L and π* (B) = π * (C) = Uniform({L, R}). With F = {Q ⋆ 1 , Q ⋆ 2 }where Q ⋆ j is the optimal Q function for M j , then one can easily verify that F satisfies Bellman completeness, for both MDPs. Finally with offline distribution ν supported on states A and B only (with no coverage on state C), we have sufficient coverage over d π ⋆ . However, samples from ν are unable to distinguish between f 1 and f 2 or (M 1 and M 2 ), since state C is not supported by ν. Unfortunately, adversarial tie-breaking may result the greedy policies of f 1 and f 2 visiting state C, where we have no information about the correct action.

Figure 2: The learning curve for combination lock with H = 100. The plots show the median and 80th/20th quantile for 5 replicates. Pure offline and IL methods are visualized as dashed horizontal lines (in the left plot, CQL overlaps with BC). Note that we report the number of samples while Zhang et al. (2022c) report the number of episodes.

Figure 3: The combination lock (Zhang et al., 2022c), reproduced with permission.

Figure 4: The learning curve for Montezuma's Revenge. The plots show the median and 80th/20th quantile for 5 replicates. Pure offline, IL methods and dataset qualities are visualized as dashed horizontal lines. "Expert" denotes V π eand "Offline" denotes the average trajectory reward in the offline dataset. The y-axis denotes the (moving) average of 100 episodes for the methods involving online interactions. Note that CQL and BC overlap on the last plot.

Figure 5: Learning curves of Hy-Q and RND. Metric follows Figure 4.

γ, and {ϵ i } n i=1 are independent random variables such that E[y t | x t ] = h * (x t ). Additionally, suppose that max t |y t | ≤ R and max x |h * (x)| ≤ R. Then the least square solution h ← argmin h∈H T t=1 (h(x t ) -y t )

ACKNOWLEDGEMENT

AS thanks Karthik Sridharan for useful discussions. WS acknowledges funding support from NSF IIS-2154711. We thank Simon Zhao for their careful reading of the manuscript and improvement on the technical correctness of our paper. We also thank Uri Sherman for their discussion on the computational efficiency of the original draft.

annex

Let π be the ϵ-greedy policy w.r.t. f θ i.e., π(s) = argmax a f θ (s, a) with probability 1 -ϵ and π(s) = U(A) with probability ϵ.// Online collection 7:Interact with the environment for one step: a = π(s), s ′ ∼ P (s, a), r ∼ R(s, a). Perform one-step gradient descent on D:12:end if 

