HYBRID RL: USING BOTH OFFLINE AND ONLINE DATA CAN MAKE RL EFFICIENT

Abstract

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma's Revenge.

1. INTRODUCTION

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve suitable performance. This issue is particularly salient when the environment requires sophisticated exploration and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence, the practical success of online RL and related policy gradient/improvement methods has been largely restricted to settings where a high quality simulator is available. To overcome the issue of sample inefficiency, attention has turned to the offline RL setting (Levine et al., 2020) , where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in some other manner (e.g., by a system running in production or an expert). While these methods still require a large dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without compromising system performance. However, offline RL methods can suffer from distribution shift, where the state distribution induced by the learned policy differs significantly from the offline distribution (Wang et al., 2021) . Existing provable approaches for addressing distribution shift are computationally intractable, while empirical approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see). In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the favorable properties of both offline and online settings. In Hybrid RL, the agent has both an offline dataset and the ability to interact with the environment, as in the traditional online RL setting. The offline dataset helps address the exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a number of empirical works (Rajeswaran et al., 2017; Hester et al., 2018; Nair et al., 2018; 2020; Vecerik et al., 2017) which mainly focus on using expert demonstrations as offline data. Our algorithmic development is closely related to these works, although our focus is on formalizing the hybrid setting and establishing theoretical guarantees against more general offline datasets. Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from a "nice" distribution. A number of simple and effective algorithms, including CPI (Kakade & Langford, 2002) , PSDP (Bagnell et al., 2003) , and policy gradient methods (Kakade, 2001; Agarwal et al., 2020b) -which have further inspired deep RL methods such as TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) -are provably efficient in the reset setting. Yet, a nice reset distribution is a strong requirement (often tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid RL differs from the reset setting in that (a) we have an offline dataset, but (b) our online interactions start from the initial distribution of the environment, which is not assumed to have any nice properties. Both features (offline data and a nice reset distribution) facilitate algorithm design by de-emphasizing the exploration challenge. However, Hybrid RL is much more practical since an offline dataset is much easier to obtain in practice. We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The algorithm is a simple adaptation of the classical fitted Q-iteration algorithm (FQI) and accommodates value-based function approximation.foot_0 For our theoretical results, we prove that Hy-Q is both statistically and computationally efficient assuming that: (1) the offline distribution covers some high quality policy, (2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least squares regression oracle. The first three assumptions are standard statistical assumptions in the RL literature while the fourth is a widely used computational abstraction for supervised learning. No computationally efficient algorithms are known under these assumptions in pure offline or pure online settings, which highlights the advantages of the hybrid setting. We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination lock (Misra et al., 2020 ) and Montezuma's Revenge from the Arcade Learning Environment (Bellemare et al., 2013) . Starting with an offline dataset that contains some transitions from a high quality policy, our approach outperforms: an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma's Revenge, pure offline RL baselines, imitation learning baselines, and existing hybrid methods. Compared to the online methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efficiency. Compared to the offline and hybrid methods, Hy-Q performs most favorably when the offline dataset also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the significant benefits that can be realized by combining offline and online data.

2. RELATED WORKS

We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, offline RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the system to initial state distribution of the environment, which is not assumed to provide any form of coverage. Pure online RL Beyond tabular settings, many existing statistically efficient RL algorithms are not computationally tractable, due to the difficulty of implementing optimism. This is true in the linear MDP (Jin et al., 2020) with large action spaces, the linear Bellman complete model (Zanette et al., 2020; Agarwal et al., 2019) , and in the general function approximation setting (Jiang et al., 2017; Sun et al., 2019; Du et al., 2021; Jin et al., 2021a) . These computational challenges have inspired results on intractability of aspects of online RL (Dann et al., 2018; Kane et al., 2022) . There are several online RL algorithms that aim to tackle the computational issue via stronger structural assumptions and supervised learning-style computational oracles (Misra et al., 2020; Zhang et al., 2022c; Agarwal et al., 2020a; Uehara et al., 2021; Modi et al., 2021; Zhang et al., 2022a; Qiu et al., 2022) . Compared to these oracle-based methods, our approach operates in the more general



We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results analyze Q-iteration, but we use an algorithm with an online/mini-batch flavor that is closer to Q-learning for our experiments.

