THE ROLE OF COVERAGE IN ONLINE REINFORCEMENT LEARNING

Abstract

Coverage conditions-which assert that the data logging distribution adequately covers the state space-play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing-somewhat surprisingly-that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability-that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability-can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the sequential extrapolation coefficient, to provide a unification.

1. INTRODUCTION

The last decade has seen development of reinforcement learning algorithms with strong empirical performance in domains including robotics (Kober et al., 2013; Lillicrap et al., 2015) , dialogue systems (Li et al., 2016) , and personalization (Agarwal et al., 2016; Tewari and Murphy, 2017) . While there is great interest in applying these techniques to real-world decision making applications, the number of samples (steps of interaction) required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach human-level performance in challenging domains. Developing algorithms with improved sample efficiency, which entails efficiently generalizing across high-dimensional states and actions while taking advantage of problem structure as modeled practitioners, remains a major challenge. Investigation into design and analysis of algorithms for sample-efficient reinforcement learning has largely focused on two distinct problem formulations: • Online reinforcement learning, where the learner can repeatedly interact with the environment by executing a policy and observing the resulting trajectory. • Offline reinforcement learning, where the learner has access to logged transitions and rewards gathered from a fixed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly interact with the underlying environment. While these formulations share a common goal (learning a near-optimal policy), the algorithms used to achieve this goal and conditions under which it can be achieved are seemingly quite different. Focusing on value function approximation, sample-efficient algorithms for online reinforcement learning require both (a) representation conditions, which assert that the function approximator is flexible enough to represent value functions for the underlying MDP (optimal or otherwise), and (b) exploration conditions (or, structural conditions) which limit the amount of exploration required to learn a near-optimal policytypically by enabling extrapolation across states or limiting the number of effective state distributions (Russo and Van Roy, 2013; Jiang et al., 2017; Sun et al., 2019; Wang et al., 2020b; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021) . Algorithms for offline reinforcement learning typically require similar representation conditions. However, since data is collected passively from a fixed logging policy/distribution rather than actively, the exploration conditions used in online RL are replaced with coverage conditions, which assert that the data collection distribution provides sufficient coverage over the state space (Antos et al., 2008; Chen and Jiang, 2019; Xie and Jiang, 2020; 2021; Jin et al., 2021b; Rashidinejad et al., 2021; Foster et al., 2022; Zhan et al., 2022) . The aim for both lines of research (online and offline) is to identify the weakest possible conditions under which learning is possible, and design algorithms that take advantage of these conditions. The two lines have largely evolved in parallel, and it is natural to wonder whether there are deeper connections. Since the conditions for sample-efficient online RL and offline RL mainly differ via exploration versus coverage, this leads us to ask: If an MDP admits a data distribution with favorable coverage for offline RL, what does this imply about our ability to perform online RL efficiently? Beyond intrinsic theoretical value, this question is motivated by the observation that many realworld applications lie on a spectrum between online and offline. It is common for the learner to have access to logged/offline data, yet also have the ability to actively interact with the underlying environment, possibly subject to limitations such as an exploration budget (Kalashnikov et al., 2018) . Building a theory of real-world RL that can lead to algorithm design insights for such settings requires understanding the interplay between online and offline RL.

1.1. OUR RESULTS

We investigate connections between coverage conditions in offline RL and exploration in online RL by focusing on the concentrability coefficient, the most ubiquitous notion of coverage in offline RL. Concentrability quantifies the extent to which the data collection distribution uniformly covers the state-action distribution induced by any policy. We introduce a new structural property, coverability, which reflects the best concentrability coefficient that can be achieved by any data distribution, possibly designed by an oracle with knowledge of the underlying MDP. Our main results are as follows: 1. We show (Section 3) that coverability (that is, mere existence of a distribution with good concentrability) is sufficient for sample-efficient online exploration, even when the learner has no prior knowledge of this distribution. This result requires no additional assumptions on the underlying MDP beyond standard Bellman completeness, and-perhaps surprisingly-is achieved using standard algorithms (Jin et al., 2021a) , albeit with analysis ideas that go beyond existing techniques. 2. We show (Section 4) that several weaker notions of coverage in offline RL, including single-policy concentrability (Jin et al., 2021b; Rashidinejad et al., 2021) and conditions based on Bellman residuals (Chen and Jiang, 2019; Xie et al., 2021a) , are insufficient for sample-efficient online exploration. This shows that in general, coverage in offline reinforcement learning and exploration in online RL not compatible, and highlights the need for additional investigation going forward. Our results serve as a starting point for systematic study of connections between online and offline learnability in RL. To this end, we provide several secondary results: 1. We show (Section 5) that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, do not optimally capture coverability, and provide a new complexity measure, the sequential extrapolation coefficient, which unifies these notions. 2. We establish (Appendix C) connections between coverability and reinforcement learning with exogenous noise, with applications to learning in exogenous block MDPs (Efroni et al., 2021; 2022a) . 3. We give algorithms for reward-free exploration (Jin et al., 2020a; Chen et al., 2022) under coverability (Appendix G). While our results primarily concern analysis of existing algorithms rather than algorithm design, they highlight a number of exciting directions for future research, and we are optimistic that the notion of coverability can guide the design of practical algorithms going forward. Notation. For an integer n ∈ N, we let [n] denote the set {1,...,n}. For a set X , we let ∆(X ) denote the set of all probability distributions over X . We adopt standard big-oh notation, and write f = O(g) to denote that f = O(g•max{1,polylog(g)}) and a ≲ b as shorthand for a = O(b).

