THE ROLE OF COVERAGE IN REINFORCEMENT LEARNING

Abstract

Coverage conditions-which assert that the data logging distribution adequately covers the state space-play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing-somewhat surprisingly-that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability-that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability-can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the sequential extrapolation coefficient, to provide a unification.

1. INTRODUCTION

The last decade has seen development of reinforcement learning algorithms with strong empirical performance in domains including robotics (Kober et al., 2013; Lillicrap et al., 2015) , dialogue systems (Li et al., 2016), and personalization (Agarwal et al., 2016; Tewari and Murphy, 2017) . While there is great interest in applying these techniques to real-world decision making applications, the number of samples (steps of interaction) required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach human-level performance in challenging domains. Developing algorithms with improved sample efficiency, which entails efficiently generalizing across high-dimensional states and actions while taking advantage of problem structure as modeled practitioners, remains a major challenge. Investigation into design and analysis of algorithms for sample-efficient reinforcement learning has largely focused on two distinct problem formulations: • Online reinforcement learning, where the learner can repeatedly interact with the environment by executing a policy and observing the resulting trajectory. • Offline reinforcement learning, where the learner has access to logged transitions and rewards gathered from a fixed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly interact with the underlying environment. While these formulations share a common goal (learning a near-optimal policy), the algorithms used to achieve this goal and conditions under which it can be achieved are seemingly quite different. Focusing on value function approximation, sample-efficient algorithms for online reinforcement learning require both (a) representation conditions, which assert that the function approximator is flexible enough to represent value functions for the underlying MDP (optimal or otherwise), and (b) exploration conditions * Equal contribution 1

