OFFLINE REINFORCEMENT LEARNING FROM HETERO-SKEDASTIC DATA VIA SUPPORT CONSTRAINTS

Abstract

Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.

1. INTRODUCTION

Recent advances in offline RL (Levine et al., 2020; Lange et al., 2012) hint at exciting possibilities in learning high-performing policies, entirely from offline datasets, without requiring dangerous (Garcıa & Fernández, 2015) or expensive (Kalashnikov et al., 2018) active interaction with the environment. Analogously to the importance of data diversity in supervised learning (Deng et al., 2009) , the practical benefits of offline RL depend heavily on the coverage of behavior in the offline datasets (Kumar et al., 2022) . Intuitively, the dataset must illustrate the consequences of a diverse range of behaviors, so that an offline RL method can determine what behaviors lead to high returns, ideally returns that are significantly higher than the best single behavior in the dataset. We posit that combining many realistic sources of data can provide this kind of coverage, but doing so can lead to the variety of demonstrated behaviors varying in highly non-uniform ways across the state space, i.e. heteroskedastic datasets. For example, a dataset of humans driving cars might show very high variability in driving habits, with some drivers being timid and some more aggressive, but remain remarkably consistent in critical states (e.g., human drivers are extremely unlikely to swerve in an empty road or drive off a bridge). A good offline RL algorithm should combine the best parts of each behavior in the dataset -e.g., in the above example, the algorithm should produce a policy that is as good as the best human in each situation, which would be better than any human driver overall. At the same time, the learned policy should not attempt to extrapolate to novel actions in subset of the state space where the distribution of demonstrated behaviors is narrow (e.g., the algorithm should not attempt to drive off a bridge). How effectively can current offline RL methods selectively choose on a per-state basis how closely to stick to the behavior policy? Most existing methods (Kumar et al., 2019; 2020; Kostrikov et al., 2021b; a; Wu et al., 2019; Fujimoto et al., 2018a; Jaques et al., 2019) constraint the learned policy to stay close to the behavior policy, so-called "distributional constraints". Our first contribution consists of empirical and theoretical evidence demonstrating that distributional constraints are insufficient when the heteroskedas-1

