D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING

Abstract

The offline reinforcement learning (RL) setting (also known as full batch RL), where a policy is learned from a static dataset, is compelling as progress enables RL methods to take advantage of large, previously-collected datasets, much like how the rise of large datasets has fueled results in supervised learning. However, existing online RL benchmarks are not tailored towards the offline setting and existing offline RL benchmarks are restricted to data generated by partially-trained agents, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. With a focus on dataset collection, examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies. By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms. To facilitate research, we have released our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms, an evaluation protocol, and open-source examples. This serves as a common starting point for the community to identify shortcomings in existing offline RL methods and a collaborative route for progress in this emerging area.

1. INTRODUCTION

Impressive progress across a range of machine learning applications has been driven by high-capacity neural network models with large, diverse training datasets (Goodfellow et al., 2016) . While reinforcement learning (RL) algorithms have also benefited from deep learning (Mnih et al., 2015) , active data collection is typically required for these algorithms to succeed, limiting the extent to which large, previously-collected datasets can be leveraged. Offline RL (Lange et al., 2012) (also known as full batch RL), where agents learn from previously-collected datasets, provides a bridge between RL and supervised learning. The promise of offline RL is leveraging large, previously-collected datasets in the context of sequential decision making, where reward-driven learning can produce policies that reason over temporally extended horizons. This could have profound implications for a range of application domains, such as robotics, autonomous driving, and healthcare. Current offline RL methods have not fulfilled this promise yet. While recent work has investigated technical reasons for this (Fujimoto et al., 2018a; Kumar et al., 2019; Wu et al., 2019) , a major challenge in addressing these issues has been the lack of standard evaluation benchmarks. Ideally, such a benchmark should: a) be composed of tasks that reflect challenges in real-world applications of data-driven RL, b) be widely accessible for researchers and define clear evaluation protocols for reproducibility, and c) contain a range of difficulty to differentiate between algorithms, especially challenges particular to the offline RL setting. Most recent works (Fujimoto et al., 2018b; Wu et al., 2019; Kumar et al., 2019; Peng et al., 2019; Agarwal et al., 2019b) use existing online RL benchmark domains and data collected from training runs of online RL methods. However, these benchmarks were not designed with offline RL in mind and such datasets do not reflect the heterogenous nature of data collected in practice. Wu et al. (2019) find that existing benchmark datasets are not sufficient to differentiate between simple baseline approaches and recently proposed algorithms. Furthermore, the aforementioned works do not propose a standard evaluation protocol, which makes comparing methods challenging. Why simulated environments? While relying on (existing) real-world datasets is appealing, evaluating a candidate policy is challenging because it weights actions differently than the data collection and may take actions that were not collected. Thus, evaluating a candidate policy requires either collecting additional data from the real-world system, which is hard to standardize and make broadly available, or employing off-policy evaluation, which is not yet reliable enough (e.g., the NeurIPS 2017 Criteo Ad Placement Challenge used off-policy evaluation, however, in spite of an unprecedentedly large dataset because of the variance in the estimator, top entries were not statistically distinguishable from the baseline). Both options are at odds with a widely-accessible and reproducible benchmark. As a compromise, we use high-quality simulations that have been battle-tested in prior domain-specific work, such as in robotics and autonomous driving. These simulators allow researchers to evaluate candidate policies accurately. Our primary contribution is the introduction of Datasets for Deep Data-Driven Reinforcement Learning (D4RL): a suite of tasks and datasets for benchmarking progress in offline RL. We focus our design around tasks and data collection strategies that exercise dimensions of the offline RL problem likely to occur in practical applications, such as partial observability, passively logged data, or human demonstrations. To serve as a reference, we benchmark state-of-theart offline RL algorithms (Haarnoja et al., 2018b; Kumar et al., 2019; Wu et al., 2019; Agarwal et al., 2019b; Fujimoto et al., 2018a; Nachum et al., 2019; Peng et al., 2019; Kumar et al., 2020) and provide reference implementations as a starting point for future work. While previous studies (e.g., (Wu et al., 2019) ) found that all methods including simple baselines performed well on the limited set of tasks used in prior work, we find that most algorithms struggle to perform well on tasks with properties crucial to real-world applications such as passively logged data, narrow data distributions, and limited human demonstrations. By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms. To facilitate adoption, we provide an easy-to-use API for tasks, datasets, and a collection of benchmark implementations of existing algorithms (https://sites.google.com/view/d4rl-anonymous/). This is a common starting point for the community to identify shortcomings in existing offline RL methods, and provides a meaningful metric for progress in this emerging area.

2. RELATED WORK

Recent work in offline RL has primarily used datasets generated by a previously trained behavior policy, ranging from a random initial policy to a near-expert online-trained policy. This approach has been used for continuous control for robotics (Fujimoto et al., 2018a; Kumar et al., 2019; Wu et al., 2019; Gulcehre et al., 2020 ), navigation (Laroche et al., 2019 ), industrial control (Hein et al., 2017) , and Atari video games (Agarwal et al., 2019b) . To standardize the community around common datasets, several recent works have proposed benchmarks for offline RL algorithms. Agarwal et al. (2019b); Fujimoto et al. (2019) propose benchmarking based on the discrete Atari domain. Concurrently to our work, Gulcehre et al. (2020) proposed a benchmarkfoot_1 based on locomotion and manipulation tasks with perceptually challenging input and partial observability. While these are important contributions, both benchmarks suffer from the same shortcomings as prior evaluation protocols: they rely on data collected from online RL training runs. In contrast, with D4RL, in addition to collecting data from online RL training runs, we focus on a range of dataset collection



Website with code, examples, tasks, and data is available at https://sites.google.com/view/ d4rl-anonymous/ Note that the benchmark proposed by Gulcehre et al. (2020) contains the Real-World RL Challenges benchmark (Dulac-Arnold et al., 2020) based on(Dulac-Arnold et al., 2019) and also uses data collected from partially-trained RL agents.



Figure 1: A selection of proposed benchmark tasks.

