BENCHMARKING OFFLINE REINFORCEMENT LEARN-ING ON REAL-ROBOT HARDWARE

Abstract

Learning policies from previously recorded data is a promising direction for realworld robotics tasks, as online learning is often infeasible. Dexterous manipulation in particular remains an open problem in its general form. The combination of offline reinforcement learning with large diverse datasets, however, has the potential to lead to a breakthrough in this challenging domain analogously to the rapid progress made in supervised learning in recent years. To coordinate the efforts of the research community toward tackling this problem, we propose a benchmark including: i) a large collection of data for offline learning from a dexterous manipulation platform on two tasks, obtained with capable RL agents trained in simulation; ii) the option to execute learned policies on a real-world robotic system and a simulation for efficient debugging. We evaluate prominent open-sourced offline reinforcement learning algorithms on the datasets and provide a reproducible experimental setup for offline reinforcement learning on real systems. Visit https://sites.google.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton et al., 1998) holds great potential for robotic manipulation and other real-world decision-making problems as it can solve tasks autonomously by learning from interactions with the environment. When data can be collected during learning, RL in combination with high-capacity function approximators can solve challenging high-dimensional problems (Mnih et al., 2015; Lillicrap et al., 2016; Silver et al., 2017; Berner et al., 2019) . However, in many cases online learning is not feasible because collecting a large amount of experience with a partially trained policy is either prohibitively expensive or unsafe (Dulac-Arnold et al., 2020) . Examples include autonomous driving, where suboptimal policies can lead to accidents, robotic applications where the hardware is likely to get damaged without additional safety mechanisms, and collaborative robotic scenarios where humans are at risk of being harmed. Offline reinforcement learning (offline RL or batch RL) (Lange et al., 2012) tackles this problem by learning a policy from prerecorded data generated by experts or handcrafted controllers respecting the system's constraints. Independently of how the data is collected, it is essential to make the best possible use of it and to design algorithms that improve performance with the increase of available data. This property has led to unexpected generalization in computer vision (Krizhevsky et al., 2012; He et al., 2016; Redmon et al., 2016) and natural language tasks (Floridi & Chiriatti, 2020; Devlin et al., 2018) when massive datasets are employed. With the motivation to learn similarly capable decision-making systems from data, the field of offline RL has gained considerable attention. Progress is currently measured by benchmarking algorithms on simulated domains, both in terms of data collection and evaluation.

