BENCHMARKING OFFLINE REINFORCEMENT LEARN-ING ON REAL-ROBOT HARDWARE

Abstract

Learning policies from previously recorded data is a promising direction for realworld robotics tasks, as online learning is often infeasible. Dexterous manipulation in particular remains an open problem in its general form. The combination of offline reinforcement learning with large diverse datasets, however, has the potential to lead to a breakthrough in this challenging domain analogously to the rapid progress made in supervised learning in recent years. To coordinate the efforts of the research community toward tackling this problem, we propose a benchmark including: i) a large collection of data for offline learning from a dexterous manipulation platform on two tasks, obtained with capable RL agents trained in simulation; ii) the option to execute learned policies on a real-world robotic system and a simulation for efficient debugging. We evaluate prominent open-sourced offline reinforcement learning algorithms on the datasets and provide a reproducible experimental setup for offline reinforcement learning on real systems. Visit https://sites.google.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton et al., 1998) holds great potential for robotic manipulation and other real-world decision-making problems as it can solve tasks autonomously by learning from interactions with the environment. When data can be collected during learning, RL in combination with high-capacity function approximators can solve challenging high-dimensional problems (Mnih et al., 2015; Lillicrap et al., 2016; Silver et al., 2017; Berner et al., 2019) . However, in many cases online learning is not feasible because collecting a large amount of experience with a partially trained policy is either prohibitively expensive or unsafe (Dulac-Arnold et al., 2020) . Examples include autonomous driving, where suboptimal policies can lead to accidents, robotic applications where the hardware is likely to get damaged without additional safety mechanisms, and collaborative robotic scenarios where humans are at risk of being harmed. Offline reinforcement learning (offline RL or batch RL) (Lange et al., 2012) tackles this problem by learning a policy from prerecorded data generated by experts or handcrafted controllers respecting the system's constraints. Independently of how the data is collected, it is essential to make the best possible use of it and to design algorithms that improve performance with the increase of available data. This property has led to unexpected generalization in computer vision (Krizhevsky et al., 2012; He et al., 2016; Redmon et al., 2016) and natural language tasks (Floridi & Chiriatti, 2020; Devlin et al., 2018) when massive datasets are employed. With the motivation to learn similarly capable decision-making systems from data, the field of offline RL has gained considerable attention. Progress is currently measured by benchmarking algorithms on simulated domains, both in terms of data collection and evaluation. Since real-world data is different from simulated data, it is important to put offline RL algorithms to the test on real systems. We propose challenging robotic manipulation datasets recorded on real robots for two tasks: object pushing and object lifting with reorientation on the TriFinger platform (Wüthrich et al., 2021) . To study the differences between real and simulated environments, we also provide datasets collected in simulation. Our benchmark of state-of-the-art offline RL algorithms on these datasets reveals that they are able to solve the moderately difficult pushing task while their performance on the more challenging lifting task leaves room for improvement. In particular, there is a much larger gap between the performance of the expert policy and offline-learned policies on the real system compared to the simulated system. This underpins the importance of real-world benchmarks for offline RL. We furthermore study the impact of adding suboptimal trajectories to expert data and find that all algorithms are 'distracted' by them, i.e., their success rate drops significantly. This identifies an important open challenge for the offline RL community: robustness to suboptimal trajectories. Importantly, a cluster of TriFinger robots is set up for evaluation of offline-learned policies for which remote access can be requested for research purposes. With our dataset and evaluation platform, we therefore aim to provide a breeding ground for future offline RL algorithms.

2. THE TRIFINGER PLATFORM

We use a robot cluster that was initially developed and build for the Real Robot Challenge in 2020 and 2021 (Bauer et al., 2022) . The robots that constitute the cluster are an industrial-grade adaptation of a robotic platform called TriFinger, an open-source hardware and software design introduced in Wüthrich et al. ( 2021), see Fig. 1 .The robots have three arms mounted at a 120 degrees radially symmetric arrangement with 3 degrees of freedom (DoF) each. The arms are actuated by outrunner brushless motors with a 1:9 belt-drive, yielding high agility, low friction, and good force feedback (details on the actuator modules can be found in Grimminger et al. ( 2020)). Pressure sensors inside the elastic fingertips provide basic tactile feedback. The working area, where objects can be manipulated, is encapsulated by a high barrier to ensure that the object stays inside the arena even during aggressive motions. This is essential for operation without human supervision. The robot is inside a closed housing that is well lit by top-mounted LED panels making the images taken by three high-speed global shutter cameras consistent. The cameras are distributed between the arms to ensure objects are always seen by any camera. We study dexterous manipulation of a cube whose pose is estimated by a visual tracking system with 10 Hz.



Figure 1: The TriFinger manipulation platform(Wüthrich et al., 2021; Bauer et al., 2022). Left: The robot has 3 arms with 3 DoF each. The cube is constrained by a bowl-shaped arena, allowing for unattended data collection. Right: A cluster of these robots for parallel data collection and evaluation.

