PRE-TRAINING FOR ROBOTS: LEVERAGING DIVERSE MULTITASK DATA VIA OFFLINE RL

Abstract

Recent progress in deep learning highlights the tremendous potential of utilizing diverse datasets for achieving effective generalization and makes it enticing to consider leveraging broad datasets for attaining more robust generalization in robotic learning as well. However, in practice we likely will want to learn a new skill in a new environment that is unlikely to be contained in the prior data. Therefore we ask: how can we leverage existing diverse offline datasets in combination with small amounts of task-specific data to solve new tasks, while still enjoying the generalization benefits of training on large amounts of data? In this paper, we demonstrate that end-to-end offline RL can be an effective approach for doing this, without the need for any representation learning or vision-based pre-training. We present pre-training for robots (PTR), a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as a few as 10 demonstrations. At its core, PTR applies an existing offline RL method such as conservative Qlearning (CQL), but extends it to include several crucial design decisions that enable PTR to actually work and outperform a variety of prior methods. To the best of our knowledge, PTR is the first offline RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens.

1. INTRODUCTION

Robotic learning methods based on reinforcement learning (RL) or imitation learning (IL) have led to a number of impressive results (Levine et al., 2016; Kalashnikov et al., 2018; Young et al., 2020; Kalashnikov et al., 2021; Ahn et al., 2022) , but the generalization abilities of policies learned in this way are typically limited by the quantity and breadth of the data available to train them. In practice, the cost of real-world data collection for each task means that such methods often use smaller datasets, which leads to more limited generalization. A natural way to circumvent this limitation is to incorporate existing diverse robotic datasets into the training pipeline of a robot learning algorithm, analogously to how pretraining on diverse prior datasets has enabled rapid finetuning in supervised learning fields, such as computer vision and NLP. But how can we devise algorithms that enable effective pretraining for robotic RL? In most cases, answering this question requires a method that can pre-train on existing data from a wide range of tasks and domains, and then provide a good starting point for efficiently learning a new task in a new domain. Prior approaches utilize such existing data by running imitation learning (IL) (Young et al., 2020; Ebert et al., 2021; Shafiullah et al., 2022) or by using representation learning (Nair et al., 2022) methods for pre-training and then fine-tuning with imitation learning. However, this may not necessarily lead to representations that can reason about the consequences of their actions. In contrast, end-to-end RL can offer a more general paradigm, that can be effective for both pre-training and fine-tuning, and is applicable even when assumptions in prior work are violated. Therefore we ask, can we devise a simple and unified framework where both the pretraining and finetuning process uses RL? This presents significant challenges pertaining to leveraging large amounts of offline multi-task datasets, which would require high capacity models and this can be very challenging (Bjorck et al., 2021) . In this paper, we show that multi-task offline RL pretraining on diverse multi-task demonstration data followed by offline RL finetuning on a very small number of trajectories (as few as 10 trials, a maximum of 15) can indeed be made into an effective robotic learning strategy that in practice can significantly outperform methods based on imitation learning pre-training as well as RL-based methods that do not employ pre-training. This is surprising and significant, since prior work (Mandlekar et al., 2021) has claimed that IL methods are superior to offline RL when provided with human demonstrations. Our framework, which we call PTR (pre-training for robots), is based on the CQL algorithm (Kumar et al., 2020), but introduces a number of design decisions that we show are critical for good performance and enable large-scale pre-training. These choices include a specific choice of architecture for providing high capacity while preserving spatial information, the use of group normalization, and an approach for feeding actions into the model that ensures that actions are used properly for value prediction. We experimentally validate these design decisions and show that PTR benefits from increasing the network capacity, even with large ResNet50 architectures, which have never been previously shown to work with offline RL. Our experiments utilize the bridge dataset (Ebert et al., 2021) , which is an extensive previously collected dataset consisting of thousands of trials for a very large number of robotic manipulation tasks in multiple environments. A schematic of our framework is shown in Figure 1 . The main contribution of this work is a demonstration that PTR can enable offline RL pre-training on diverse real-world robotic data, and that these pre-trained policies can be fine-tuned to learn new tasks with just 10-15 demonstrations. This is a significant improvement over prior RL-based pretraining and finetuning methods, which typically require hundreds or even thousands of trials (Singh et al., 2020; Kalashnikov et al., 2021; Julian et al., 2020; Chebotar et al., 2021; Lee et al., 2022a) . We present a detailed analysis of the design decisions that enable offline RL to provide an effective pretraining framework, and show empirically that these design decisions are crucial for good performance. Although the individual components that constitute PTR are based closely on prior work, we show that the novel combination of these components in PTR is important to make offline RL into a viable pre-training tool that can outperform other pre-training approaches and other RL-based policy learning strategies.

2. RELATED WORK

A number of prior works have proposed algorithms for offline RL (Fujimoto et al., 2018; Kumar et al., 2019; 2020; Kostrikov et al., 2021a; b; Wu et al., 2019; Jaques et al., 2019; Fujimoto & Gu, 2021; Siegel et al., 2020) . Especially, many prior works study offline RL with multi-task data and devise techniques that perform parameter sharing (Wilson et al., 2007; Parisotto et al., 2015; Teh et al., 2017; Espeholt et al., 2018; Hessel et al., 2019) , or perform data sharing or relabeling (Yu et al., 2021; Andrychowicz et al., 2017; Yu et al., 2022; Kalashnikov et al., 2021; Xie & Finn, 2021) . In this paper, our goal is not to develop new offline RL algorithms, but to show that these offline RL algorithms can be an effective tool to pre-train from prior data and then fine-tune to new tasks, and we illustrate the design decisions required to get such methods to work well. Unlike methods for fine-tuning from a learned initialization (Nair et al., 2020; Kostrikov et al., 2021b; Lee et al., 2022c) , which typically perform online interaction, we consider the setting where we do not use any online interaction and do not require access to a reward function. This resembles the problem setting considered by offline meta-RL methods (Li et al., 2019; Dorfman & Tamar, 2020; Mitchell et al., 2021; Pong et al., 2021; Lin et al., 2022) . However, our approach is simpler as we simply fine-tune the very same offline RL algorithm and our method, PTR, outperforms the meta-RL method from Mitchell et al. (2021) . Xu et al. ( 2022) present a prompting-based few-shot adaptation approach based on decision-transformers that is also related to our work.



Figure 1: Overview of PTR: We first perform general offline pre-training ondiverse multi-task robot data and subsequently finetune on one or several a target tasks while mixing batches between the prior data and the target dataset using a batch mixing ratio of τ .

