MULTI-ENVIRONMENT PRETRAINING ENABLES TRANSFER TO ACTION LIMITED DATASETS

Abstract

Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions -for example, videos of gameplay are much more available than sequences of frames paired with the logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a target environment of interest with fullyannotated datasets from various other source environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, even when using annotated datasets equivalent to only 12 minutes of gameplay.

1. INTRODUCTION

The training of large-scale models on large and diverse data has become a standard approach in natural language and computer vision applications (Devlin et al., 2019; Brown et al., 2020; Mahajan et al., 2018; Zhai et al., 2021) . Recently, a number of works have shown that a similar approach can be applied to tasks more often tackled by reinforcement learning (RL), such as robotics and game-playing. For example, Reed et al. (2022) suggest combining large datasets of expert behavior from a variety of RL domains in order to train a single generalist agent, while Lee et al. (2022) demonstrate a similar result but using non-expert (offline RL) data from a suite of Atari gameplaying environments and using a decision transformer (DT) sequence modeling objective (Chen et al., 2021b) . Applying large-scale training necessarily relies on the ability to gather sufficiently large and diverse datasets. For RL domains, this can be a challenge, as the most easily available data -for example, videos of a human playing a video game or a human completing a predefined task -often does not contain labelled actions, i.e., game controls or robot joint controls. We call such datasets action limited, because little or none of the dataset is annotated with action information. Transferring the success of approaches like DT to such tasks is therefore bottlenecked by the ability to acquire action labels, which can be expensive and time-consuming (Zolna et al., 2020) . Some recent works have explored approaches to mitigate the issue of action limited datasets. For example, Video PreTraining (VPT) (Baker et al., 2022) proposes gathering a small amount (2k hours of video) of labeled data manually which is used to train an inverse dynamics model (IDM) (Nguyen-Tuong et al., 2008) ; the IDM is then used to provide action labels on a much larger videoonly dataset (70k hours). This method is shown to achieve human level performance in Minecraft. While VPT shows promising results, it still requires over 2k hours of manually-labelled data; thus, a similar amount of expensive labelling is potentially necessary to extend VPT to other environments. In this paper, we propose an orthogonal but related approach to VPT: leveraging a large set of labeled data from various source domains to learn an agent policy on a limited action dataset of a target evaluation environment. To tackle this setting, we propose Action Limited Pretraining (ALPT), which relies on the hypothesis that shared structures between environments can be exploited by non-causal (i.e., bidirectional) transformer IDMs. This allows us to look at both past and future frames to infer actions. In many experimental settings, the dynamics are far simpler than multifaceted human behavior in the same setting. It has been suggested that IDMs are therefore more data efficient and this has been empirically shown (Baker et al., 2022) . ALPT thus uses the multienvironment source datasets as pretraining for an IDM, which is then finetuned on the action-limited data of the target environment in order to provide labels for the unlabelled target data, which is then used for training a DT agent. Through various experiments and ablations, we demonstrate that leveraging the generalization capabilities of IDMs is critical to the success of ALPT, as opposed to, for example, pretraining the DT model alone on the multi-environment datasets or training the IDM only on the target environment. On a benchmark game-playing environment, we show that ALPT yields as much as 5x improvement in performance, with as little as 10k labelled samples required (i.e., 0.01% of the original labels), derived from only 12 minutes of labelled game play (Ye et al., 2021) . We show that these benefits even hold when the source and target environments use distinct action spaces; i.e., the environments share similar states but no common actions, further demonstrating the power of IDM pretraining. While ALPT is, algorithmically, a straightforward application of existing offline RL approaches, our results provide a new perspective on large-scale training for RL. Namely, our results suggest that the most efficient path to large-scale RL methods may be via generalist inverse dynamics modelling paired with specialized agent finetuning, instead of generalist agent training alone.

2. RELATED WORK

In this section, we briefly review relevant works in multi-task RL, meta-learning for RL, semisupervised learning, and transfer learning. Multi-Task RL. It is commonly assumed that similar tasks share similar structure and properties (Caruana, 1997; Ruder, 2017; Zhang et al., 2014; Radford et al., 2019a) . Many multi-task RL works leverage this assumption by learning a shared low-dimensional representation across all tasks (Calandriello et al., 2014; Borsa et al., 2016; D'Eramo et al., 2020) . These methods have also been extended to tasks where the action space does not align completely (Bräm et al., 2020) . Other methods assume a universal dynamics model when the reward structure is shared but dynamics are not (Zhang et al., 2021a) . Multi-task RL has generally relied on a task identifier (ID) to provide contextual information, but recent methods have explored using additional side information available in the task meta-data to establish a richer context (Sodhani et al., 2021) . ALPT can be seen as multi-task RL, given that we train both the sequence model and IDM using multiple different environments, but we do not explicitly model context information or have access to task IDs. Meta RL. Meta-learning is a set of approaches for learning to learn which leverages a set of metatraining tasks (Schmidhuber, 1987; Bengio et al., 1991) , from which an agent can learn either parts of the learning algorithms (eg how to tune the learning rate) or the entire algorithm (Lacombe et al., 2021; Kalousis, 2002) . In this setting, meta-learning can be used to learn policies (Duan et al., 2017; Finn et al., 2017) or dynamics models (Clavera et al., 2019) . A distribution of tasks is assumed to be available for sampling, in order to provide additional contextual information to the policy. One such method models contextual information as probabilistic context variables which condition



Figure 1: The dynamics model pretraining procedure of ALPT using the source set of environments along with the limited action target environment dataset.

