MULTI-ENVIRONMENT PRETRAINING ENABLES TRANSFER TO ACTION LIMITED DATASETS

Abstract

Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions -for example, videos of gameplay are much more available than sequences of frames paired with the logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a target environment of interest with fullyannotated datasets from various other source environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, even when using annotated datasets equivalent to only 12 minutes of gameplay.

1. INTRODUCTION

The training of large-scale models on large and diverse data has become a standard approach in natural language and computer vision applications (Devlin et al., 2019; Brown et al., 2020; Mahajan et al., 2018; Zhai et al., 2021) . Recently, a number of works have shown that a similar approach can be applied to tasks more often tackled by reinforcement learning (RL), such as robotics and game-playing. For example, Reed et al. (2022) suggest combining large datasets of expert behavior from a variety of RL domains in order to train a single generalist agent, while Lee et al. (2022) demonstrate a similar result but using non-expert (offline RL) data from a suite of Atari gameplaying environments and using a decision transformer (DT) sequence modeling objective (Chen et al., 2021b) . Applying large-scale training necessarily relies on the ability to gather sufficiently large and diverse datasets. For RL domains, this can be a challenge, as the most easily available data -for example, videos of a human playing a video game or a human completing a predefined task -often does not contain labelled actions, i.e., game controls or robot joint controls. We call such datasets action limited, because little or none of the dataset is annotated with action information. Transferring the success of approaches like DT to such tasks is therefore bottlenecked by the ability to acquire action labels, which can be expensive and time-consuming (Zolna et al., 2020) . Some recent works have explored approaches to mitigate the issue of action limited datasets. For example, Video PreTraining (VPT) (Baker et al., 2022) proposes gathering a small amount (2k hours of video) of labeled data manually which is used to train an inverse dynamics model (IDM) (Nguyen-Tuong et al., 2008) ; the IDM is then used to provide action labels on a much larger videoonly dataset (70k hours). This method is shown to achieve human level performance in Minecraft. While VPT shows promising results, it still requires over 2k hours of manually-labelled data; thus, a similar amount of expensive labelling is potentially necessary to extend VPT to other environments. In this paper, we propose an orthogonal but related approach to VPT: leveraging a large set of labeled data from various source domains to learn an agent policy on a limited action dataset of a target evaluation environment. To tackle this setting, we propose Action Limited Pretraining (ALPT),

