BECOME A PROFICIENT PLAYER WITH LIMITED DATA THROUGH WATCHING PURE VIDEOS

Abstract

Recently, RL has shown its strong ability for visually complex tasks. However, it suffers from the low sample efficiency and poor generalization ability, which prevent RL from being useful in real-world scenarios. Inspired by the huge success of unsupervised pre-training methods on language and vision domains, we propose to improve the sample efficiency via a novel pre-training method for model-based RL. Instead of using pre-recorded agent trajectories that come with their own actions, we consider the setting where the pre-training data are action-free videos, which are more common and available in the real world. We introduce a two-phase training pipeline as follows: for the pre-training phase, we implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network through a novel forward-inverse cycle consistency (FICC) objective based on vector quantization; for down-stream tasks, we finetune with small amount of task data based on the learned models. Our framework can significantly improve the sample efficiency on Atari Games with data of only one hour of game playing. We achieve 118.4% mean human performance and 36.0% median performance with only 50k environment steps, which is 85.6% and 65.1% better than the scratch EfficientZero model. We believe such pre-training approach can provide an option for solving real-world RL problems. The code is available at https://github.com/YeWR/FICC.git.

1. INTRODUCTION

Recently, deep reinforcement learning algorithms have achieved great success on various tasks, including simulated games, robotics manipulations, protein structure analysis and even controlling nuclear fusion (Schrittwieser et al., 2020; Jumper et al., 2021; Degrave et al., 2022) . However, the great success of these RL algorithms is based on huge amounts of data. For example, it requires over 20 million games on Go for AlphaZero (Silver et al., 2017), and Liu et al. (2021) spend millions of data for playing simulated humanoid football. But in real applications or complex tasks, it is impossible to acquire such amounts of data through interactions with environments. To keep strong performance while requiring much less data, some researchers propose to use modelbased reinforcement learning (MBRL) algorithms. They build environmental world models in assis-tance of planning to increase the sample efficiency. And experiments have proved the high sample efficiency of MBRL (Kaiser et al., 2019; Hafner et al., 2019) . The high sample efficiency of MBRL shows the great potential for handling sequential decision-making problems in complex simulated environments and real-world (Hafner et al., 2020; Ye et al., 2021) . Although MBRL has improved sample efficiency a lot, it still requires a non-trivial amount of interactions to finish each task (Moerland et al., 2020; Schrittwieser et al., 2020) . And each time, it learns from scratch, which makes it difficult to deploy quickly on different downstream tasks. Consequently, a good approach to improving this is to pre-train the world model with some data first. However, the datasets equipped with actions are hard to obtain on a large scale. This is because to collect a large-scale, high-quality dataset with actions, we need a good policy in the first place, and this becomes a chicken-and-egg problem. Instead, pure videos without action labels are more accessible and affordable in the real world. There are a huge amount of video datasets without action labels on the Internet.Thus, in this paper, we study how to pre-train the world models with action-free videos for MBRL. We propose to pre-train a latent dynamics model based on inverse latent action prediction from pure videos without any action labels. We propose a novel cycle consistency loss by chaining the forward dynamics and the inverse dynamics, as shown in Figure 1 . This loss can pre-train the forward models as well as the inverse models in the visual MBRL algorithms. Afterward, we fine-tune the downstream tasks based on the pre-trained models. Experiments show that our method can build sound representation and dynamics pretrained models for the downstream task. We achieve 118.4% mean human performance and 36.0% median performance with only 50k environment steps, which is 85.6% and 65.1% better than the scratch EfficientZero model. Our contributions are the following: • We systematically study the problem of pre-training from action-free videos for modelbased RL, which could be the foundation for future sample efficient, robust, and few-shot generalizable robots in the physical world. • We propose a forward dynamics -inverse dynamics cycle consistency pre-training method that can jointly infer the latent actions from the video and train the representation function as well as the dynamics function. We also propose a practical fine-tuning scheme that achieves high performance on many downstream tasks. • Our framework achieves the SoTA on the 60-minute Atari games and significantly outperforms others. Experiments show that the model pre-trained on distinct environmental data together can be fine-tuned well to the corresponding environments without re-pre-training.

2. RELATED WORKS

Unsupervised Pre-training in NLP and CV In recent work, researchers have found that the language model pre-trained with unsupervised learning can quickly and well generalized to down-steam language tasks (Devlin et al., 2018; Yang et al., 2019) . Some researches show that the pre-trained model can be a good few-shot learner (Brown et al., 2020) or multi-task learner (Radford et al., 2019) . More importantly, the two-stage procedure of training has become more popular for large models in NLP, such as Transformers (Vaswani et al., 2017; Brown et al., 2020) . People find that in computer vision, similar unsupervised pre-training methods can build sound representation models for various visual tasks based on transformer (Li et al., 2019; Dosovitskiy et al., 2020) . Contrastive learning and reconstruction are two common techniques to achieve these goals (He et al., 2020; Grill et al., 2020; He et al., 2022) . Generally, all these methods aim to model a universal representation function, which can be fine-tuned well to some specific vision tasks, e.g. classification or detection. Unsupervised for Representation Learning in RL Inspired by the great success of unsupervised pre-training in NLP and CV, researchers attempt to learn representations for visual RL in an unsupervised manner. People find that contrastive learning on online visual RL helps to extract good latent states for robotics control tasks (Laskin et al., 2020; Schwarzer et al., 2020) . Ye et al. 



Figure 1: The forward-inverse cycle consistency builds an unsupervised training objective for model-based reinforcement learning from pure videos.

(2021) propose to improve the sample efficiency of model-based RL through temporal contrastive learning. Furthermore, Stooke et al. (2021) introduce a new unsupervised learning task to decouple the representation learning from policy learning. Xiao et al. (2022) propose to do better motor control from the masked visual pre-training method from real-world images. Besides, some researchers attempt to do pre-training and fine-tuning for RL down-stream tasks. Parisi et al. (2022) find that

