BECOME A PROFICIENT PLAYER WITH LIMITED DATA THROUGH WATCHING PURE VIDEOS

Abstract

Recently, RL has shown its strong ability for visually complex tasks. However, it suffers from the low sample efficiency and poor generalization ability, which prevent RL from being useful in real-world scenarios. Inspired by the huge success of unsupervised pre-training methods on language and vision domains, we propose to improve the sample efficiency via a novel pre-training method for model-based RL. Instead of using pre-recorded agent trajectories that come with their own actions, we consider the setting where the pre-training data are action-free videos, which are more common and available in the real world. We introduce a two-phase training pipeline as follows: for the pre-training phase, we implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network through a novel forward-inverse cycle consistency (FICC) objective based on vector quantization; for down-stream tasks, we finetune with small amount of task data based on the learned models. Our framework can significantly improve the sample efficiency on Atari Games with data of only one hour of game playing. We achieve 118.4% mean human performance and 36.0% median performance with only 50k environment steps, which is 85.6% and 65.1% better than the scratch EfficientZero model. We believe such pre-training approach can provide an option for solving real-world RL problems.

Cycle consistency

Inverse Forward 

1. INTRODUCTION

Recently, deep reinforcement learning algorithms have achieved great success on various tasks, including simulated games, robotics manipulations, protein structure analysis and even controlling nuclear fusion (Schrittwieser et al., 2020; Jumper et al., 2021; Degrave et al., 2022) . However, the great success of these RL algorithms is based on huge amounts of data. For example, it requires over 20 million games on Go for AlphaZero (Silver et al., 2017), and Liu et al. (2021) spend millions of data for playing simulated humanoid football. But in real applications or complex tasks, it is impossible to acquire such amounts of data through interactions with environments. To keep strong performance while requiring much less data, some researchers propose to use modelbased reinforcement learning (MBRL) algorithms. They build environmental world models in assis-



Figure 1: The forward-inverse cycle consistency builds an unsupervised training objective for model-based reinforcement learning from pure videos.

