UNSUPERVISED ACTIVE PRE-TRAINING FOR REIN-FORCEMENT LEARNING

Abstract

We introduce a new unsupervised pre-training method for reinforcement learning called APT, which stands for ActivePre-Training. APT learns a representation and a policy initialization by actively searching for novel states in reward-free environments. We use the contrastive learning framework for learning the representation from collected transitions. The key novel idea is to collect data during pre-training by maximizing a particle based entropy computed in the learned latent representation space. By doing particle based entropy maximization, we alleviate the need for challenging density modeling and are thus able to scale our approach to image observations. APT successfully learns meaningful representations as well as policy initializations without using any reward. We empirically evaluate APT on the Atari game suite and DMControl suite by exposing task-specific reward to agent after a long unsupervised pre-training phase. On Atari games, APT achieves human-level performance on 12 games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult for training from scratch. Importantly, the pre-trained models can be fine-tuned to solve different tasks as long as the environment does not change. Finally, we also pre-train multi-environment encoders on data from multiple environments and show generalization to a broad set of RL tasks.

1. INTRODUCTION

Deep reinforcement learning (RL) provides a general framework for solving challenging sequential decision-making problems, it has achieved remarkable success in advancing the frontier of AI technologies thanks to scalable and efficient learning algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017) . These landmarks include outperforming humans in board (Silver et al., 2016; 2018; Schrittwieser et al., 2019) and computer games (Mnih et al., 2015; Berner et al., 2019; Schrittwieser et al., 2019; Vinyals et al., 2019; Badia et al., 2020a) , and solving complex robotic control tasks (Andrychowicz et al., 2017; Akkaya et al., 2019) . Despite these successes, a key challenge with Deep RL is that it requires a huge amount of interactions with the environment before it learns effective policies, and needs to do so for each encountered task. Environments are required to have carefully designed task-specific reward functions to guide the RL algorithms (Andrychowicz et al., 2017; Ng et al., 1999) , which further limits its wide applications of Deep RL. This is in contrast to how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment. Unsupervised pre-training is a framework that trains models without expert supervision has obtained promising results in computer vision (Oord et al., 2018; He et al., 2019; Chen et al., 2020b; Caron et al., 2020; Grill et al., 2020) and natural language modeling (Vaswani et al., 2017; Devlin et al., 2018; Peters et al., 2018; Brown et al., 2020) . The key insight of unsupervised pre-training techniques is learning a good representation or initialization from a massive amount of unlabeled data such as ImageNet (Deng et al., 2009) , Instagram image set (He et al., 2019), Wikipedia, and WebText (Radford et al., 2019) which are easier to collect and scales to millions or trillions of data points. As a result, The learned representation when fine-tuned on the downstream tasks can solve them efficiently without needing any supervision or in a few-shot manner.

