UNSUPERVISED ACTIVE PRE-TRAINING FOR REIN-FORCEMENT LEARNING

Abstract

We introduce a new unsupervised pre-training method for reinforcement learning called APT, which stands for ActivePre-Training. APT learns a representation and a policy initialization by actively searching for novel states in reward-free environments. We use the contrastive learning framework for learning the representation from collected transitions. The key novel idea is to collect data during pre-training by maximizing a particle based entropy computed in the learned latent representation space. By doing particle based entropy maximization, we alleviate the need for challenging density modeling and are thus able to scale our approach to image observations. APT successfully learns meaningful representations as well as policy initializations without using any reward. We empirically evaluate APT on the Atari game suite and DMControl suite by exposing task-specific reward to agent after a long unsupervised pre-training phase. On Atari games, APT achieves human-level performance on 12 games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult for training from scratch. Importantly, the pre-trained models can be fine-tuned to solve different tasks as long as the environment does not change. Finally, we also pre-train multi-environment encoders on data from multiple environments and show generalization to a broad set of RL tasks.

1. INTRODUCTION

Deep reinforcement learning (RL) provides a general framework for solving challenging sequential decision-making problems, it has achieved remarkable success in advancing the frontier of AI technologies thanks to scalable and efficient learning algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017) . These landmarks include outperforming humans in board (Silver et al., 2016; 2018; Schrittwieser et al., 2019) and computer games (Mnih et al., 2015; Berner et al., 2019; Schrittwieser et al., 2019; Vinyals et al., 2019; Badia et al., 2020a) , and solving complex robotic control tasks (Andrychowicz et al., 2017; Akkaya et al., 2019) . Despite these successes, a key challenge with Deep RL is that it requires a huge amount of interactions with the environment before it learns effective policies, and needs to do so for each encountered task. Environments are required to have carefully designed task-specific reward functions to guide the RL algorithms (Andrychowicz et al., 2017; Ng et al., 1999) , which further limits its wide applications of Deep RL. This is in contrast to how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment. Unsupervised pre-training is a framework that trains models without expert supervision has obtained promising results in computer vision (Oord et al., 2018; He et al., 2019; Chen et al., 2020b; Caron et al., 2020; Grill et al., 2020) and natural language modeling (Vaswani et al., 2017; Devlin et al., 2018; Peters et al., 2018; Brown et al., 2020) . The key insight of unsupervised pre-training techniques is learning a good representation or initialization from a massive amount of unlabeled data such as ImageNet (Deng et al., 2009) , Instagram image set (He et al., 2019), Wikipedia, and WebText (Radford et al., 2019) which are easier to collect and scales to millions or trillions of data points. As a result, The learned representation when fine-tuned on the downstream tasks can solve them efficiently without needing any supervision or in a few-shot manner. Driven by the significance of the massive abundance of unlabeled data relative to labeled data, we pose the following question: is enabling efficient unsupervised pretraining for deep RL as easy as increasing the amount of unlabeled data? Unlike the computer vision or language domains, in reinforcement learning it's not obvious where to extract large pools of unlabeled data. A natural choice is pretraining on ImageNet and transfer the encoder to reinforcement learning tasks. We experimented with using ImageNet data for unsupervised representation learning as initialization of the encoder in deep RL agent, specifically, we used the momentum contrast (He et al., 2019; Chen et al., 2020c) method which is one of the state-of-the-art methods for representation learning. We used DrQ (Kostrikov et al., 2020) as the RL optimization algorithm. The results on DMControl are shown in Figure 1 . We can see that using ImageNet pre-trained representations does not lead to any significant improvement over training from scratch. We also experimented with using supervised pre-trained ResNet features as initialization similar to Levine et al. ( 2016) (details in Appendix) but the results are no different. This seems in contrast to the preeminent successes of ImageNet pre-trained models in various computer vision downstream tasks (see e.g. Krizhevsky et al., 2012; Zeiler & Fergus, 2014; Hendrycks et al., 2019; Chen et al., 2020a) . On the other hand, previous research in robotics also found that ImageNet pre-training did not help (Julian et al., 2020) . We hypothesize that the reason for the discrepancy is that the ImageNet data distribution is far from the induced sample distribution encountered during RL training. It is therefore necessary to collect data from the RL agent induced distribution. To investigate this hypothesis, we also experimented with training RL agents by 'exhaustively' collecting data during the reward-free interaction. Specifically, during pre-training phase, the only reward signal is defined by the count-based exploration (Bellemare et al., 2016; Ostrovski et al., 2017) which is one of the state-of-the-art methods for exploration (Taïga et al., 2019) , and the density estimation model is PixelCNN (Van den Oord et al., 2016) . The results of using the resulting pre-trained policy as initialization are shown in Figure 1 . We can see that pre-trained initialization in Cheetah environment does not improve significantly over random initialization on Cheetah Run task. Similarly, pre-trained initialization in Hopper environment only leads to a small improvement over baseline. The reason for this ineffectiveness is that density modeling at the pixel level is difficult especially in the low-data and non-stationary regime. The results on DMControl demonstrate that simply increasing the amount of unlabeled data does not work well, therefore we need a more systematical strategy that caters to RL. In this paper, we address the issue by proposing to actively collect novel data by exploring unknown areas in the task agnostic environment. Our means is maximizing the entropy of visited state distribution subject to some prior constraints. The entropy maximization principle (Jaynes, 1957) originated in statistical mechanics, where Jaynes showed that entropy in statistical mechanics and information theory were equivalent. Our motivation is that the resulting representation and initialization will encode both prior information while being as agnostic as possible, and can be adapted to various downstream tasks. While the entropy maximization principle seems simple, it is practically difficult to calculate the Shannon entropy (Shannon, 2001) as a density model is needed. To remedy this, we resort to the particle-based entropy estimator (Singh et al., 2003; Beirlant, 1997) which has wide applications in various machine learning areas (Sricharan et al., 2013; Pál et al., 2010; Jiao et al., 2018) . The particle-based entropy estimator is known to be asymptotically unbiased and consistent. Specifically, it computes the average of the Euclidean distance of each sample to its nearest neighbors. We compute the entropy in the latent representation space, for this we adapt the idea of contrastive learning (Hadsell et al., 2006; Gutmann & Hyvärinen, 2010; Mnih & Kavukcuoglu, 2013; He et al., 2019; Chen et al., 2020b) to encode image observations to representation space.



Figure 1: Unsupervised pre-training for deep RL on DM-Control. After pre-training (e.g. on ImageNet or in Cheetah reward free environment), the agent fine-tunes the pretrained representation or initialization to achieve higher taskspecific rewards (e.g. let the Cheetah run faster). ImageNet pre-training denotes training MoCo on downsampled Ima-geNet. Count-based pre-training means training RL agent with only count-based exploration signal. The training details are in Appendix Section F.1. The results show none of the two methods outperforms training from scratch.

