UNSUPERVISED ACTIVE PRE-TRAINING FOR REIN-FORCEMENT LEARNING

Abstract

We introduce a new unsupervised pre-training method for reinforcement learning called APT, which stands for ActivePre-Training. APT learns a representation and a policy initialization by actively searching for novel states in reward-free environments. We use the contrastive learning framework for learning the representation from collected transitions. The key novel idea is to collect data during pre-training by maximizing a particle based entropy computed in the learned latent representation space. By doing particle based entropy maximization, we alleviate the need for challenging density modeling and are thus able to scale our approach to image observations. APT successfully learns meaningful representations as well as policy initializations without using any reward. We empirically evaluate APT on the Atari game suite and DMControl suite by exposing task-specific reward to agent after a long unsupervised pre-training phase. On Atari games, APT achieves human-level performance on 12 games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult for training from scratch. Importantly, the pre-trained models can be fine-tuned to solve different tasks as long as the environment does not change. Finally, we also pre-train multi-environment encoders on data from multiple environments and show generalization to a broad set of RL tasks.

1. INTRODUCTION

Deep reinforcement learning (RL) provides a general framework for solving challenging sequential decision-making problems, it has achieved remarkable success in advancing the frontier of AI technologies thanks to scalable and efficient learning algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; 2017) . These landmarks include outperforming humans in board (Silver et al., 2016; 2018; Schrittwieser et al., 2019) and computer games (Mnih et al., 2015; Berner et al., 2019; Schrittwieser et al., 2019; Vinyals et al., 2019; Badia et al., 2020a) , and solving complex robotic control tasks (Andrychowicz et al., 2017; Akkaya et al., 2019) . Despite these successes, a key challenge with Deep RL is that it requires a huge amount of interactions with the environment before it learns effective policies, and needs to do so for each encountered task. Environments are required to have carefully designed task-specific reward functions to guide the RL algorithms (Andrychowicz et al., 2017; Ng et al., 1999) , which further limits its wide applications of Deep RL. This is in contrast to how intelligent creatures learn in the absence of external supervisory signals, acquiring abilities in a task-agnostic manner by exploring the environment. Unsupervised pre-training is a framework that trains models without expert supervision has obtained promising results in computer vision (Oord et al., 2018; He et al., 2019; Chen et al., 2020b; Caron et al., 2020; Grill et al., 2020) and natural language modeling (Vaswani et al., 2017; Devlin et al., 2018; Peters et al., 2018; Brown et al., 2020) . The key insight of unsupervised pre-training techniques is learning a good representation or initialization from a massive amount of unlabeled data such as ImageNet (Deng et al., 2009) , Instagram image set (He et al., 2019) , Wikipedia, and WebText (Radford et al., 2019) which are easier to collect and scales to millions or trillions of data points. As a result, The learned representation when fine-tuned on the downstream tasks can solve them efficiently without needing any supervision or in a few-shot manner. Driven by the significance of the massive abundance of unlabeled data relative to labeled data, we pose the following question: is enabling efficient unsupervised pretraining for deep RL as easy as increasing the amount of unlabeled data? Unlike the computer vision or language domains, in reinforcement learning it's not obvious where to extract large pools of unlabeled data. A natural choice is pretraining on ImageNet and transfer the encoder to reinforcement learning tasks. We experimented with using ImageNet data for unsupervised representation learning as initialization of the encoder in deep RL agent, specifically, we used the momentum contrast (He et al., 2019; Chen et al., 2020c) method which is one of the state-of-the-art methods for representation learning. We used DrQ (Kostrikov et al., 2020) as the RL optimization algorithm. The results on DMControl are shown in Figure 1 . We can see that using ImageNet pre-trained representations does not lead to any significant improvement over training from scratch. We also experimented with using supervised pre-trained ResNet features as initialization similar to Levine et al. (2016) (details in Appendix) but the results are no different. This seems in contrast to the preeminent successes of ImageNet pre-trained models in various computer vision downstream tasks (see e.g. Krizhevsky et al., 2012; Zeiler & Fergus, 2014; Hendrycks et al., 2019; Chen et al., 2020a) . On the other hand, previous research in robotics also found that ImageNet pre-training did not help (Julian et al., 2020) . We hypothesize that the reason for the discrepancy is that the ImageNet data distribution is far from the induced sample distribution encountered during RL training. It is therefore necessary to collect data from the RL agent induced distribution. To investigate this hypothesis, we also experimented with training RL agents by 'exhaustively' collecting data during the reward-free interaction. Specifically, during pre-training phase, the only reward signal is defined by the count-based exploration (Bellemare et al., 2016; Ostrovski et al., 2017) which is one of the state-of-the-art methods for exploration (Taïga et al., 2019) , and the density estimation model is PixelCNN (Van den Oord et al., 2016) . The results of using the resulting pre-trained policy as initialization are shown in Figure 1 . We can see that pre-trained initialization in Cheetah environment does not improve significantly over random initialization on Cheetah Run task. Similarly, pre-trained initialization in Hopper environment only leads to a small improvement over baseline. The reason for this ineffectiveness is that density modeling at the pixel level is difficult especially in the low-data and non-stationary regime. The results on DMControl demonstrate that simply increasing the amount of unlabeled data does not work well, therefore we need a more systematical strategy that caters to RL. In this paper, we address the issue by proposing to actively collect novel data by exploring unknown areas in the task agnostic environment. Our means is maximizing the entropy of visited state distribution subject to some prior constraints. The entropy maximization principle (Jaynes, 1957) originated in statistical mechanics, where Jaynes showed that entropy in statistical mechanics and information theory were equivalent. Our motivation is that the resulting representation and initialization will encode both prior information while being as agnostic as possible, and can be adapted to various downstream tasks. While the entropy maximization principle seems simple, it is practically difficult to calculate the Shannon entropy (Shannon, 2001) as a density model is needed. To remedy this, we resort to the particle-based entropy estimator (Singh et al., 2003; Beirlant, 1997) which has wide applications in various machine learning areas (Sricharan et al., 2013; Pál et al., 2010; Jiao et al., 2018) . The particle-based entropy estimator is known to be asymptotically unbiased and consistent. Specifically, it computes the average of the Euclidean distance of each sample to its nearest neighbors. We compute the entropy in the latent representation space, for this we adapt the idea of contrastive learning (Hadsell et al., 2006; Gutmann & Hyvärinen, 2010; Mnih & Kavukcuoglu, 2013; He et al., 2019; Chen et al., 2020b) to encode image observations to representation space. Our approach alternates between training the encoder via contrastive learning and RL style optimization of maximizing expected reward where reward is defined by the particle-based entropy. After the pre-training phase, we can either fine-tune the encoder representation for test tasks that have different action space dimension or fine-tune the policy initialization for tasks with the same action space dimension. Since our method actively collects data during the pre-training phase, the method is named as Active Pre-Training (APT). We empirically evaluate APT on the Atari game suite and DMControl suite by exposing task-specific reward to the agent after a long unsupervised pre-training phase. On the full suite of Atari games, fine-tuning APT pre-trained models achieves human-level performance on 12 games. On the Atari 100k benchmark (Kaiser et al., 2019) , our fine-tuning achieves 1.3× higher human median scores than state-of-the-art training from scratch and 4× higher scores than state-of-the-art pre-training RL. On DMControl suite, fine-tuning APT pre-trained models beating all baselines in terms of asymptotic performance and data efficiency and solving tasks that are extremely difficult for training from scratch. The contributions of our paper can be summarized as: (i) We propose a new approach for pretraining in RL. (ii) We show that our pre-training method significantly improves data efficiency of solving downstream tasks on DMControl and Atari suite. (iii) We demonstrate that pre-training with particle-based entropy maximization in contrastive representation space significantly outperforms prior count-based approaches that rely on density modeling.

2.1. INTRINSIC MOTIVATION AND EXPLORATION

The learning process of RL agents becomes highly inefficient in sparse supervision tasks when relying on standard exploration techniques. This issue can be alleviated by introducing intrinsic motivation, i.e., denser reward signals that can be automatically computed. These rewards are generally taskagnostic and might come from state visitation count bonus (Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017; Zhao & Tresp, 2019) , learning to predict environment dynamics (Meyer & Wilson, 1991; Pathak et al., 2017; Burda et al., 2018a; Sekar et al., 2020) , distilling random neural networks (Burda et al., 2018b; Choi et al., 2018) , hindsight relabeling (Andrychowicz et al., 2017) , learning options (Sutton et al., 1999) through mutual information (Jung et al., 2011; Mohamed & Rezende, 2015) , information gain (Lindley, 1956; Sun et al., 2011; Houthooft et al., 2016) , successor features (Kulkarni et al., 2016; Machado et al., 2018) , maximizing mutual information between behaviors and some aspect of the corresponding trajectory (Gregor et al., 2016; Florensa et al., 2017; Warde-Farley et al., 2018; Hausman et al., 2018; Shyam et al., 2019) , using imitation learning to return to the furthest discovered states (Ecoffet et al., 2019) , self-play curriculum (Schmidhuber, 2013; Sukhbaatar et al., 2017; Liu et al., 2019) , exploration in latent space (Vezzani et al., 2019) , injecting noise in parameter space (Fortunato et al., 2017; Plappert et al., 2017) , learning to imitate self (Oh et al., 2018) , predicting improvement measure (Schmidhuber, 1991; Oudeyer et al., 2007; Lopes et al., 2012; Achiam & Sastry, 2017) , and unsupervised auxiliary task (Jaderberg et al., 2016) . The work by Badia et al. (2020b) also considers k-nearest neighbor based intrinsic reward to incentive exploration, and shows improved exploration in sparse reward games. Our work differs in that we consider reward-free settings and the objective of our intrinsic reward is based on particle-based entropy instead of count bonus. The work closest to ours is Hazan et al. (2019) which presents provably efficient exploration algorithms under certain conditions. However, their method directly estimates state visitations through a density model which is difficult to scale. In contrast, our work turns to particle based entropy maximization in contrastive representation space. Concurrent work by Mutti et al. (2020) shows maximizing particle-based entropy can improve data efficiency in solving downstream continuous control tasks. However, their method relies on importance sampling and on-policy RL which suffers from high variance and is difficult to scale. In contrast, our work resorts to a biased but lower variance entropy estimator which is scalable for high dimensional observations and suitable for off-policy RL optimization.

2.2. DATA EFFICIENCY IN RL

Deep RL algorithms are sample inefficient compared to intelligent biological creatures, which can quickly learn to complete new tasks. To close this data efficiency gap, various methods have been proposed: Kaiser et al. (2019) introduce a model-based agent (SimPLe) and show that it compares favorably to standard RL algorithms when data is limited. Hessel et al. (2018) ; Kielak (2020); van Hasselt et al. (2019) show combining existing RL algorithms (Rainbow) can boost data efficiency. Srinivas et al. (2020) proposed to combine contrastive loss with image augmentation while follow-up results from Laskin et al. (2020) suggest that the most of the benefits come from its use of image augmentation. Laskin et al. (2020) ; Kostrikov et al. (2020) demonstrate applying modest image augmentation can substantially improve data efficiency in vision-based RL. Our work improves data efficiency of RL in an orthogonal direction by unsupervised pre-training, the above advances can be used inside of APT to obtain better RL optimization in both pre-training and fine-tuning phases.

3. METHOD

Our method shown in Figure 2 consists of contrastive representation learning and particle-based entropy maximization in the learned representation space. Consider an agent that sees an observation x t , takes an action a t and transitions to the next state with observation x t+1 following unknown environmental dynamics. We want to incentivize this agent with a reward r t relating to how informative the transition was. The goal of the agent is to learn a good representation f θ (x) of the observation x, or a good initialization of policy π(a|x) by interacting with the environment in a reward-free manner, such that fine-tuning on downstream tasks achieves higher long-term cumulative task-specific reward than training from scratch. Contrastive Loss

Aug

Reward= 1)) and RL optimization to maximize particle based entropy (equation ( 5)). After pre-training, the task-agnostic encoder f θ and the RL policy initialization can be fine-tuned for different downstream tasks to maximize task-specific reward. " ! # " $ Aug $ ( ( ( # ) Representation K-th nearest neighbor Batch Expected Reward log Learning contrastive representations Within each batch of transitions sampled from the replay buffer. We apply data augmentation to each data point and the augmented observations are encoded into a small latent space where a contrastive loss is applied. Our contrastive learning is based on SimCLR (Chen et al., 2020b) , chosen for its simplicity. min θ,φ -E log exp(z T i z j ) 2N i=1 I [k =i] exp(z T i z k ) , where x is a n-dimensional data point in the observation space X ⊆ R n , and z i , z j are normalized d Z -dimensional vectors of two random augmentations x i and x j of the data point x followed by a deterministic mapping g φ (f θ (x i )(•)), and f θ is representation encoder given by f θ : R n → R d Y , and g φ is a projection head g φ : R d Y → R d Z , and N is the batch size. This objective tries to maximally distinguish an input x i from alternative inputs x j . The intuition is that by doing so, the representation captures important information between similar data points, and therefore improve performance on downstream tasks. Particle based entropy maximization Let the distribution of observations be p(x). The entropy of the observations is given by H(p) = -E x∼p(x) [log p(x)]. However, in high-dimensional spaces it is challenging to estimate the density, preventing us from directly maximizing the exact entropy. To remedy this issue, we resort to the particle-based entropy estimator (Singh et al., 2003; Beirlant, 1997) which is based on k-Nearest Neighbors (kNN). The particle based entropy estimate is given by H k (p) = - 1 N N i=1 log k N Vol k i + log k -Ψ(k) ∝ N i=1 Vol k i , ( ) where Ψ is the digamma function, log k -Ψ(k) is a bias correction term, Vol k i is the volume of the hyper-sphere of radius R i = x i -x kNN i , which is the Euclidean distance between x i and its k-th nearest neighbor x kNN i . The volume is given by: Vol k i = x i -x kNN i n n • π n/2 Γ p 2 + 1 , ( ) where Γ is the gamma function, and n the dimensions of X . Put equation (2) and equation ( 3) together, we simplify the entropy: H k (p) ∝ N i=1 log x i -x kNN i n n . When the target data distribution p (x) (as in the case of off-policy RL, which we use for sample efficiency) is different from the sampling distribution p(x), in principle importance sampling is needed to correct bias (Ajgl & Šimandl, 2011) . However, empirically we find the biased approximation in equation ( 2) works fine and does not need to estimate the high variance importance ratios, as shown in Section B (Appendix). Given a batch of transitions sampled from the replay buffer, we associate the particle based entropy estimation with the intrinsic reward, and use off-policy RL algorithm to maximize the expected reward. With the objective of particle based entropy given in equation ( 4), we might able to maximize state entropy in continuous control, but it is still not applicable to learn visual RL agents in highdimensional domains like DMControl and Atari games. To remedy this issue, we maximize the entropy in our learned lower-dimensional representation space, we do this by jointly learning representations by contrastive learning equation ( 1) and exploring by particle based entropy maximization equation ( 4). Specifically, for a batch of transitions {(x t , a t , x t+1 )} sampled from the replay buffer, each x t+1 is treated as a particle and we associate each transition with a intrinsic reward given by r(x t , a t , x t+1 ) := log( y t+1 - y kNN t+1 n n + c), where y = f θ (x) is the representation (i.e. we estimate the entropy in the latent space), c is a constant for numerical stability (fixed to 1 in all our experiments). In order to keep the rewards on a consistent scale, we normalized the intrinsic reward by dividing it by a running estimate of the standard deviation of the intrinsic reward. A detailed computation of the intrinsic reward in PyTorch can be found in Algorithm 2. With the intrinsic reward defined in equation ( 5), we can derive the intrinsic reward decreases to 0 as most of the state space is visited which is a favorable property for pre-training. Lemma 1. Assume we have an episodic MDP setting, and a finite state space X ⊆ R n , and a buffer of observed states (x 1 , . . . , x T ) with total sample size T , a deterministic representation encoder f θ : R n → R d Y , an intrinsic reward defined as equation ( 5) with k ∈ N, and an optimal policy that maximize the intrinsic rewards. We can derive the intrinsic reward is 0 in the limit of sample size lim T →∞ r(x, a, x ) = 0, ∀x ∈ X . (6) Proof. Since the intrinsic reward r(x, a, x ) defined in equation ( 5) depends on the k-th nearest neighbor in latent space and the encoder f θ is deterministic, we just need to prove the visitation count c(x) of x is larger than k as T goes infinity. We know the MDP is episodic, therefore as T → ∞, all states communicate and c(x) → ∞, thus we have lim T →∞ c(x) ≥ k, ∀k ∈ N, ∀x ∈ X . While the assumption of finite state space may not be true for large complex environment like Atari games, lemma 1 gives more insights on using this particular intrinsic reward for pre-training. We use n = 2 in our implementation because of the structure we imposed on the contrastive representations. The kNN in principle should be computed over the entire buffer which means it scales linearly with sample size, we instead compute the intrinsic reward within current batch to trade-off computation. We fixed k = 3 in all our experiments as we found it works well in initial experiments. APT alternates between minimizing contrastive loss in equation ( 1) and maximizing expected intrinsic reward in equation ( 5). The pseudocode of APT is shown in Algorithm 1 and the full pseudocode of the algorithm is summarized in Algorithm 3 (Appendix). The diagram of APT is shown in Figure 2 .

4. EXPERIMENTS

4.1 EXPERIMENTAL SETUP The evaluation benchmarks are DeepMind Control Suite (DMControl; Tassa et al., 2020) and Atari suite (Bellemare et al., 2013) from OpenAI Gym (Brockman et al., 2016) . For DMControl, we use pixel observation instead of state as input. For all of the experiments, we use DrQ as the underlying RL optimization algorithm. Unless stated otherwise, all curves are the average of three runs with different seeds, and the shaded areas are standard errors of the mean. The results reported on Atari games suite are averaged over five runs with different seeds. In addition to the existing tasks in DMControl, we also design several new sparse reward tasks: (1) {HalfCheetah, Hopper, Walker} Jump Sparse: the agent receives a positive reward 1 for jumping above a given height otherwise reward is 0. (2) {HalfCheetah, Hopper, Walker} Reach Sparse: the agent receives positive reward 1 for reaching a given target location otherwise reward is 0. (3) Walker Escape Sparse: the initial position of Walker is turned upside down, and receives reward 1 for successfully turning itself over otherwise 0. In all the considered tasks, the episode ends when the goal is reached. We conduct a full evaluation of APT on a diverse set of tasks in the single environment setting, the models are pre-trained on Cheetah, Hopper, and Walker for a long period of interacting without reward supervision (5M steps), then fine-tuned for 15 downstream RL tasks such as controlling the Walker to turn itself over. Fine-tuning representation is a common practice in deep learning (Krizhevsky et al., 2012; He et al., 2016) . In our RL experiments, APT stands for fine-tuning both representation and RL agent initialization. The main baseline is count-based exploration (Bellemare et al., 2014; 2016; Ostrovski et al., 2017) , which was proposed as a way to estimate counts in high dimensional states spaces by estimating density; the agent is then encouraged to visit states with a low visit count. We followed Ostrovski et al. (2017) Results on DMControl shown in Figure 3 demonstrate that APT beats all baselines on all the tasks across different environments, while the sparse reward tasks are extremely difficult for training from scratch. In some cases, APT allows for very rapid fine-tuning, indicating APT learns reward-free representation and meaningful RL initialization. The significantly superior performance of APT empirically shows that maximizing particle-based entropy can drive the RL agent to collect diverse samples and learn reward-free initializations that are effective for downstream tasks. The evaluation on the full suite of 57 Atari games (Bellemare et al., 2013) follows the setting of VISR (Hansen et al., 2020) . Firstly, we evaluate APT in a two-phase setup. Agents are allowed a long unsupervised pre-training phase without access to rewards, followed by a short test phase(100k steps). DIAYN (Eysenbach et al., 2018 ) is a skill-discovery method that maximizes the mutual information between latent variable polices and their behavior in terms of state visitation. The main baseline is VISR (and its variant GPI VISR), which combines skill discovery with universal successor approximators (Borsa et al., 2018) to enable fast task inference at both training and test phases (Barreto et al., 2017; 2018) . Due to the high computational cost, the count-based baseline is not evaluated on the entire 57 games suite but the 26 games subset (Kaiser et al., 2019) . Secondly, we contrast APT with canonical RL algorithms in the low-data regime, following the setting of Kaiser et al. (2019) . The compared algorithms include DrQ, SimPLe (Kaiser et al., 2019) , proximal policy optimization (PPO) (Schulman et al., 2017) , and OTRainbow (Kielak, 2020) . Results shown in Table 1 demonstrate that APT significantly outperforms all baselines and buying performance equivalent to hundreds of millions of sampling steps. Note that it is possible we can further improve the performance by directly applying VISR on top of the pre-trained models learned by APT, we leave it for future direction. A further discussion of the connections and differences between APT and VISR can be found in Section C in Appendix which gives more intuitions. We investigated the difference between fine-tuning only the representation and fine-tuning both representation and policy initialization. APT denotes fine-tuning both representation and RL policy initialization while APT (representation) stands for fine-tuning encoder from pre-trained models and randomly initializing policy. The notable difference is that APT (representation) decouples the action space dimension from pre-trained models. We applied count-based exploration on top of contrastive learning representations to eliminate the potential effect of representation learning difference, this baseline is shown as contrastive count-based pre-training, where we train VAE (Kingma & Welling, 2013) on the learned contrastive representations to estimate state visitation.

4.3. REPRESENTATION AND RL AGENT INITIALIZATION FINE-TUNE

As we show in Figure 4 , APT (representation) beats all of the supervised RL and pre-training RL baselines, fine-tuning both representation and RL initialization further improve performance. In some cases, APT allows for more rapid performance improvement than APT (representation) in a small fraction of the total number of samples, indicating APT learns meaningful reward-free and task-agnostic behavior. Both APT and APT (representation) significantly outperform the ablated contrastive count-based pre-training baseline, confirming the particle-based entropy maximization is a crucial component. Table 2 shows the results of fine-tuning pre-trained models for 100k timesteps on Atari games. VISR tends to be more data efficient than APT and APT(representation) on easy exploration games, potentially due to the explicit reward regression and successor feature in VISR. On hard exploration games, e.g., Freeway, APT has a significant advantage, achieving an order of magnitude higher scores than VISR while maintaining a very high score across the remaining games. APT significantly outperforms APT(representation) on Atari games, showing the learned exploratory policy is crucial for learning with a very limited number of interactions. Figure 5 : Comparison between fine-tuning representations learned on single environment and multiple environments. We apply APT to all the three environments. We completed an initial exploration of APT (representation) in multi-environment setting on DMControl suite, results shown in Table 5 . In this setup, APT simultaneously learns pre-trained representation from Hopper, Cheetah, and Walker environments. We evaluate the pre-trained model by using it in separate RL agents learning each downstream task from every environment. As shown in Figure 5 , we found multi-environment pre-training outperforms training from scratch in sparse reward tasks and performs on par or better than training from scratch in dense reward tasks, showing multi-environment pre-training is efficient. Comparing with single environment APT, multi-environment pre-training tends to have mixed results, indicating there exists intervention between representation learning in different environments. We remark despite the multi-environment variant of APT is outperformed by APT, the multi-environment pre-training is a novel research direction and our careful implementation considerations and extensive experimental results allow the method to be widely adopted.

5. CONCLUSION

A new unsupervised pre-training method for RL is introduced to address reward-free pre-training for visual RL. On DMControl suite and Atari games, our method dramatically improves performance on tasks that are extremely difficult for training from scratch. Our method achieves the results of fully supervised canonical RL algorithms using a small fraction of total samples and outperforms data-efficient supervised RL methods. Our major contribution is proposing an efficient algorithm for maximizing particle-based entropy in the latent representation space, allowing the same task-agnostic pre-trained model to successfully tackle a broad set of RL tasks. expanding the definition and derive a variational lower bound (Barber & Agakov, 2003) . J(θ) = I(z; s) (7) = H(z) -H(z|s) (8) = E π,z [log q(z|s)] + E s [KL(p(•|s)||q φ (•|s))] + H(z) (9) ≥ E π,z [log q φ (z|s)] + H(z), where q φ (z|s) is a variational approximation. In practice, sampling z from a fixed distribution yields better and more stable results (Eysenbach et al., 2018; Hansen et al., 2020) , which simplifies the objective to maximizing the conditional entropy, L(θ, φ) = E π,z [log q φ (z|s)]. The optimization of equation ( 12) is then accomplished by RL algorithm by defining reward r(s, a, s ) = log q φ (z|s) (13) equation ( 12) has been shown effective in RL, from learning skills in state based control in DI-AYN (Eysenbach et al., 2018) and EDL (Campos et al., 2020) to combining successor features with skill discovery in VISR (Hansen et al., 2020) . Comparing the variational based intrinsic reward equation ( 13) with the intrinsic reward used in APT( equation ( 5)), we can see that the particle-based reward does not need to learn a parametric probabilistic density and thus gracefully scales to high dimensional vision-based RL. We see that APT performs considerably better on downstream RL tasks as shown in In Figure 1 , the ImageNet (Deng et al., 2009) pre-trained model is based on MoCo (He et al., 2019) . The policies are represented by the Impala convolutional residual network as in (Espeholt et al., 2018) , with the LSTM (Hochreiter & Schmidhuber, 1997) part excluded. Images sampled from ImageNet are downsampled to 84 × 84, followed by frame-stacking and data augmentation. We experiment with the data augmentation methods used in He et al. (2019) ; Chen et al. (2020c) and the simpler random crop used in RL (Kostrikov et al., 2020) . Results on DMControl shown no benefit comes from ImageNet unsupervised pre-trained models. To improve the quality of pre-trained representations, we consider initializing the filters in the first layer with weights from the model of He et al. (2016) which is trained on ImageNet classification. Similar to fine-tuning MoCo trained models, we observe no difference in performance between fine-tuning supervised pre-trained models observed and training from scratch.

F.2 GENERAL IMPLEMENTATION DETAILS

The encoder network f is a ReLU convolution neural network followed by a full-connected layer normalized by LayerNorm (Ba et al., 2016) and a tanh nonlinearity applied to the output of fullyconnected layer. The data augmentation is a simple random shift which has been shown effective in visual domain RL in DrQ (Kostrikov et al., 2020) and RAD (Laskin et al., 2020) . Specifically, the images are padded each side by 4 pixels (by repeating boundary pixels) and then select a random 84 × 84 crop, yielding the original image. This procedure is repeated every time an image is sampled from the replay buffer. The learning rate of contrastive learning is 0.001, the temperature is 0.1. We incorporate the memory mechanism (with a moving average of weights for stabilization) from He et al. (2019) ; Chen et al. (2020c) . We use DrQ as the RL optimization algorithm at both pre-training phase and fine-tuning phase. The batch size of contrastive learning is 1024, the batch size of RL optimization is 512. The pre-training phase consists of 5M environment steps on DMControl and 250M environment steps on Atari games. The replay buffer size is 100K. The projection network is a two-layer MLP with hidden size of 128 and output size of 64. All hyperparameters are included in Table 7 and Table 8 .

F.3 MULTI-ENVIRONMENT PRE-TRAINING DETAILS

For the experiments of multi-environment pre-training (results shown in Figure 5 ), we use one separate replay buffer for each of the three environments, and compute environment specific loss using data sampled from its own replay buffer. The contrastive loss and RL loss are then summations of every environment specific loss. Table 6 : The action repeat hyper-parameter used for each environment.

F.5 ATARI HYPERPARAMETERS

For the experiments in Atari 100k experiments, we largely reuse the hyper-parameters from DrQ (Kostrikov et al., 2020) . The evaluation is done for 125k environment steps at the end of training for 100k environment steps.



Figure 1: Unsupervised pre-training for deep RL on DM-Control. After pre-training (e.g. on ImageNet or in Cheetah reward free environment), the agent fine-tunes the pretrained representation or initialization to achieve higher taskspecific rewards (e.g. let the Cheetah run faster). ImageNet pre-training denotes training MoCo on downsampled Ima-geNet. Count-based pre-training means training RL agent with only count-based exploration signal. The training details are in Appendix Section F.1. The results show none of the two methods outperforms training from scratch.

Figure 2: Diagram of the proposed method Unsupervised Active Pre-Training: it consists of contrastive representation learning on data collected by the agent (equation (1)) and RL optimization to maximize particle based entropy (equation (5)). After pre-training, the task-agnostic encoder f θ and the RL policy initialization can be fine-tuned for different downstream tasks to maximize task-specific reward.

Figure 4: Comparison of fine-tuning representation, fine-tuning both representation and RL agent, and ablated baselines. Models are pre-trained on Hopper and subsequently fine-tuned on downstream tasks. The 'sparse' denote reward is sparse. Both variants of APT outperform training from scratch and other baselines.

Active pre-training (multi-env representation)



Evaluation on Atari games. @N represents the amount of RL interaction utilized. M dn is the median of human-normalized scores, M is the mean, > 0 is the number of games with better than random performance, and > H is the number of games with human-level performance. On each subset, we mark as bold the highest score. Since different papers report different results of supervised RL e.g SimPLe, we choose the best available results and contrast them to APT's results. The results of VISR are cited fromHansen et al. (2020) as the source code is not publicly available. Raw scores of each Atari game given Table5(Appendix). Top: data-limited RL.

Figure 3: Evaluation on DeepMind Control suite. Models are pre-trained on Cheetah, Hopper, and Walker, and subsequently fine-tuned on respective downstream tasks. The 'sparse' denotes reward is sparse. The scores of each environment given in Table 4 (Appendix).

Comparison of fine-tuning representation, fine-tuning both representation and RL agent on Atari games. The results are obtained after fine-tuning 100k timesteps and are averaged over five random seeds.



Comparison of fine-tuning on DMControl. Models are pre-trained on Cheetah, Hopper, and Walker, and subsequently fine-tuned on respective downstream tasks. The 'sparse' denotes reward is sparse. APT significantly outperforms baselines in most of sparse reward tasks.

Comparison of raw scores of each method on Atari games. Results are averaged over five random seeds. denotes dense reward hard exploration games. denotes sparse reward hard exploration games. @N represents the amount of RL interaction utilized at fine-tuning phase.

B EVALUATION OF ENTROPY MAXIMIZATION

We conducted experiments to evaluate APT's performance in maximizing entropy in state based continuous control tasks from OpenAI Gym (Brockman et al., 2016) . We compared APT with state-of-the-art state entropy maximization methods, including (1) MaxEnt (Hazan et al., 2019) which proposes a provable efficient exploration method for maximizing state entropy, and shows their method can improve the efficiency of exploring the state space in continuous control.(2) MEPOL (Mutti et al., 2020) which is a recent state-of-the-art of exploration in continuous control, their estimation of particle based entropy based on importance sampling and is unbiased.(3) APT-like MEPOL which denotes training APT objective with MEPOL's trust region optimization.Following MaxEnt (Hazan et al., 2019) , we compute the entropy index by discretizing the state space. The Ant-v2 Jump and Humanoid-v2 Standup tasks are two sparse reward tasks given in MEPOL (Mutti et al., 2020) .Figure 6 shows the results of entropy achieved by different methods, Table 3 and shows the results of fine-tuning pre-trained models for 0.5M timesteps. MaxEnt performs poorly in high-dimensional continuous control due to state density modeling is difficult. APT-like MEPOL outperforms MEPOL, indicating the objective of APT balances between variance and bias. APT significantly outperforms all baselines in maximizing entropy and also has the highest data efficiency in solving sparse reward tasks, confirming that the off-policy nature of APT is crucial for both pre-training and fine-tuning. 

C ANALYSIS VARIATIONAL UNSUPERVISED RL METHODS

We contrast APT with variational unsupervised RL algorithms to give more intuitions besides the empirical evaluation in Table 1 .Utilizing a strong inductive bias that is likely to yield features relevant to rewards of possible downstream tasks has been the central goal of unsupervised RL research. One of the widely used such bias is proposed by Achiam & Sastry (2017) ; Gregor et al. (2016) that is to only represent the subset of observation space that the agent can control. This can be accomplished by maximizing the mutual information between a policy conditioning variable and the agent's behavior. Formally, the goal is to learn latent-conditioned policies π θ (a|s, z) and define skills as the policies obtained when conditioning π on a fixed value of z ∈ Z. There exist many algorithms that optimize policy parameter θ maximize the mutual information through various means (see e.g. Eysenbach et al., 2018; Hansen et al., 2020; Warde-Farley et al., 2018; Sharma et al., 2019) . The quantity can be derived by 

