COVERAGE AS A PRINCIPLE FOR DISCOVERING TRANSFERABLE BEHAVIOR IN REINFORCEMENT LEARNING

Abstract

Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Unsupervised learning provides a useful paradigm for autonomous acquisition of task-agnostic knowledge. In supervised settings, representations discovered through unsupervised pre-training offer important benefits when transferred to downstream tasks. Given the nature of the reinforcement learning problem, we explore how to transfer knowledge through behavior instead of representations. The behavior of pre-trained policies may be used for solving the task at hand (exploitation), as well as for collecting useful data to solve the problem (exploration). We argue that pre-training policies to maximize coverage will result in behavior that is useful for both strategies. When using these policies for both exploitation and exploration, our agents discover solutions that lead to larger returns. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task. 0 1 B 2 B 3 B 4 B 5 B environment steps 0 2 k 4 k 6 k 8 k 10 k 12 k 14 k montezuma_revenge 0 1 B 2 B 3 B 4 B 5 B environment steps 0 10 k 20 k 30 k 40 k 50 k 60 k space_invaders NGU no transfer behavior behavior + representation representation fine-tuning

1. INTRODUCTION

Unsupervised representation learning techniques have led to unprecedented results in domains like computer vision (Hénaff et al., 2019; He et al., 2019) and natural language processing (Devlin et al., 2019; Radford et al., 2019) . These methods are commonly composed of two stages -an initial unsupervised phase, followed by supervised fine-tuning on downstream tasks. The self-supervised nature of the learning objective allows to leverage large collections of unlabelled data in the first stage. This produces models that extract task-agnostic features that are well suited for transfer to downstream tasks. In reinforcement learning (RL), auxiliary representation learning objectives provide denser signals that result in data efficiency gains (Jaderberg et al., 2017) and even bridge the gap between learning from true state and pixel observations (Laskin et al., 2020) . However, RL applications have not yet seen the advent of the two-stage setting where task-agnostic pre-training is followed by efficient transfer to downstream tasks. We argue that there are two reasons explaining this lag with respect to their supervised counterparts. First, these methods traditionally focus on transferring representations (Lesort et al., 2018) . While this is enough in supervised scenarios, we argue that leveraging pre-trained behavior is far more important in RL domains requiring structured exploration. Second, what type of self-supervised objectives enable the acquisition of transferable, task-agnostic knowledge is still an open question. Defining these objectives in the RL setting is complex, as they should account for the fact that the the distribution of the input data will be defined by the behavior of the agent. Transfer in deep learning is often performed through parameter initialization followed by fine-tuning. The most widespread procedure consists in initializing all weights in the neural network using those from the pre-trained model, and then adding an output layer with random parameters (Girshick et al., 2014; Devlin et al., 2019) . Depending on the amount of available data, pre-trained parameters can either be fine-tuned or kept fixed. This builds on the intuition that the pre-trained model will map inputs to a feature space where the downstream task is easy to perform. In the RL setting, this procedure will completely dismiss the pre-trained policy and fall back to a random one when collecting experience. Given that complex RL problems require structured and temporally-extended behaviors, we argue that representation alone is not enough for efficient transfer in challenging Figure 1 : Comparison of transfer strategies on Montezuma's Revenge (hard exploration) and Space Invaders (dense reward) from a task-agnostic policy pre-trained with NGU (Puigdomènech Badia et al., 2020b) . Transferring representations provides a significant boost on dense reward games, but it does not seem to help in hard exploration ones. Leveraging the behavior of the pre-trained policy provides important gains in hard exploration problems when compared to standard fine-tuning and is complementary to transferring representations. We refer the reader to Appendix F for details on the network architecture. domains. Pre-trained representations do indeed provide data efficiency gains in domains with dense reward signals (Finn et al., 2017; Yarats et al., 2019; Stooke et al., 2020a) , but our experiments show that the standard fine-tuning procedure falls short in hard exploration problems (c.f. Figure 1 ). We observe this limitation even when fine-tuning the pre-trained policy, which is aligned with findings from previous works (Finn et al., 2017) . Learning in the downstream task can lead to catastrophically forgetting the pre-trained policy, something that depends on many difficult-to-measure factors such as the similarity between the tasks. We address the problem of leveraging arbitrary pre-trained policies when solving downstream tasks, a requirement towards enabling efficient transfer in RL. Defining unsupervised RL objectives remains an open problem, and existing solutions are often influenced by how the acquired knowledge will be used for solving downstream tasks. Modelbased approaches can learn world models from unsupervised interaction (Ha & Schmidhuber, 2018) . However, the diversity of the training data will impact the accuracy of the model (Sekar et al., 2020) and deploying this type of approach in visually complex domains like Atari remains an open problem (Hafner et al., 2019) . Unsupervised RL has also been explored through the lens of empowerment (Salge et al., 2014; Mohamed & Rezende, 2015) , which studies agents that aim to discover intrinsic options (Gregor et al., 2016; Eysenbach et al., 2019) . While these options can be leveraged by hierarchical agents (Florensa et al., 2017) or integrated within the universal successor features framework (Barreto et al., 2017; 2018; Borsa et al., 2019; Hansen et al., 2020) , their lack of coverage generally limits their applicability to complex downstream tasks (Campos et al., 2020) . We argue that maximizing coverage is a good objective for task-agnostic RL, as agents that succeed at this task will need to develop complex behaviors in order to efficiently explore the environment (Kearns & Singh, 2002) . This problem can be formulated as that of finding policies that induce maximally entropic state distributions, which might become extremely inefficient in high-dimensional state spaces without proper priors (Hazan et al., 2019; Lee et al., 2019) . In practice, exploration is often encouraged through intrinsic curiosity signals that incorporate priors in order to quantify how different the current state is from those already visited (Bellemare et al., 2016; Houthooft et al., 2016; Ostrovski et al., 2017; Puigdomènech Badia et al., 2020b) . Agents that maximize these novelty-seeking signals have been shown to discover useful behaviors in unsupervised settings (Pathak et al., 2017; Burda et al., 2018a) , but little research has been conducted towards leveraging the acquired knowledge once the agent is exposed to extrinsic reward. We show that coverage-seeking objectives are a good proxy for acquiring knowledge in task-agnostic settings, as leveraging the behaviors discovered in an unsupervised pre-training stage provides important gains when solving downstream tasks. Our contributions can be summarized as follows. (1) We study how to transfer knowledge in RL through behavior by re-using pre-trained policies, an approach that is complementary to re-using representations. We argue that pre-trained behavior can be used for both exploitation and exploration, and present techniques to achieve both goals. (2) We propose coverage as a principle for discovering behavior that is suitable for both exploitation and exploration. While coverage is naturally aligned with exploration, we show that this objective will lead to the discovery of behavior that is useful for exploitation as well. (3) We propose Coverage Pre-training for Transfer (CPT), a method that implements the aforementioned hypotheses, and provide extensive experimental evaluation to support them. Our results show that leveraging the behavior of policies pre-trained to maximize coverage provides important benefits when solving downstream tasks. CPT obtains the largest gains in hard exploration games, where it almost doubles the median human normalized score achieved by our strongest baseline. Importantly, these benefits are observed even when the pre-trained policies are misaligned with the task being solved, confirming that the benefits do not come from a fortuitous alignment between our pre-training objective and the task reward. Furthermore, we show that CPT is able to leverage a single task-agnostic policy to solve multiple tasks in the same environment.

Unsupervised stage Downstream task

Figure 2 : Intuition behind CPT on a simple maze, where the agent needs to collect treasure chests (positive reward) while avoiding skulls (negative reward). Trajectories that a policy π p trained to maximize coverage could produce are depicted in orange. Left: while π p ignores some of the rewarding objects, many learning opportunities appear when following it during training. Right: combining primitive actions (red) with actions from π p (orange) side-steps the need to learn behavior that is already available through π p when solving downstream tasks. We follow a similar setup to that proposed by Hansen et al. (2020) . In an initial pre-training stage, agents are allowed as many interactions with the environment as needed as long as they are not exposed to task-specific rewards. Rewards are reinstated in a second stage, where the knowledge acquired during unsupervised pre-training should be leveraged in order to enable efficient learning. This is analogous to the evaluation setting for unsupervised learning methods, where pre-training on classification benchmarks with labels removed is evaluated after fine-tuning on small sets of annotated examples. The two-stage setup introduces two main challenges: defining pretext tasks in the absence of reward, and efficiently leveraging knowledge once rewards are reinstated. Our proposed method, Coverage Pre-training for Transfer (CPT), relies on coverage maximization as a pretext task for task-agnostic pre-training in order to produce policies whose behavior can be leveraged for both exploitation and exploration when solving downstream tasks in the same environment. Figure 2 provides intuition about the potential benefits of CPT.

3. LEVERAGING PRE-TRAINED POLICIES

Transfer in supervised domains often exploits the fact that related tasks might be solved using similar representations. This practice deals with the data inefficiency of training large neural networks with stochastic gradient descent. However, there is an additional source of data inefficiency when training RL agents: unstructured exploration. If the agent fails at discovering reward while exploring, it will struggle even when fitting simple function approximators on top of the true state of the MDP. These two strategies are complementary, as they address different sources of inefficiency, which motivates the study of techniques for leveraging pre-trained behavior (i.e. policies). Our approach relies on off-policy learning methods in order to leverage arbitrary pre-trained policies. We make use of the mapping from observations to actions of such policies (i.e. their behavior), and do not transfer knowledge through pre-trained neural network weights. We consider value-based methods with experience replay that estimate action-value functions and derive greedy policies from them. The presented formulation considers a single pre-trained policy, π p , but note that it is straightforward to extend it to multiple such policies. No assumptions are made on how the pre-trained policy is obtained, and it is only used for acting. We propose using the behavior of the pre-trained policy for two complementary purposes: exploitation and exploration. Figure 2 provides intuition about the potential benefits of these two approaches on a simple environment, and pseudo-code for the proposed methods is included in Appendix A. Exploitation. When the behavior of π p is aligned with the downstream task, it can be used for zero-shot transfer. However, we are concerned with the more realistic scenario where only some of the behaviors of π p might be aligned with downstream tasks (c.f. Figure 2 , right). We propose to leverage π p for exploitation by letting the agent combine primitive actions with the behavior of π p . This is achieved by considering an expanded action set A + = A ∪ {π p (s)}, so that the agent can fall back to π p for one step when taking the additional action. Intuitively, this new state-dependent action should enable faster convergence when the pre-trained policy discovered behaviors that are useful for the task, while letting the agent ignore it otherwise. The return of taking action a ∼ π p (s) is used as target to fit both Q(s, π p (s)) and Q(s, a ), which implements the observation that they are the same action and thus will lead to the same outcomes. Exploration. Following the pre-trained policy might bring the agent to states that are unlikely to be visited with unstructured exploration techniques such as -greedy. This property has the potential of accelerating learning even when the behavior of the pre-trained policy is not aligned with the downstream task, as it will effectively shorten the path between otherwise distant states (Liu & Brunskill, 2018 ). As we rely on off-policy methods that can learn from experience collected by arbitrary policies, we propose to perform temporally-extended exploration with π p , which we will refer to as flights. Inspired by z-greedy and its connection to Lévy flights (Viswanathan et al., 1996) , a class of ecological models for animal foraging, these flights are started randomly and their duration is sampled from a heavy-tailed distribution. Our proposal can be understood as a variant of z-greedy where pre-trained policies are used as exploration options. An exploratory flight might be started at any step with some probability. The duration for the flight is sampled from a heavy-tailed distribution, and control is handed over to π p during the complete flight. When not in a flight, the exploitative policy that maximizes the extrinsic reward is derived from the estimated Q-values using the -greedy operator. This ensures that all state-action pairs will be visited given enough time, as exploring only with π p does not guarantee such property. Note that this is not needed in z-greedy, which reduces to standard -greedy exploration when sampling a flight duration of one step.

4. COVERAGE AS A GOAL FOR UNSUPERVISED PRE-TRAINING

So far we considered strategies for leveraging the behavior of arbitrary policies, and we now discuss how to train such policies in an initial pre-training stage with rewards removed. In such setting, it is a common practice to derive objectives for proxy tasks in order to drive learning. As we proposed to take advantage of pre-trained policies for both exploitation and exploration, it might seem unlikely that a single pre-training objective will produce policies that are useful for both purposes. However, we hypothesize that there exists a single criterion that will produce policies that can be used for both exploration and exploitation: coverage. This objective aims at visiting as many states as possible and is naturally aligned with exploration (Kearns & Singh, 2002) . Long episodes where the agent visits as many different states as possible result in high returns in some domains such as videogames, locomotion and navigation (Pathak et al., 2017; Burda et al., 2018a) . We argue that pre-training for coverage will bring benefits beyond these particular domains, as it fosters mastery over the environment. This leads to the discovery of skills and behaviors that can be exploited by the agent when solving downstream tasks even if the pre-trained policy does not obtain high returns. Policies that maximize coverage should visit as many states as possible within a single episode, which differs from traditional exploration strategies employed when solving a single task. The goal of the latter is discovering potentially rewarding states, and the drive for exploration fades as strategies that lead to high returns are discovered. The proposed objective is closely related to methods for task-agnostic exploration that train policies that induce maximally entropic state visitation distributions (Hazan et al., 2019; Lee et al., 2019) . However, since the problems we are interested in involve large state spaces where states are rarely visited more than once, we instead propose to consider only the controllable aspects of the state space. This enables disentangling observations from states and gives rise to a more scalable, and thus more easily covered, notion of the state space. We choose Never Give Up (NGU) (Puigdomènech Badia et al., 2020b) as a means for training policies that maximize coverage. NGU defines an intrinsic reward that combines per-episode and life-long novelty over controllable aspects of the state space. It can be derived directly from observations, unlike other approaches that make use of privileged information (Conti et al., 2018) or require estimating state visitation distributions (Hazan et al., 2019) , making it suitable for environments that involve highdimensional observations and partial observability. The intrinsic NGU reward maintains exploration throughout the entire training process, a property that makes it suitable for driving learning in taskagnostic settings. This contrasts with other intrinsic reward signals, that generally vanish as training progresses (Ecoffet et al., 2019) . NGU was originally designed to solve hard-exploration problems by learning a family of policies with different degrees of exploratory behavior. Thanks to weight sharing, the knowledge discovered by exploratory policies enabled positive transfer to exploitative ones, obtaining impressive results when applied to large-scale domains (Puigdomènech Badia et al., 2020a) . We instead propose to use NGU as a pre-training strategy in the absence of reward, transferring knowledge to downstream tasks in the form of behavior rather than weight sharing.

5. CPT: COVERAGE PRE-TRAINING FOR TRANSFER

CPT consists of two stages: (1) pre-training a task-agnostic policy using the intrinsic NGU reward, and (2) solving downstream tasks in the same environment by leveraging the pre-trained behavior.  S × A → R. • We define a Q-function Q π (s, a) : S × A → R on an extended action set A = A ∪ a . • When the agent selects action a , it executes the action given by the pre-trained policy: π(s) = arg max a [Q π (s, a)] if arg max a [Q π (s, a)] = a π p (s) if arg max a [Q π (s, a)] = a • We parameterise Q π using a neural network with random initialization. • We use Lévy flights as behavioral policy when interacting with the environment. See Algorithm 3 for details.

6. EXPERIMENTS

We evaluate CPT in the Atari suite (Bellemare et al., 2013) , a benchmark that presents a variety of challenges and is often used to measure the competence of agents. All our experiments are run using the distributed R2D2 agent (Kapturowski et al., 2019) . A detailed description of the full distributed setting is provided in Appendix J. We use the same hyperparameters as in Agent57 (Puigdomènech Badia et al., 2020a), which are reported in Appendix B. All reported results are the average over three random seeds.

6.1. UNSUPERVISED STAGE

Unsupervised RL methods are often evaluated by measuring the amount of task reward collected by the discovered policies (Burda et al., 2018a; Hansen et al., 2020) , and we use this metric to evaluate the quality of our unsupervised policies. We pre-train our agents using 16B frames in order to guarantee the discovery of meaningful exploration policiesfoot_0 , as it is common to let agents interact with the environment for as long as needed in this unsupervised stage (Hansen et al., 2020) . We compare the results of our unsupervised pre-training state against other unsupervised approaches, standard RL algorithms in the low-data regime and methods that perform unsupervised pre-training followed by an adaptation stage. We select some of the top performing methods in the literature, and refer the reader to Appendix C for a more extensive list of baselines. Since the NGU reward is non-negative, we consider a baseline where the agent obtains a constant positive reward at each step in order to measure the performance of policies that seek to stay alive for as long as possible. Table 1 shows that unsupervised CPT outperforms all baselines by a large margin, confirming the intuition that coverage is a good pre-training objective for the Atari benchmark. These results suggest that there is a strong correlation between exploration and the goals established by game designers (Burda et al., 2018a) . In spite of the strong results, it is worth noting that unsupervised CPT achieves lower scores than random policies in some games, and it is quite inefficient at collecting rewards in some environments (e.g. it needs long episodes to obtain high scores). These observations motivate the development of techniques to leverage these pre-trained policies without compromising performance even when there exists a misalignment between objectives. We now evaluate the proposed strategies for leveraging pre-trained policies once the reward function is reinstated by training R2D2-based agents (Kapturowski et al., 2019) for 5B frames. This is a relatively small budget for these distributed agents with hundreds of actors (Puigdomènech Badia et al., 2020a) . We compare the proposed method against -greedy and z-greedy (Dabney et al., 2020) exploration strategies. Policies are evaluated using five parallel evaluator threads, and we report the average return over the last 300 evaluation episodes. Table 2 reports results in full Atari suite, which confirm the benefits of leveraging the behavior of a policy trained to maximize coverage. Our approach is most beneficial in the set of hard exploration gamesfoot_1 , where unstructured exploration generally precludes the discovery of high-performing policies. It should be noted that our z-greedy ablation under-performs relative to Dabney et al. (2020) . This is due to our hyper-parameters and setting being derived from Puigdomènech Badia et al. (2020b), which adopts the standard Atari pre-processing (e.g. gray scale images and frame stacking). In contrast, Dabney et al. (2020) use color images, no frame stacking, a larger neural network and different hyper-parameters (e.g. smaller replay buffer). Studying if the performance of both NGU and the CPT is preserved in this setting is an important direction for future work. We suspect that improving the performance of our z-greedy ablation will also improve our method, since exploration flights are central to both. Median human normalized score Ablation studies. We run experiments on a subset of games in order to gain insight on the individual contribution of each of the proposed ways of leveraging the pre-trained policy. The subset is composed by 12 gamesfoot_2 , obtained by combining those used to tune hyperparameters by Hansen et al. (2020) with games where zgreedy provides clear gains over -greedy as per Dabney et al. (2020) . This results in a set of games that require different amounts of exploration, and featuring both dense and sparse rewards. Figure 3 shows that both strategies obtain similar median scores across the 12 games, but combining them results in an important performance gain. This suggests that the gains they provide are complementary, and both are responsible for the strong performance of CPT. Note that CPT also outperforms a fine-tuning baseline, where the policy is initialized using the pre-trained weights rather than random ones. We believe that the benefits of both approaches can be combined by training via CPT a policy initialized with pre-trained weights. Effect of the pre-trained policy. The behavior of the pre-trained policy will likely have a strong impact on the final performance of agents. We consider the amount of pre-training as a proxy for the exploration capabilities of the task-agnostic policies. Intuitively, policies trained for longer time spans will develop more complex behaviors that enable visiting a larger number of states. Figure 4 reports the end performance of agents after before and after transfer under different lengths of the pre-training phase, and shows how it has a different impact depending on the nature of the task. Montezuma's Revenge requires structured exploration for efficient learning, and longer pre-training times provide dramatic improvements in the end performance. Note that these improvements do not correlate with the task performance of the task-agnostic policy, which suggests that gains are due to a more efficient exploration of the state space. On the other hand, the final score in Pong is independent of the amount of pre-training. Simple exploration is enough to discover optimal policies, so the behaviors discovered by the unsupervised policy do not play an important role in this game. Transfer to multiple tasks. An appealing property of task-agnostic knowledge is that it can be leveraged to solve multiple tasks. In the RL setting, this can be evaluated by leveraging a single task-agnostic policy for solving multiple tasks (i.e. reward functions) in the same environment. We evaluate whether the unsupervised NGU policies can be useful beyond the standard Atari tasks by creating two alternative versions of Ms Pacman and Hero with different levels of difficulty. The goal in the modified version of Ms Pacman is to eat vulnerable ghosts, with pac-dots giving 0 (easy version) or -10 (hard version) points. In the modified version of Hero, saving miners gives a fixed return of 1000 points and dynamiting walls gives either 0 (easy version) or -300 (hard version) points. The rest of rewards are removed, e.g. eating fruit in Ms Pacman or the bonus for unused power units in Hero. Note that even in the easy version of the games exploration is harder than in the original counterparts, as there are no small rewards guiding the agent towards its goals. In the hard version of the games exploration is even more challenging, as the intermediate rewards work as a deceptive signal that takes the agent away from its actual goal. In this case finding rewarding behaviors requires a stronger commitment to an exploration strategy. In this setting, the exploratory policies often achieve very low or even negative rewards, which contrasts with the strong performance they showed when evaluated under the standard game reward. Even in this adversarial scenario, results in Figure 5 shows that leveraging pre-trained exploration policies provides important gains. These results suggest that the strong performance observed under the standard game rewards is not due to an alignment between the NGU reward and the game goals, but due to an efficient usage of pre-trained exploration policies. Saving miners (hard) Figure 5 : Final scores per task in the Atari games of Ms Pacman (top) and Hero (bottom) with modified reward functions. We train a single task-agnostic policy per environment, and leverage it to solve three different tasks: the standard game reward, a task with sparse rewards (easy), and a variant of the same task with deceptive rewards (hard). Despite the pre-trained policy might obtain low or even negative scores in some of the tasks, committing to its exploratory behavior eventually lets the agent discover strategies that lead to high returns. Towards the low-data regime. So far we considered R2D2-based agents tuned for end-performance on massively distributed setups. Some applications might require higher efficiency in the low-data regime, even if this comes at the cost of a drop in end performance. The data efficiency of our method can be boosted by reusing representations from the pre-trained convolutional torso in the NGU policy (c.f. Figure 6 for details on the architecture), as shown in Figure 1 . We observe that the data efficiency can be boosted further by decreasing the number of parallel actors. Figure 10 in the appendix showcases the improved data efficiency on Montezuma's Revenge when using 16 actors (instead of 256 as in previous experiments), obtaining superhuman scores in less than 50M frames. We note that this is around two times faster than the best results in the benchmark by Taïga et al. ( 2019), even though they consider single-threaded Rainbow-based agents (Hessel et al., 2018) that were designed for data efficiency.

7. RELATED WORK

Our work uses the experimental methodology presented in Hansen et al. (2020) . But whereas that work only considered a simplified adaptation process that limited the final performance on the downstream task, the focus here is on the more general case of using a previously trained policy to aid in solving the full reinforcement learning problem. Specifically, VISR uses successor features to identify which of the pre-trained tasks best matches the true reward structure, which has previously been shown to work well for multi-task transfer (Barreto et al., 2018) . Gupta et al. (2018) provides an alternative method to meta-learn a solver for reinforcement learning problems from unsupervised reward functions. This method utilizes gradient-based metalearning (Finn et al., 2017) , which makes the adaptation process standard reinforcement learning updates. This means that even if the downstream reward is far outside of the training distribution, final performance would not necessarily be affected. However, these methods are hard to scale to the larger networks considered here, and followup work (Jabri et al., 2019) changed to memory-based meta-learning (Duan et al., 2016) which relies on information about rewards staying in the recurrent state. This makes it unsuitable to the sort of hard exploration problem our method excels at. Recent work has shown success in transferring representations learned in an unsupervised setting to reinforcement learning tasks (Stooke et al., 2020b) . Our representation transfer experiments suggest that this should handicap final performance, but the possibility also exists that different unsupervised objectives should be used for representation transfer and policy transfer. Concurrent work by Bagot et al. (2020) also augments an agent with the ability to utilize another policy. However, their work treats the unsupervised policy as an option, only callable for an extended duration. In contrast, we only perform extended calls to the unsupervised policy during exploratory levy flights and augment the action space to allow for single time-step calls. This difference between exploratory and exploitative calls to the unsupervised policy in critical to overall performance, as illustrated in Figure 3 . In addition, in Bagot et al. (2020) the unsupervised policy is learned in tandem based on an intrinsic reward function. This is a promising direction which is complementary to our work, as it handles the case wherein there is no unsupervised pre-training phase. However, their work only considers tabular domains, so it is unclear how this approach would fair in the high-dimensional state spaces considered here.

8. DISCUSSION

We studied the problem of transferring pre-trained behavior in reinforcement learning, an approach that is complementary to the common practice of transferring representations. Depending on the behavior of the pre-trained policies, we argued that they might be useful for exploitation, exploration, or both. We proposed methods to make use of pre-trained behavior for both purposes: exploiting with the pre-trained policy by making it available to the agent as an extra action, and performing temporally-extended exploration with it. While we make no assumption on the nature of the pretrained policies, this raises the question of how to discover behaviors that are suitable for transfer. We proposed coverage as a principle for pre-training task-agnostic policies that are suitable for both exploitation and exploration. We chose NGU in our experiments for its scalability, but note that our approach could be combined with any other strategy for maximizing coverage. We found that unsupervised training with this objective produces strong performing policies in the Atari suite, likely due to the way in which the goals in some of these tasks were designed (Burda et al., 2018a) . Our transfer experiments demonstrate that these pre-trained policies can be used to boost the performance of agents trained to maximize reward, providing the most important gains in hard exploration tasks. These benefits are not due to an alignment between our pre-training and downstream tasks, as we also observed positive transfer in games where the pre-trained policy obtained low scores. In order to provide further evidence for this claim, we designed alternative tasks for Atari games involving hard exploration and deceptive rewards. Our transfer strategy outperformed all considered baselines in these settings, even when the pre-trained policy obtained very low or even negative scores, demonstrating the generality of the method. Besides disambiguating the role of the alignment between pre-training and downstream tasks, these experiments demonstrate the utility of a single task-agnostic policy for solving multiple tasks in the same environment.

A PSEUDO-CODE

Algorithm 1 provides pseudo-code for the flight logic that controls how the pre-trained policy is used for exploration purposes. At each step, a flight is started with probability levy . The duration for the flight is sampled from a heavy-tailed distribution, n dist , similarly to z-greedy (c.f. Appendix B for more details). When not in a flight, the exploitative policy that maximizes the extrinsic reward is derived from the estimated Q-values using the -greedy operator. This ensures that all state-action pairs will be visited given enough time, as exploring only with π p does not guarantee such property. Note that this is not needed in z-greedy, which reduces to standard -greedy exploration when the sampled flight duration equals 1. Algorithm 2 provides pseudo-code for the actor logic when using the augmented action set, A + = A ∪ {π p (s)}. It derives an -greedy policy over |A| + 1 actions, where the (|A| + 1)-th action is resolved by sampling from π p (s). Finally, Algorithm 3 provides pseudo-code for the actor in the full CPT method that combines Algorithms 1 and 2. 

E ALTERNATIVE REWARD FUNCTIONS

MsPacman: eating ghosts • Pac-dots: 0 points (easy) or -10 points (hard) • Eating vulnerable ghosts: -#1 in succession: 200 points -#2 in succession: 400 points -#3 in succession: 800 points -#4 in succession: 1600 points • Other actions: 0 points Hero: rescuing miners • Dynamiting walls: 0 points (easy) or -300 points (hard) • Rescuing a miner: 1000 points • Other actions: 0 points F Q-NETWORK ARCHITECTURE All policies use the same Q-Network architecture as Agent57 (Puigdomènech Badia et al., 2020a) , which is composed by a convolutional torso followed by an LSTM (Hochreiter & Schmidhuber, 1997) and a dueling head (Wang et al., 2016) . When leveraging the behavior of the pre-trained policy to solve new tasks, one can train a policy from scratch or share some of the components for increased efficiency (c.f. Figure 6 ). Shared weights are kept fixed in order to preserve the behavior of the pre-trained policy. G SCORES PER GAME • Use Q-network to learn from (r t , x, a) with Peng's Q(λ) (Peng & Williams, 1994) using the procedure used by R2D2.

Actor

• (once per episode) Sample levy . • Obtain x t . • If not on a flight, start one with probability levy . • If on a flight, compute forward pass with π p to obtain a t . Otherwise, compute forward pass of R2D2 to obtain a t . If a t = |A| + 1, a t ← π p (x). • Insert x t , a t and r t in the replay buffer. • Step on the environment with a t .

Evaluator

• Obtain x t . • Compute forward pass of R2D2 to obtain a t . If a t = |A| + 1, a t ← π p (x). • Step on the environment with a t .

Distributed training

As in R2D2, we train the agent with a single GPU-based learner and a fixed discount factor γ. All actors collect experience using the same policy, but with a different value of . In the replay buffer, we store fixed-length sequences of (x, a, r) tuples. These sequences never cross episode boundaries. Given a single batch of trajectories we unroll both online and target networks on the same sequence of states to generate value estimates. We use prioritized experience replay. We followed the same prioritization scheme proposed in Kapturowski et al. (2019) using a mixture of max and mean of the TD-errors with priority exponent η = 1.0.



The pre-training budget was not tuned, but we observe that competitive policies arise early in training. This observation suggests that smaller budgets are feasible as well. montezuma_revenge, pitfall, private_eye, venture, gravitar, solaris asterix, bank_heist, frostbite, gravitar, jamesbond, montezuma_revenge, ms_pacman, pong, private_eye, space_invaders, tennis, up_n_down.



The agent interacts with a Markov Decision Process (MDP) defined by the tuple (S, A, P, r NGU , γ), with S being the state space, A being the action space, P the statetransition distribution, γ ∈ (0, 1] the discount factor and the reward function r N GU is the intrinsic reward used in NGU(Puigdomènech Badia et al., 2020b).• We use a value-based agent with a Q-function, Q NGU (s, a) : S × A → R, parameterised with a neural network as defined in Appendix F. • We train Q NGU to maximise the NGU intrinsic reward, obtaining a deterministic policy given by π p (s) = arg max[Q NGU (s, a)]. • We use -greedy as behavioral policy when interacting with the environment.Transfer• We are given now a new MDP given by (S, A, P, r, γ), where the only change with respect to the pre-training stage is a new extrinsic reward function r :

Figure 3: Ablation results. Using the task-agnostic policy for exploitation and exploration seems to provide complementary benefits, as combining the two techniques results in important gains.

Figure 4: Effect of the pre-training budget, before and after adaptation, on Montezuma's Revenge (hard exploration) and Pong (dense reward).

Figure 6: Q-Network architecture for the reinforcement learning stage. The pre-trained policy can leveraged without transferring representations (left), but sharing weights generally provides efficiency gains early in training (right).

Figure 8: Training curves for ablation experiments after 5B frames. Shading shows maximum and minimum over 3 runs, while dark lines indicate the mean. Both methods offer benefits over the baselines, but in different sets of games. Combining them retains the best of both methods, and boosts performance even further in some games.

Figure9: Alternative reward functions for MsPacman (top) and Hero (bottom). We report training curves for the standard game reward (left), a variant with sparse rewards (center), and a task with deceptive rewards (right). Despite the pre-trained policy might obtain low or even negative scores in some of the tasks, committing to its exploratory behavior eventually lets the agent discover strategies that lead to high returns.

Figure 10: Training curves after 50M frames on Montezuma's Revenge, using 16 actors and the CNN encoder from the pre-trained policy. Pre-trained weights are not fine-tuned.



Atari Suite comparisons for R2D2-based agents. @N represents the amount of RL interaction with reward utilized, with four frames observed at each iteration. Mdn, M and CM are median, mean and mean capped human normalized scores, respectively.

Algorithm 1: Actor pseudo-code for CPT (exploration only) Input: Q-value estimate for the current policy, Q π (s, a) Input: Pre-trained policy, π p Input: Probability of starting a flight, levy Input: Flight length distribution, n dist Pre-trained policy, π p Input: Q-value estimate for the current policy, Q π (s, a) ∀a ∈ A ∪ {π p (s)}

Results per game at 5B training frames.

Final scores per game in our ablation study after 5B frames. We consider versions of CPT where the pre-trained policy is used for exploitation, exploration, or both.

annex

Algorithm 3: Actor pseudo-code for CPT Input: Action set A Input: Pre-trained policy, π p Input: Q-value estimate for the current policy, Q π (s, a) ∀a ∈ A ∪ {π p (s)} Input: Probability of taking an exploratory action, Input: Probability of starting a flight, levy Input: Flight length distribution, n dist while True do n ← -0 // flight length while episode not ended do Observe state s if n == 0 and random() 

B HYPERPARAMETERS

Table 3 summarizes the main hyperparameters of our method. The pre-trained policies were optimized using Retrace (Munos et al., 2016) . Transfer was performed with Peng's Q(λ) (Peng & Williams, 1994) instead, which we found to be much more data efficient in our experiments. The reason for this difference is that the benefits of Q(λ) were observed once unsupervised policies had been trained on all Atari games, and we suspect that the data efficiency gains will transfer to the pre-training stage as well. 

C EXTENDED UNSUPERVISED RL RESULTS

Table 4 compares unsupervised CPT with all the methods reported by Hansen et al. (2020) .Table 4 : Atari Suite comparisons, adapted from Hansen et al. (2020) . @N represents the amount of RL interaction with reward utilized, with four frames observed at each iteration. Mdn and M are median and mean human normalized scores, respectively; > 0 is the number of games with better than random performance; and > H is the number of games with human-level performance as defined in Mnih et al. (2015) . 

J DISTRIBUTED SETTING

All experiments are run using a distributed setting. The evaluation we do is also identical to the one done in R2D2 (Kapturowski et al., 2019) : parallel evaluation workers, which share weights with actors and learners, run the Q-network against the environment. This worker and all the actor workers are the two types of workers that draw samples from the environment. For Atari, we apply the standard DQN pre-processing, as used in R2D2. The next subsections describe how actors, evaluators, and learner are run in each stage.

J.1 UNSUPERVISED STAGE

The computation of the intrinsic NGU reward, r NGU t , follows the method described in Puigdomènech Badia et al. (2020b, Appendix A.1) . In particular, we use the version that combines episodic intrinsic rewards with intrinsic reward from Random Network Distillation (RND) (Burda et al., 2018b) .

Learner

• Sample from the replay buffer a sequence of intrinsic rewards r NGU t , observations x and actions a.• Use Q-network to learn from (r NGU t , x, a) with Retrace (Munos et al., 2016) using the procedure used by R2D2.• Use last 5 frames of the sampled sequences to train the action prediction network in NGU.This means that, for every batch of sequences, all time steps are used to train the RL loss, whereas only 5 time steps per sequence are used to optimize the action prediction loss.• Use last 5 frames of the sampled sequences to train the predictor of RND.

Evaluator and Actor

• Obtain x t and r NGU t-1 . • With these inputs, compute forward pass of R2D2 to obtain a t .• With x t , compute r NGU t using the embedding network in NGU.• (actor) Insert x t , a t and r NGU t in the replay buffer.• Step on the environment with a t .

Distributed training

As in R2D2, we train the agent with a single GPU-based learner and a fixed discount factor γ. All actors collect experience using the same policy, but with a different value of . This differs from the original NGU agent, where each actor runs a policy with a different degree of exploratory behavior and discount factor.In the replay buffer, we store fixed-length sequences of (x, a, r) tuples. These sequences never cross episode boundaries. Given a single batch of trajectories we unroll both online and target networks on the same sequence of states to generate value estimates. We use prioritized experience replay. We followed the same prioritization scheme proposed in Kapturowski et al. (2019) using a mixture of max and mean of the TD-errors with priority exponent η = 1.0.

Learner

• Sample from the replay buffer a sequence of extrinsic rewards r t , observations x and actions a.• (expanded action set) Duplicate transitions collected with π p and relabel the duplicates with the primitive action taken by π p when acting.

