COVERAGE AS A PRINCIPLE FOR DISCOVERING TRANSFERABLE BEHAVIOR IN REINFORCEMENT LEARNING

Abstract

Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Unsupervised learning provides a useful paradigm for autonomous acquisition of task-agnostic knowledge. In supervised settings, representations discovered through unsupervised pre-training offer important benefits when transferred to downstream tasks. Given the nature of the reinforcement learning problem, we explore how to transfer knowledge through behavior instead of representations. The behavior of pre-trained policies may be used for solving the task at hand (exploitation), as well as for collecting useful data to solve the problem (exploration). We argue that pre-training policies to maximize coverage will result in behavior that is useful for both strategies. When using these policies for both exploitation and exploration, our agents discover solutions that lead to larger returns. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.

1. INTRODUCTION

Unsupervised representation learning techniques have led to unprecedented results in domains like computer vision (Hénaff et al., 2019; He et al., 2019) and natural language processing (Devlin et al., 2019; Radford et al., 2019) . These methods are commonly composed of two stages -an initial unsupervised phase, followed by supervised fine-tuning on downstream tasks. The self-supervised nature of the learning objective allows to leverage large collections of unlabelled data in the first stage. This produces models that extract task-agnostic features that are well suited for transfer to downstream tasks. In reinforcement learning (RL), auxiliary representation learning objectives provide denser signals that result in data efficiency gains (Jaderberg et al., 2017) and even bridge the gap between learning from true state and pixel observations (Laskin et al., 2020) . However, RL applications have not yet seen the advent of the two-stage setting where task-agnostic pre-training is followed by efficient transfer to downstream tasks. We argue that there are two reasons explaining this lag with respect to their supervised counterparts. First, these methods traditionally focus on transferring representations (Lesort et al., 2018) . While this is enough in supervised scenarios, we argue that leveraging pre-trained behavior is far more important in RL domains requiring structured exploration. Second, what type of self-supervised objectives enable the acquisition of transferable, task-agnostic knowledge is still an open question. Defining these objectives in the RL setting is complex, as they should account for the fact that the the distribution of the input data will be defined by the behavior of the agent. Transfer in deep learning is often performed through parameter initialization followed by fine-tuning. The most widespread procedure consists in initializing all weights in the neural network using those from the pre-trained model, and then adding an output layer with random parameters (Girshick et al., 2014; Devlin et al., 2019) . Depending on the amount of available data, pre-trained parameters can either be fine-tuned or kept fixed. This builds on the intuition that the pre-trained model will map inputs to a feature space where the downstream task is easy to perform. In the RL setting, this procedure will completely dismiss the pre-trained policy and fall back to a random one when collecting experience. Given that complex RL problems require structured and temporally-extended behaviors, we argue that representation alone is not enough for efficient transfer in challenging 2020b) . Transferring representations provides a significant boost on dense reward games, but it does not seem to help in hard exploration ones. Leveraging the behavior of the pre-trained policy provides important gains in hard exploration problems when compared to standard fine-tuning and is complementary to transferring representations. We refer the reader to Appendix F for details on the network architecture. domains. Pre-trained representations do indeed provide data efficiency gains in domains with dense reward signals (Finn et al., 2017; Yarats et al., 2019; Stooke et al., 2020a) , but our experiments show that the standard fine-tuning procedure falls short in hard exploration problems (c.f. Figure 1 ). We observe this limitation even when fine-tuning the pre-trained policy, which is aligned with findings from previous works (Finn et al., 2017) . Learning in the downstream task can lead to catastrophically forgetting the pre-trained policy, something that depends on many difficult-to-measure factors such as the similarity between the tasks. We address the problem of leveraging arbitrary pre-trained policies when solving downstream tasks, a requirement towards enabling efficient transfer in RL. Defining unsupervised RL objectives remains an open problem, and existing solutions are often influenced by how the acquired knowledge will be used for solving downstream tasks. Modelbased approaches can learn world models from unsupervised interaction (Ha & Schmidhuber, 2018). However, the diversity of the training data will impact the accuracy of the model (Sekar et al., 2020) and deploying this type of approach in visually complex domains like Atari remains an open problem (Hafner et al., 2019) . Unsupervised RL has also been explored through the lens of empowerment (Salge et al., 2014; Mohamed & Rezende, 2015) , which studies agents that aim to discover intrinsic options (Gregor et al., 2016; Eysenbach et al., 2019) . While these options can be leveraged by hierarchical agents (Florensa et al., 2017) or integrated within the universal successor features framework (Barreto et al., 2017; 2018; Borsa et al., 2019; Hansen et al., 2020) , their lack of coverage generally limits their applicability to complex downstream tasks (Campos et al., 2020) . We argue that maximizing coverage is a good objective for task-agnostic RL, as agents that succeed at this task will need to develop complex behaviors in order to efficiently explore the environment (Kearns & Singh, 2002) . This problem can be formulated as that of finding policies that induce maximally entropic state distributions, which might become extremely inefficient in high-dimensional state spaces without proper priors (Hazan et al., 2019; Lee et al., 2019) . In practice, exploration is often encouraged through intrinsic curiosity signals that incorporate priors in order to quantify how different the current state is from those already visited (Bellemare et al., 2016; Houthooft et al., 2016; Ostrovski et al., 2017; Puigdomènech Badia et al., 2020b) . Agents that maximize these novelty-seeking signals have been shown to discover useful behaviors in unsupervised settings (Pathak et al., 2017; Burda et al., 2018a) , but little research has been conducted towards leveraging the acquired knowledge once the agent is exposed to extrinsic reward. We show that coverage-seeking objectives are a good proxy for acquiring knowledge in task-agnostic settings, as leveraging the behaviors discovered in an unsupervised pre-training stage provides important gains when solving downstream tasks.



Figure 1: Comparison of transfer strategies on Montezuma's Revenge (hard exploration) and Space Invaders (dense reward) from a task-agnostic policy pre-trained with NGU (Puigdomènech Badia et al.,2020b). Transferring representations provides a significant boost on dense reward games, but it does not seem to help in hard exploration ones. Leveraging the behavior of the pre-trained policy provides important gains in hard exploration problems when compared to standard fine-tuning and is complementary to transferring representations. We refer the reader to Appendix F for details on the network architecture.

