COVERAGE AS A PRINCIPLE FOR DISCOVERING TRANSFERABLE BEHAVIOR IN REINFORCEMENT LEARNING

Abstract

Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Unsupervised learning provides a useful paradigm for autonomous acquisition of task-agnostic knowledge. In supervised settings, representations discovered through unsupervised pre-training offer important benefits when transferred to downstream tasks. Given the nature of the reinforcement learning problem, we explore how to transfer knowledge through behavior instead of representations. The behavior of pre-trained policies may be used for solving the task at hand (exploitation), as well as for collecting useful data to solve the problem (exploration). We argue that pre-training policies to maximize coverage will result in behavior that is useful for both strategies. When using these policies for both exploitation and exploration, our agents discover solutions that lead to larger returns. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.

1. INTRODUCTION

Unsupervised representation learning techniques have led to unprecedented results in domains like computer vision (Hénaff et al., 2019; He et al., 2019) and natural language processing (Devlin et al., 2019; Radford et al., 2019) . These methods are commonly composed of two stages -an initial unsupervised phase, followed by supervised fine-tuning on downstream tasks. The self-supervised nature of the learning objective allows to leverage large collections of unlabelled data in the first stage. This produces models that extract task-agnostic features that are well suited for transfer to downstream tasks. In reinforcement learning (RL), auxiliary representation learning objectives provide denser signals that result in data efficiency gains (Jaderberg et al., 2017) and even bridge the gap between learning from true state and pixel observations (Laskin et al., 2020) . However, RL applications have not yet seen the advent of the two-stage setting where task-agnostic pre-training is followed by efficient transfer to downstream tasks. We argue that there are two reasons explaining this lag with respect to their supervised counterparts. First, these methods traditionally focus on transferring representations (Lesort et al., 2018) . While this is enough in supervised scenarios, we argue that leveraging pre-trained behavior is far more important in RL domains requiring structured exploration. Second, what type of self-supervised objectives enable the acquisition of transferable, task-agnostic knowledge is still an open question. Defining these objectives in the RL setting is complex, as they should account for the fact that the the distribution of the input data will be defined by the behavior of the agent. Transfer in deep learning is often performed through parameter initialization followed by fine-tuning. The most widespread procedure consists in initializing all weights in the neural network using those from the pre-trained model, and then adding an output layer with random parameters (Girshick et al., 2014; Devlin et al., 2019) . Depending on the amount of available data, pre-trained parameters can either be fine-tuned or kept fixed. This builds on the intuition that the pre-trained model will map inputs to a feature space where the downstream task is easy to perform. In the RL setting, this procedure will completely dismiss the pre-trained policy and fall back to a random one when collecting experience. Given that complex RL problems require structured and temporally-extended behaviors, we argue that representation alone is not enough for efficient transfer in challenging

