META-REINFORCEMENT LEARNING WITH INFORMED POLICY REGULARIZATION

Abstract

Meta-reinforcement learning aims at finding a policy able to generalize to new environments. When facing a new environment, this policy must explore to identify its particular characteristics and then exploit this information for collecting reward. We consider the online adaptation setting where the agent needs to trade-off between the two types of behaviour within the same episode. Even though policies based on recurrent neural networks can be used in this setting by training them on multiple environments, they often fail to model this trade-off, or solve it at a very high computational cost. In this paper, we propose a new algorithm that uses privileged information in the form of a task descriptor at train time to improve the learning of recurrent policies. Our method learns an informed policy (i.e., a policy receiving as input the description of the current task) that is used to both construct task embeddings from the descriptors, and to regularize the training of the recurrent policy through parameters sharing and an auxiliary objective. This approach significantly reduces the learning sample complexity without altering the representational power of RNNs, by focusing on the relevant characteristics of the task, and by exploiting them efficiently. We evaluate our algorithm in a variety of environments that require sophisticated exploration/exploitation strategies and show that it outperforms vanilla RNNs, Thompson sampling and the task-inference approaches to meta-reinforcement learning.

1. INTRODUCTION

Deep Reinforcement Learning has been used to successfully train agents on a range of challenging environments such as Atari games (Mnih et al., 2013; Bellemare et al., 2013; Hessel et al., 2017) or continuous control (Peng et al., 2017; Schulman et al., 2017) . Nonetheless, in these problems, RL agents perform exploration strategies to discover the environment and implement algorithms to learn a policy that is tailored to solving a single task. Whenever the task changes, RL agents generalize poorly and the whole process of exploration and learning restarts from scratch. On the other hand, we expect an intelligent agent to fully master a problem when it is able to generalize from a few instances (tasks) and achieve the objective of the problem under many variations of the environment. For instance, children know how to ride a bike (i.e., the problem) when they can reach their destination irrespective of the specific bike they are riding, which requires to adapt to the weight of the bike, the friction of the brakes and tires, and the road conditions (i.e., the tasks). How to enable agents to generalize across tasks has been studied in Multi-task Reinforcement Learning (e.g. Wilson et al., 2007; Teh et al., 2017 ), Transfer Learning (e.g. Taylor & Stone, 2011; Lazaric, 2012) and Meta-Reinforcement Learning (Finn et al., 2017; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019) . These works fall into two categories. Learning to learn approaches aim at speeding up learning on new tasks, by pre-training feature extractors or learning good initializations of policy weights (Raghu et al., 2019) . In contrast, we study in this paper the online adaptation setting where a single policy is trained for a fixed family of tasks. When facing a new task, the policy must then balance exploration (or probing), to reduce the uncertainty about the current task, and exploitation to maximize the cumulative reward of the task. Agents are evaluated on their ability to manage this trade-off within a single episode of the same task. The online adaptation setting is a special case of a partially observed markov decision problem, where the unobserved variables are the descriptors of the current task. It is thus

