META-REINFORCEMENT LEARNING WITH INFORMED POLICY REGULARIZATION

Abstract

Meta-reinforcement learning aims at finding a policy able to generalize to new environments. When facing a new environment, this policy must explore to identify its particular characteristics and then exploit this information for collecting reward. We consider the online adaptation setting where the agent needs to trade-off between the two types of behaviour within the same episode. Even though policies based on recurrent neural networks can be used in this setting by training them on multiple environments, they often fail to model this trade-off, or solve it at a very high computational cost. In this paper, we propose a new algorithm that uses privileged information in the form of a task descriptor at train time to improve the learning of recurrent policies. Our method learns an informed policy (i.e., a policy receiving as input the description of the current task) that is used to both construct task embeddings from the descriptors, and to regularize the training of the recurrent policy through parameters sharing and an auxiliary objective. This approach significantly reduces the learning sample complexity without altering the representational power of RNNs, by focusing on the relevant characteristics of the task, and by exploiting them efficiently. We evaluate our algorithm in a variety of environments that require sophisticated exploration/exploitation strategies and show that it outperforms vanilla RNNs, Thompson sampling and the task-inference approaches to meta-reinforcement learning.

1. INTRODUCTION

Deep Reinforcement Learning has been used to successfully train agents on a range of challenging environments such as Atari games (Mnih et al., 2013; Bellemare et al., 2013; Hessel et al., 2017) or continuous control (Peng et al., 2017; Schulman et al., 2017) . Nonetheless, in these problems, RL agents perform exploration strategies to discover the environment and implement algorithms to learn a policy that is tailored to solving a single task. Whenever the task changes, RL agents generalize poorly and the whole process of exploration and learning restarts from scratch. On the other hand, we expect an intelligent agent to fully master a problem when it is able to generalize from a few instances (tasks) and achieve the objective of the problem under many variations of the environment. For instance, children know how to ride a bike (i.e., the problem) when they can reach their destination irrespective of the specific bike they are riding, which requires to adapt to the weight of the bike, the friction of the brakes and tires, and the road conditions (i.e., the tasks). How to enable agents to generalize across tasks has been studied in Multi-task Reinforcement Learning (e.g. Wilson et al., 2007; Teh et al., 2017 ), Transfer Learning (e.g. Taylor & Stone, 2011; Lazaric, 2012) and Meta-Reinforcement Learning (Finn et al., 2017; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019) . These works fall into two categories. Learning to learn approaches aim at speeding up learning on new tasks, by pre-training feature extractors or learning good initializations of policy weights (Raghu et al., 2019) . In contrast, we study in this paper the online adaptation setting where a single policy is trained for a fixed family of tasks. When facing a new task, the policy must then balance exploration (or probing), to reduce the uncertainty about the current task, and exploitation to maximize the cumulative reward of the task. Agents are evaluated on their ability to manage this trade-off within a single episode of the same task. The online adaptation setting is a special case of a partially observed markov decision problem, where the unobserved variables are the descriptors of the current task. It is thus Optimal informed policies are shortest paths from start to either G1 or G2, which never visit the sign. Thompson sampling cannot represent the optimal exploration/exploitation policy (go to the sign first) since going to the sign is not feasible by any informed policy. possible to rely on recurrent neural networks (RNNs) (Bakker, 2001; Heess et al., 2015) , since they can theoretically represent optimal policies in POMDPs if given enough capacity. Unfortunately, the training of RNN policies has often prohibitive sample complexity and it may converge to suboptimal local minima. To overcome this drawback, efficient online adaptation methods leverage the knowledge of the task at training time. The main approach is to pair an exploration strategy with the training of informed policies, i.e. policies taking the description of the current task as input. Probe-then-Exploit (PTE) algorithms (e.g. Zhou et al., 2019) operate in two stages. They first rely on an exploration policy to identify the task. Then, they commit to the identified task by playing the associated informed policy. Thompson Sampling (TS) approaches (Thompson, 1933; Osband et al., 2016; 2019 ) maintain a distribution over plausible tasks and play the informed policy of a task sampled from the posterior following a predefined schedule. PTE and TS are expected to be sample-efficient relatively to RNNs as learning informed policies is a fully observable problem. However, as we discuss in Section 3, PTE and TS cannot represent effective exploration/exploitation policies in many environments. Humplik et al. ( 2019) proposed an alternative approach, Task Inference (TI), which trains a full RNN policy with the current task prediction as an auxiliary loss. TI avoids the suboptimality of PTE/TS by not constraining the structure of the exploration/exploitation policy. However, in TI, the task descriptors are used as targets and not as inputs, so TI focuses on reconstructing even irrelevant features of the task descriptor and it does not leverage the faster learning of informed policies. In this paper, we introduce IMPORT (InforMed POlicy RegularizaTion), a novel policy architecture for efficient online adaptation that combines the rich expressivity of RNNs with the efficient learning of informed policies. At train time, a shared policy head receives as input the current observation, together with either a (learned) embedding of the current task, or the hidden state of an RNN such that the informed policy and the RNN policy are learned simultaneously. At test time, the hidden state of the RNN replaces the task embedding, and the agent acts without having access to the current task. This leads to several advantages: 1) IMPORT benefits from informed policy to speed up learning; 2) it avoids to reconstruct features of the task descriptor that are irrelevant for learning; and as a consequence, 3) it adapts faster to unknown environments, showing better generalization capabilities. We evaluate IMPORT against the main approaches to online adaptation on environments that require sophisticated exploration/exploitation strategies. We confirm that TS suffers from its limited expressivity, and show that the policy regularization of IMPORT significantly speeds up learning compared to TI. Moreover, the learnt task embeddings of IMPORT make it robust to irrelevant or minimally informative task descriptors, and able to generalize when learning on few training tasks.

2. SETTING

Let M be the space of possible tasks. Each µ ∈ M is associated to an episodic µ-MDP M µ = (S, A, p µ , r µ , γ) whose dynamics p µ and rewards r µ are task dependent, while state and action spaces are shared across tasks and γ is the discount factor. The descriptor µ can be a simple id (µ ∈ N) or a set of parameters (µ ∈ R d ). When the reward function and the transition probabilities are unknown, RL agents need to devise a strategy that balances exploration to gather information about the system and exploitation to maximize the cumulative reward. Such a strategy can be defined as the solution of a partially observable MDP (POMDP), where the hidden variable is the descriptor µ of the MDP. Given a trajectory τ t = (s 1 , a 1 , r 1 , . . . , s t-1 , a t-1 , r t-1 , s t ), a POMDP policy π(a t |τ t ) maps the trajectory to actions. In particular, the optimal policy in a POMDP is a history-dependent policy that uses τ t to construct a belief state b t , which describes the uncertainty about the task at hand, and then maps it to the action that maximizes the expected sum of rewards (e.g. Kaelbling et al., 1998) . In this case,



Figure1: An environment with two tasks: the goal location (G1 or G2) changes at each episode. The sign reveals the location of the goal. Optimal informed policies are shortest paths from start to either G1 or G2, which never visit the sign. Thompson sampling cannot represent the optimal exploration/exploitation policy (go to the sign first) since going to the sign is not feasible by any informed policy.

