META-REINFORCEMENT LEARNING ROBUST TO DISTRIBUTIONAL SHIFT VIA MODEL IDENTIFICATION AND EXPERIENCE RELABELING

Abstract

Reinforcement learning algorithms can acquire policies for complex tasks autonomously. However, the number of samples required to learn a diverse set of skills can be prohibitively large. While meta-reinforcement learning methods have enabled agents to leverage prior experience to adapt quickly to new tasks, their performance depends crucially on how close the new task is to the previously experienced tasks. Current approaches are either not able to extrapolate well, or can do so at the expense of requiring extremely large amounts of data for on-policy meta-training. In this work, we present model identification and experience relabeling (MIER), a meta-reinforcement learning algorithm that is both efficient and extrapolates well when faced with out-of-distribution tasks at test time. Our method is based on a simple insight: we recognize that dynamics models can be adapted efficiently and consistently with off-policy data, more easily than policies and value functions. These dynamics models can then be used to continue training policies and value functions for out-of-distribution tasks without using meta-reinforcement learning at all, by generating synthetic experience for the new task.

1. INTRODUCTION

Recent advances in reinforcement learning (RL) have enabled agents to autonomously acquire policies for complex tasks, particularly when combined with high-capacity representations such as neural networks (Lillicrap et al., 2015; Schulman et al., 2015; Mnih et al., 2015; Levine et al., 2016) . However, the number of samples required to learn these tasks is often very large. Meta-reinforcement learning (meta-RL) algorithms can alleviate this problem by leveraging experience from previously seen related tasks (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017a) , but the performance of these methods on new tasks depends crucially on how close these tasks are to the meta-training task distribution. Meta-trained agents can adapt quickly to tasks that are similar to those seen during meta-training, but lose much of their benefit when adapting to tasks that are too far away from the meta-training set. This places a significant burden on the user to carefully construct meta-training task distributions that sufficiently cover the kinds of tasks that may be encountered at test time. Many meta-RL methods either utilize a variant of model-agnostic meta-learning (MAML) and adapt to new tasks with gradient descent (Finn et al., 2017a; Rothfuss et al., 2018; Nagabandi et al., 2018) , or use an encoder-based formulation that adapt by encoding experience with recurrent models (Duan et al., 2016; Wang et al., 2016 ), attention mechanisms (Mishra et al., 2017) or variational inference (Rakelly et al., 2019) . The encoder-based methods struggle when adapting to out-of-distribution tasks, because the adaptation procedure is entirely learned and carries no guarantees for out-of-distribution inputs (as with any learned model). Methods that utilize gradientbased adaptation have the potential of handling out-of-distribution tasks more effectively, since gradient descent corresponds to a well-defined and consistent learning process that has a guarantee of improvement regardless of the task (Finn & Levine, 2018) . However, in the RL setting, these methods (Finn et al., 2017a; Rothfuss et al., 2018) utilize on-policy policy gradient methods for meta-training, which require a very large number meta-training samples (Rakelly et al., 2019) . In this paper, we aim to develop a meta-RL algorithm that can both adapt effectively to out-ofdistribution tasks and be meta-trained efficiently via off-policy value-based algorithms. One approach 1

