META-REINFORCEMENT LEARNING ROBUST TO DISTRIBUTIONAL SHIFT VIA MODEL IDENTIFICATION AND EXPERIENCE RELABELING

Abstract

Reinforcement learning algorithms can acquire policies for complex tasks autonomously. However, the number of samples required to learn a diverse set of skills can be prohibitively large. While meta-reinforcement learning methods have enabled agents to leverage prior experience to adapt quickly to new tasks, their performance depends crucially on how close the new task is to the previously experienced tasks. Current approaches are either not able to extrapolate well, or can do so at the expense of requiring extremely large amounts of data for on-policy meta-training. In this work, we present model identification and experience relabeling (MIER), a meta-reinforcement learning algorithm that is both efficient and extrapolates well when faced with out-of-distribution tasks at test time. Our method is based on a simple insight: we recognize that dynamics models can be adapted efficiently and consistently with off-policy data, more easily than policies and value functions. These dynamics models can then be used to continue training policies and value functions for out-of-distribution tasks without using meta-reinforcement learning at all, by generating synthetic experience for the new task.

1. INTRODUCTION

Recent advances in reinforcement learning (RL) have enabled agents to autonomously acquire policies for complex tasks, particularly when combined with high-capacity representations such as neural networks (Lillicrap et al., 2015; Schulman et al., 2015; Mnih et al., 2015; Levine et al., 2016) . However, the number of samples required to learn these tasks is often very large. Meta-reinforcement learning (meta-RL) algorithms can alleviate this problem by leveraging experience from previously seen related tasks (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017a) , but the performance of these methods on new tasks depends crucially on how close these tasks are to the meta-training task distribution. Meta-trained agents can adapt quickly to tasks that are similar to those seen during meta-training, but lose much of their benefit when adapting to tasks that are too far away from the meta-training set. This places a significant burden on the user to carefully construct meta-training task distributions that sufficiently cover the kinds of tasks that may be encountered at test time. Many meta-RL methods either utilize a variant of model-agnostic meta-learning (MAML) and adapt to new tasks with gradient descent (Finn et al., 2017a; Rothfuss et al., 2018; Nagabandi et al., 2018) , or use an encoder-based formulation that adapt by encoding experience with recurrent models (Duan et al., 2016; Wang et al., 2016 ), attention mechanisms (Mishra et al., 2017) or variational inference (Rakelly et al., 2019) . The encoder-based methods struggle when adapting to out-of-distribution tasks, because the adaptation procedure is entirely learned and carries no guarantees for out-of-distribution inputs (as with any learned model). Methods that utilize gradientbased adaptation have the potential of handling out-of-distribution tasks more effectively, since gradient descent corresponds to a well-defined and consistent learning process that has a guarantee of improvement regardless of the task (Finn & Levine, 2018) . However, in the RL setting, these methods (Finn et al., 2017a; Rothfuss et al., 2018) utilize on-policy policy gradient methods for meta-training, which require a very large number meta-training samples (Rakelly et al., 2019) . In this paper, we aim to develop a meta-RL algorithm that can both adapt effectively to out-ofdistribution tasks and be meta-trained efficiently via off-policy value-based algorithms. One approach Figure 1 : Overview of our approach. The model context variable (φ) is adapted using gradient descent, and the adapted context variable (φT ) is fed to the policy alongside state so the policy can be trained with standard RL (Model Identification). The adapted model is used to relabel the data from other tasks by predicting next state and reward, generating synthetic experience to continue improving the policy (Experience Relabeling). might be to directly develop a value-based off-policy meta-RL method that uses gradient-based meta-learning. However, this is very difficult, since the fixed point iteration used in value-based RL algorithms does not correspond to gradient descent, and to our knowledge no prior method has successfully adapted MAML to the off-policy value-based setting. We further discuss this difficulty in Appendix A. Instead, we propose to leverage a simple insight: dynamics and reward models can be adapted consistently, using gradient based update rules with off-policy data, even if policies and value functions cannot. These models can then be used to train policies for out-of-distribution tasks without using meta-RL at all, by generating synthetic experience for the new tasks. Based on this observation, we propose model identification and experience relabeling (MIER), a meta-RL algorithm that makes use of two independent novel concepts: model identification and experience relabeling. Model identification refers to the process of identifying a particular task from a distribution of tasks, which requires determining its transition dynamics and reward function. We use a gradient-based supervised meta-learning method to learn a dynamics and reward model and a (latent) model context variable such that the model quickly adapts to new tasks after a few steps of gradient descent on the context variable. The context variable must contain sufficient information about the task to accurately predict dynamics and rewards. The policy can then be conditioned on this context (Schaul et al., 2015; Kaelbling, 1993) and therefore does not need to be meta-trained or adapted. Hence it can be learned with any standard RL algorithm, avoiding the complexity of meta-reinforcement learning. We illustrate the model identification process Figure 1 (left) . When adapting to out-of-distribution tasks at meta-test time, the adapted context variable may itself be out of distribution, and the context-conditioned policy might perform poorly. However, since MIER adapts the model with gradient descent, we can continue to improve the model using more gradient steps. To continue improving the policy, we leverage all data collected from other tasks during meta-training, by using the learned model to relabel the next state and reward on every previously seen transition, obtaining synthetic data to continue training the policy. We call this process, shown in the right part of Figure 1 , experience relabeling. This enables MIER to adapt to tasks outside of the meta-training distribution, outperforming prior meta-reinforcement learning methods in this setting.

2. PRELIMINARIES

Formally, the reinforcement learning problem is defined by a Markov decision process (MDP). We adopt the standard definition of an MDP, T = (S, A, p, µ 0 , r, γ), where S is the state space, A is the action space, p(s |s, a) is the unknown transition probability of reaching state s at the next time step when an agent takes action a at state s, µ 0 (s) is the initial state distribution, r(s, a) is the reward function, and γ ∈ (0, 1) is the discount factor. An agent acts according to some policy π(a|s) and the learning objective is to maximize the expected return, E st,at∼π [ t γ t r(s t , a t )]. We further define the meta-reinforcement learning problem. Meta-training uses a distribution over MDPs, ρ(T ), from which tasks are sampled. Given a specific task T , the agent is allowed to collect a small amount of data D (T ) adapt , and adapt the policy to obtain π T . The objective of meta-training is to maximize the expected return of the adapted policy E T ∼ρ(T ),st,at∼π T [ t γ t r(s t , a t )].

