OFFLINE META-REINFORCEMENT LEARNING WITH ADVANTAGE WEIGHTING

Abstract

This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting. Offline meta-RL is analogous to the widely successful supervised learning strategy of pretraining a model on a large batch of fixed, pre-collected data (possibly from various tasks) and fine-tuning the model to a new task with relatively little data. That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks and adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task. By nature of being offline, algorithms for offline meta-RL can utilize the largest possible pool of training data available and eliminate potentially unsafe or costly data collection during meta-training. This setting inherits the challenges of offline RL, but it differs significantly because offline RL does not generally consider a) transfer to new tasks or b) limited data from the test task, both of which we face in offline meta-RL. Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW). MACAW is an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training. On offline variants of common meta-RL benchmarks, we empirically find that this approach enables fully offline meta-reinforcement learning and achieves notable gains over prior methods.

1. INTRODUCTION

Meta-reinforcement learning (meta-RL) has emerged as a promising strategy for tackling the high sample complexity of reinforcement learning algorithms, when the goal is to ultimately learn many tasks. Meta-RL algorithms exploit shared structure among tasks during meta-training, amortizing the cost of learning across tasks and enabling rapid adaptation to new tasks during meta-testing from only a small amount of experience. Yet unlike in supervised learning, where large amounts of pre-collected data can be pooled from many sources to train a single model, existing meta-RL algorithms assume the ability to collect millions of environment interactions online during meta-training. Developing offline meta-RL methods would enable such methods, in principle, to leverage existing data from any source, making them easier to scale to real-world problems where large amounts of data might be necessary to generalize broadly. To this end, we propose the offline meta-RL problem setting and a corresponding algorithm that uses only offline (or batch) experience from a set of training tasks to enable efficient transfer to new tasks without any further interaction with either the training or testing environments. See Figure 1 for a comparison of offline meta-RL and standard meta-RL. Because the offline setting does not allow additional data collection during training, it highlights the desirability of a consistent meta-RL algorithm. A meta-RL algorithm is consistent if, given enough diverse data on the test task, adaptation can find a good policy for the task regardless of the training task distribution. Such an algorithm would provide a) rapid adaptation to new tasks from the same distribution as the train tasks while b) allowing for improvement even for out of distribution test tasks. However, designing a consistent meta-RL algorithm in the offline setting is difficult: the consistency requirement suggests we might aim to extend the model-agnostic meta-learning (MAML) algorithm (Finn et al., 2017a) , since it directly corresponds to fine-tuning at meta-test time. However, existing MAML approaches use online policy gradients, and only value-based approaches have proven effective in the offline setting. Yet combining MAML with value-based RL subroutines is not straightforward: the higher-order optimization in MAML-like methods demands stable and efficient gradient-descent updates, while TD backups are both somewhat unstable and require a large number of steps to propagate reward information across long time horizons. To address these challenges, one might combine MAML with a supervised, bootstrap-free RL subroutine, such as advantage-weighted regression (AWR) (Peters and Schaal, 2007; Peng et al., 2019) , for both for the inner and outer loop of a gradient-based meta-learning algorithm, yielding a 'MAML+AWR' algorithm. However, as we will discuss in Section 4 and find empirically in Section 5, naïvely combining MAML and AWR in this way does not provide satisfactory performance because the AWR policy update is not sufficiently expressive. Motivated by prior work that studies the expressive power of MAML (Finn and Levine, 2018), we increase the expressive power of the meta-learner by introducing a carefully chosen policy update in the inner loop. We theoretically prove that this change increases the richness of the policy's update and find empirically that this policy update dramatically improves adaptation performance and stability in some settings. We further observe that standard feedforward neural network architectures used in reinforcement learning are not well-suited to optimization-based meta-learning and suggest an alternative that proves critical for good performance across four different environments. We call the resulting meta-RL algorithm and architecture Meta-Actor Critic with Advantage Weighting, or MACAW. Our main contributions are the offline meta-RL problem setting itself and MACAW, an offline meta-reinforcement learning algorithm that possesses three key properties: sample efficiency, offline meta-training, and consistency at meta-test time. To our knowledge, MACAW is the first algorithm to successfully combine gradient-based meta-learning and off-policy value-based RL. Our evaluations include experiments on offline variants of standard continuous control meta-RL benchmarks as well as settings specifically designed to test the robustness of an offline meta-learner when training tasks are scarce. In all of these settings, MACAW significantly outperforms fully offline variants state-of-the-art off-policy RL and meta-RL baselines.

2. PRELIMINARIES

In reinforcement learning, an agent interacts with a Markov Decision Process (MDP) to maximize its cumulative reward. An MDP is a tuple (S, A, T, r) consisting of a state space S, an action space A, stochastic transition dynamics T : S × A × S → [0, 1], and a reward function r. At each time step, the agent receives reward r t = r(s t , a t , s t+1 ). The agent's objective is to maximize the expected return (i.e. discounted sum of rewards) R = t γ t r t , where γ ∈ [0, 1] is a discount factor. To extend this setting to meta-RL, we consider tasks drawn from a distribution T i ∼ p(T ), where each task T i = (S, A, p i , r i ) represents a different MDP. Both the dynamics and reward function may vary across tasks. The tasks are generally assumed to exhibit some (unknown) shared structure. During meta-training, the agent is presented with tasks sampled from p(T ); at meta-test time, an agent's objective is to rapidly find a high-performing policy for a (potentially unseen) task T ∼ p(T ). That is, with only a small amount of experience on T , the agent should find a policy that achieves high expected return on that task. During meta-training, the agent meta-learns parameters or update rules that enables such rapid adaptation at test-time. Model-agnostic meta-learning One class of algorithms for addressing the meta-RL problem (as well as meta-supervised learning) are variants of the MAML algorithm (Finn et al., 2017a) , which involves a bi-level optimization that aims to achieve fast adaptation via a few gradient updates. Specifically, MAML optimizes a set of initial policy parameters θ such that a few gradient-descent steps from θ leads to policy parameters that achieve good task performance. At each meta-training step, the inner loop adapts θ to a task T by computing θ = θ -α∇ θ L T (θ), where L is the loss function for task T and α is the step size (in general, θ might be computed from multiple gradient steps, rather than just one as is written here). The outer loop updates the initial parameters as θ ← θ -β∇ θ L T (θ ), where L T is a loss function for task T , which may or may not be the same as the inner-loop loss function L T , and β is the step size. MAML has been previously instantiated with policy gradient updates in the inner and outer loops (Finn et al., 2017a; Rothfuss et al., 2018) , which can only be applied to on-policy meta-RL settings; we address this shortcoming in this work. Advantage-weighted regression. To develop an offline meta-RL algorithm, we build upon advantage-weighted regression (AWR) (Peng et al., 2019) , a simple offline RL method. The AWR policy objective is given by L AWR (ϑ, ϕ, B) = E s,a∼B -log π ϑ (a|s) exp 1 T (R B (s, a) -V ϕ (s)) , where B = {s j , a j , s j , r j } can be an arbitrary dataset of transition tuples sampled from some behavior policy, and R B (s, a) is the return recorded in the dataset for performing action a in state

