FOCAL: EFFICIENT FULLY-OFFLINE META-REINFORCEMENT LEARNING VIA DISTANCE METRIC LEARNING AND BEHAVIOR REGULARIZATION

Abstract

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

1. INTRODUCTION

Applications of reinforcement learning (RL) in real-world problems have been proven successful in many domains such as games (Silver et al., 2017; Vinyals et al., 2019; Ye et al., 2020) and robot control (Johannink et al., 2019) . However, the implementations so far usually rely on interactions with either real or simulated environments. In other areas like healthcare (Gottesman et al., 2019) , autonomous driving (Shalev-Shwartz et al., 2016) and controlled-environment agriculture (Binas et al., 2019) where RL shows promise conceptually or in theory, exploration in real environments is evidently risky, and building a high-fidelity simulator can be costly. Therefore a key step towards more practical RL algorithms is the ability to learn from static data. Such paradigm, termed "offline RL" or "batch RL", would enable better generalization by incorporating diverse prior experience. Moreover, by leveraging and reusing previously collected data, off-policy algorithms such as SAC (Haarnoja et al., 2018) has been shown to achieve far better sample efficiency than on-policy methods. The same applies to offline RL algorithms since they are by nature off-policy. The aforementioned design principles motivated a surge of recent works on offline/batch RL (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020) . These papers propose remedies by regularizing the learner to stay close to the logged transitions of the training datasets, namely the behavior policy, in order to mitigate the effect of bootstrapping error (Kumar et al., 2019) , where evaluation errors of out-of-distribution state-action pairs are never corrected and hence easily diverge due to inability to collect new data samples for feedback. There exist claims that offline RL can be implemented successfully without explicit correction for distribution mismatch given sufficiently large and diverse training data (Agarwal et al., 2020) . However, we find such assumption unrealistic in many practices, including our experiments. In this paper, to tackle the out-of-distribution problem in offline RL in general, we adopt the proposal of behavior regularization by Wu et al. (2019) . For practical RL, besides the ability to learn without exploration, it's also ideal to have an algorithm that can generalize to various scenarios. To solve real-world challenges in multi-task setting, such as treating different diseases, driving under various road conditions or growing diverse crops in autonomous greenhouses, a robust agent is expected to quickly transfer and adapt to unseen tasks, especially when the tasks share common structures. Meta-learning methods (Vilalta & Drissi, 2002; Thrun & Pratt, 2012) address this problem by learning an inductive bias from experience collected across a distribution of tasks, which can be naturally extended to the context of reinforcement learning. Under the umbrella of this so-called meta-RL, almost all current methods require on-policy data during either both meta-training and testing phases (Wang et al., 2016; Duan et al., 2016; Finn et al., 2017) or at least testing stage (Rakelly et al., 2019) for adaptation. An efficient and robust method which incorporates both fully-offline learning and meta-learning in RL, despite few attempts (Li et al., 2019b; Dorfman & Tamar, 2020) , has not been fully developed and validated. In this paper, under the first principle of maximizing practicality of RL algorithm, we propose an efficient method that integrates task inference with RL algorithms in a fully-offline fashion. Our fully-offline context-based actor-critic meta-RL algorithm, or FOCAL, achieves excellent sample efficiency and fast adaptation with limited logged experience, on a range of deterministic continuous control meta-environments. The primary contribution of this work is designing the first end-to-end and model-free offline meta-RL algorithm which is computationally efficient and effective without any prior knowledge of task identity or reward/dynamics. To achieve efficient task inference, we propose an inverse-power loss for effective learning and clustering of task latent variables, in analogy to coulomb potential in electromagnetism, which is also unseen in previous work. We also shed light on the specific design choices customized for OMRL problem by theoretical and empirical analyses.

2. RELATED WORK

Meta-RL Our work FOCAL builds upon the meta-learning framework in the context of reinforcement learning. Among all paradigms of meta-RL, this paper is most related to the context-based and metric-based approaches. Context-based meta-RL employs models with memory such as recurrent (Duan et al., 2016; Wang et al., 2016; Fakoor et al., 2019 ), recursive (Mishra et al., 2017) or probabilistic (Rakelly et al., 2019) structures to achieve fast adaptation by aggregating experience into a latent representation on which the policy is conditioned. The design of the context usually leverages the temporal or Markov properties of RL problems. Metric-based meta-RL focuses on learning effective task representations to facilitate task inference and conditioned control policies, by employing techniques such as distance metric learning (Yang & Jin, 2006) . Koch et al. (2015) proposed the first metric-based meta-algorithm for few-shot learning, in which a Siamese network (Chopra et al., 2005) is trained with triplet loss to compare the similarity between a query and supports in the embedding space. Many metric-based meta-RL algorithms extend these works (Snell et al., 2017; Sung et al., 2018; Li et al., 2019a) . Among all aforementioned meta-learning approaches, this paper is most related to the contextbased PEARL algorithm (Rakelly et al., 2019) and metric-based prototypical networks (Snell et al., 2017) . PEARL achieves SOTA performance for off-policy meta-RL by introducing a probabilistic permutation-invariant context encoder, along with a design which disentangles task inference and control by different sampling strategies. However, it requires exploration during meta-testing. The prototypical networks employ similar design of context encoder as well as an Euclidean distance metric on deterministic embedding space, but tackles meta-learning of classification tasks with squared distance loss as opposed to the inverse-power loss in FOCAL for the more complex OMRL problem. Offline/Batch RL To address the bootstrapping error (Kumar et al., 2019) problem of offline RL, this paper adopts behavior regularization directly from Wu et al. (2019) , which provides a relatively unified framework of several recent offline or off-policy RL methods (Haarnoja et al., 

