CORRECTING DATA DISTRIBUTION MISMATCH IN OF-FLINE META-REINFORCEMENT LEARNING WITH FEW-SHOT ONLINE ADAPTATION

Abstract

Offline meta-reinforcement learning (offline meta-RL) extracts knowledge from a given dataset of multiple tasks and achieves fast adaptation to new tasks. Recent offline meta-RL methods typically use task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset and learn an offline meta-policy. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks or oracle reward functions. Offline meta-RL with few-shot online adaptation remains an open problem. In this paper, we first formally characterize a unique challenge under this setting: reward and transition distribution mismatch between offline training and online adaptation. This distribution mismatch may lead to unreliable offline policy evaluation and the regular adaptation methods of online meta-RL will suffer. To address this challenge, we introduce a novel mechanism of data distribution correction, which ensures the consistency between offline and online evaluation by filtering out out-of-distribution episodes in online adaptation. As few-shot out-of-distribution episodes usually have lower returns, we propose a Greedy Context-based data distribution Correction approach, called GCC, which greedily infers how to solve new tasks. GCC diversely samples "task hypotheses" from the current posterior belief and selects greedy hypotheses with higher return to update the task belief. Our method is the first to provide an effective online adaptation without additional information, and can be combined with off-the-shelf context-based offline meta-training algorithms. Empirical experiments show that GCC achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.

1. INTRODUCTION

Human intelligence is capable of learning a wide variety of skills from past history and can adapt to new environments by transferring skills with limited experience. Current reinforcement learning (RL) has surpassed human-level performance (Mnih et al., 2015; Silver et al., 2017; Hafner et al., 2019) but requires a vast amount of experience. However, in many real-world applications, RL encounters two major challenges: multi-task efficiency and costly online interactions. In multi-task settings, such as robotic manipulation or locomotion (Yu et al., 2020b) , agents are expected to solve new tasks with few-shot adaptation with previously learned knowledge. Moreover, collecting sufficient exploratory interactions is usually expensive or dangerous in robotics (Rafailov et al., 2021 ), autonomous driving (Yu et al., 2018 ), and healthcare (Gottesman et al., 2019) . One popular paradigm for breaking this practical barrier is offline meta reinforcement learning (offline meta-RL; Li et al., 2020; Mitchell et al., 2021) , which trains a meta-RL agent with pre-collected offline multi-task datasets and enables fast policy adaptation to unseen tasks. Recent offline meta-RL methods have been proposed to utilize a multi-task dataset collected by taskdependent behavior policies (Li et al., 2020; Dorfman et al., 2021) . They show promise by solving new tasks with few-shot adaptation. However, existing offline meta-RL approaches require additional information or assumptions for fast adaptation. For example, FOCAL (Li et al., 2020) and MACAW (Mitchell et al., 2021) conduct offline adaptation using extra offline contexts for meta-testing tasks. BOReL (Dorfman et al., 2021) and SMAC (Pong et al., 2022) employ few-shot online adaptation, in which the former assumes known reward functions, and the latter focuses on offline meta-training with unsupervised online samples (without rewards). Therefore, achieving effective few-shot online adaptation purely relying on online experience remains an open problem for offline meta-RL. One particular challenge of meta-RL compared to meta-supervised learning is the need to learn how to explore in the testing environments (Finn & Levine, 2019) . In offline meta-RL, the gap of reward and transition distribution between offline training and online adaptation presents a unique conundrum for meta-policy learning, namely data distribution mismatch. Note that this data distribution mismatch differs distributional shift in offline single-task RL (Levine et al., 2020) , which refers to the shifts between offline policy learning and a behavior policy, not the gap between reward and transition distribution. As illustrated in Figure 1 , when we collect an offline dataset using the expert policy for each task, the robot will be meta-trained on all successful trajectories ( ). The robot aims to fast adapt to new tasks during meta-testing. However, when it tries the middle path and hits a stone, this failed adaptation trajectory ( ) does not match the offline training data distribution, which can lead to false task inference or adaptation. To formally characterize this phenomenon, we utilize the perspective of Bayesian RL (BRL; Duff, 2002; Zintgraf et al., 2019; Dorfman et al., 2021) that maintains a task belief given the context history and learns a meta-policy on belief states. Our theoretical analysis shows that task-dependent data collection may lead the agent to out-of-distribution belief states during meta-testing and results in an inconsistency between offline meta-policy evaluation and online adaptation evaluation. This contrasts with policy evaluation consistency in offline singletask RL (Fujimoto et al., 2019) . To deal with this inconsistency, we can choose either to trust the offline dataset or to trust new experience and continue online exploration. The latter may not be able to collect sufficient data in few-shot adaptation to learn a good policy only on online data. Therefore, we adopt the former strategy and introduce a mechanism of online data distribution correction that meta-policies with Thompson sampling (Strens, 2000) can filter out out-of-distribution episodes and provide a theoretical consistency guarantee in policy evaluations. To realize our theoretical implications in practical settings, we propose a context-based offline meta-RL algorithm with a novel online adaptation mechanism, called Greedy Context-based data distribution Correction (GCC). To align adaptation context with the meta-training distribution, GCC utilizes greedy task inference, which diversely samples "task hypotheses" and selects hypotheses with higher return to update task belief. In Figure 1 , the robot can sample other "task hypotheses" (i.e., try other paths) and the failed adaptation trajectory (middle) will not be used for task inference due to out-of-distribution with a lower return. To our best knowledge, our method is the first to design a delicate context mechanism to achieve effective online adaptation for offline meta-RL and has the advantage of directly combining with off-the-shelf context-based offline meta-training algorithms. Our main contribution is to formalize a specific challenge (i.e., data distribution mismatch) in offline meta-RL with online adaptation and propose a greedy context mechanism with theoretical motivation. We extensively evaluate the performance of GCC in didactic problems proposed by prior work (Rakelly et al., 2019; Zhang et al., 2021) and Meta-World ML1 benchmark with 50 tasks (Yu et al., 2020b) . In these didactic problems, GCC demonstrates that our context mechanism can accurately infer task identification, whereas the original online adaptation methods will suffer due to out-ofdistribution data. Empirical results on the more challenging Meta-World ML1 benchmark show that GCC significantly outperforms baselines with few-shot online adaptation, and achieves better or comparable performance than offline adaptation baselines with expert context.

2.1. STANDARD META-RL

The goal of meta-RL (Finn et al., 2017; Rakelly et al., 2019) is to train a meta-policy that can quickly adapt to new tasks using N adaptation episodes. The standard meta-rl setting deals with a distribution



Figure 1: Illustration of data distribution mismatch between offline training and online adaptation.

