CORRECTING DATA DISTRIBUTION MISMATCH IN OF-FLINE META-REINFORCEMENT LEARNING WITH FEW-SHOT ONLINE ADAPTATION

Abstract

Offline meta-reinforcement learning (offline meta-RL) extracts knowledge from a given dataset of multiple tasks and achieves fast adaptation to new tasks. Recent offline meta-RL methods typically use task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset and learn an offline meta-policy. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks or oracle reward functions. Offline meta-RL with few-shot online adaptation remains an open problem. In this paper, we first formally characterize a unique challenge under this setting: reward and transition distribution mismatch between offline training and online adaptation. This distribution mismatch may lead to unreliable offline policy evaluation and the regular adaptation methods of online meta-RL will suffer. To address this challenge, we introduce a novel mechanism of data distribution correction, which ensures the consistency between offline and online evaluation by filtering out out-of-distribution episodes in online adaptation. As few-shot out-of-distribution episodes usually have lower returns, we propose a Greedy Context-based data distribution Correction approach, called GCC, which greedily infers how to solve new tasks. GCC diversely samples "task hypotheses" from the current posterior belief and selects greedy hypotheses with higher return to update the task belief. Our method is the first to provide an effective online adaptation without additional information, and can be combined with off-the-shelf context-based offline meta-training algorithms. Empirical experiments show that GCC achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.

1. INTRODUCTION

Human intelligence is capable of learning a wide variety of skills from past history and can adapt to new environments by transferring skills with limited experience. Current reinforcement learning (RL) has surpassed human-level performance (Mnih et al., 2015; Silver et al., 2017; Hafner et al., 2019) but requires a vast amount of experience. However, in many real-world applications, RL encounters two major challenges: multi-task efficiency and costly online interactions. In multi-task settings, such as robotic manipulation or locomotion (Yu et al., 2020b) , agents are expected to solve new tasks with few-shot adaptation with previously learned knowledge. Moreover, collecting sufficient exploratory interactions is usually expensive or dangerous in robotics (Rafailov et al., 2021) , autonomous driving (Yu et al., 2018), and healthcare (Gottesman et al., 2019) . One popular paradigm for breaking this practical barrier is offline meta reinforcement learning (offline meta-RL; Li et al., 2020; Mitchell et al., 2021) , which trains a meta-RL agent with pre-collected offline multi-task datasets and enables fast policy adaptation to unseen tasks. Recent offline meta-RL methods have been proposed to utilize a multi-task dataset collected by taskdependent behavior policies (Li et al., 2020; Dorfman et al., 2021) . They show promise by solving new tasks with few-shot adaptation. However, existing offline meta-RL approaches require additional information or assumptions for fast adaptation. For example, FOCAL (Li et al., 2020) and MACAW (Mitchell et al., 2021) conduct offline adaptation using extra offline contexts for meta-testing tasks. BOReL (Dorfman et al., 2021) and SMAC (Pong et al., 2022) employ few-shot online adaptation, in which the former assumes known reward functions, and the latter focuses on offline meta-training

