CORRECTING DATA DISTRIBUTION MISMATCH IN OF-FLINE META-REINFORCEMENT LEARNING WITH FEW-SHOT ONLINE ADAPTATION

Abstract

Offline meta-reinforcement learning (offline meta-RL) extracts knowledge from a given dataset of multiple tasks and achieves fast adaptation to new tasks. Recent offline meta-RL methods typically use task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset and learn an offline meta-policy. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks or oracle reward functions. Offline meta-RL with few-shot online adaptation remains an open problem. In this paper, we first formally characterize a unique challenge under this setting: reward and transition distribution mismatch between offline training and online adaptation. This distribution mismatch may lead to unreliable offline policy evaluation and the regular adaptation methods of online meta-RL will suffer. To address this challenge, we introduce a novel mechanism of data distribution correction, which ensures the consistency between offline and online evaluation by filtering out out-of-distribution episodes in online adaptation. As few-shot out-of-distribution episodes usually have lower returns, we propose a Greedy Context-based data distribution Correction approach, called GCC, which greedily infers how to solve new tasks. GCC diversely samples "task hypotheses" from the current posterior belief and selects greedy hypotheses with higher return to update the task belief. Our method is the first to provide an effective online adaptation without additional information, and can be combined with off-the-shelf context-based offline meta-training algorithms. Empirical experiments show that GCC achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation.

1. INTRODUCTION

Human intelligence is capable of learning a wide variety of skills from past history and can adapt to new environments by transferring skills with limited experience. Current reinforcement learning (RL) has surpassed human-level performance (Mnih et al., 2015; Silver et al., 2017; Hafner et al., 2019) but requires a vast amount of experience. However, in many real-world applications, RL encounters two major challenges: multi-task efficiency and costly online interactions. In multi-task settings, such as robotic manipulation or locomotion (Yu et al., 2020b) , agents are expected to solve new tasks with few-shot adaptation with previously learned knowledge. Moreover, collecting sufficient exploratory interactions is usually expensive or dangerous in robotics (Rafailov et al., 2021) , autonomous driving (Yu et al., 2018) , and healthcare (Gottesman et al., 2019) . One popular paradigm for breaking this practical barrier is offline meta reinforcement learning (offline meta-RL; Li et al., 2020; Mitchell et al., 2021) , which trains a meta-RL agent with pre-collected offline multi-task datasets and enables fast policy adaptation to unseen tasks. Recent offline meta-RL methods have been proposed to utilize a multi-task dataset collected by taskdependent behavior policies (Li et al., 2020; Dorfman et al., 2021) . They show promise by solving new tasks with few-shot adaptation. However, existing offline meta-RL approaches require additional information or assumptions for fast adaptation. For example, FOCAL (Li et al., 2020) and MACAW (Mitchell et al., 2021) conduct offline adaptation using extra offline contexts for meta-testing tasks. BOReL (Dorfman et al., 2021) and SMAC (Pong et al., 2022) employ few-shot online adaptation, in which the former assumes known reward functions, and the latter focuses on offline meta-training with unsupervised online samples (without rewards). Therefore, achieving effective few-shot online adaptation purely relying on online experience remains an open problem for offline meta-RL. One particular challenge of meta-RL compared to meta-supervised learning is the need to learn how to explore in the testing environments (Finn & Levine, 2019) . In offline meta-RL, the gap of reward and transition distribution between offline training and online adaptation presents a unique conundrum for meta-policy learning, namely data distribution mismatch. Note that this data distribution mismatch differs distributional shift in offline single-task RL (Levine et al., 2020) , which refers to the shifts between offline policy learning and a behavior policy, not the gap between reward and transition distribution. As illustrated in Figure 1 , when we collect an offline dataset using the expert policy for each task, the robot will be meta-trained on all successful trajectories ( ). The robot aims to fast adapt to new tasks during meta-testing. However, when it tries the middle path and hits a stone, this failed adaptation trajectory ( ) does not match the offline training data distribution, which can lead to false task inference or adaptation. To formally characterize this phenomenon, we utilize the perspective of Bayesian RL (BRL; Duff, 2002; Zintgraf et al., 2019; Dorfman et al., 2021) that maintains a task belief given the context history and learns a meta-policy on belief states. Our theoretical analysis shows that task-dependent data collection may lead the agent to out-of-distribution belief states during meta-testing and results in an inconsistency between offline meta-policy evaluation and online adaptation evaluation. This contrasts with policy evaluation consistency in offline singletask RL (Fujimoto et al., 2019) . To deal with this inconsistency, we can choose either to trust the offline dataset or to trust new experience and continue online exploration. The latter may not be able to collect sufficient data in few-shot adaptation to learn a good policy only on online data. Therefore, we adopt the former strategy and introduce a mechanism of online data distribution correction that meta-policies with Thompson sampling (Strens, 2000) can filter out out-of-distribution episodes and provide a theoretical consistency guarantee in policy evaluations. To realize our theoretical implications in practical settings, we propose a context-based offline meta-RL algorithm with a novel online adaptation mechanism, called Greedy Context-based data distribution Correction (GCC). To align adaptation context with the meta-training distribution, GCC utilizes greedy task inference, which diversely samples "task hypotheses" and selects hypotheses with higher return to update task belief. In Figure 1 , the robot can sample other "task hypotheses" (i.e., try other paths) and the failed adaptation trajectory (middle) will not be used for task inference due to out-of-distribution with a lower return. To our best knowledge, our method is the first to design a delicate context mechanism to achieve effective online adaptation for offline meta-RL and has the advantage of directly combining with off-the-shelf context-based offline meta-training algorithms. Our main contribution is to formalize a specific challenge (i.e., data distribution mismatch) in offline meta-RL with online adaptation and propose a greedy context mechanism with theoretical motivation. We extensively evaluate the performance of GCC in didactic problems proposed by prior work (Rakelly et al., 2019; Zhang et al., 2021) and Meta-World ML1 benchmark with 50 tasks (Yu et al., 2020b) . In these didactic problems, GCC demonstrates that our context mechanism can accurately infer task identification, whereas the original online adaptation methods will suffer due to out-ofdistribution data. Empirical results on the more challenging Meta-World ML1 benchmark show that GCC significantly outperforms baselines with few-shot online adaptation, and achieves better or comparable performance than offline adaptation baselines with expert context.

2.1. STANDARD META-RL

The goal of meta-RL (Finn et al., 2017; Rakelly et al., 2019) is to train a meta-policy that can quickly adapt to new tasks using N adaptation episodes. The standard meta-rl setting deals with a distribution p(κ) over MDPs, in which each task κ i sampled from p(κ) presents a finite-horizon MDP (Zintgraf et al., 2019; Yin & Wang, 2021) . κ i is defined by a tuple (S, A, R, H, P κi , R κi ), including state space S, action space A, reward space R, planning horizon H, transition function P κi (s ′ |s, a), and reward function R κi (r|s, a). Denote K is the space of task κ i . In this paper, we assume dynamics function P and reward function R may vary across tasks and share a common structure. The meta-RL algorithms repeatedly sample batches of tasks to train a meta-policy. In the meta-testing, agents aim to rapidly adapt a good policy for new tasks drawn from p(κ). From a perspective of Bayesian RL (BRL; Ghavamzadeh et al., 2015) , recent meta-RL methods (Zintgraf et al., 2019) utilize a Bayes-adaptive Markov decision process (BAMDP; Duff, 2002) to formalize standard meta-RL. A BAMDP is a belief MDP (Kaelbling et al., 1998) of a special Partially Observable MDP (POMDP; Astrom, 1965) whose unobserved state information presents unknown task identification in N adaptation episodes. BAMDPs are defined as a tuple M + = S + , A, R, H + , P + 0 , P + , R + (Zintgraf et al., 2019) , in which S + = S × B is the hyper-state space, B is the space of task beliefs over meta-RL MDPs, A is the action space, R is the reward space, H + = N × H is the planning horizon across adaptation episodes, P + 0 s + 0 is the initial hyperstate distribution representing task distribution p(κ), P + s + t+1 s + t , a t , r t is the transition function, and R + r t s + t , a t is the reward function. A meta-policy π + a t s + t on BAMDPs prescribes a distribution over actions for each hyper-state s + t = (s t , b t ). The objective of meta-RL agents is to find a meta-policy π + that maximizes expected return, i.e., online policy evaluation denoted by J M + (π + ). Formal descriptions are deferred to Appendix A.1.3.

2.2. OFFLINE META-RL

In the offline meta-RL setting, a meta-learner only has access to an offline multi-task dataset D + and is not allowed to interact with the environment during meta-training (Li et al., 2020) . Recent offline meta-RL methods (Dorfman et al., 2021) always utilize task-dependent behavior policies p(µ|κ), which represents the random variable of the behavior policy µ(a|s) conditioned on the random variable of the task κ. For brevity, we overload [µ] = p(µ|κ). Similar to related work on offline RL (Yin & Wang, 2021) , we assume that D + consists of multiple i.i.d. trajectories that are collected by executing task-dependent policies [µ] in M + . We define the reward and transition distribution of the task-dependent data collection by P M + ,[µ] (Jin et al., 2021) , i.e., for each step t in a trajectory, P M + ,[µ] r t , s t+1 s + t , a t ∝ E κi∼p(κ),µi∼p(µ|κi) P κi (r t , s t+1 |s t , a t ) • p M + s + t |κ i , µ i , ) where the reward and transition distribution of data collection with µ i in a task κ i is defined as P κi (r t , s t+1 |s t , a t ) = R κi (r t |s t , a t ) • P κi (s t+1 |s t , a t ) , and p M + s + t |κ i , µ i denotes the probability of s + t when executing µ i in a task κ i . Note that the offline dataset D + can be narrow and a large amount of state-action pairs are not covered. These unseen state-action pairs will be erroneously estimated to have unrealistic values, called a phenomenon of extrapolation error (Fujimoto et al., 2019) . To overcome extrapolation error in offline RL, related works (Fujimoto et al., 2019) introduce batch-constrained RL, which restricts the action space in order to force policy selection of an agent with respect to a given dataset. A policy π + is defined to be batch-constrained by D + if π + (a |s + ) = 0 for all (s + , a) tuples that are not contained in D + . Offline RL (Liu et al., 2020; Chen & Jiang, 2019) approximates policy evaluation for a batch-constrained policy π + by sampling from an offline dataset D + , which is denoted by J D + (π + ) and called Approximate Dynamic Programming (ADP; Bertsekas & Tsitsiklis, 1995) . During meta-testing, RL agents perform online adaptation using a meta-policy π + in new tasks drawn from meta-RL task distribution. The reward and transition distribution of data collection with π + in M + during adaptation is defined by P M + r t , s t+1 s + t , a t = R + r t s + t , a t • P + s t+1 s + t , a t , where R + , P + are defined in M + . Detailed formulations are deferred to Appendix A.1.4.

3. THEORY: DATA DISTRIBUTION MISMATCH CORRECTION

Consistency of training and testing conditions is an important principle for machine learning (Finn & Levine, 2019) . Recently, offline meta-RL with task-dependent behavior policies (Dorfman et al., 2021) faces a special challenge: the reward and transition distribution in offline training and online adaptation may not match. We first build a theory about data distribution mismatch to understand this phenomenon. Our theoretical analysis is based on Bayesian RL (BRL; Zintgraf et al., 2019) and demonstrates that data distribution mismatch may lead to out-of-distribution belief states during meta-testing and results in a large gap between online and offline policy evaluation. To address this challenge, we propose a new mechanism to correct data distribution, which transforms the BAMDPs (Duff, 2002) to enrich information about task-dependent behavior policies into overall beliefs. This transformed BAMDPs provide reliable policy evaluation by filtering out out-of-distribution episodes.

3.1. DATA DISTRIBUTION MISMATCH IN OFFLINE META-RL

We define data distribution mismatch in offline meta-RL as follows. Definition 1 (Data Distribution Mismatch). In a BAMDP M + , for each task-dependent behavior policies [µ] and batch-constrained meta-policy π + , the data distribution mismatch between π + and [µ] is defined by that ∃s + t , a t , s.t., p π + M + s + t , a t > 0 and P M + r t , s t+1 s + t , a t ̸ = P M + ,[µ] r t , s t+1 s + t , a t , where p π + M + s + t , a t is the probability of reaching a tuple s + t , a t while executing π + in M + (formal definition deferred to Appendix A.2), and P M + , P M + ,[µ] are the reward and transition distribution of data collection with π + and [µ] defined in Eq. ( 1) and ( 3), respectively. The data distribution induced by π + and [µ] mismatches when the reward and transition distribution of π + and [µ] differs in a tuple s + t , a t , in which the agent can reach this tuple by executing π + in M + , i.e., p π + M + s + t , a t > 0. Note that if π + can reach a tuple s + t , a t , this tuple is guaranteed to be collected in the offline dataset, i.e., p

[µ]

M + s + t , a t > 0, because a batch-constrained policy π + will not select an action outside of the dataset collected by [µ], as introduced in Section 2.2. Theorem 1. There exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the data distribution induced by π + and [µ] does not match. s 0 r κ 1 (a1)=1,r κ i̸ =1 (a1)=0 * * . . . / / r κv (av)=1,r κ i̸ =v (av)=0 4 4 s 0 p(κ i ) = 1 v , p(µ i |κ i ) = 1, and µ i (a i |s 0 ) = 1 Figure 2: A concrete example, which has v meta-RL tasks, one state, v actions, v behavior policies, horizon H = 1 in an episode, and v adaptation episodes, where v ≥ 3. To serve a concrete example, we construct an offline meta-RL setting shown in Figure 2 . In this example, there are v meta-RL tasks K = {κ 1 , . . . , κ v } and v behavior policies {µ 1 , . . . , µ v }, where v ≥ 3. Each task κ i has one state s 0 , v actions, and horizon H = 1. For each task κ i , RL agents can receive reward 1 performing action a i . During adaptation, the RL agent can interact with the environment within v episodes. The task distribution is uniform, the behavior policy of task κ i is µ i , and each behavior policy µ i will perform a i . When any batch-constrained meta-policy π + selects an action ã on the initial state s + 0 , we find that P M + r = 1 s + 0 , ã = 1 v ̸ = P M + ,[µ] r = 1 s + 0 , ã = 1, in which there is the probability of 1 v to sample a corresponding testing task, whose reward function of ã is 1, whereas the reward in the offline dataset collected by [µ] is all 1. Fact 1. With any probability 1 -δ ∈ [0, 1), there exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the agent will visit out-of-distribution belief states during meta-testing due to data distribution mismatch. In the example shown in Figure 2 , an offline multi-task dataset D + is drawn from the task-dependent data collection P M + , [µ] . Since the reward of D + is all 1, the belief states in D + has two types: (i) all task possible s + 0 and (ii) determining task i with receiving reward 1 in action a i . For any batchconstrained meta-policy π + selecting an action ãj on s + 0 during meta-testing, there has probability 1 -1 v to receive reward 0 and the belief state will become "excluding task j", which is not contained in D + with v ≥ 3. For any δ ∈ (0, 1], let v > 1 δ , with probability 1 -δ, the agent will visit out-of-distribution belief states during adaptation. Note that, this phenomenon will directly lead to the following proposition since RL agents will visit out-of-distribution belief states during meta-testing in BAMDPs and induce unreliable offline policy evaluation. Proposition 1. There exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the gap of policy evaluation of π + between offline meta-training and online adaptation is at least H + -1

2

, where H + is the planning horizon of M + . In Figure 2 , an offline dataset D + only contains reward 1, thus for each batch-constrained meta-policy π + , the offline evaluation of π + in D + is J D + (π + ) = H + = vH, as defined in Section 2.2. The optimal meta-policy π +, * in this example is to enumerate a 1 , . . . , a v until the task identification is inferred from the action with a reward of 1. A meta-policy π +, * needs to explore in the testing environments and its online policy evaluation is J M + (π +, * ) = H + +1 2 . The detailed proof is deferred to Appendix A.2. Thus, the evaluation gap between offline meta-training and online adaptation is J M + π + -J D + π + ≥ J D + π + -J M + π +, * = H + -1 2 . ( ) Proposition 1 suggests that offline policy evaluation can fail by visiting out-of-distribution belief states. As few-shot online adaptation, we choose to trust the offline dataset and it is imperative to build a connection between online and offline evaluation for effective offline meta-policy training.

3.2. DATA DISTRIBUTION CORRECTION MECHANISM

To correct data distribution mismatch, we define the transformed BAMDPs as follows. Transformed BAMDPs maintain an overall belief about the task and behavior policies given the current context history. Compared to the original BAMDPs stated in Section 2.2, the transformed BAMDPs incorporate additional information about offline data collection into beliefs. This transformation will introduce an extra condition to the online adaptation process, as indicated by the following fact. Fact 2. For feasible Bayesian belief updating, transformed BAMDPs confine the agent in the indistribution belief states during meta-testing. During online adaptation, RL agents construct a hyper-state s+ t = s t , bt from the context history and perform a meta-policy π+ a t s+ t . The new belief bt accounts for the uncertainty of task MDPs and task-dependent behavior policies. In contrast with Fact 1, transformed BAMDPs do not allow the agent to visit out-of-distribution belief states with D + . Otherwise, the context history will conflict with the belief of behavior policies, i.e., RL agents cannot update beliefs bt when they have observed an event that they believe to have probability zero. For example in Figure 2 , RL agents can take an action a 1 and receive a reward of 0 during meta-testing. After observing an action a 1 , the posterior belief will be p κ 1 , µ 1 bt = 1, which is contradictory because the reward function of a 1 is 1 in κ 1 . To support feasible Bayesian belief updating, transformed BAMDPs require the RL agents to filter out out-of-distribution episodes and induce the following theorem. Theorem 2. In a transformed BAMDP M + , for each task-dependent behavior policy [µ] and batchconstrained meta-policy π+ , the data distribution induced by π+ and [µ] matches after filtering out out-of-distribution episodes in online adaptation. Besides, the policy evaluation of π+ in offline meta-training and online adaptation will be asymptotically consistent, as the offline dataset grows. In Theorem 2, the proof of consistent policy evaluation is similar to that in offline single-task RL (Fujimoto et al., 2019) , whose proof is deferred to Appendix A.3. This theorem indicates that we can design a delicate context mechanism to correct data distribution and guarantee the final performance of online adaptation by maximizing the expected future return during offline meta-training. Moreover, we prove that meta-policies with Thompson sampling (Strens, 2000) can filter out out-of-distribution episodes in M + with high probability, as the offline dataset D + grows (see Appendix A.5). then meta-policies with Thompson sampling (Strens, 2000) can utilize in-distribution context to infer how to solve meta-testing tasks. However, in practice, accurate out-of-distribution quantification is challenging (Yu et al., 2021; Wang et al., 2021) . For example, we conduct a popular empirical measure of offline RL, uncertainty estimation via an ensemble (Yu et al., 2020c; Kidambi et al., 2020) , and find that such quantification works poorly in the offline meta-RL (see Section 5.1). To address this challenge, we introduce a novel Greedy Context-based data distribution Correction (GCC), which elaborates a greedy online adaptation mechanism and can directly combine with off-the-shelf offline meta-RL algorithms. GCC consists of two main components: (i) an off-the-shelf context-based offline meta-training method which extracts meta-knowledge from a given multi-task dataset, and (ii) a greedy context-based online adaptation that realizes a selective context mechanism to infer how to solve meta-testing tasks. The whole algorithm is illustrated in Algorithm 1.

4.1. CONTEXT-BASED OFFLINE META-TRAINING

To support effective offline meta-training, we employ a state-of-the-art off-the-shelf context-based algorithm, i.e., FOCAL (Li et al., 2020) , which follows the algorithmic framework of a popular meta-RL approach, PEARL (Rakelly et al., 2019) . In this meta-training paradigm, task identification κ is modeled by a latent task embedding z. GCC meta-trains a context encoder q(z|c), a policy π(a|s, z), and a value function Q(s, a, z), where c is the context information including states, actions, rewards, and next states. The encoder q infers a task belief about the latent task variable z based on the received context. We use q(z) to denote the prior distribution for c = ∅. The policy π and value function Q are conditioned on the latent task variable z, in which the representation of z can be end-to-end trained on the RL losses of π or Q. In addition to the gradient from π or Q, recent offline meta-RL (Li et al., 2020; Yuan & Lu, 2022 ) also uses a contrastive loss to help the representation of z distinguish different tasks. We argue that meta-training the inference network q within a given dataset implicitly incorporates the information about offline data collection into the belief, as discussed in Definition 3. Formally, in the offline setting, the prior distribution q(z) can approximately present a distribution of the latent task variable z over meta-training tasks, i.e., a sample z drawn from q(z) is equivalent to sampling a task during meta-training and inferring z using the offline context from the offline dataset. During meta-testing, few-shot out-of-distribution episodes may confuse the context encoder q(z|c) and RL agents need to use in-distribution context to provide reliable task beliefs.

4.2. GREEDY CONTEXT-BASED META-TESTING

GCC is a context-based meta-RL algorithm (Rakelly et al., 2019) , whose adaptation protocol follows the framework of Thompson sampling (Strens, 2000) . The RL agent will iteratively update task belief by interacting with the environment and improve the meta-policy based on the "task hypothesis" sampled from the current beliefs. This adaptation paradigm is generalized from Bayesian inference (Thompson, 1933) and has a solid background in RL theory (Agrawal & Goyal, 2012) . To ensure feasible Bayesian belief updating, we adopt a heuristic of offline RL (Fujimoto et al., 2019) , i.e., few-shot out-of-distribution episodes generated by an offline-learned meta-policy usually have lower returns since offline training paradigm does not well-optimize meta-policies on out-of-distribution states. Its contrapositive statement is that episodes with higher returns have a higher probability of being in-distribution and the true return of episodes can be a good out-of-distribution quantification. In this way, GCC contains two steps to perform a greedy posterior belief update at each iteration: (i) a diverse sampling of latent task variables, and (ii) a greedy context mechanism that selects a task embedding with higher return to update the belief. For each iteration t, denote the current task belief by b t , the meta-testing task by κ test , and the number of belief updating iterations by n it . A diverse sampling of latent task variables will generate n z t candidates of the task embedding z t , denoting by Z t = z i t n z t -1 i=0 , to provide various policies π a s, z i t for the subsequent context selection mechanism. Due to the contrastive loss applied to the representation of z (Li et al., 2020) , a closer task embedding z may yield a more similar policy. Hence, to encourage the diversity of policies, each candidate z i t ∈ Z t is designed by z i t = arg max z∈ Z i t min min j<i z -z j t 2 , min t ′ <t,k<n z t ′ z -z k t ′ 2 and Z i t = zi,u t ∼ b t nz-1 u=0 , in which z i t aims to maximize the minimum distance to the previous embeddings among n z samples. This greedy method is similar to Farthest-Point Clustering (Gonzalez, 1985) , a two-approximation algorithm for an NP-hard problem Minimax Facility Location (Fowler et al., 1981) , which seeks a set of locations to minimize the maximum distance from other facilities to the set. A greedy selective context mechanism will select a latent task variable z t ∈ Z t with higher return to update the task belief b t . To evaluate the return of each z i t , GCC utilizes the policy π a s, z i t to draw n e episodes in κ test , denoting by E i t = e i,j t ne-1 j=0 . The policy evaluation of z i t can be approximated by sampling: J κtest z i t = J κtest π a s, z i t ≈ J κtest E i t = 1 n e ne-1 j=0 H-1 k=0 r k e i,j t , where J κtest E i t is the average return of episodes E i t and r k e i,j t is the reward of k-th step in an episode e i,j t . The task belief update in GCC consists of two phases: (i) an initial stage and (ii) an iterative optimization process. In the initial stage, GCC aims to find a reliable initial task inference z 0 using a large amount of n z 0 diverse candidates, i.e., the initial context is c 0 = arg max E0∈{E i 0 } J κtest (E 0 ), maintaining the corresponding task embedding z 0 , and deriving the posterior belief b 1 = q(z|c 0 ). In the following iterations, GCC utilizes an iterative optimization method to maximize final performance during few-shot online adaptation, i.e., when t > 1, let n z t = 1 and if J κtest E 0 t > J κtest (c t-1 ), we have c t = E 0 t and update the posterior belief b t+1 = q(z| ∪ t ′ ≤t c t ′ ), otherwise c t = c t-1 and keep the belief b t+1 = b t . The final policy π (a |s, z t ) will depend on the optimal task embedding z t with the highest return. Algorithm 1 GCC: Greedy Context-based data distribution Correction 1: Require: An offline multi-task dataset D + , a meta-testing task κ test ∼ p(κ), the number of iterations n it , n z 0 , and a context-based offline meta-training algorithm A (i.e., FOCAL) 2: Randomly initialize a context encoder q(z|c), a policy π(a|s, z), and a value function Q(s, a, z) 3: Offline meta-train q, π, and Q given an algorithm A and a dataset D + ▷ Offline meta-training 4: Generate a prior task distribution q(z) using the dataset D + ▷ Start online meta-testing 5: Collect diverse adaptation episodes E i 0 n z 0 -1 i=0 using q(z) and π in κ test 6: Compute the greedy context c 0 , task embedding z 0 , and posterior belief b 1 ▷ An initial stage 7: for t = 1 . . . n it -1 do ▷ An iterative optimization process 8: Collect a diverse episode E 0 t using b t and π(a|s, z t ) in κ test

9:

Compute the greedy context c t and posterior belief b t+1 10: Derive the final policy π out (a |s, z t ) with the optimal task embedding z t 11: Return: π out

5. EXPERIMENTS

In this section, we first study a didactic example to analyze the out-of-distribution problem, and show how GCC alleviates this problem by its greedy selective mechanism. Then we conduct large-scale experiments on Meta-World ML1 (Yu et al., 2020a) , a popular meta-RL benchmark that consists of 50 robot arm manipulation task sets. Finally, we perform ablation studies to analyze GCC's sensitivity to hyper-parameter settings and dataset qualities. Following FOCAL (Li et al., 2020) , we use expert-level datasets sampled by policies trained with SAC on the corresponding tasks. We compare against FOCAL (Li et al., 2020) and MACAW (Mitchell et al., 2021) , as well as their online adaptation variants. We also compare against BOReL (Dorfman et al., 2021) , an offline Meta-RL algorithm build upon VariBAD (Zintgraf et al., 2019) . For fair comparison, we evaluate a variant of BOReL that does not utilize oracle reward functions, as introduced in the original paper (Dorfman et al., 2021) . FOCAL is built upon PEARL (Rakelly et al., 2019) and uses contrastive losses to learn context embeddings, while MACAW is a MAML-based (Finn et al., 2017) algorithm and incorporates AWR (Peng et al., 2019) . Both FOCAL and MACAW are originally proposed for the offline adaptation settings (i.e., with expert context). For online adaptation, we use online experiences instead of expert contexts, and adopt the adaptation protocol of PEARL and MAML, respectively. Evaluation results are averaged over six random seeds, and variance is measured by 95% bootstrapped confidence interval. Detailed hyper-parameter settings are deferred to Appendix B.

5.1. DIDACTIC EXAMPLE

We introduce Point-Robot, a simple 2D navigation task set commonly used in meta-RL (Rakelly et al., 2019; Zhang et al., 2021) . Figure 3 

5.2. MAIN RESULTS

We evaluate on Meta-World ML1 (Yu et al., 2020a) , a popular meta-RL benchmark that consists of 50 robot arm manipulation task sets. Each task set consists of 50 tasks with different goals. For each task set, we use 40 tasks as meta-training tasks, and remain the other 10 tasks as meta-testing tasks. As shown in Table 1 , GCC significantly outperforms baselines under the online context setting. With expert contexts, FOCAL and MACAW both achieve reasonable performance. GCC achieves better or comparable performance to baselines with expert contexts, which implies that expert contexts may not be necessary for offline meta-RL. Under online contexts, FOCAL fails due to the data distribution mismatch between offline training and online adaptation. MACAW has the ability of online fine-tuning as it is based on MAML, but it also suffers from the distribution mismatch problem, and online fine-tuning can hardly improve its performance within a few adaptation episodes. BOReL fails on most of the tasks, as BOReL without oracle reward functions will also suffer from the distribution mismatch problem, which is consistent with the results in the original paper. We also test the uncertainty estimation method on several representative tasks, and it performs similarly to FOCAL. Detailed results are deferred to Table 5 in Appendix E.1. Table 2 shows algorithms' performance on 20 representative Meta-World ML1 task sets, as well a sparse-reward version of Point-Robot and Cheetah-Vel, which are popular meta-RL tasks (Li et al., 2020) . GCC achieves remarkable performance in most tasks and may fail in some hard tasks as offline meta-training is difficult. We also find that GCC achieves better or comparable performance to baselines with expert contexts on 33 out of the 50 task sets. Detailed algorithm performance on all 50 tasks as well as comparison to baselines with expert contexts are deferred to Appendix E.2. We further perform ablation studies on hyper-parameter settings and dataset qualities, and results are deferred to Appendix F. Results demonstrate that GCC is generally robust to the choice of hyper-parameters, and performs well on medium-quality datasets. 

6. RELATED WORK

In the literature, offline meta-RL methods utilize a context-based (Rakelly et al., 2019) or gradientbased (Finn et al., 2017) meta-RL framework to solve new tasks with few-shot adaptation. They utilize the techniques of contrastive learning (Li et al., 2020; Yuan & Lu, 2022) , more expressive power (Mitchell et al., 2021) , or reward relabeling (Dorfman et al., 2021; Pong et al., 2022) with various popular offline single-task RL tricks, i.e., using KL divergence (Wu et al., 2019; Peng et al., 2019; Nair et al., 2020) or explicitly constraining the policy to be close to the dataset (Fujimoto et al., 2019; Zhou et al., 2020) . However, these methods always require extra information for fast adaptation, such as offline context for testing tasks (Li et al., 2020; Mitchell et al., 2021; Yuan & Lu, 2022) , oracle reward functions (Dorfman et al., 2021) , or free interactions without reward supervision (Pong et al., 2022) . To address the challenge, we propose GCC, a greedy context mechanism with theoretical motivation, to perform effective online adaptation without requiring additional information. The concepts of distribution shift in z-space in Pong et al. ( 2022) and MDP ambiguity in Dorfman et al. (2021) are related to the data distribution mismatch proposed in this paper. We reveal that the taskdependent behavior policies will induce different reward and transition distribution between offline meta-training and online adaptation, in which this mismatch combined with "policy" distribution shift in offline single-task RL (Levine et al., 2020) are two essential factors in the phenomenon of distribution shift in z-space (Pong et al., 2022) . After filtering out these out-of-distribution data, GCC can maintain an overall belief about the task with behavior policies to address MDP ambiguity.

7. CONCLUSION

This paper formalizes data distribution mismatch in offline meta-RL with online adaptation and introduces GCC, a novel context-based online adaptation approach. Inspired by theoretical implications, GCC adopts a greedy context mechanism to filter out out-of-distribution with lower return for online data correction. We demonstrate that GCC can perform accurate task inference and achieve state-of-the-art performance on Meta-World ML1 benchmark with 50 tasks. Compared to offline adaptation baselines with expert context, GCC also performs better or comparably, suggesting that offline context may not be necessary for the testing environments. One potential future direction is to extend GCC to gradient-based online adaptation methods with data distribution correction and to deal with task distributional shift problem in offline meta-RL (which is not considered in GCC).

A THEORY

A.1 BACKGROUND Throughout this paper, for a given non-negative integer N ∈ Z + , we use [N ] to denote the set {0, 1, . . . , N -1}. For any object that is a function of/distribution over S, S × A, S × A × S, or S × A × R, we will treat it as a vector whenever convenient. A.1.1 FINITE-HORIZON SINGLE-TASK RL In single-task RL, an agent interacts with a Markov Decision Process (MDP) to maximize its cumulative reward (Sutton & Barto, 2018) . A finite-horizon MDP is defined as a tuple M = (S, A, R, H, P, R) (Zintgraf et al., 2019; Du et al., 2019) , where S is the state space, A is the action space, R is the reward space, H ∈ Z + is the planning horizon, P : S × A → ∆ (S) is the transition function which takes a state-action pair and returns a distribution over states, and R : S × A → ∆ (R) is the reward distribution. In particular, we consider finite state, action, and reward spaces in the theoretical analysis, i.e., |S| < ∞, |A| < ∞, |R| < ∞. Without loss of generality, we assume a fixed initial state s 0foot_1 . A policy π : S → ∆ (A) prescribes a distribution over actions for each state. The policy π induces a (random) H-horizon trajectory τ π H = (s 0 , a 0 , r 0 , s 1 , a 1 , . . . , s H-1 , a H-1 , r H-1 ), where a 0 ∼ π(s 0 ), r 0 ∼ R(s 0 , a 0 ), s 1 ∼ P (s 0 , a 0 ), a 1 ∼ π(s 1 ), etc. To streamline our analysis, for each h ∈ [H], we use S h ⊆ S to denote the set of states at h-th timestep, and we assume S h do not intersect with each other. To simplify notation, we assume the transition from any state in S H-1 and any action to the initial state s 0 , i.e., ∀s ∈ S H-1 , a ∈ A, we have P (s 0 |s, a) = 1foot_2 . We also assume r t ∈ [0, 1], ∀t ∈ [H] almost surely. Denote the probability of τ H : p (τ π H ) =   t∈[H] π(a t |s t ) • R(r t |s t , a t )   t∈[H-1] P (s t+1 |s t , a t ). For any policy π, we define a value function V π : S → R as: ∀h ∈ [H], ∀s ∈ S h , V π (s) = E s h =s,at∼π(•|st),rt∼R(•|st,at),st+1∼P (•|st,at) H-1 t=h r t (10) =            a∈A π(a|s) • E r∼R(•|s,a) [r] , if h = H -1, a∈A π(a|s)   E r∼R(•|s,a) [r] + s ′ ∈S h+1 P (s ′ |s, a)V π (s ′ )   , otherwise, and a visitation distribution of π is defined by ρ π (•) : ∆ (S) which is ∀h ∈ [H], ∀s ∈ S h , ρ π (s) =            1 H , if h = 0 and s = s 0 , s∈S h-1 ,ã∈A ρ π (s) • π(ã|s) • P (s|s, ã), if h > 0, 0, otherwise, and ∀s ∈ S, a ∈ A, r ∈ R, ρ π (s, a) = ρ π (s) • π(a|s) and ρ π (s, a, r) = ρ π (s) • π(a|s) • R(r|s, a). ( ) The expected total reward induced by policy π, i.e., the policy evaluation of π, is defined by J M (π) = V π (s 0 ) = H s∈S,a∈A ρ π (s, a) • E r∼R(•|s,a) [r] . The goal of RL is to find a policy π that maximizes its expected return J (π).

A.1.2 OFFLINE FINITE-HORIZON SINGLE-TASK RL

We consider the offline finite-horizon single-task RL setting, that is, a learner only has access to a dataset D consisting of K trajectories s k t , a k t , r k t k∈ [K] t∈[H] (i.e., |D| = KH tuples) and is not allowed to interact with the environment for additional online explorations. The data can be collected through multi-source logging policies and denote the unknown behavior policy µ. Similar with related work (Ren et al., 2021; Yin et al., 2020; Yin & Wang, 2021; Yin et al., 2021; Shi et al., 2022) , we assume that D is collected through interacting K i.i.d. episodes using policy µ in M . Define the reward and transition distribution of data collection with µ in M by P M (Jin et al., 2021) , i.e., ∀t ∈ [H] in each episode, P M (r t , s t+1 |s t , a t ) = R M (r t |s t , a t ) • P M (s t+1 |s t , a t ) , where the action a t is drawn from a behavior policy µ. Denote a dataset collected following the i.i.d. data collecting process, i.e., D ∼ (P M , µ) is an i.i.d. dataset. Note that the offline dataset D can be narrow collected by some behavior policy µ and a large amount of state-action pairs are not contained in D. These unseen state-action pairs will be erroneously estimated to have unrealistic values, called a phenomenon extrapolation error (Fujimoto et al., 2019) . To overcome extrapolation error in policy learning of finite MDPs, Fujimoto et al. (Fujimoto et al., 2019) introduces batch-constrained RL, which restricts the action space in order to force policy selection of an agent with respect to a subset of the given data. Thus, define a batch-constrained policy set is Π D = {π |π(a|s) = 0 whenever (s, a) ̸ ∈ D } , where denoting (s, a) ∈ D if there exists a trajectory containing (s, a) in the dataset D, and similarly for s ∈ D, (s, a, r) ∈ D, or (s, a, r, s ′ ) ∈ D. The batch-constrained policy set Π D consists of the policies that for any state s observed in the dataset D, the agent will not select an action outside of the dataset. Thus, for any batch-constrained policy π ∈ Π D , define the approximate value function Fujimoto et al., 2019; Liu et al., 2020) as V D π : S → R estimated from D ( : ∀h ∈ [H], ∀s ∈ S h , V D π (s) = E s h =s,at∼π(•|st),(st,at,rt,st+1)∼D H-1 t=h r t (16) =        a∈A π(a|s)E (s,a,r)∈D [r] , if h = H -1, a∈A π(a|s)E (s,a,r,s ′ )∈D r + V D π (s ′ ) , otherwise, which is called Approximate Dynamic Programming (ADP) (Bertsekas & Tsitsiklis, 1995) and such methods take sampling data as input and approximate the value-function (Liu et al., 2020; Chen & Jiang, 2019) . In addition, define the approximate policy evaluation of π estimated from D as J D (π) = V D π (s 0 ). ( ) The offline RL literature (Fujimoto et al., 2019; Liu et al., 2020; Chen & Jiang, 2019; Kumar et al., 2019; 2020) aims to utilize approximate expected total reward J D (π) with various conservatism regularizations (i.e., policy constraints, policy penalty, uncertainty penalty, etc.) (Levine et al., 2020) to find a good policy within a batch-constrained policy set Π D . Similar to offline finite-horizon single-task RL theory (Ren et al., 2021; Yin et al., 2020; Yin & Wang, 2021; Yin et al., 2021; Shi et al., 2022) , define d M µ = min {ρ µ (s, a) |ρ µ (s, a) > 0, ∀s ∈ S, a ∈ A } , which is the minimal visitation state-action distribution induced by the behavior policy µ in M and is an intrinsic quantity required by theoretical offline learning (Yin et al., 2020) . Note that, different from recent offline episodic RL theory (Ren et al., 2021; Yin et al., 2020; Yin & Wang, 2021; Yin et al., 2021; Shi et al., 2022) , we do not assume any weak or uniform converage assumption in the dataset because we focus on the policy evaluation of all batch-constrained policies in Π D rather than the optimal policy in the MDP M .

A.1.3 STANDARD META-RL

The goal of meta-RL (Finn et al., 2017; Rakelly et al., 2019) is to train a meta-policy that can quickly adapt to new tasks using N adaptation episodes. The standard meta-RL setting deals with a distribution p(κ) over MDPs, in which each task κ i sampled from p(κ) presents a finite-horizon MDP (Zintgraf et al., 2019; Du et al., 2019) . κ i is defined by a tuple (S, A, R, H, P κi , R κi ), including state space S, action space A, reward space R, planning horizon H, transition function P κi (s ′ |s, a), and reward function R κi (r|s, a). Denote K is the space of task κ i . In this paper, we assume dynamics function P and reward function R may vary across tasks and share a common structure. The meta-RL algorithms repeatedly sample batches of tasks to train a meta-policy. In the meta-testing, agents aim to rapidly adapt a good policy for new tasks drawn from p(κ). POMDPs. We can formalize the meta-RL with few-shot adaptation as a specific finite-horizon Partially Observable Markov Decision Process (POMDP), which is defined by a tuple M = S, A, R, Ω, H, P , P 0 , O, R , where S = S × K is the state space, A and R are the same action and reward spaces as the finite-horizon MDP M defined in Appendix A.1.1, respectively, Ω = S is the observation space, H = N × H is the planning horizon which represents N adaptation episodes for a single meta-RL MDP κ i , as discussed in Zintgraf et al. (Zintgraf et al., 2019) , P : S × A → ∆ S is the transition function: ∀ŝ, ŝ′ ∈ S, a ∈ A, where denoting ŝ = (s, κ i ) and ŝ′ = (s ′ , κ j ), P (ŝ ′ |ŝ, a) = P κi (s ′ |s, a), if κ i = κ j , 0, otherwise, P 0 : ∆ S is the initial state distribution: ∀ŝ = (s, κ i ) ∈ S, P 0 (ŝ) = p(κ i ), if s = s 0 , 0, otherwise, O : S → ∆ (Ω) is the observation probability distribution conditioned on a state: ∀ŝ = (s, κ i ) ∈ S, o ∈ Ω, O (o|ŝ) = 1, if o = s, 0, otherwise, and R : S × A → ∆ (R) is the reward distribution: ∀ŝ = (s, κ i ) ∈ S, a ∈ A, r ∈ R, R (r|ŝ, a) = R κi (r|s, a). Denote context c t = (a t , r t , s t+1 ) as an experience collected at timestep t, and c :t = ⟨s 0 , c 0 , . . . , c t-1 ⟩ 3 ∈ C t ≡ Ω × (A × R × Ω) t indicates all experiences collected during t timesteps. Note that t may be larger than H, and when it is the case, c :t represents experiences collected across episodes in the single meta-RL MDP κ i . Denote the entire context space C = H-1 t=0 C t and a meta-policy π : C → ∆ (A) (Wang et al., 2016; Duan et al., 2016) prescribes a distribution over actions for each context. The goal of meta-RL is to find a meta-policy on history contexts π that maximizes the expected return within N adaptation episodes: J M (π) = E ŝ0∼P0,ot∼O(•|st),at∼π(•|c:t),rt∼ R(•|st,at),ŝt+1∼ P (•|ŝt,at)   H-1 t=0 r t   (24) = E κi∼p(κ)   N -1 j=0 E at∼π(•|c :(jH+t) ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   . BAMDPs. A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state (Cassandra et al., 1994) . We can transform the finite-horizon POMDP M to a finite-horizon belief MDP, which is called Bayes-Adaptive MDP (BAMDP) in the literature (Zintgraf et al., 2019; Ghavamzadeh et al., 2015; Dorfman et al., 2021) and is defined by a tuple M + = S + , A, R, H + , P + , P + 0 , R + , S + = S × B is the hyper-state space, where B = {p(κ|c) |c ∈ C } is the set of belief states over the meta-RL MDPs, the prior b κ 0 = p (κ|c :0 ) = p(κ) is the meta-RL MDP distribution, and ∀t ∈ H -1 , ∀c :(t+1) ∈ C, denoting b κ t = p (κ|c :t ) and b κ t+1 = p κ|c :(t+1) = p p (κ|c :t ) |c : (t+1) = p (p (κ|c :t ) |s t , c t ) = p (b κ t |s t , c t ) (27) ∝ p (b κ t , c t |s t ) = p (c t |s t , b κ t ) p (b κ t |s t ) = p (c t |s t , b κ t ) b κ t (28) = E κi∼b κ t [R κi (r t |s t , a t ) • P κi (s t+1 |s t , a t )] • b κ t ( ) is the posterior over the MDPs given the context c :(t+1) , A, R are the same action space and reward space as the finite-horizon POMDP M , respectively, H + = N × H is the planning horizon across adaptation episodes,  P + : S + × A × R → ∆ (S + ) = E κi∼p(b κ t |st,at,rt) [P κi (s t+1 |s t , a t )] • 1 bκ t+1 = p(b κ t |s t , c t ) , P + 0 : ∆ (S + ) is the initial hyper-state distribution, i.e., a deterministic initial hyper-state is s + 0 = (s 0 , b κ 0 ) = (s 0 , p(κ)) ∈ S + , + : S + × A → ∆ (R) is the reward distribution: ∀s + = (s, b κ ) ∈ S + , a ∈ A, r ∈ R, R + r|s + , a = R + (r|s, b κ , a) = E κi∼b κ [R κi (r|s, a)] . (34) In a BAMDP, the belief is over the transition and reward functions, which are constant for a given task. A meta-policy on BAMDP π + : S + → ∆ (A) prescribes a distribution over actions for each hyper-state. The agent's objective is now to find a meta-policy on hyper-states π + that maximizes the expected return in the BAMDP, J M + π + = E at∼π + (•|s + t ),rt∼R + (•|s + t ,at ),s + t+1 ∼P + (•|s + t ,at )   H-1 t=0 r t   (35) = E κi∼p(κ)   N -1 j=0 E at∼π + (•|s + jH+t ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   . ( ) For any meta-policy on hyper-states π + , denote the corresponding meta-policy on history contexts fπ + : C → ∆ (A), i.e., ∀t ∈ H -1 , ∀c :t ∈ C t , s.t., fπ + (•|c :t ) = π + •|s + t , where s + t = (s t , b κ t ) = (s t , p(κ|c :t )) , and we have J M fπ + = E κi∼p(κ)   N -1 j=0 E at∼ fπ + (•|c:(jH+t) ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   (37) = J M + π + . (38) The belief MDP is such that an optimal policy for it, coupled with the correct state estimator, will give rise to optimal behavior for the original POMDP (Astrom, 1965; Smallwood & Sondik, 1973; Kaelbling et al., 1998) , which indicates that J M + π +, * = J M fπ +, * = J M (π * ) , where π +, * and π * are the optimal policies for BAMDP M + and POMDP M , respectively. Thus, the agent can find a policy π + to maximize the expected return in the BAMDP M + to address the POMDP M by the transformed policy fπ + .

A.1.4 OFFLINE META-RL

In the offline meta-RL setting, a meta-learner only has access to an offline multi-task dataset D + and is not allowed to interact with the environment during meta-training (Li et al., 2020) . Recent offline meta-RL methods (Dorfman et al., 2021) always utilize task-dependent behavior policies p(µ|κ), which represents the random variable of the behavior policy µ(a|s) conditioned on the random variable of the task κ. For brevity, we overload [µ] = p(µ|κ). Similar to related work on offline RL (Shi et al., 2022) , we assume that D + is collected through interacting multiple i.i.d. trajectories using task-dependent policies [µ] in M + . Define the reward and transition distribution of the task-dependent data collection by P M + ,[µ] (Jin et al., 2021) , i.e., for each step t in a trajectory, P M + ,[µ] r t , s t+1 s + t , a t ∝ E κi∼p(κ),µi∼p(µ|κi) P κi (r t , s t+1 |s t , a t ) • p M + s + t |κ i , µ i , ( ) where P κi is the reward and transition distribution of κ i defined in Eq. ( 14), and p M + s + t |κ i , µ i denotes the probability of s + t when executing µ i in a task κ i , i.e., p M + s + t |κ i , µ i = c:t∈Ct p µi κi (c :t ) • 1 [b κ t = p(κ|c :t )] , where the state in c :t is s t and p µi κi (c :t ) is defined in Eq. ( 9). Similar to offline single-task RL (see Appendix A.1.2), offline dataset D + can be narrow and a large amount of state-action pairs are not contained. These unseen state-action pairs will be erroneously estimated to have unrealistic values, called a phenomenon extrapolation error (Fujimoto et al., 2019) . To overcome extrapolation error in offline RL, related works (Fujimoto et al., 2019) introduce batch-constrained RL, which restricts the action space in order to force policy selection of an agent with respect to a given dataset. Define a policy π + to be batch-constrained by D + if π + (a |s + ) = 0 whenever a tuple (s + , a) is not contained in D + . Offline RL (Liu et al., 2020; Chen & Jiang, 2019) approximates policy evaluation for a batch-constrained policy π + by sampling from an offline dataset D + , which is denoted by J D + (π + ) and called Approximate Dynamic Programming (ADP; Bertsekas & Tsitsiklis, 1995) . During meta-testing, RL agents perform online adaptation using a meta-policy π + in new tasks drawn from meta-RL task distribution. The reward and transition distribution of data collection with π + in M + during adaptation is defined by  P M + r t , p π + M + s + t , a t > 0 and P M + r t , s t+1 s + t , a t ̸ = P M + ,[µ] r t , s t+1 s + t , a t , where p π + M + s + t , a t is the probability of reaching a tuple s + t , a t while executing π + in M + (formal definition deferred to Appendix A.2), and P M + , P M + ,[µ] are the reward and transition distribution of data collection with π + and [µ] defined in Eq. ( 1) and (3), respectively.

More specifically,

p π + M + s + t , a t = p π + M + (s + t ) • π + a t s + t , where p π + M + (s + t ) is defined in Eq. ( 9). Theorem 1. There exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the data distribution induced by π + and [µ] does not match. Proof. To serve a concrete example, we construct an offline meta-RL setting shown in Figure 4 . In this example, there are v meta-RL tasks K = {κ 1 , . . . , κ v } and v behavior policies {µ 1 , . . . , µ v }, where v ≥ 3. Each task κ i has one state S = {s 0 }, 2v actions A = {a 1 , . . . , a v }, and horizon in an episode H = 1. For each task κ i , RL agents can receive reward 1 performing action a i . During adaptation, the RL agent can interact with the environment within v episodes. The task distribution is uniform, the behavior policy of task κ i is µ i , and each behavior policy µ i will perform a i . When a batch-constrained meta-policy π + selects an action ã in the initial state s + 0 , we find that s 0 r κ 1 (a1)=1,r κ i̸ =1 (a1)=0 * * . . . / / r κv (av)=1,r κ i̸ =v (av)=0 4 4 s 0 p(κ i ) = 1 v , p(µ i |κ i ) = 1, and µ i (a i |s 0 ) = 1 P M + r = 1 s + 0 , ã = 1 v ̸ = P M + ,[µ] r = 1 s + 0 , ã = 1, ( ) in which there is the probability of 1 v to sample a corresponding testing task, whose reward function of ã is 1, whereas the reward in the offline dataset collected by [µ] is all 1. Fact 1. With any probability 1 -δ ∈ [0, 1), there exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the agent will visit out-of-distribution belief states during meta-testing due to data distribution mismatch. Proof. In the example shown in Figure 4 , an offline multi-task dataset D + is drawn from the taskdependent data collection P M + , [µ] . Since the reward of D + is all 1, the belief states in D + has two types: (i) all task possible s + 0 and (ii) determining task i with receiving reward 1 in action a i . For any batch-constrained meta-policy π + selecting an action ãj on s + 0 during meta-testing, there has probability 1 -1 v to receive reward 0 and the belief state will become "excluding task j", which is not contained in D + with v ≥ 3. For any δ ∈ (0, 1], let v > 1 δ , with probability 1 -δ, the agent will visit out-of-distribution belief states during adaptation. Proposition 1. There exists a BAMDP M + with task-dependent behavior policies [µ] such that, for any batch-constrained meta-policy π + , the gap of policy evaluation of π + between offline meta-training and online adaptation is at least H + -1

2

, where H + is the planning horizon of M + . Proof. In the example shown in Figure 2 , an offline dataset D + only contains reward 1, thus for each batch-constrained meta-policy π + , the offline evaluation of π + in D + is J D + (π + ) = H + = vH. The optimal meta-policy π +, * in this example is to enumerate a 1 , . . . , a v until the task identification is inferred from an action with a reward of 1. A meta-policy π +, * needs to explore in the testing environments and its online policy evaluation is J M + π +, * = N -1 k=0 N -k v -k k-1 j=0 1 - 1 v -j (46) = N -1 k=0 k-1 j=0 v -j -1 v -j (47) = N -1 k=0 v -k v = N - N (N -1) 2v (48) = v + 1 2 = H + + 1 2 , ( ) In a transformed BAMDP M + , the overall belief is about the task-dependent behavior policies, transition function, and reward function, which are constant for a given task. A meta-policy on M + is π+ : S + → ∆ (A) prescribes a distribution over actions for each hyper-state. With feasible Bayesian belief updating, the objective of RL agents is now to find a meta-policy on hyper-states π+ that maximizes the expected return in the transformed BAMDP, J M + π+ = E at∼π + (•|s + t ),rt∼R + (•|s + t ,at ),s + t+1 ∼P + (•|s + t ,at )   H-1 t=0 r t   (60) = E (κi,µi)∼p(κ,µ)   N -1 j=0 E at∼π + (•|s + jH+t ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   (61) = E κi∼p(κ)   N -1 j=0 E at∼π + (•|s + jH+t ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   . ( ) For any meta-policy on hyper-states π+ , denote the corresponding meta-policy on history contexts  fπ + : C → ∆ (A), i.e., J M fπ + = E κi∼p(κ)   N -1 j=0 E at∼ fπ + (•|c:(jH+t) ),rt∼R κ i (•|st,at),st+1∼P κ i (•|st,at) H-1 t=0 r t   (63) = J M + π+ . ( ) Fact 2. For feasible Bayesian belief updating, transformed BAMDPs confine the agent in the indistribution belief states during meta-testing. Proof. During online adaptation, RL agents construct a hyper-state s+ t = s t , bt from the context history and perform a meta-policy π+ a t s+ t . The new belief bt accounts for the uncertainty of task MDPs and task-dependent behavior policies. In contrast with Fact 1, for feasible Bayesian belief updating, transformed BAMDPs do not allow the agent to visit out-of-distribution belief states. Otherwise, the context history will conflict with the belief about behavior policies, i.e., RL agents cannot update their beliefs bt when they have observed an event that they believe to have probability zero. Lemma 1. In an MDP M , for each behavior policy µ and batch-constrained policy π, collect a dataset D and the gap between approximate offline policy evaluation J D (π) and accurate policy evaluation J M (π) will asymptotically approach to 0, as the offline dataset D grows. From a given dataset D, an abstract MDP M D can be estimated (Fujimoto et al., 2019; Yin & Wang, 2021; Szepesvári, 2022) . According to concentration bounds, the estimated transition and reward function will asymptotically approach M (Yin & Wang, 2021) during the support of D. Then, using the simulation lemma (Alekh Agarwal, 2017; Szepesvári, 2022) , the gap between J D (π) and J M (π) will asymptotically approach to 0, as the offline dataset D grows. Formal proofs are deferred in Appendix A.4. Theorem 2. In a transformed BAMDP M + , for each task-dependent behavior policy [µ] and batchconstrained meta-policy π+ , the data distribution induced by π+ and [µ] matches after filtering out out-of-distribution episodes in online adaptation. Besides, the policy evaluation of π+ in offline meta-training and online adaptation will be asymptotically consistent, as the offline dataset grows. Proof. We assume feasible Bayesian belief updating in this proof.  where b κ,µ t is defined in Eq. (51) and p (κ i , µ i |b κ,µ t ) = E κi∼p(κ),µi∼p(µ|κi) E c:t∼p M + (c:t|κi,µi) [1 [b κ,µ t = p(κ, µ|c :t )]] , where p M + (c :t |κ i , µ i ) is defined in Eq. ( 9). According to Eq. ( 40) and ( 41), P M + ,[µ] r t , s t+1 s+ t , a t ∝ E κi∼p(κ),µi∼p(µ|κi) P κi (r t , s t+1 |s t , a t ) • p M + s+ t |κ i , µ i (71) = E κi∼p(κ),µi∼p(µ|κi) E c:t∼p M + (c:t|κi,µi) [P κi (r t , s t+1 |s t , a t ) • 1 [b κ,µ t = p(κ, µ|c :t )]] (72) = E (κi,µi)∼b κ,µ t [P κi (r t , s t+1 |s t , a t )] = P M + r t , s t+1 s+ t , a t . Thus, the data distribution induced by π+ and [µ] matches. Directly use Lemma 1 in a transformed BAMDP M + , in which M + is a belief MDP, a type of MDP. Therefore, the policy evaluation of π+ in offline meta-training and online adaptation will be asymptotically consistent, as the offline dataset grows. A.4 OMITTED PROOF OF LEMMA 1 Definition 4 (Dataset Induced Finite-Horizon MDPs). In a finite-horizon MDP M with an offline dataset D, a dataset induced finite-horizon MDP is defined by M D = S, A, R, H, P M D , R M D , with the same state space, action space, reward space, and horizon as M . The transition function is defined as follows: ∀s, s ′ ∈ S, a ∈ A, P M D (s ′ |s, a) =    N (s, a, s ′ ) N (s, a) , if N (s, a) > 0, 0, otherwise, where N (s, a, s ′ ) and N (s, a) are the number of times the tuples (s, a, s ′ ) and (s, a) are observed in D, respectively. The reward function is defined by ∀s ∈ S, a ∈ A, r ∈ R, R M D (r|s, a) =    N (s, a, r) N (s, a) , if N (s, a) > 0, 0, otherwise, where N (s, a, r) is the number of times the tuple (s, a, r) are observed in D. The offline policy evaluation in D is equal to the policy evaluation in M D , i.e., for any batch-constrained policy π, J D (π) = J M D (π). Note that dataset induced finite-horizon MDPs M D are not defined on supports outside of dataset D. For simplicity, We set all undefined numbers to 0 in the transition and reward function. Lemma 2 (Simulation Lemma for Offline Finite-Horizon MDPs). In an MDP M with an offline dataset D, for any batch-constrained policy π ∈ Π D , if max s∈S,a∈A with ρ M D π (s,a)>0 P M D (•|s, a) -P M (•|s, a) 1 ≤ ϵ P , max s∈S,a∈A with ρ M D π (s,a)>0 r M D (s, a) -r M (s, a) ≤ ϵ r , max s∈S,a∈A max r M D (s, a), r M (s, a) ≤ r max , where r M (s, a) = E r∼R M (s,a) [r] and r M D (s, a) = E r∼R M D (s,a) [r], we have |J M D (π) -J M (π)| ≤ Hϵ r + H(H -1)r max 2 ϵ P . Proof. Similar with famous Simulation Lemma in finite-horizon MDPs (Alekh Agarwal, 2017), the proof is as follows. Recall value function ∀h ∈ [H -1], ∀s ∈ S h (see Eq. ( 10)), V M π (s) = a∈A π(a|s)   r M (s, a) + s ′ ∈ Sh+1 P M (s ′ |s, a)V M π (s ′ )   , J M (π) = V M π (s 0 ), and ∀h ∈ [H], max s∈ Sh V κ π (s) ≤ (H -h)r max . We will prove ∀h ∈ [H], ∀s ∈ S h with ρ M D π (s) > 0, V M D π (s) -V M π (s) ≤ (H -h)ϵ r + (H -h)(H -h -1)r max 2 ϵ P (84) by induction. When h = H -1, we have ∀s ∈ S h with ρ M D π (s) > 0, V M D π (s) -V M π (s) = a∈A π(a|s)r M D (s, a) - a∈A π(a|s)r M (s, a) ≤ ϵ r (85) holds. And ∀h ∈ [H -1], ∀s ∈ S h with ρ M D π (s) > 0, V M D π (s) -V M π (s) (86) = a∈A π(a|s)   r M D (s, a) + s ′ ∈S h+1 P M D (s ′ |s, a)V M D π (s ′ )   - ( ) a∈A π(a|s)   r M (s, a) + s ′ ∈S h+1 P M (s ′ |s, a)V M π (s ′ )   ≤ a∈A π(a|s) r M D (s, a) -r M (s, a) + a∈A π(a|s) s ′ ∈S h+1 P M D (s ′ |s, a)V M D π (s ′ ) -P M (s ′ |s, a)V M π (s ′ ) ≤ ϵ r + a∈A π(a|s) s ′ ∈S h+1 P M D (s ′ |s, a) -P M (s ′ |s, a) V M π (s ′ )+ a∈A π(a|s) s ′ ∈S h+1 P M D (s ′ |s, a) V M D π (s ′ ) -V M π (s ′ ) (92) ≤ ϵ r + (H -(h + 1))r max ϵ P + (93) (H -(h + 1))ϵ r + (H -(h + 1))(H -(h + 1) -1)r max 2 ϵ P (94) = (H -h)ϵ r + (H -h)(H -h -1)r max 2 ϵ P . Thus, |J M D (π) -J M (π)| = V M D π (s 0 ) -V M π (s 0 ) ≤ Hϵ r + H(H -1)r max 2 ϵ P . Lemma 3. In an MDP M with an offline dataset D collected by a behavior policy µ, for any batch-constrained policy π ∈ Π D , ∀δ ∈ (0, 1], with probability 1 -δ, |J M D (π) -J M (π)| ≤ H 2 |S| log 1 δ + log 2 |S| 2 |A| Kd M D µ , ( ) where K is the number of trajectories in the dataset D and d M D µ is the minimal visitation state-action distribution induced by the behavior policy µ in M D (see Eq. ( 19)). Proof. ∀s, s ′ ∈ S, a ∈ A with ρ M D π (s, a) > 0, note that ρ M D π (s, a) ≥ d M D µ and according to Binomial theorem and Hoeffding's inequality, ∀ϵ ∈ [0, 1], P P M D (s ′ |s, a) -P M (s ′ |s, a) ≥ ϵ ≤ 2 1 -d M D µ + d M D µ exp -2ϵ 2 K (98) ≤ 2 1 -d M D µ ϵ 2 K (99) and P r M D (s, a) -r M (s, a) ≥ ϵ ≤ 2 1 -d M D µ ϵ 2 K . Thus, using union bound, ∀δ ∈ (0, 1], with probability 1 -δ, denote ϵ P = max s∈S,a∈A with ρ M D π (s,a)>0 P M D (•|s, a) -P M (•|s, a) 1 (101) ≤ max s∈S,a∈A with ρ M D π (s,a)>0 |S| P M D (•|s, a) -P M (•|s, a) ∞ (102) ≤ |S| log 1 δ + log 2 |S| 2 |A| Kd M D µ , ϵ r = max s∈S,a∈A with ρ M D π (s,a)>0 r M D (s, a) -r M (s, a) ≤ log 1 δ + log (2 |S| |A|) Kd M D µ , and r max = 1, thus, according to Lemma 2 (a varianted simulation lemma in offline RL), we have |J M D (π) -J M (π)| ≤ H + H(H -1) 2 |S| log 1 δ + log 2 |S| 2 |A| Kd M D µ (105) ≤ H 2 |S| log 1 δ + log 2 |S| 2 |A| Kd M D µ (106) Lemma 1. In an MDP M , for each behavior policy µ and batch-constrained policy π, collect a dataset D and the gap between approximate offline policy evaluation J D (π) and accurate policy evaluation J M (π) will asymptotically approach to 0, as the offline dataset D grows. Proof. From Lemma 3, as the size of an offline dataset |D| = KH grows, the gap between approximate offline policy evaluation J D (π) and accurate policy evaluation J M (π) will asymptotically approach to 0. A sub-dataset collected by a behavior policy µ i in a task κ i is defined by D κi,µi . Note an offline multi-task dataset D + is the union of sub-datasets D κi,µi , i.e., D + = κi,µi D κi,µi . For each sub-dataset D κi,µi , we can define a batch-constrained policy set in a single-task (κ i , µ i ) as Π Dκ i ,µ i (see the definition in Eq. ( 15)). Definition 6 (Meta-Policy with Thompson Sampling). For each transformed BAMDP M + , a metapolicy set with Thompson sampling on M + is defined by π+,T : S × B × K × Π [µ] → ∆ (A), where B is the space of beliefs over meta-RL MDPs with behavior policies, K is the space of task κ, and Π [µ] is the space of task-dependent behavior policies. In each episode, π+,T samples a task hypothesis (κ i , µ i ) from the current belief b κ,µ t ′ , where t ′ is the starting step in this episode. During this episode, π+,T (• |s t , b κ,µ t ′ , κ i , µ i ) prescribes a distribution over actions for each state s t , belief b κ,µ t ′ , and task hypothesis (κ i , µ i ). Beliefs b κ,µ t ′ and task hypotheses (κ i , µ i ) will periodically update after each episode. In the deep-learning-based implementation, a context-based meta-RL algorithm, PEARL (Rakelly et al., 2019) , utilizes a meta-policy with Thompson sampling (Strens, 2000) to iteratively update task belief by interacting with the environment and improve the meta-policy based on the "task hypothesis" sampled from the current beliefs. We can adopt such adaptation protocol to design practical offline meta-RL algorithms for transformed BAMDPs. Definition 7 (Batch-Constrained Meta-Policy Set with Thompson Sampling). For each transformed BAMDP M + with an offline multi-task dataset D + , a batch-constrained meta-policy set with Thompson sampling is defined by Π D + ,T = π+,T π+,T (a t |s t , b κ,µ t ′ , κ i , µ i ) = 0 whenever (s t , a t ) ̸ ∈ D κi,µi , ∀b κ,µ t ′ , ( ) where denoting (s t , a t ) ∈ D κi,µi if there exists a trajectory containing (s t , a t ) in the dataset D κi,µi . The batch-constrained meta-policy set with Thompson sampling Π D + ,T consists of the meta-policies that for any state s t observed in the hypothesis dataset D κi,µi , the agent will not select an action outside of the dataset. Note that in each episode with a task hypothesis (κ i , µ i ), a batch-constrained meta-policy with Thompson sampling π+,T is batch-constrained within a sub-dataset D κi,µi , i.e., ∀b κ,µ t ′ , we have π+,T (• |s t , b κ,µ t ′ , κ i , µ i ) ∈ Π Dκ i ,µ i . Definition 8 (Probability that a Policy Leaves the Dataset). In an MDP M and an arbitrary offline dataset D, for each policy π : S → ∆ (A), the probability that executing π in M leaves the dataset D for an episode is defined by p M,D out (π) = τ H p M π (τ H ) 1 [τ H leaves D] (109) = τ H p M π (τ H ) 1 [∃t ∈ [H] s.t. s t ̸ ∈ D or (s t , a t , r t ) ̸ ∈ D] , where p M π (τ H ) is the probability of executing π in M to generate an H-horizon trajectory τ H (see the definition in Eq. ( 9)), denoting s t ∈ D if there exists a trajectory containing s t in the dataset D, and similarly for (s t , a t , r t ) ∈ D. When we aim to confine the agent in the in-distribution states with high probability as the offline dataset D grows, it is equivalent to bound the probability that executing a policy π in M leaves the dataset D for an episode, i.e., p M,D out (π). Theorem 3. In a transformed BAMDP M + with an offline multi-task dataset D + collected by task-dependent behavior policies [µ], consider each batch-constrained meta-policy with Thompson sampling π+,T ∈ Π D + ,T (see Definition 6 and 7). For each adaptation episode in a meta-testing task κ test ∼ p(κ), denote the current belief by b κ,µ t ′ , there exists a task hypothesis (κ i , µ i ) from b κ,µ t ′ , executing π+,T with (b κ,µ t ′ , κ i , µ i ) in κ test will confine the agent in the in-distribution belief states with high probability, as the offline dataset D + grows. From Theorem 3, for each adaptation episode in κ test with the current belief b κ,µ t ′ , we can sample task task hypothesis (κ i , µ i ) ∼ b κ,µ t ′ and execute π+,T with (b κ,µ t ′ , κ i , µ i ) to interact with the environment until finding an in-distribution episode. Thus, we prove that meta-policies with Thompson sampling can filter out out-of-distribution episodes in M + with high probability, as the offline dataset D + grows. Note that Theorem 3 considers arbitrary task distribution p(κ), since the distance between the closest meta-training task κ i * and κ test will asymptotically approach zero with high probability, as the i.i.d. offline meta-training tasks K train sampled from p(κ) in D + grows. The detailed proofs are as follows. Proof. For each batch-constrained meta-policy with Thompson sampling π+,T ∈ Π D + ,T with (b κ,µ t ′ , κ i , µ i ), similar to Definition 8, define the probability that executing π+,T (• |s t , b κ,µ t ′ , κ i , µ i ) leaves the dataset D + in an adaptation episode of a meta-testing task κ test ∼ p(κ):  p κtest,D + out π+,T , b κ,µ t ′ , κ i , µ i = τ + H p κtest π+,T τ + H b κ,µ t ′ , κ i , µ i 1 τ + H leaves D + , = p κtest π+,T ( τ H | b κ,µ t ′ , κ i , µ i ) , where we can transform τ + H to τ H with the same probability since the belief b κ,µ t ′ and task hypothsis (κ i , µ i ) will periodically update after each episode. Therefore, (π), which asymptotically approaches zero. p κtest,D + out π+,T , b κ,µ t ′ , κ i , µ i = τ + H p κtest π+,T ( τ H | b κ,µ t ′ , κ i , µ i ) 1 τ + H leaves D + (115) ≤ τ H p κtest π+,T ( τ H | b κ,µ t ′ , κ i , µ i ) 1 [τ H leaves D κ i * ,µ i * ] ∥κ i -κ test ∥ ∞ (118) = arg min κi∈Ktrain max (∥P κi (s, a, s ′ ) -P κtest (s, a, s ′ )∥ ∞ , ∥R κi (s, a, r) -R κtest (s, a, r)∥ ∞ ) , For the first adaptation episode in a meta-testing task κ test ∼ p(κ) with the prior belief p(κ, µ) = p(κ)p(µ|κ), there exists a task hypothesis (κ i * , µ i * ) in the prior p(κ, µ), then due to π+,T (• |s t , p(κ, µ), κ i * , µ i * ) ∈ Π Dκ i * ,µ i * from Definition 7 and p κtest,D + out π+,T , p(κ, µ), κ i * , µ i * ≤ p κtest,Dκ i * ,µ i * out π+,T p(κ, µ), κ i * , µ i * , as the offline dataset D + grows, executing π+,T with (p(κ, µ), κ i * , µ i * ) for the first episode in κ test will confine the agent in in-distribution belief states with high probability. In the subsequent adaptation episodes with current belief b κ,µ t ′ in κ test , the task hypothesis (κ i * , µ i * ) is also in the belief b κ,µ t ′ by induction. Therefore, for each adaptation episode with current belief b κ,µ t ′ in κ test , there exists a task hypothesis (κ i , µ i ) from b κ,µ t ′ , e.g., (κ i * , µ i * ), executing π+,T with (b κ,µ t ′ , κ i , µ i ) in κ test will confine the agent in in-distribution belief states with high probability, as the offline dataset D + grows. A.6 OMITTED LEMMAS FOR THEOREM 3 Lemma 4. In an MDP M and an arbitrary offline dataset D, for each policy π, p M,D out (π) ≤ H ρ M π (s) -ρ M D π (s) 1 + ρ M π (s, a, r) -ρ M D π (s, a, r) 1 , where ρ M π (s, a, r) is the visitation distribution of (s, a, r) in M , as defined in Eq. ( 12). Proof. For each τ H leaving D, we use the first outlier data (s t ̸ ∈ D or (s t , a t , r t ) ̸ ∈ D) to represent τ H . Thus,  p M,D out (π) = τ H p M π (τ H ) 1 [∃t ∈ [H] s.t. s t ̸ ∈ D or (s t , a t , r t ) ̸ ∈ D] (122) ≤ H   s∈S with ρ M D π (s)=0 ρ M π (s) + s∈S,a∈A,r∈R with ρ M D π (s,a,r)=0 ρ M π (s, a, r)   (123) ≤ H ρ M π (s) -ρ M D π (s) 1 + ρ M π (s, a, r) -ρ M D π (s, a, r) 1 . )∥ ∞ , ∥R κi (s, a, r) -R κtest (s, a, r)∥ ∞ ) , then for any batch-constrained policy π in D κ i * ,µ i * , where D κ i * ,µ i * is a sub-dataset collected in D + (see Definition 5), i.e., ∀π ∈ Π Dκ i * ,µ i * , p κtest,Dκ i * ,µ i * out (π) (127) ≤ 2H 2 |S| 2 |A| |R|            log 1 δ + log 2 |S| 2 |A| K κ i * ,µ i * • d M Dκ i * ,µ i * µ Asymptotically approaches zero, when Dκ i * ,µ i * is sufficiently large + 2 log 1 δ |K train | 1 |S||A|(|S|+|R|) Asymptotically approaches zero, when Ktrain is sufficiently large            , (128) where K κ i * ,µ i * is the number of trajectories in the sub-dataset D κ i * ,µ i * , d M Dκ i * ,µ i * µ is the minimal visitation state-action distribution induced by the behavior policy µ in M Dκ i * ,µ i * (see Eq. ( 19)), and  |K train | ≤ H ρ M Dκ i * ,µ i * π (s) -ρ κtest π (s) 1 + ρ M Dκ i * ,µ i * π (s, a, r) -ρ κtest π (s, a, r) 1 (130) ≤ H ρ M Dκ i * ,µ i * π (s) -ρ κ i * π (s) 1 + ρ M Dκ i * ,µ i * π (s, a, r) -ρ κ i * π (s, a, r) 1 + (131) H (∥ρ κi * π (s) -ρ κtest π (s)∥ 1 + ∥ρ κi * π (s, a, r) -ρ κtest π (s, a, r)∥ 1 ) . Part I Similar with Lemma 3, ∀δ ∈ (0, 1], with probability 1 -δ, denote ϵ P = max s∈S,a∈A with ρ M Dκ i * ,µ i * π (s,a)>0 P M Dκ i * ,µ i * (•|s, a) -P κ i * (•|s, a) 1 (133) ≤ |S| log 1 δ + log 2 |S| 2 |A| K κ i * ,µ i * • d M Dκ i * ,µ i * µ , ( ) ϵ r = max s∈S,a∈A with ρ M Dκ i * ,µ i * π (s,a)>0 r M Dκ i * ,µ i * (s, a) -r κ i * (s, a) ≤ log 1 δ + log (2 |S| |A|) K κ i * ,µ i * • d M Dκ i * ,µ i * µ , thus, from Lemma 6, we have H ρ M Dκ i * ,µ i * π (s) -ρ κ i * π (s) 1 + ρ M Dκ i * ,µ i * π (s, a, r) -ρ κ i * π (s, a, r) 1 (137) ≤ 2H 2 |S| 2 |A| |R| log 1 δ + log 2 |S| 2 |A| K κ i * ,µ i * • d M Dκ i * ,µ i * µ . Part II From Lemma 10, ∀δ ∈ (0, 1], with probability 1 -δ, denote εP = max s∈S,a∈A ∥P κ i * (•|s, a) -P κtest (•|s, a)∥ 1 (139) ≤ |S| ∥κ i * -κ test ∥ ∞ (140) ≤ 2 |S| log 1 δ |K train | 1 |S||A|(|S|+|R|) ,

εR = max

s∈S,a∈A,r∈R |R κ i * (r|s, a) -R κtest (r|s, a)| (142) ≤ ∥κ i * -κ test ∥ ∞ (143) ≤ 2 log 1 δ |K train | 1 |S||A|(|S|+|R|) , then from Lemma 8, we have H (∥ρ κi * π (s) -ρ κtest π (s)∥ 1 + ∥ρ κi * π (s, a, r) -ρ κtest π (s, a, r)∥ 1 ) (145) ≤ 4H 2 |S| 2 |A| |R| log 1 δ |K train | 1 |S||A|(|S|+|R|) . ( ) Overall Combining Part I and Part II, we have p κtest,Dκ i * ,µ i * out (π) ≤ H ρ  M Dκ i * ,µ i * π (s) -ρ κ i * π (s) 1 + ρ M Dκ i * ,µ i * π (s, a, r) -ρ κ i * π (s, a, r) 1 + (148) H (∥ρ κi * π (s) -ρ κtest π (s)∥ 1 + ∥ρ κi * π (s, a, r) -ρ κtest π (s, a, r)∥ 1 ) (149) ≤ 2H 2 |S| 2 |A| |R|     log 1 δ + log 2 |S| 2 |A| K κ i * ,µ i * • d M Dκ i * ,µ i * µ + 2 log 1 δ |K train | 1 |S||A|(|S|+|R|)     . M D π (s,a)>0 R M D (r|s, a) -R M (r|s, a) ≤ ϵ R , we have ρ M π (s) -ρ M D π (s) 1 + ρ M π (s, a, r) -ρ M D π (s, a, r) 1 (153) ≤ |S| (|A| |R| + 1) H -1 2 ϵ P + |S| |A| |R| ϵ R . Proof. For each ŝ ∈ S, â ∈ A, create an auxiliary reward function r(s, a) :  S × A → [0, 1]: ∀s ∈ S, a ∈ A, r(s, a) =    1 H , if s = ŝ and a = â, 0, otherwise. |X i -Y i | ≥ ϵ ≤ 1 - ϵ 2 n . ( ) Proof. Denote an auxiliary set V = x ∈ R n max i∈[n] |x i | < ϵ 2 , ( ) then if X, Y ∈ V , we must have max i∈[n] |X i -Y i | < ϵ. ( ) For any c ∈ N n , denote V c = V + v c , where v c i = c i + 1 2 ϵ, ∀i ∈ [n]. We may construct a set of such cosets of V as follows: S = {V c |c ∈ C} , where C = c ∈ N n c i ∈ 1 ϵ . are several properties related to these constructions: • For any c ∈ N n , if X, Y ∈ V c , max i∈[n] |X i -Y i | < ϵ. • The union of sets in S contains [0, 1] n • Any two different sets in S are disjoint. The only loophole is that we have not considered points in boundaries ∂V c (V c ∈ S). These boundaries can be decomposed into disjoint union of hyperplanes in R n . For each one of the hyperplanes, arbitrarily designate it to an adjacent V c ∈ S. New V c s are the union of the original one and the hyperplanes designated to it. Note that c∈C [X ∈ V c ] = 1. Therefore, P max i∈[n] |X i -Y i | ≥ ϵ ≤ 1 - c∈C P [X ∈ V c ] P [Y ∈ V c ] (194) = 1 - c∈C P [X ∈ V c ] 2 (195) ≤ 1 - 1 |C| c∈C P [X ∈ V c ] 2 (196) = 1 - 1 |C| . ( ) Since 1 ϵ ≥ 1, we have |C| = 1 ϵ n < 1 + 1 ϵ n ≤ 2 ϵ n and P max i∈[n] |X i -Y i | ≥ ϵ ≤ 1 - ϵ 2 n . ( ) Lemma 10. In a transformed BAMDP M + with an offline multi-task dataset D + , for any meta-testing task κ test ∼ p(κ), ∀δ ∈ (0, 1], with probability 1 -δ, we have ∥κ i * -κ test ∥ ∞ (199) = max (∥P κ i * (s, a, s ′ ) -P κtest (s, a, s ′ )∥ ∞ , ∥R κ i * (s, a, r) -R κtest (s, a, r)∥ ∞ ) (200) ≤ 2 log 1 δ |K train | 1 |S||A|(|S|+|R|) , where κ i * ∈ K train is the closest offline meta-training task to κ test (see Eq. ( 125)), ∥κ i * -κ test ∥ ∞ is the distance between κ i * and κ test (see Eq. ( 126)), and K train is the i.i.d. offline meta-training tasks sampled from p(κ) in D + . Proof. From Lemma 9, we set n = |S| |A| (|S| + |R|), then ∀ϵ ∈ (0, 1], ∀κ i ∈ K train , P [∥κ i -κ test ∥ ∞ ≥ ϵ] ≤ 1 - ϵ 2 n . Hence P arg min κi∈Ktrain ∥κ i -κ test ∥ ∞ ≥ ϵ = κi∈Ktrain P [∥κ i -κ test ∥ ∞ ≥ ϵ] (203) ≤ 1 - ϵ 2 n |Ktrain| . Therefore, ∀δ ∈ (0, 1], with probability 1 -δ, arg min κi∈Ktrain ∥κ i -κ test ∥ ∞ ≤ 2 log 1 δ |K train | 1 |S||A|(|S|+|R|) .

C ANALYSIS AND VISUALIZATION ON UNCERTAINTY ESTIMATION

To further investigate why the uncertainty estimation method (Kidambi et al., 2020) fails to identify indistribution trajectories, we demonstrate the ensemble's prediction errors and uncertainty estimation on the first 10 trajectories. As shown in Figure 5 , uncertainty estimation methods cannot accurately estimate the distance to the dataset, and fail to identify in-distribution data. On the other hand, GCC can successfully select in-distribution data with its greedy selection mechanism. 

D ADDITIONAL VISUALIZATION RESULTS

Figure 6 shows GCC, FOCAL, and uncertainty estimation method (Kidambi et al., 2020) 's adaptation visualization (episode 11-20) after the initial exploration phase in adaptation (episode 1-10). Results demonstrate that while GCC is able to identify in-distribution data and achieve superior adaptation performance, FOCAL utilizes all the 10 trajectories for adaptation, and cannot correctly update task belief due to the out-of-distribution issue. The uncertainty estimation method, as discussed in Appendix C, fails to correctly select the in-distribution trajectory, and thus cannot successfully reach the goal. 



GCC: GREEDY CONTEXT-BASED DATA DISTRIBUTION CORRECTIONIn this section, we investigate a scheme to address data distribution mismatch in offline meta-RL with few-shot online adaptation. Inspired by the theoretical results discussed in Section 3, we aim to distinguish whether an adaptation episode is in the distribution of the given meta-training dataset, Some papers assume the initial state is sampled from a distribution P1. Note this is equivalent to assuming a fixed initial state s0, by setting P (s0, a) = P1 for all a ∈ A and now our state s1 is equivalent to the initial state in their assumption. The transition from the state in SH-1 does not affect learning in the finite-horizon MDP M . For clarity, we denote c κ i :0 = s0.



Figure 1: Illustration of data distribution mismatch between offline training and online adaptation.

Figure 3: (a) Illustration of data distribution mismatch between offline meta-training (blue) and online adaptation (green and red trajectories). (b) The agent fails to identify in-distribution trajectories via uncertainty estimation. Trajectories are colored with corresponding normalized uncertainty estimation. (c) Adaptation performance of GCC, FOCAL, FOCAL with expert context, and selecting trajectories with uncertainty estimation. meta-training and online adaptation, as the dataset is collected by task-dependent behavior policies. As a consequence, directly performing adaptation with the online collected trajectories lead to poor adaptation performance (see FOCAL in Figure 3(c)). Figure 3(b) shows that existing methods that estimate uncertainty with ensembles (Kidambi et al., 2020) cannot correctly detect OOD data, as there is a large error on uncertainty estimation. GCC fixes this problem by greedily selecting trajectories. As shown in Figure 3(a), at the end of the initial stage, GCC only updates its belief with the red trajectory as it has the highest return. After the initial stage, GCC iteratively optimizes the posterior belief to get the final policy. As shown in Figure 3(c), GCC achieves comparable performance to FOCAL with expert context and significantly outperforms FOCAL with online adaptation. Visualization of GCC's and FOCAL's adaptation trajectories after the initial stage are deferred to Appendix D. Further analysis and visualization about the uncertainty estimation methods are deferred to Appendix C.

Figure 4: A concrete example, which has v meta-RL tasks, one state, v actions, v behavior polices, horizon H = 1 in a episode, and v adaptation episodes, where v ≥ 3.

HOW TO FILTER OUT OUT-OF-DISTRIBUTION EPISODES IN TRANSFORMED BAMDPS DURING META-TESTING? Definition 5 (Sub-Datasets Collected by Single Task Data Collection). In a transformed BAMDP M + , an offline multi-task dataset D + is drawn from the task-dependent data collection P M + ,[µ] .

In an MDP M with an offline dataset D, for any batch-constrained policy π ∈ Π D , if max s∈S,a∈A with ρ M D π (s,a)>0 P M D (•|s, a) -P M (•|s, a) 1 ≤ ϵ P , (151) max s∈S,a∈A,r∈R with ρ

Denote M = S, A, R, H, P M , r and M D = S, A, R, H, P M D , r . Using the offline Simulation Lemma shown in Lemma 2, for any batch-constrained policy π ∈ Π D ,ρ M π (ŝ, â) = J M (π) and ρ M D π (ŝ, â) = J M D (π).(156)Thus, ϵ r = 0, r max = 1 H , andρ M π (ŝ, â) -ρ M D π (ŝ, â) ≤ J M (π) -J M D (π) s ∈ S, a ∈ A, r ∈ R, ρ M π (s, a, r) -ρ M D π (s, a, r) (161) = ρ M π (s, a)R M (r|s, a) -ρ M D π (s, a)R M D (r|s, a) (162) ≤ ρ M π (s, a) -ρ M D π (s, a) R M (r|s, a) + ρ M D π (s, a) R M (r|s, a) -R M D (r|s, a) (163) ≤ ρ M π (s, a) -ρ M D π (s, a) + R M (r|s, a) -R M D (r|s, a) -ρ M D π (s) 1 + ρ M π (s, a, r) -ρ M D π (s, a, r) 1 (166) ≤ |S| H -1 2 ϵ P + |S| |A| |R| H -1 2 ϵ P + ϵ R (167) = |S| (|A| |R| + 1) H -1 2 ϵ P + |S| |A| |R| ϵ R .(168)A.6.2 DETAILED LEMMAS (PART II)Lemma 7 (Simulation Lemma for Finite-Horizon MDPs). Given a pair of finite-horizon MDPs M 1 and M 2 with the same state space S, same action space A, same reward function R, and same horizon H. If max s∈S,a∈A P M1 (•|s, a) -P M2 (•|s, a) 1 ≤ εP , (169) max s∈S,a∈A r M1 (s, a) -r M2 (s, a) ≤ εr , (170) max s∈S,a∈A max r M1 (s, a), r M2 (s, a) ≤ r max , (171) where r M1 (s, a) = E r∼R M 1 (s,a) [r] and r M2 (s, a) = E r∼R M 2 (s,a) [r], we have

Figure 5: Visualization of GCC and uncertainty estimation method's trajectory selection. (a) GCC successfully selects the in-distribution trajectory. (b) Uncertainty estimation method cannot identify in-distribution data, as its uncertainty estimation is not accurate.

Figure 6: Visualization of GCC, FOCAL and uncertainty estimation method's adaptation in Episode 11-20 on the Point-Robot environment. (a) GCC successfully selects the in-distribution trajectory and keeps improving in adaptation. (b) FOCAL suffers from the out-of-distribution issue, and cannot correctly update posterior belief, leading to poor performance. (c) Uncertainty estimation method cannot identify in-distribution data, and also suffers from the out-of-distribution issue.

Algorithms' normalized scores averaged over 50 Meta-World ML1 task sets. Scores are normalized by expert-level policy return.

Performance on example tasks, a bunch of Meta-World ML1 tasks with normalized scores. For Meta-World tasks, "-V2" is omitted for brevity.

is the transition function: ∀s + t , s + t+1 ∈ S + , a t ∈ A, r t ∈ R, where denoting s +

s t+1 s +

κi (r t , s t+1 |s t , a t )]

′ , κ i , µ i is the probability of executing π+,T (• |s t , b κ,µ t ′ , κ i , µ i ) in κ test to generate an H-horizon trajectory τ + H in an adaptation episode, i.e., π+,T (a t |s t , b κ,µ t ′ , κ i , µ i ) • R κtest (r t |s t , a t ) P κtest (s t+1 |s t , a t , r t ) (113)

′ , κ i , µ i is the probability that executing π+,T (• |s t , b κ,µ t ′ , κ i , µ i ) in κ test leaves the sub-dataset D κ i * ,µ i * for an episode (see Definition 8), D κ i * ,µ i * is a sub-dataset collected in D

in which denoting the i.i.d. offline meta-training tasks sampled from p(κ) in D+ by K train . From Lemma 5, as the offline dataset D + grows, D κ i * ,µ i * and K train grow monotonically, for any batch-constrained policy π in D κ i * ,µ i * , i.e., ∀π ∈ Π Dκ i * ,µ i * , when executing π in an episode of κ test ∼ p(κ), the probability leaving the dataset D

is the number of i.i.d. offline meta-training tasks sampled from p(κ) in D

|J M1 (π) -J M2 (π)| ≤ Hε r + Proof. Similar with famous Simulation Lemma in finite-horizon MDPs (Alekh Agarwal, 2017) and the offline variant shown in Lemma 2, we will prove ∀h ∈[H], ∀s ∈ S h , When h = H -1, we have ∀s ∈ S h , V M1 π (s) -V M2 π (s) ≤ εr holds. And ∀h ∈ [H -1], ∀s ∈ S h ,Lemma 8. Given a pair of finite-horizon MDPs M 1 and M 2 with the same state space, same action space, same reward function, and same horizon. If max s∈S,a∈AP M1 (•|s, a) -P M2 (•|s, a) 1 ≤ εP ,Proof. Similar with Lemma 6, ∀s ∈ S, a ∈ A, r ∈ R, using Simulation Lemma 7, we have Lemma 9. Let X, Y be two i.i.d. random vectors that take values in [0, 1] n , n ∈ N + . For any ϵ ∈ (0, 1], we have

annex

where N = v is the number of adaptation episodes. Thus, the gap of policy evaluation of π + between offline meta-training and online adaptation isA.3 MAIN RESULTS IN SECTION 3.2Definition 3 (Transformed BAMDPs (formal version of Definition 2)). A transformed BAMDP is defined as a tuple M + = S + , A, R, H + , P + , P + 0 , R + , where S + = S × B is the hyper-state space, B is the space of overall beliefs over meta-RL MDPs with behavior policies, A, R, H + are the same action space, reward space, and planning horizon as the original BAMDP M + , respectively, P + 0 is the initial hyper-state distribution presenting joint distribution of task and behavior policies p(κ, µ) = p(κ)p(µ|κ), and P + , R + are the transition and reward functions. The goal of meta-RL agents is to find a meta-policy π+ a t s+ t that maximizes online policy evaluation J M + (π + ). Denote the reward and transition distribution of the task-dependent data collection in a transformed BAMDP M+ by P M + ,[µ] , as defined in Eq. ( 40). Denote the offline multi-task dataset collected by task-dependent data collection P M + ,[µ] by D + .More specifically, a finite-horizon transformed BAMDP is defined by a tuple Mis the space of beliefs over meta-RL MDPs with behavior policies, the prioris the distribution of meta-RL MDPs with behavior policies, and ∀t ∈ H -1 , ∀c :(t+1) ∈ C,is the posterior over the meta-RL MDPs with behavior policies given the context c :(t+1) , A, R and H are the same action space, reward space, and planning horizon as the finite-horizon BAMDP M + , respectively, PP + 0 : ∆ S + is the initial hyper-state distribution, i.e., a deterministic initial hyper-state isandB HYPER-PARAMETER SETTINGS Environment Settings. Table 3 shows hyper-parameter settings for the task sets used in our experiments. Most hyper-parameters are adopted from previous works (Li et al., 2020; Mitchell et al., 2021) . For all task sets, 80% of the tasks are meta-training tasks, and the remaining 20% tasks are meta-testing tasks. GCC hyper-parameter settings. Table 4 shows GCC's hyper-parameter settings. Most hyperparameters are adopted from FOCAL (Li et al., 2020) . We set n e to 1 as the evaluation environments are all nearly deterministic. 

E.1 UNCERTAINTY ESTIMATION METHOD ON REPRESENTATIVE TASKS

Table 5 shows the uncertainty estimation method's online adaptation performance on several representative tasks. Results demonstrate that the uncertainty estimation method achieves similar performance to FOCAL, and underperforms GCC. This is because the uncertainty estimation method cannot correctly estimate in-distribution trajectories, as discussed in Section 5.1, Appendix C and D. 7 , and Table 8 shows baselines' online adaptation and offline performance on all 50 Meta-World ML1 task sets, respectively. GCC significantly outperforms baselines with online adaptation, and achieves better or comparable performance to baselines with offline adaptation. 

F ABLATION STUDY

In this section, we conduct various ablation studies to investigate the robustness of GCC in dataset quality and hyper-parameters.Initial stage length. Table 9 shows GCC's performance with different initial stage lengths. The total number of adaptation episodes is 20. We find that GCC performs well during 10-15 episodes, which is 50%-75% of the total number of adaptation episodes. A small initial stage length (5) may lead to a possibly unreliable task belief and cause a degrade in performance. The 19-episode does not perform the iterative optimization process, and the task belief updates will not converge.Number of latent task variables sampled in the initial phase. n z controls the number of diverse samples used to produce the task embedding candidates z i t . As shown in Table 10 , GCC is robust to changes of n z , and works in a large range from 5 to 20.Dataset Quality. We test GCC and baselines with several "medium" datasets, which are collected by periodically evaluating policies of SAC. As shown in Table 11 , GCC still significantly outperforms baseline algorithms on medium datasets, which implies GCC's ability to learn various datasets. We do not test BOReL, as it already fails on the easier expert-level datasets. Peg-Insert-Side 0.30 ± 0.04 0.08 ± 0.03 0.00 ± 0.00 Peg-Insert-Side-Medium 0.30 ± 0.14 0.10 ± 0.07 0.00 ± 0.00

