FOCAL: EFFICIENT FULLY-OFFLINE META-REINFORCEMENT LEARNING VIA DISTANCE METRIC LEARNING AND BEHAVIOR REGULARIZATION

Abstract

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

1. INTRODUCTION

Applications of reinforcement learning (RL) in real-world problems have been proven successful in many domains such as games (Silver et al., 2017; Vinyals et al., 2019; Ye et al., 2020) and robot control (Johannink et al., 2019) . However, the implementations so far usually rely on interactions with either real or simulated environments. In other areas like healthcare (Gottesman et al., 2019) , autonomous driving (Shalev-Shwartz et al., 2016) and controlled-environment agriculture (Binas et al., 2019) where RL shows promise conceptually or in theory, exploration in real environments is evidently risky, and building a high-fidelity simulator can be costly. Therefore a key step towards more practical RL algorithms is the ability to learn from static data. Such paradigm, termed "offline RL" or "batch RL", would enable better generalization by incorporating diverse prior experience. Moreover, by leveraging and reusing previously collected data, off-policy algorithms such as SAC (Haarnoja et al., 2018) has been shown to achieve far better sample efficiency than on-policy methods. The same applies to offline RL algorithms since they are by nature off-policy. The aforementioned design principles motivated a surge of recent works on offline/batch RL (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020) . These papers propose remedies by regularizing the learner to stay close to the logged transitions of the training datasets, namely the behavior policy, in order to mitigate the effect of bootstrapping error (Kumar et al., 2019) , where evaluation errors of out-of-distribution state-action pairs are never corrected and hence easily diverge due to inability to collect new data samples for feedback. There exist claims that offline RL can be implemented successfully without explicit correction for distribution mismatch given sufficiently large and diverse training data (Agarwal et al., 2020) . However, we find such assumption unrealistic in many practices, including our experiments. In this paper, to tackle the out-of-distribution problem in offline RL in general, we adopt the proposal of behavior regularization by Wu et al. (2019) . For practical RL, besides the ability to learn without exploration, it's also ideal to have an algorithm that can generalize to various scenarios. To solve real-world challenges in multi-task setting, such as treating different diseases, driving under various road conditions or growing diverse crops in autonomous greenhouses, a robust agent is expected to quickly transfer and adapt to unseen tasks, especially when the tasks share common structures. Meta-learning methods (Vilalta & Drissi, 2002; Thrun & Pratt, 2012) address this problem by learning an inductive bias from experience collected across a distribution of tasks, which can be naturally extended to the context of reinforcement learning. Under the umbrella of this so-called meta-RL, almost all current methods require on-policy data during either both meta-training and testing phases (Wang et al., 2016; Duan et al., 2016; Finn et al., 2017) or at least testing stage (Rakelly et al., 2019) for adaptation. An efficient and robust method which incorporates both fully-offline learning and meta-learning in RL, despite few attempts (Li et al., 2019b; Dorfman & Tamar, 2020) , has not been fully developed and validated. In this paper, under the first principle of maximizing practicality of RL algorithm, we propose an efficient method that integrates task inference with RL algorithms in a fully-offline fashion. Our fully-offline context-based actor-critic meta-RL algorithm, or FOCAL, achieves excellent sample efficiency and fast adaptation with limited logged experience, on a range of deterministic continuous control meta-environments. The primary contribution of this work is designing the first end-to-end and model-free offline meta-RL algorithm which is computationally efficient and effective without any prior knowledge of task identity or reward/dynamics. To achieve efficient task inference, we propose an inverse-power loss for effective learning and clustering of task latent variables, in analogy to coulomb potential in electromagnetism, which is also unseen in previous work. We also shed light on the specific design choices customized for OMRL problem by theoretical and empirical analyses.

2. RELATED WORK

Meta-RL Our work FOCAL builds upon the meta-learning framework in the context of reinforcement learning. Among all paradigms of meta-RL, this paper is most related to the context-based and metric-based approaches. Context-based meta-RL employs models with memory such as recurrent (Duan et al., 2016; Wang et al., 2016; Fakoor et al., 2019) , recursive (Mishra et al., 2017) or probabilistic (Rakelly et al., 2019) structures to achieve fast adaptation by aggregating experience into a latent representation on which the policy is conditioned. The design of the context usually leverages the temporal or Markov properties of RL problems. Metric-based meta-RL focuses on learning effective task representations to facilitate task inference and conditioned control policies, by employing techniques such as distance metric learning (Yang & Jin, 2006) . Koch et al. (2015) proposed the first metric-based meta-algorithm for few-shot learning, in which a Siamese network (Chopra et al., 2005) is trained with triplet loss to compare the similarity between a query and supports in the embedding space. Many metric-based meta-RL algorithms extend these works (Snell et al., 2017; Sung et al., 2018; Li et al., 2019a) . Among all aforementioned meta-learning approaches, this paper is most related to the contextbased PEARL algorithm (Rakelly et al., 2019) and metric-based prototypical networks (Snell et al., 2017) . PEARL achieves SOTA performance for off-policy meta-RL by introducing a probabilistic permutation-invariant context encoder, along with a design which disentangles task inference and control by different sampling strategies. However, it requires exploration during meta-testing. The prototypical networks employ similar design of context encoder as well as an Euclidean distance metric on deterministic embedding space, but tackles meta-learning of classification tasks with squared distance loss as opposed to the inverse-power loss in FOCAL for the more complex OMRL problem. 2018 ; Fujimoto et al., 2019; Kumar et al., 2019) . It incorporates a divergence function between distributions over state-actions in the actor-critic objectives. As with SAC (Haarnoja et al., 2018) , one limitation of the algorithm is its sensitivity to reward scale and regularization strength. In our experiments, we indeed observed wide spread of optimal hyper-parameters across different meta-RL environments, shown in Table 4 . Offline Meta-RL To the best of our knowledge, despite attracting more and more attention, the offline meta-RL problem is still understudied. We are aware of a few papers that tackle the same problem from different angles (Li et al., 2019b; Dorfman & Tamar, 2020) . Li et al. (2019b) focuses on a specific scenario where biased datasets make the task inference module prone to overfit the state-action distributions, ignoring the reward/dynamics information. This so-called MDP ambiguity problem occurs when datasets of different tasks do not have significant overlap in their stateaction visitation frequencies, and is exacerbated by sparse rewards. Their method MBML requires training of offline BCQ (Fujimoto et al., 2019) and reward/dynamics models for each task, which are computationally demanding, whereas our method is end-to-end and model-free. Dorfman & Tamar (2020) on the other hand, formulate the OMRL as a Bayesian RL (Ghavamzadeh et al., 2016) problem and employs a probabilistic approach for Bayes-optimal exploration. Therefore we consider their methodology tangential to ours.

3.1. NOTATIONS AND PROBLEM STATEMENT

We consider fully-observed Markov Decision Process (MDP) (Puterman, 2014) in deterministic environments such as MuJoCo (Todorov et al., 2012) . An MDP can be modeled as M = (S, A, P, R, ρ 0 , γ) with state space S, action space A, transition function P (s |s, a), bounded reward function R(s, a), initial state distribution ρ 0 (s) and discount factor γ ∈ (0, 1). The goal is to find a policy π(a|s) to maximize the cumulative discounted reward starting from any state. We introduce the notion of multi-step state marginal of policy π as µ t π (s), which denotes the distribution over state space after rolling out π for t steps starting from state s. The notation R π (s) denotes the expected reward at state s when following policy π: R π (s) = E a∼π [R(s, a)]. The state-value function (a.k.a. value function) and action-value function (a.k.a Q-function) are therefore V π (s) = ∞ t=0 γ t E st∼µ t π (s) [R(s t )] Q π (s, a) = R(s, a) + γE s ∼P (s |s,a) [V π (s )] Q-learning algorithms are implemented by iterating the Bellman optimality operator B, defined as: (B Q)(s, a) := R(s, a) + γE P (s |s,a) [max a Q(s , a )] When the state space is large/continuous, Q is used as a hypothesis from the set of function approximators (e.g. neural networks). In the offline context of this work, given a distribution of tasks p(T ) where every task is an MDP, we study off-policy meta-learning from collections of static datasets of transitions D i = {(s i,t , a i,t , s i,t , r i,t )|t = 1, ..., N } generated by a set of behavior policies {β i (a|s)} associated with each task index i. A key underlying assumption of meta-learning is that the tasks share some common structures. By definition of MDP, in this paper we restrict our attention to tasks with shared state and action space, but differ in transition and reward functions. We define the meta-optimization objective as L(θ) = E Ti∼p(T ) [L Ti (θ)] where L Ti (θ) is the objective evaluated on transition samples drawn from task T i . A common choice of p(T ) is the uniform distribution on the set of given tasks {T i |i = 1, ..., n}. In this case, the meta-training procedure turns into minimizing the average losses across all training tasks θmeta = arg min θ 1 n n k=1 E [L k (θ)]

3.2. BEHAVIOR REGULARIZED ACTOR CRITIC (BRAC)

Similar to SAC, to constrain the bootstrapping error in offline RL, for each individual task T i , behavior regularization (Wu et al., 2019) introduces a divergence measure between the learner π θ and the behavior policy π b in value and target Q-functions. For simplicity, we ignore task index in this section: V D π (s) = ∞ t=0 γ t E st∼µ t π (s) [R π (s t ) -αD(π θ (•|s t ), π b (•|s t ))] (6) QD ψ (s, a) = Qψ (s, a) -γα D(π θ (•|s), π b (•|s)) where Q denotes a target Q-function without gradients and D denotes a sample-based estimate of the divergence function D. In actor-critic framework, the loss functions of Q-value and policy learning are given by, respectively, L critic = E (s,a,r,s )∼D a ∼π θ (•|s ) r + γ QD ψ (s , a ) -Q ψ (s, a) 2 L actor = -E (s,a,r,s )∼D E a ∼π θ (•|s) [Q ψ (s, a )] -α D (9) 3.3 CONTEXT-BASED META-RL Context-based meta-RL algorithms aggregate context information, typically in form of task-specific transitions, into a latent space Z. It can be viewed as a special form of RL on partially-observed MDP (Kaelbling et al., 1998) in which a latent representation z as the unobserved part of the state needs to be inferred. Once given complete information of z and s combined as the full state, the learning of the universal policy π θ (s, z) and value function V π (s, z) (Schaul et al., 2015) becomes RL on regular MDP, and properties of regular RL such as the existence of optimal policy and value functions hold naturally. We therefore formulate the context-based meta-RL problem as solving a task-augmented MDP (TA-MDP). The formal definitions are provided in Appendix B.

4. METHOD

Based on our formulation of context-based meta-RL problem, FOCAL first learns an effective representation of meta-training tasks on latent space Z, then solves the offline RL problem on TA-MDP with behavior regularized actor critic method. We illustrate our training procedure in Figure 1 and describe the detailed algorithm in Appendix A. We assume that pre-collected datasets are available for both training and testing phases, making our algorithm fully offline. Our method consists of three key design choices: deterministic context encoder, distance metric learning on latent space as well as decoupled training of task inference and control.

4.1. DETERMINISTIC CONTEXT ENCODER

Similar to Rakelly et al. (2019) , we introduce an inference network q φ (z|c), parameterized by φ, to infer task identity from context c ∼ C. In terms of the context encoder design, recent meta-RL methods either employ recurrent neural networks (Duan et al., 2016; Wang et al., 2016) to capture the temporal correlation, or use probabilistic models (Rakelly et al., 2019) for uncertainties estimation. These design choices are proven effective in on-policy and partially-offline off-policy algorithms. However, since our approach aims to address the fully-offline meta-RL problem, we argue that a deterministic context encoder works better in this scenario, given a few assumptions: First, we consider only deterministic MDP in this paper, where the transition function is a Dirac delta distribution. We assume that all meta-learning tasks in this paper are deterministic MDPs, which is satisfied by common RL benchmarks such as MuJoCo. The formal definitions are detailed in Appendix B. Second, we assume all tasks share the same state and action space, while each is characterized by a unique combination of transition and reward functions. Mathematically, this means there exists an injective function f : T → P × R, where P and R are functional spaces of transition probability P : S × A × S → {0, 1} and bounded reward R : S × A → R respectively. A Figure 1 : Meta-training procedure. The inference network q φ uses context data c to compute the latent context variable z, which conditions the actor and critic, and is optimized by the distance metric learning (DML) objective. The learning of context encoder (L dml ) and control policy (L actor , L critic ) are decoupled in terms of gradients. stronger condition of this injective property is that for any state-action pair (s, a), the corresponding transition and reward are point-wise unique across all tasks, which brings the following assumption: Assumption 1 (Task-Transition Correspondence). We consider meta-RL with a task distribution p(T ) to satisfy task-transition correspondence if and only if ∀T 1 , T 2 ∼ p(T ), (s, a) ∈ S × A: P 1 (•|s, a) = P 2 (•|s, a), R 1 (s, a) = R 2 (s, a) ⇐⇒ T 1 = T 2 Under the deterministic MDP assumption, the transition probability function P (•|s, a) is associated with the transition map t : S ×A → S (Definition B.3). The task-transition correspondence suggests that, given the action-state pair (s, a) and task T , there exists a unique transition-reward pair (s , r). Based on these assumptions, one can define a task-specific map f T : S × A → S × R on the set of transitions D T : f T (s t , a t ) = (s t , r t ), ∀T ∼ p(T ), (s t , a t , s t , r t ) ∈ D T (11) Recall that all tasks defined in this paper share the same state-action space, hence {f T |T ∼ p(T )} forms a function family defined on the transition space S × A × S × R, which is also by definition the context space C. This lends a new interpretation that as a task inference module, the context encoder q φ (z|c) enforces an embedding of the task-specific map f T on the latent space Z, i.e. q φ : S × A × S × R → Z. Following Assumption 1, every transition {s i , a i , s i , r i } corresponds to a unique task T i , which means in principle, task identity can be inferred from any single transition tuple. This implies the context encoder should be permutation-invariant and deterministic, since the embedding of context does not depend on the order of the transitions nor involve any uncertainty. This observation is crucial since it provides theoretical basis for few-shot learning (Snell et al., 2017; Sung et al., 2018) in our settings. In particular, when learning in a fully-offline fashion, any meta-RL algorithm at test-time cannot perform adaptation by exploration. The theoretical guarantee that a few randomly-chosen transitions can enable effective task inference ensures that FOCAL is feasible and efficient.

4.2. DISTANCE METRIC LEARNING (DML) OF LATENT VARIABLES

In light of our analysis on the context encoder design, the goal of task inference is to learn a robust and effective representation of context for better discrimination of task identities. Unlike PEARL, which requires Bellman gradients to train the inference network, our insight is to disentangle the learning of context encoder from the learning of control policy. As explained in previous reasoning about the deterministic encoder, the latent variable is a representation of the task properties involving only dynamics and reward, which in principle should be completely captured by the transition datasets. Given continuous neural networks as function approximators, the learned value functions conditioned on latent variable z cannot distinguish between tasks if the corresponding embedding vectors are too close (Appendix C). Therefore for implementation, we formulate the latent variable learning problem as obtaining the embedding q φ : S × A × S × R → Z of transition data D i = {(s i,t , a i,t , s i,t , r i,t )|t = 1, ..., N } that clusters similar data (same task) while pushes away dissimilar samples (different tasks) on the embedding space Z, which is essentially distance metric learning (DML) (Sohn, 2016) . A common loss function in DML is contrasitive loss (Chopra et al., 2005; Hadsell et al., 2006) . Given input data x i , x j ∈ X and label y ∈ {1, ..., L}, it is written as L m cont (x i , x j ; q) = 1{y i = y j }||q i -q j || 2 2 + 1{y i = y j }max(0, m -||q i -q j || 2 ) 2 (12) where m is a constant parameter, q i = q φ (x i ) is the embedding vector of x i . For data point of different tasks/labels, contrastive loss rewards the distance between their embedding vectors by L 2 norm, which is weak when the distance is small, as in the case when z is normalized and q φ is randomly initialized. Empirically, we observe that objectives with positive powers of distance lead to degenerate representation of tasks, forming clusters that contain embedding vectors of multiple tasks (Figure 2a ). Theoretically, this is due to the fact that an accumulative L 2 loss of distance between data points is proportional to the dataset variance, which may lead to degenerate distribution such as Bernoulli distribution. This is proven in Appendix B. To build robust and efficient task inference module, we conjecture that it's crucial to ensure every task embedding cluster to be separated from each other. We therefore introduce a negative-power variant of contrastive loss as follows: L dml (x i , x j ; q) = 1{y i = y j }||q i -q j || 2 2 + 1{y i = y j }β • 1 ||q i -q j || n 2 + (13) where > 0 is a small hyperparameter added to avoid division by zero, the power n can be any non-negative number. Note that when n = 2, Eqn 13 takes form analogous to the Cauchy graph embedding introduced by Luo et al. ( 2011), which was proven to better preserve local topology and similarity relationships compared to Laplacian embeddings. We experimented with 1 (inverse) and 2 (inverse-square) in this paper and compare with the classical L 1 , L 2 metrics in Figure 2 and §5.2.1.

5. EXPERIMENTS

In our experiments, we assess the performance of FOCAL by comparing it with several baseline algorithms on meta-RL benchmarks, for which return curves are averaged over 3 random seeds. Specific design choices are examined through 3 ablations and supplementary experiments are provided in Appendix E.

5.1. SAMPLE EFFICIENCY AND ASYMPTOTIC PERFORMANCE

We evaluate FOCAL on 6 continuous control meta-environments of robotic locomotion, 4 of which are simulated via the MuJoCo simulator (Todorov et al., 2012) For OMRL, there are two natural baselines. The first is by naively modifying PEARL to train and test from logged data without exploration, which we term Batch PEARL. The second is Contextual BCQ. It incorporates latent variable z in the state and perform task-augemented variant of offline BCQ algorithm (Fujimoto et al., 2019) . Like PEARL, the task inference module is trained using Bellman gradients. Lastly, we include comparison with the MBML algorithm proposed by Li et al. (2019a) . Although as discussed earlier, MBML is a model-based, two-stage method as opposed to our model-free and end-to-end approach, we consider it by far the most competitive and related OMRL algorithm to FOCAL, due to the lack of other OMRL methods. As shown in Figure 3 , we observe that FOCAL outperforms other offline meta-RL methods across almost all domains. In Figure 4b , we also compared FOCAL to other algorithm variants including a more competitive variant of Batch PEARL by applying the same behavior regularization. In both trials, FOCAL with our proposed design achieves the best overall sample efficiency and asymptotic performance. We started experiments with expert-level datasets. However, for some tasks such as Ant and Walker, we observed that a diverse training sets result in a better meta policy (Table 2 ). We conjecture that mixed datasets, despite sub-optimal actions, provides a broader support for state-action distributions, making it easier for the context encoder to learn the correct correlation between task identity and transition tuples (i.e., transition/reward functions). While using expert trajectories, there might be little overlap between state-action distributions across tasks (Figure 8 ), which may cause the agent to overfit to spurious correlation. This is the exact problem Li et al. (2019b) aims to address, termed MDP ambiguity. Such overfitting to state-action distributions leads to suboptimal latent representations and poor robustness to distribution shift (Table 6 ), which can be interpreted as a special form of memorization problem in classical meta-learning (Yin et al., 2019) . MDP ambiguity problem is addressed in an extension of FOCAL (Li et al., 2021) .

5.2. ABLATIONS

Based on our previous analysis, we examine and validate three key design choices of FOCAL by the following ablations. The main results are illustrated in Figure 4 and 5. 

5.2.1. POWER LAW OF DISTANCE METRIC LOSS

To show the effectiveness of our proposed negative-power distance metrics for OMRL problem, we tested context embedding loss with different powers of distance, from L -2 to L 2 . A t-SNE (Van der Maaten & Hinton, 2008) visualization of the high-dimensional embedding space in Figure 2a demonstrates that, distance metric loss with negative powers are more effective in separating embedding vectors of different tasks, whereas positive powers exhibit degenerate behaviors, leading to less robust and effective conditioned policies. By a physical analogy, the inverse-power losses provide "repulsive forces" that drive apart all data points, regardless of the initial distribution. In electromagnetism, consider the latent space as a 3D metal cube and the embedding vectors as positions of "charges" of the same polarity. By Gauss's law, at equilibrium state, all charges are distributed on the surface of the cube with densities positively related to the local curvature of the surface. Indeed, we observe from the "Inverse-square" and "Inverse" trials that almost all vectors are located near the edges of the latent space, with higher concentration around the vertices, which have the highest local curvatures (Figure 7 ). To evaluate the effectiveness of different powers of DML loss, we define a metric called effective separation rate (ESR) which computes the percentage of embedding vector pairs of different tasks whose distance on latent space Z is larger than the expectation of randomly distributed vector pairs, i.e., 2l/3 on (-1, 1) l . Table 1 demonstrates that DML losses of negative power are more effective in maintaining distance between embeddings of different tasks, while no significant distinction is shown in terms of RMS distance, which is aligned with our insight that RMS or effectively classical L 2 objective, can be optimized by degenerate distributions (Lemma B.1). This is the core challenge addressed by our proposed inverse-power loss.

5.2.2. DETERMINISTIC VS. PROBABILISTIC CONTEXT ENCODER

Despite abundance successes of probabilistic/variational inference models in previous work (Kingma & Welling, 2013; Alemi et al., 2016; Rakelly et al., 2019) , by comparing FOCAL with deterministic and probabilistic context encoder in Figure 4b , we observe experimentally that the former performs significantly better on tasks differ in either reward or transition dynamics in the fully offline setting. Intuitively, by our design principles, this is due to 1. Offline meta-RL does not require exploration. Also when Assumption 1 is satisfied, there is not need for reasoning about uncertainty during adaption. 2. The deterministic context encoder in FOCAL is trained with carefully designed metricbased learning objective, detached from the Bellman update, which provides better efficiency and stability for meta-learning. Moreover, the advantage of our encoder design motivated by Assumption 1 is also reflected in Figure 4a , as our proposed method is the only variant that achieves effective clustering of task embeddings. The connection between context embeddings and RL performance is elaborated in Appendix C.

5.2.3. CONTEXT ENCODER TRAINING STRATEGIES

The last design choice of FOCAL is the decoupled training of context encoder and control policy illustrated in Figure 1 . To show the necessity of such design, in Figure 4 we compare our proposed FOCAL with a variant by allowing backpropagation of the Bellman gradients to context encoder. Figure 5a shows that our proposed strategy achieves effective clustering of task context and therefore better control policy, whereas training with Bellman gradients cannot. As a consequence, the corresponding performance gap is evident in Figure 5b . We conjecture that on complex tasks where behavior regularization is necessary to ensure convergence, without careful tuning of hyperparameters, the Bellman gradients often dominate over the contribution of the distance metric loss. Eventually, context embedding collapses and fails to learn effective representations. Additionally however, we observed that some design choices of the behavior regularization, particularly the value penalty and policy regularization in BRAC (Wu et al., 2019) can substantially affect the optimal training strategy. We provide more detailed discussion in Appendix E.2.

6. CONCLUSION

In this paper, we propose a novel fully-offline meta-RL algorithm, FOCAL, in pursuit of more practical RL. Our method involves distance metric learning of a deterministic context encoder for efficient task inference, combined with an actor-critic apparatus with behavior regularization to effectively learn from static data. By re-formulating the meta-RL tasks as task-augmented MDPs under the task-transition correspondence assumption, we shed light on the effectiveness of our design choices in both theory and experiments. Our approach achieves superior performance compared to existing OMRL algorithms on a diverse set of continuous control meta-RL domains. Despite the success, the strong assumption we made regarding task inference from transitions can potentially limit FOCAL's robustness to common challenges in meta-RL such as distribution shift, sparse reward and stochastic environments, which opens up avenues for future work of more advanced OMRL algorithms.

Appendices

A PSEUDO-CODE Algorithm 1: FOCAL Meta-training Given: • Pre-collected batch D i = {(s i,j , a i,j , s i,j , r i,j )} j:1...N of a set of training tasks {T i } i=1...n drawn from p(T ) • Learning rates α 1 , α 2 , α 3 1 Initialize context replay buffer C i for each task T i 2 Initialize inference network q φ (z|c), learning policy π θ (a|s, z) and Q-network Q ψ (s, z, a) with parameters φ, θ and ψ 3 while not done do 4 for each T i do 5 for t = 0, T -1 do 6 Sample mini-batches of B transitions {(s i,t , a i,t , s i,t , r i,t )} t:1...B ∼ D i and update C i L i actor = L actor (b i , q(c i )) 18 L i critic = L critic (b i , q(c i )) 19 end 20 φ ← φ -α 1 ∇ φ ij L ij dml 21 θ ← θ -α 2 ∇ θ i L i actor 22 ψ ← ψ -α 3 ∇ ψ i L i critic 23 end 24 end Algorithm 2: FOCAL Meta-testing Given: • Pre-collected batch D i = {(s i ,j , a i ,j , s i ,j , r i ,j )} j :1...M of a set of testing tasks {T i } i =1...m drawn from p(T ) 1 Initialize context replay buffer C i for each task T i 2 for each T i do 3 for t = 0, T -1 do 4 Sample mini-batches of B transitions c i = {(s i ,t , a i ,t , s i ,t , r i ,t )} t:1...B ∼ D i and update C i 5 Compute z i = q φ (c i ) 6 Roll out policy π θ (a|s, z i ) for evaluation 7 end 8 end

B DEFINITIONS AND PROOFS

Lemma B.1. The contrastive loss of a given dataset X = {x i |i = 1, ..., N } is proportional to the variance of the random variable X ∼ X Proof. Consider the contrastive loss i =j (x i -x j ) 2 , which consists of N (N -1) pairs of different samples (x i , x j ) drawn from X . It can be written as i =j (x i -x j ) 2 = 2   (N -1) i x 2 i - i =j x i x j   (14) The variance of X ∼ X is expressed as Var(X) = (X -X) 2 (15) = X 2 -(X) 2 (16) = 1 N i x 2 i - 1 N 2 ( i x i ) 2 (17) = 1 N 2   (N -1) i x 2 i - i =j x i x j   ( ) where X denotes the expectation of X. By substituting Eqn 18 into 14, we have i =j where δ(x -y) is the Dirac delta function that is zero almost everywhere except x = y. (x i -x j ) 2 = 2N 2 (Var(X))

C IMPORTANCE OF DISTANCE METRIC LEARNING FOR META-RL ON TASK-AUGMENTED MDPS

We provide an informal argument that enforcing distance metric learning (DML) is crucial for meta-RL on task-augmented MDPs (TA-MDPs). Consider a classical continuous neural network N θ parametrized by θ with L ∈ N layers, n l ∈ N many nodes at the l-th hidden layer for l = 1, ..., L, input dimension n 0 , output dimension n L+1 and nonlinear continuous activation function σ : R → R. It can be expressed as N θ (x) := A L+1 • σ L • A L • • • • • σ 1 • A 1 (x) where A l : R n l-1 → R n l is an affine linear map defined by A l (x) = W l x + b l for n l × n l-1 dimensional weight matrix W l and n l dimensional bias vector b l and σ l : R n l → R n l is an elementwise nonlinear continuous activation map defined by σ l (z) := (σ(z 1 ), ..., σ(z n l )) . Since every affine and activation map is continuous, their composition N θ is also continuous, which means by definition of continuity: ∀ > 0, ∃η > 0 s.t. (23) |x 1 -x 2 | < η ⇒ |N θ (x 1 ) -N θ (x 2 )| < ( ) where | • | in principle denotes any valid metric defined on Euclidean space R n0 . A classical example is the Euclidean distance. Now consider N θ as the value function on TA-MDP with deterministic embedding, approximated by a neural network parameterized by θ: Qθ (s, a, z) ≈ Q θ (s, a, z) = R z (s, a) + γE s ∼Pz(s |s,a) [V θ (s )] The continuity of neural network implies that for a pair of sufficiently close embedding vectors (z i , z j ), there exists sufficiently small η > 0 and > 0 that z 1 , z 2 ∈ Z, |z 1 -z 2 | < η ⇒ | Qθ (s, a, z 1 ) -Qθ (s, a, z 2 )| < Eqn 26 implies that for a pair of different tasks (T i , T j ) ∼ p(T ), if their embedding vectors are sufficiently close in the latent space Z, the mapped values of meta-learned functions approximated by continuous neural networks are suffciently close too. Since by Eqn 25, due to different transition functions P zi (s |s, a), P zj (s |s, a) and reward functions R zi (s, a), R zj (s, a) of (T i , T j ), the distance between the true values of two Q-functions |Q θ (s, a, z i ) -Q θ (s, a, z j )| is not guaranteed to be small. This suggests that a meta-RL algorithm with suboptimal representation of context embedding z = q φ (c), which fails in maintaining effective distance between two distinct tasks T i , T j , is unlikely to accurately learn the value functions (or any policy-related functions) for both tasks simultaneously. The conclusion can be naturally generalized to the multi-task meta-RL setting.

D EXPERIMENTAL DETAILS D.1 DETAILS OF THE MAIN EXPERIMENTAL RESULT (FIGURE 3 AND 4)

The main experimental result in the paper is the comparative study of performance of FOCAL and three baseline OMRL algorithms: Batch PEARL, Contextual BCQ and MBML, shown in Figure 3 . Here in Figure 6 we plot the same data for the full number of steps sampled in our experiments. Some of the baseline experiments only lasted for 10 6 steps due to limited computational budget, but are sufficient to support the claims made in the main text. We directly adopted the Contextual BCQ and MBML implementation from MBML's official source codefoot_0 and perform the experiments on our own dataset generated by SAC algorithm 3 The DML loss used in experiments in Figure 3 is inverse-squared, which gives the best performance among the four power laws we experimented with in Figure 2 . In addition, we provide details on the offline datasets used to produce the result. The performance levels of the training/testing data for the experiments are given in Table 2 , which are selected for the best test-time performance over four levels: expert, medium, random, mixed (consist of all logged trajectories of trained SAC models from beginning (random quality) to end (expert quality)). For mixed data, the diversity of samples is optimal but the average performance level is lower than expert. A summary of the fixed datasets used for producing Figure 3 and 6 is given in Table 3 . Lastly, shown in in Figure 7 , we also present a faithful 3D projection (not processed by t-SNE) of latent embeddings in Figure 4a . Evidently, our proposed method is the only algorithm which achieves effective clustering of different task embeddings. As validation of our intuition about the analogy between the DML loss and electromagnetism discussed in §5.2.1, the learned embeddings do clus- ter around the corners and edges of the bounded 3D-projected latent space, which are locations of highest local curvatures. • Sparse-Point-Robot: A 2D navigation problem introduced in PEARL (Rakelly et al., 2019) . Starting from the origin, each task is to guide the agent to a specific goal located on the unit circle centered at the origin. Non-sparse reward is defined as the negative distance from the current location to the goal. In sparse-reward scenario, reward is truncated to 0 when the agent is outside a neighborhood of the goal controlled by the goal radius. While inside the neighborhood, agent is rewarded by 1distance at each step, which is a positive value. • Point-Robot-Wind: A variant of Sparse-Point-Robot. Task only in transition function. Each task is associated with the same reward but a distinct "wind" sampled uniformly from [-l, l] 2 . Every time the agent takes a step, it drifts by the wind vector. We use l = 0.05 in this paper. • Half-Cheetah-Fwd-Back: Control a Cheetah robot to move forward or backward. Reward function is dependent on the walking direction. • Half-Cheetah-Vel: Control a Cheetah robot to achieve a target velocity running forward. Reward function is dependent on the target velocity. • Ant-Fwd-Back: Control an Ant robot to move forward or backward. Reward function is dependent on the walking direction. • Walker-2D-Params: Agent is initialized with some system dynamics parameters randomized and must move forward, it is a unique environment compared to other MuJoCo environments since tasks differ in transition function. Transitions function is dependent on randomized task-specific parameters such as mass, inertia and friction coefficients.

D.3 HYPERPARAMETER SETTINGS

The details of important hyperparameters used to produce the experimental results in the paper are presented in Table 4 and 5 . found that on complex tasks such as Ant, value penalty usually requires extremely large regularization strength (Table 4 ) to converge. Since the regularization is added to the value/Q-function, this results in very large nagative Q value (Figure 10 ) and exploding Bellman gradients. In this scenario, training the context embedding with backpropogated Bellman gradients often yields sub-optimal latent representation and policy performance (Fig 5) , which leads to our design of decoupled training strategy discussed in §5.2.3. For policy regularization however, the learned value/Q-function approximates the real value (Figure 11a ), leading to comparable order of magnitude for the three losses L dml , L actor and L critic . In this case, the decoupled training of context encoder, actor and critic, may give competitive or even better performance due to end-to-end optimization, shown in Figure 9 . 

F IMPLEMENTATION

We build our algorithm on top of PEARL and BRAC, both are derivatives of the SAC algorithm. SAC is an off-policy actor-critic method with a maximum entropy RL objective which encourages exploration and learning a stochastic policy. Although exploration is not needed in fully-offline scenarios, we found empirically that a maximum entropy augmentation is still beneficial for OMRL, which is likely due to the fact that in environments such as Ant, different actions result in same next state and reward, which encourages stochastic policy. All function approximators in FOCAL are implemented as neural networks with MLP structures. For normalization, the last activation layer of context encoder and policy networks are invertible squashing operators (tanh), making Z a bounded Euclidean space (-1, 1) l , which is reflected in Figure 7 . As in Figure 1 , the whole FOCAL pipeline involves three main objectives. The DML loss for training the inference network q φ (z|c) is given by Eqn 13, for mini-batches of transitions drawn from training datasets: x i ∼ D i , x j ∼ D j . The embedding vector q i , q j are computed as the average embedding over x i and x j . The actor and critic losses are the task-augmented version of Eqn 8 and 9: where Q is a target network and z indicates that gradients are not being computed through it. As discussed in (Kumar et al., 2019; Wu, Tucker, and Nachum, 2019) , the divergence function D can take form of Kernel MMD (Gretton et al., 2012) , Wasserstein Divergence (Arjovsky, Chintala, and Bottou, 2017) or f-divergences (Nowozin et al., 2016) such as KL divergence. In this paper, we use the dual form (Nowozin, Cseke, and Tomioka, 2016) of KL divergence, which learns a discriminator g with minimax optimization to circumvent the need of a cloned policy for density estimation. L critic = E ( In principle, as a core design choice of PEARL, the context used to infer q φ (z|c) can be sampled with a different strategy than the data used to compute the actor-critic losses. In OMRL however, we found this treatment unnecessary since there is no exploration. Therefore training of DML and actorcritic objectives are randomly sampled from the same dataset, which form an end-to-end algorithm described in Algorithm 1 and 2.



https://github.com/Ji4chenLi/Multi-Task-Batch-RL For sparse reward environments like Sparse-Point-Robot, we observed no adapation at test time (return stays zero) for both Contextual BCQ and MBML, which might be due to incorrect implementation or just that both algorithms fail to adapt in sparse reward scenarios. To avoid drawing conlusion too hastily, we chose to not present Contextual BCQ and MBML result for Sparse-Point-Robot at the moment.



Figure 2: (a) t-SNE visualization of embedding vectors drawn from 20 randomized tasks on Half-Cheetah-Vel. Inverse-power distance metric losses (DML) achieve better clustering. Data points are color-coded according to task identity. (b) FOCAL trained with inverse-power DML losses outperform the linear and square distance losses.

, plus variants of a 2D navigation problem called Point-Robot. 4 (Sparse-Point-Robot, Half-Cheetah-Vel, Half-Cheetah-Fwd-Back, Ant-Fwd-Back) and 2 (Point-Robot-Wind, Walker-2D-Params) environments require adaptation by reward and transition functions respectively.

Figure 3: Performance vs. number of transition steps sampled for training. Top: Average episodic testing return of FOCAL vs. other baselines on 4 meta-environments with different reward functions across tasks. Bottom: Average episodic testing return of FO-CAL vs. other baselines on 2 metaenvironments with different transition dynamics across tasks. These meta-RL benchmarks were previously introduced by Finn et al. (2017) and Rakelly et al. (2019), with detailed description in Appendix D. For data generation, we train stochastic SAC models for every single task and roll out policies saved at each checkpoint to collect trajectories. The offline training datasets are selections of the saved trajectories, which facilitates tuning of the performance level and state-action distributions of the datasets for each task.

Figure 4: Comparative study of 4 algorithm variants: FOCAL with deterministic/probabilistic context encoder, Batch PEARL with/without behavior regularization. (a) t-SNE visualization of the embedding vectors drawn from 20 randomized tasks on Walker-2D-Params. Data points are color-coded according to task identity. (b) Return curves on tasks with different reward functions (Half-Cheetah-Vel) and transition dynamics (Walker-2D-Params).

Figure 5: FOCAL vs. FOCAL with coupled gradients. (a) t-SNE visualization of the embedding vectors drawn from 20 randomized tasks on Walker-2D-Params. Data points are color-coded according to task identity. (b) Return curves on Walker-2D-Params.

Sample mini-batches of M tasks ∼ p(T ) 10 for step in training steps do 11 for each T i do 12 Sample mini-batches c i and b i ∼ C i for context encoder and policy training 13 for each T j do 14 Sample mini-batches c j from C j 15 L ij dml = L dml (c i , c j ; q) 16 end 17

Task-Augmented MDP). A task-augmented Markov Decision Process (TA-MDP) can be modeled as M = (S, Z, A, P, R, ρ 0 , γ) where • S: state space • Z: contextual latent space • A: action space • P : transition function P (s , z |s, z, a) = P z (s |s, a) if there is no intra-task transition • R: reward function R(s, z, a) = R z (s, a) • ρ 0 (s, z): joint initial state and task distribution • γ ∈ (0, 1): discount factor Definition B.2. The Bellman optimality operator B z on TA-MDP is defined as (B z Q)(s, z, a) := R(s, z, a) + γE P (s ,z |s,z,a) [max a Q(s , z , a )] (20) Definition B.3 (Deterministic MDP). For a deterministic MDP, a transition map t : S × A → S exists such that: P (s |s, a) = δ(s -t(s, a)) (21)

Figure 6: Average episodic testing return of FOCAL vs. other baselines on five meta-environments.

Figure 7: 3D projection of the embedding vectors ∈ (-1, 1) l drawn from 20 randomized tasks on Walker-2D-Params. Data points are color-coded according to task identity.

(b) The DML loss weight β and coefficient (defined in Eqn 13) used in experiments of Figure2ato match the scale of objective functions of different power laws. The weights are chosen such that all terms are equal when the average distance of xi and xj per dimension is 0.5, a reasonable value given x ∈ (-1, 1) l . TO DISTRIBUTION SHIFT Since in OMRL, all datasets are static and fixed, many challenges from classical supervised learning such as over-fitting exist. By developing FOCAL, we are also interested in its sensitivity to distribution shift for better understanding of OMRL algorithms. Since for each task T i , our data-generating behavior policies β i (a|s) are trained from random to expert level, we select three performance levels (expert, medium, random) of datasets to study how combinations of training/testing sets with different qualities/distributions affect performance. An illustration of the three quality levels on Sparse-Point-Robot is shown in Fig 8.

Figure 8: Distribution of rollout trajectories of trained SAC policies of three performance levels: random, medium and expert. Since reward is sparse, only states that lie in the red circle are given non-zero rewards, making meta-learning more challenging and sensitive to data distributions.

Figure 9: FOCAL vs. FOCAL with coupled gradients and policy regularization. The task representation alone of the coupled training scheme might not be superior, but the policy performance can be improved due to end-to-end optimization. (a) t-SNE visualization of the embedding vectors drawn from 20 randomized tasks on Walker-2D-Params. Data points are color-coded according to task identity. (b) Return curves on Walker-2D-Params.

DIVERGENCE OF Q-FUNCTIONS IN OFFLINE SETTING The necessity of applying behavior regularization on environment like Ant-Fwd-Back and Walker-2D-Params to prevent divergence of value functions is demonstrated in Figure 10 and 11.

Figure 10: FOCAL with value penalty vs. Batch PEARL on Ant-Fwd-Back. The Q-function learned by Batch PEARL diverges (> 10 11 ) whereas the Q-function of FOCAL, despite its large order of magnitude due to value penalty, converges eventually given proper regularization (α = 10 6 ) .

s,a,r,s )∼Da ∼π θ (•|s ) r + γ QD ψ (s , z, a ) -Q ψ (s, z, a) 2(27)L actor = -E (s,a,r,s )∼D E a ∼π θ (•|s) [Q ψ (s, z, a )] -α D(28)



Quality of data used for best test-time performance. We maintain the same quality of data for training and testing due to algorithm's sensitivity to distribution shift. From our experiments, we observe that for some envs/tasks, datasets with the best performance generate the best testing result, whereas for some envs/tasks, the diversity of data matters the most.

Details of the fixed datasets used for producing Figure 3 and 6. The three numbers in the "Checkpoints" column stand for starting epoch: ending epoch: checkpoint spacing.

Hyperparameters used to produce Figure3. Meta batch size refers to the number of tasks used for computing the DML loss L ij dml at a time. Larger meta batch size leads to faster convergence but requires greater computing power.

Hyperparameters used to produce Figure2a(a) Compared to Half-Cheetah-Vel experiment in Table4, latent space dimension were reduced to speed up computation. Also the value penalty is used in behavior regularization.



Table6shows the average return at test-time for various training and testing distributions. Sensitivity to distribution shift is confirmed since training/testing on the similar distribution of data result in relatively higher performance. In particular, this is significant in sparse reward scenario since Assumption 1 is no longer satisfied. With severe over-fitting and the MDP ambiguity problem elaborated in the last paragraph of §5.1, performance of meta-RL policy is inevitably compromised by distribution mismatch between training/testing datasets.

7. ACKNOWLEDGEMENTS

The authors are grateful to Yao Yao, Zhicheng An and Yuanhao Huang for running part of the baseline experiments. A special thank to Yu Rong and Peilin Zhao for providing insightful comments and being helpful during the working process.

funding

† Work done while an intern

