LEARNING ROBUST STATE ABSTRACTIONS FOR HIDDEN-PARAMETER BLOCK MDPS

Abstract

Many control tasks exhibit similar dynamics that can be modeled as having common latent structure. Hidden-Parameter Markov Decision Processes (HiP-MDPs) explicitly model this structure to improve sample efficiency in multi-task settings. However, this setting makes strong assumptions on the observability of the state that limit its application in real-world scenarios with rich observation spaces. In this work, we leverage ideas of common structure from the HiP-MDP setting, and extend it to enable robust state abstractions inspired by Block MDPs. We derive instantiations of this new framework for both multi-task reinforcement learning (MTRL) and meta-reinforcement learning (Meta-RL) settings. Further, we provide transfer and generalization bounds based on task and state similarity, along with sample complexity bounds that depend on the aggregate number of samples across tasks, rather than the number of tasks, a significant improvement over prior work that use the same environment assumptions. To further demonstrate the efficacy of the proposed method, we empirically compare and show improvement over multi-task and meta-reinforcement learning baselines.

1. INTRODUCTION

A key open challenge in AI research that remains is how to train agents that can learn behaviors that generalize across tasks and environments. When there is common structure underlying the tasks, we have seen that multi-task reinforcement learning (MTRL), where the agent learns a set of tasks simultaneously, has definite advantages (in terms of robustness and sample efficiency) over the singletask setting, where the agent independently learns each task. There are two ways in which learning multiple tasks can accelerate learning: the agent can learn a common representation of observations, and the agent can learn a common way to behave. Prior work in MTRL has also leveraged the idea by sharing representations across tasks (D 'Eramo et al., 2020) or providing pertask sample complexity results that show improved sample efficiency from transfer (Brunskill & Li, 2013) . However, explicit exploitation of the shared structure across tasks via a unified dynamics has been lacking. Prior works that make use of shared representations use a naive unification approach that posits all tasks lie in a shared domain (Figure 1 , left). On the other hand, in the single-task setting, research on state abstractions has a much richer history, with several works on improved generalization through the aggregation of behaviorally similar states (Ferns et al., 2004; Li et al., 2006; Luo et al., 2019; Zhang et al., 2020b) . In this work, we propose to leverage rich state abstraction models from the single-task setting, and explore their potential for the more general multi-task setting. We frame the problem as a structured super-MDP with a shared state space and universal dynamics model conditioned on a task-specific hidden parameter (Figure 1, right) . This additional structure gives us better sample efficiency, both

2. BACKGROUND

In this section, we introduce the base environment as well as notation and additional assumptions about the latent structure of the environments and multi-task setup considered in this work. A finitefoot_0 , discrete-time Markov Decision Process (MDP) (Bellman, 1957; Puterman, 1995) is a tuple S, A, R, T, γ , where S is the set of states, A is the set of actions, R : S × A → R is the reward function, T : S × A → Dist(S) is the environment transition probability function, and γ ∈ [0, 1) is the discount factor. At each time step, the learning agent perceives a state s t ∈ S, takes an action a t ∈ A drawn from a policy π : S × A → [0, 1], and with probability T (s t+1 |s t , a t ) enters next state s t+1 , receiving a numerical reward R t+1 from the environment. The value function of policy π is defined as: V π (s) = E π [ ∞ t=0 γ t R t+1 |S 0 = s]. The optimal value function V * is the maximum value function over the class of stationary policies. Hidden-Parameter MDPs (HiP-MDPs) (Doshi-Velez & Konidaris, 2013) can be defined by a tuple M: S, A, Θ, T θ , R, γ, P Θ where S is a finite state space, A a finite action space, T θ describes the transition distribution for a specific task described by task parameter θ ∼ P Θ , R is the reward function, γ is the discount factor, and P Θ the distribution over task parameters. This defines a family of MDPs, where each MDP is described by the parameter θ ∼ P Θ . We assume that this parameter θ is fixed for an episode and indicated by an environment id given at the start of the episode. Block MDPs (Du et al., 2019) are described by a tuple S, A, X , p, q, R with an unobservable state space S, action space A, and observable space X . p denotes the latent transition distribution p(s |s, a) for s, s ∈ S, a ∈ A, q is the (possibly stochastic) emission mapping that emits the observations q(x|s) for x ∈ X , s ∈ S, and R the reward function. We are interested in the setting where this mapping q is one-to-many. This is common in many real world problems with many tasks where the underlying states and dynamics are the same, but the observation space that the agent perceives can be quite different, e.g. navigating a house of the same layout but different decorations and furnishings. Assumption 1 (Block structure (Du et al., 2019) ). Each observation x uniquely determines its generating state s. That is, the observation space X can be partitioned into disjoint blocks X s , each containing the support of the conditional distribution q(•|s). Assumption 1 gives the Markov property in X , a key difference from partially observable MDPs (Kaelbling et al., 1998; Zhang et al., 2019) , which has no guarantee of determining the generating state from the history of observations. This assumption allows us to compute reasonable bounds for our algorithm in k-order MDPsfoot_1 (which describes many real world problems) and avoiding the intractability of true POMDPs, which have no guarantees on providing enough information to sufficiently predict future rewards. A relaxation of this assumption entails providing less information in the observation for predicting future reward, which will degrade performance. We show empirically that our method is still more robust to a relaxation of this assumption compared to other MTRL methods. Bisimulation is a strict form of state abstraction, where two states are bisimilar if they are behaviorally equivalent. Bisimulation metrics (Ferns et al., 2011) define a distance between states as follows: Definition 1 (Bisimulation Metric (Theorem 2.6 in Ferns et al. (2011)) ). Let (S, A, P, r) be a finite MDP and met the space of bounded pseudometrics on S equipped with the metric induced by the uniform norm. Define F : met → met by F (d)(s, s ) = max a∈A (|r a s -r a s | + γW (d)(P a s , P a s )), where W (d) is the Wasserstein distance between transition probability distributions. Then F has a unique fixed point d which is the bisimulation metric. A nice property of this metric d is that difference in optimal value between two states is bounded by their distance as defined by this metric. Theorem 1 (V * is Lipschitz with respect to d (Ferns et al., 2004) ). Let V * be the optimal value function for a given discount factor γ. Then V * is Lipschitz continuous with respect to d with Lipschitz constant 1 1-γ , |V * (s) -V * (s )| ≤ 1 1 -γ d(s, s ). Therefore, we see that bisimulation metrics give us a Lipschitz value function with respect to d. For downstream evaluation of the representations we learn, we use Soft Actor Critic (SAC) (Haarnoja et al., 2018) , an off-policy actor-critic method that uses the maximum entropy framework for soft policy iteration. At each iteration, SAC performs soft policy evaluation and improvement steps. The policy evaluation step fits a parametric soft Q-function Q(s t , a t ) using transitions sampled from the replay buffer D by minimizing the soft Bellman residual, J(Q) = E (st,st,rt,st+1)∼D Q(s t , a t ) -r t -γ V (x t+1 ) 2 . The target value function V is approximated via a Monte-Carlo estimate of the following expectation, V (x t+1 ) = E at+1∼π Q(x t+1 , a t+1 ) -α log π(a t+1 |s t+1 ) , where Q is the target soft Q-function parameterized by a weight vector obtained from an exponentially moving average of the Q-function weights to stabilize training. The policy improvement step then attempts to project a parametric policy π(a t |s t ) by minimizing KL divergence between the policy and a Boltzmann distribution induced by the Q-function, producing the following objective,  J(π) = E st∼D E at∼π [α log(π(a t |s t )) -Q(s t , a t )] .

3. THE HIP-BMDP SETTING

The HiP-MDP setting (as defined in Section 2) assumes full observability of the state space. However, in most real-world scenarios, we only have access to high-dimensional, noisy observations, which often contain irrelevant information to the reward. We combine the Block MDP and HiP-MDP settings to introduce the Hidden-Parameter Block MDP setting (HiP-BMDP), where states are latent, and transition distributions change depending on the task parameters θ. This adds an additional dimension of complexity to our problem -we first want to learn an amenable state space S, and a universal dynamics model in that representationfoot_2 . In this section, we formally define the HiP-BMDP family in Section 3.1, propose an algorithm for learning HiP-BMDPs in Section 3.2, and finally provide theoretical analysis for the setting in Section 3.3.

3.1. THE MODEL

A HiP-BMDP family can be described by tuple S, A, Θ, T θ , R, γ, P Θ , X , q , with a graphical model of the framework found in Figure 2 . We are given a label k ∈ {1, ..., N } for each of N environments. We plan to learn a candidate Θ that unifies the transition dynamics across all environments, effectively finding T (•, •, θ). For two environment settings θ i , θ j ∈ Θ, we define a distance metric: d(θ i , θ j ) := max s,a∈{S,A} W T θi (s, a), T θj (s, a) . The Wasserstein-1 metric can be written as W d (P, Q) = sup f ∈F d E x∼P f (x) -E y∼Q f (y) 1 , where F d is the set of 1-Lipschitz functions under metric d (Müller, 1997) . We omit d but use d(x, y) = x -y 1 in our setting. This ties distance between θ to the maximum difference in the next state distribution of all state-action pairs in the MDP. Given a HiP-BMDP family M Θ , we assume a multi-task setting where environments with specific θ ∈ Θ are sampled from this family. We do not have access to θ, and instead get environment labels I 1 , I 2 , ..., I N . The goal is to learn a latent space for the hyperparameters θfoot_3 . We want θ to be smooth with respect to changes in dynamics from environment to environment, which we can set explicitly through the following objective: ||ψ(I 1 ) -ψ(I 2 )|| 1 = max s∈S a∈A W 2 p(s t+1 |s t , a t , ψ(I 1 )), p(s t+1 |s t , a t , ψ(I 2 )) , given environment labels I 1 , I 2 and ψ : Z + → R d , the encoder that maps from environment label, the set of positive integers, to θ.

3.2. LEARNING HIP-BMDPS

The premise of our work is that the HiP-BMDP formulation will improve sample efficiency and generalization performance on downstream tasks. We examine two settings, multi-task reinforcement learning (MTRL) and meta-reinforcement learning (meta-RL). In both settings, we have access to N training environments and a held-out set of M evaluation environments, both drawn from a defined family. In the MTRL setting, we evaluate model performance across all N training environments and ability to adapt to new environments. Adaptation performance is evaluated in both the few-shot regime, where we collect a small number of samples from the evaluation environments to learn each hidden parameter θ, and the zero-shot regime, where we average θ over all training tasks. We evaluate against ablations and other MTRL methods. In the meta-RL setting, the goal for the agent is to leverage knowledge acquired from the previous tasks to adapt quickly to a new task. We evaluate performance in terms of how quickly the agent can achieve a minimum threshold score in the unseen evaluation environments (by learning the correct θ for each new environment). Learning a HiP-BMDP approximation of a family of MDPs requires the following components: i) an encoder that maps observations from state space to a learned, latent representation, φ : X → Z, ii) an environment encoder ψ that maps an environment identifier to a hidden parameter θ, iii) a universal dynamics model T conditioned on task parameter θ. Figure 2 shows how the components interact during training. In practice, computing the maximum Wasserstein distance over the entire state-action space is computationally infeasible. Therefore, we relax this requirement by taking the expectation over Wasserstein distance with respect to the marginal state distribution of the behavior policy. We train a probabilistic universal dynamics model T to output the desired next state distributions as Gaussiansfoot_4 , for which the 2-Wasserstein distance has a closed form: W 2 (N (m 1 , Σ 1 ), N (m 2 , Σ 2 )) 2 = ||m 1 -m 2 || 2 2 + ||Σ 1/2 1 -Σ 1/2 2 || 2 F , where || • || F is the Frobenius norm. Given that we do not have access to the true universal dynamics function across all environments, it must be learned. The objective in Equation ( 2) is accompanied by an additional objective to learn T , giving a final loss function: L(ψ, T ) = M SE ψ(I 1 ) -ψ(I 2 ) 2 , W 2 T (s I1 t , π(s I1 t ), ψ(I 1 )), T (s I2 t , π(s I2 t ), ψ(I 2 )) Θ learning error + M SE T (s I1 t , a I1 t , ψ(I 1 )), s I1 t+1 + M SE T (s I2 t , a I2 t , ψ(I 2 )), s I2 t+1 Model learning error . ( ) where red indicates gradients are stopped. Transitions {s I1 t , a I1 t , s I1 t+1 , I 1 } and {s I2 t , a I2 t , s I2 t+1 , I 2 } from two different environments (I 1 = I 2 ) are sampled randomly from a replay buffer. In practice, we scale the Θ learning error, our task bisimulation metric loss, using a scalar value denoted as α ψ .

3.3. THEORETICAL ANALYSIS

In this section, we provide value bounds and sample complexity analysis of the HiP-BMDP approach. We have additional new theoretical analysis of the simpler HiP-MDP setting in Appendix B. We first define three additional error terms associated with learning a R , T , θ -bisimulation abstraction, R := sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) R(x 1 , a) -R(x 2 , a) , T := sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT (x 1 , a) -ΦT (x 2 , a) 1 , θ := θ -θ 1 . ΦT denotes the lifted version of T , where we take the next-step transition distribution from observation space X and lift it to latent space S. We can think of R , T as describing a new MDP which is close -but not necessarily the same, if R , T > 0 -to the original Block MDP. These two error terms can be computed empirically over all training environments and are therefore not task-specific. θ , on the other hand, is measured as a per-task error. Similar methods are used in Jiang et al. (2015) to bound the loss of a single abstraction, which we extend to the HiP-BMDP setting with a family of tasks. Value Bounds. We first evaluate how the error in θ prediction and the learned bisimulation representation affect the optimal Q * Mθ of the learned MDP, by first bounding its distance from the optimal Q * of the true MDP for a single-task. Theorem 2 (Q error). Given an MDP Mθ built on a ( R , T , θ )-approximate bisimulation abstraction of an instance of a HiP-BMDP M θ , we denote the evaluation of the optimal Q function of Mθ on M as [Q * Mθ ] M θ . The value difference with respect to the optimal Q * M is upper bounded by Q * M θ -[Q * Mθ ] M θ ∞ ≤ R + γ( T + θ ) R max 2(1 -γ) . Proof in Appendix C. As in the HiP-MDP setting, we can measure the transferability of a specific policy π learned on one task to another, now taking into account error from the learned representation. Theorem 3 (Transfer bound). Given two MDPs M θi and M θj , we can bound the difference in Q π between the two MDPs for a given policy π learned under an R , T , θi -approximate abstraction of M θi and applied to Q * M θ j -[Q * M θi ] M θ j ∞ ≤ R + γ T + θi + θ i -θ j 1 R max 2(1 -γ) . This result clearly follows directly from Theorem 2. Given a policy learned for task i, Theorem 3 gives a bound on how far from optimal that policy is when applied to task j. Intuitively, the more similar in behavior tasks i and j are, as denoted by θ i -θ j 1 , the better π performs on task j. Finite Sample Analysis. In MDPs (or families of MDPs) with large state spaces, it can be unrealistic to assume that all states are visited at least once, in the finite sample regime. Abstractions are useful in this regime for their generalization capabilities. We can instead perform a counting analysis based on the number of samples of any abstract state-action pair. We compute a loss bound with abstraction φ which depends on the size of the replay buffer D, collected over all tasks. Specifically, we define the minimal number of visits to an abstract state-action pair, n φ (D) = min x∈φ(S),a∈A |D x,a |. This sample complexity bound relies on a Hoeffding-style inequality, and therefore requires that the samples in D be independent, which is usually not the case when trajectories are sampled. Theorem 4 (Sample Complexity). For any φ which defines an ( R , T , θ )-approximate bisimulation abstraction on a HiP-BMDP family M Θ , we define the empirical measurement of Q * Mθ over D to be Q * MD θ . Then, with probability ≥ 1 -δ, Q * M θ -[Q * MD θ ] M θ ∞ ≤ R + γ( T + θ ) R max 2(1 -γ) + R max (1 -γ) 2 1 2n φ (D) log 2|φ(X )||A| δ . This performance bound applies to all tasks in the family and has two terms that are affected by using a state abstraction: the number of samples n φ (D), and the size of the state space |φ(X )|. We know that |φ(X )| ≤ |X | as behaviorally equivalent states are grouped together under bisimulation, and n φ (D) is the minimal number of visits to any abstract state-action pair, in aggregate over all training environments. This is an improvement over the sample complexity of applying single-task learning without transfer over all tasks, and the method proposed in Brunskill & Li (2013) , which both would rely on the number of tasks or number of MDPs seen. We consider two setups for evaluation: i) an interpolation setup and ii) an extrapolation setup where the changes in the dynamics function are interpolations and extrapolations between the changes in the dynamics function of the training environment respectively. This dual-evaluation setup provides a more nuanced understanding of how well the learned model transfers across the environments. Implementation details can be found in Appendix D and sample videos of policies at https://sites.google.com/view/hip-bmdp. Environments. We create a family of MDPs using the existing environment-task pairs from DMC and change one environment parameter to sample different MDPs. We denote this parameter as the perturbation-parameter. We consider the following HiP-BMDPs: 1. Cartpole-Swingup-V0: the mass of the pole varies, 2. Cheetah-Run-V0: the size of the torso varies, 3. Walker-Run-V0: the friction coefficient between the ground and the walker's legs varies, 4. Walker-Run-V1: the size of left-foot of the walker varies, and 5. Finger-Spin-V0: the size of the finger varies. We show an example of the different pixel observations for Cheetah-Run-V0 in Figure 3 . Additional environment details are in Appendix D. We sample 8 MDPs from each MDP family by sampling different values for the perturbationparameter. The MDPs are arranged in order of increasing values of the perturbation-parameter such that we can induce an order over the family of MDPs. We denote the ordered MDPs as A -H. MDPs {B, C, F, G} are training environments and {D, E} are used for evaluating the model in the interpolation setup (i.e. the value of the perturbation-parameter can be obtained by interpolation). MDPs {A, H} are for evaluating the model in the extrapolation setup (i.e. the value of the perturbation-parameter can be obtained by extrapolation). We evaluate the learning agents by computing average reward (over 10 episodes) achieved by the policy after training for a fixed number of steps. All experiments are run for 10 seeds, with mean and standard error reported in the plots. Multi-Task Setting. We first consider a multi-task setup where the agent is trained on four related, but different environments with pixel observations. We compare our method, HiP-BMDP, with the following baselines and ablations: i) DeepMDP (Gelada et al., 2019) where we aggregate data across all training environments, ii) HiP-BMDP-nobisim, HiP-BMDP without the task bisimulation metric loss on task embeddings, iii) Distral, an ensemble of policies trained using the Distral algorithm (Teh et al., 2017) with SAC-AE (Yarats et al., 2019) as the underlying policy, iv) PCGrad (Yu et al., 2020) , and v) GradNorm (Chen et al., 2018) . For all models, the agent sequentially performs one update per environment. For fair comparison, we ensure that baselines have at least as many parameters as HiP-BMDP. Distral has more parameters as it trains one policy per environment. Additional implementation details about baselines are in Appendix D.1. In Figures 4, and 10 (in Appendix), we observe that for all the models, performance deteriorates when evaluated on interpolation/extrapolation environments. We only report extrapolation results in the main paper because of space constraints, as they were very similar to the interpolation performance. The gap between the HiP-BMDP model and other baselines also widens, showing that the proposed approach is relatively more robust to changes in environment dynamics. At training time (Figure 9 in Appendix), we observe that HiP-BMDP consistently outperforms other baselines on all the environments. The success of our proposed method can not be attributed to task embeddings alone as HiP-BMDP-nobisim also uses task embeddings. Moreover, only incorporating the task-embeddings is not guaranteed to improve performance in all the environments (as can be seen in the case of Cheetah-Run-V0). We also note that the multi-task learning baselines like Distral, PCGrad, and GradNorm sometimes lag behind even the DeepMDP baseline, perhaps because they do not leverage a shared global dynamics model. Meta-RL Setting. We consider the Meta-RL setup for evaluating the few-shot generalization capabilities of our proposed approach on proprioceptive state, as meta-RL techniques are too timeintensive to train on pixel observations directly. Specifically, we use PEARL (Rakelly et al., 2019) , an off-policy meta-learning algorithm that uses probabilistic context variables, and is shown to outperform common meta-RL baselines like MAML-TRPO (Finn et al., 2017) and ProMP (Rothfuss et al., 2019) on proprioceptive state. We incorporate our proposed approach in PEARL by training the inference network q φ (z|c) with our additional HiP-BMDP loss. The algorithm pseudocode can be found in Appendix D. In Figure 5 we see that the proposed approach (blue) converges faster to a threshold reward (green) than the baseline for Cartpole-Swingup-V0 and Walker-Walk-V1. We provide additional results in Appendix E. Evaluating the Universal Transition Model. We investigate how well the transition model performs in an unseen environment by only adapting the task parameter θ. We instantiate a new MDP, sampled from the family of MDPs, and use a behavior policy to collect transitions. These transitions are used to update only the θ parameter, and the transition model is evaluated by unrolling the transition model for k-steps. We report the average, perstep model error in latent space, averaged over 10 environments. While we expect both the proposed setup and baseline setups to adapt to the new environment, we expect the proposed setup to adapt faster because of the exploitation of underlying structure. In Figure 6 , we indeed observe that the proposed HiP-BMDP model adapts much faster than the ablation HiP-BMDP-nobisim. Relaxing the Block MDP Assumption. We incorporate sticky observations into the environment to determine how HiP-BMDP behaves when the Block MDP assumption is relaxed. For some probability p (set to 0.1 in practice), the current observation is dropped, and the agent sees the previous observation again. In Figure 7 , we see that even in this setting the proposed HiP-BMDP model outperforms the other baseline models.

5. RELATED WORK

Multi-task learning has been extensively studied in RL with assumptions around common properties of different tasks, e.g., reward and transition dynamics. A lot of work has focused on considering tasks as MDPs and learning optimal policies for each task while maximizing shared knowledge. However, in most real-world scenarios, the parameters governing the dynamics are not observed. Moreover, it is not explicitly clear how changes in dynamics across tasks are controlled. The HiP-BMDP setting provides a principled way to change dynamics across tasks via a latent variable. Green line shows a threshold reward. 100 steps are used for adaptation to the evaluation environments. a discrete and tractably small state space. Tirinzoni et al. (2020) assumes access to a generative model for any state-action pair and scales with the minimum of number of tasks or state space. In the rich observation setting, this minimum will almost always be the number of tasks. Similar to our work, Perez et al. ( 2020) also treats the multi-task setting as a HiP-MDP by explicitly designing latent variable models to model the latent parameters, but require knowledge of the structure upfront, whereas our approach does not make any such assumptions. Meta-learning, or learning to learn, is also a related framework with a different approach. We focus here on context-based approaches, which are more similar to the shared representation approaches of MTRL and our own method. Rakelly et al. (2019) model and learn latent contexts upon which a universal policy is conditioned. However, no explicit assumption of a universal structure is leveraged. Amit & Meir (2018) ; Yin et al. (2020) give a PAC-Bayes bound for meta-learning generalization that relies on the number of tasks n. Our setting is quite different from the typical assumptions of the meta-learning framework, which stresses that the tasks must be mutually exclusive to ensure a single model cannot solve all tasks. Instead, we assume a shared latent structure underlying all tasks, and seek to exploit that structure for generalization. We find that under this setting, our method indeed outperforms policies initialized through meta-learning. The ability to extract meaningful information through state abstractions provides a means to generalize across tasks with a common structure. Abel et al. (2018) learn transitive and PAC state abstractions for a distribution over tasks, but they concentrate on finite, tabular MDPs. One approach to form such abstractions is via bisimulation metrics (Givan et al., 2003; Ferns et al., 2004) which formalize a concrete way to group behaviorally equivalent states. Prior work also leverages bisimulation for transfer (Castro & Precup, 2010) , but on the policy level. Our work instead focuses on learning a latent state representation and established theoretical results for the MTRL setting. Recent work (Gelada et al., 2019) also learns a latent dynamics model and demonstrates connections to bisimulation metrics, but does not address multi-task learning.

6. DISCUSSION

In this work, we advocate for a new framework, HiP-BMDP, to address the multi-task reinforcement learning setting. Like previous methods, HiP-BMDP assumes a shared state and action space across tasks, but additionally assumes latent structure in the dynamics. We exploit this structure through learning a universal dynamics model with latent parameter θ, which captures the behavioral similarity across tasks. We provide error and value bounds for the HiP-MDP (in appendix) and HiP-BMDP settings, showing improvements in sample complexity over prior work by producing a bound that depends on the number of samples in aggregate over tasks, rather than number of tasks seen at training time. Our work relies on an assumption that we have access to an environment id, or knowledge of when we have switched environments. This assumption could be relaxed by incorporating an environment identification procedure at training time to cluster incoming data into separate environments. Further, our bounds rely L ∞ norms for measuring error and the value and transfer bounds. In future work we will investigate tightening these bounds with L p norms.

A BISIMULATION BOUNDS

We first look at the Block MDP case only (Zhang et al., 2020a) , which can be thought of as the single-task setting in a HiP-BMDP. We can compute approximate error bounds in this setting by denoting φ an ( R , T )-approximate bisimulation abstraction, where R := sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) R(x 1 , a) -R(x 2 , a) , T := sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT (x 1 , a) -ΦT (x 2 , a) 1 . ΦT denotes the lifted version of T , where we take the next-step transition distribution from observation space X and lift it to latent space S. Theorem 5. Given an MDP M built on a ( R , T )-approximate bisimulation abstraction of Block MDP M, we denote the evaluation of the optimal Q function of M on M as [Q * M] M . The value difference with respect to the optimal Q * M is upper bounded by Q * M -[Q * M] M ∞ ≤ R + γ T R max 2(1 -γ) . Proof. From Theorem 2 in Jiang (2018).

B THEORETICAL RESULTS FOR THE HIP-MDP SETTING

We explore the HiP-MDP setting, where a low-dimensional state space is given, to highlight the results that can be obtained just from assuming this hierarchical structure of the dynamics.

B.1 VALUE BOUNDS

Given a family of environments M Θ , we bound the difference in expected value between two sampled MDPs, M θi , M θj ∈ M Θ using d(θ i , θ j ). Additionally, we make the assumption that we have a behavior policy π that is near both optimal policies π * θi , π * θj . We use KL divergence to define this neighborhood for π * θi , d KL (π, π * θi ) = E s∼ρ π KL(π(•|s), π * θi (•|s)) 1/2 . (5) We start with a bound for a specific policy π. One way to measure the difference between two tasks M θi , M θj is to measure the difference in value when that policy is applied in both settings. We show the relationship between the learned θ and this difference in value. The following results are similar to error bounds in approximate value iteration (Munos, 2005; Bertsekas & Tsitsiklis, 1996) , but instead of tracking model error, we apply these methods to compare tasks with differences in dynamics. Theorem 6. Given policy π, the difference in expected value between two MDPs drawn from the family of MDPs M θi , M θj ∈ M Θ is bounded by |V π θi -V π θj | ≤ γ 1 -γ θ i -θ j 1 . Proof. We use a telescoping sum to prove this bound, which is similar to Luo et al. (2019) . First, we let Z k denote the discounted sum of rewards if the first k steps are in M θi , and all steps t > k are in M θj , Z k := E ∀t≥0,at∼π(st) ∀j>t≥0,st+1∼T θ i (st,at) ∀t≥j,st+1∼T θ j (st,at) ∞ t=0 γ t R(s t , a t ) . By definition, we have Z ∞ = V π θi and Z 0 = V π θj . Now, the value function difference can be written as a telescoping sum, V π θi -V π θj = ∞ k=0 (Z k+1 -Z k ). Published as a conference paper at ICLR 2021 Each term can be simplified to Z k+1 -Z k = γ k+1 E s k ,a k ∼π,T θ i E s k+1 ∼T θ j (•|s k ,a k ), s k+1 ∼T θ i (•|s k ,a k ) V π θj (s k+1 ) -V π θj (s k+1 . Plugging this back into Equation ( 7), V π θi -V π θj = γ 1 -γ E s∼ρ π θ i , a∼π(s) E s ∼T θ i (•|s,a) V π θj (s ) -E s ∼T θ j (•|s,a) V π θj (s ) . This expected value difference is bounded by the Wasserstein distance between T θi , T θj , |V π θi -V π θj | ≤ γ 1 -γ W (T θi , T θj ) = γ 1 -γ θ i -θ j 1 using Equation (1). Another comparison to make is how different the optimal policies in different tasks are with respect to the distance θ i -θ j . Theorem 7. The difference in expected optimal value between two MDPs M θi , M θj ∈ M Θ is bounded by, |V * θi -V * θj | ≤ γ (1 -γ) 2 θ i -θ j 1 . Proof. |V * θi (s) -V * θj (s)| = | max a Q * θi (s, a) -max a Q * θj (s, a )| ≤ max a |Q * θi (s, a) -Q * θj (s, a)| We can bound the RHS with sup s,a |Q * θi (s, a)-Q * θj (s, a)| ≤ sup s,a |r θi (s, a)-r θj (s, a)|+γ sup s,a |E s ∼T θ i (•|s,a) V * θi (s )-E s ∼T θ j (•|s,a) V * θj (s )| All MDPs in M Θ have the same reward function, so the first term is 0. sup s,a Q * θi (s, a) -Q * θj (s, a) ≤ γ sup s,a |E s ∼T θ i (•|s,a) V * θi (s ) -E s ∼T θ j (•|s,a) V * θj (s )| = γ sup s,a E s ∼T θ i (•|s,a) V * θi (s ) -V * θj (s ) + E s ∼T θ j (•|s,a), s ∼T θ i (•|s,a) V * θj (s ) -V * θj (s ) ≤ γ sup s,a E s ∼T θ i (•|s,a) V * θi (s ) -V * θj (s ) + γ sup s,a E s ∼T θ j (•|s,a), s ∼T θ i (•|s,a) V * θj (s ) -V * θj (s ) ≤ γ sup s,a E s ∼T θ i (•|s,a) V * θi (s ) -V * θj (s ) + γ 1 -γ θ i -θ j 1 ≤ γ max s V * θi (s) -V * θj (s) + γ 1 -γ θ i -θ j 1 = γ max s max a Q * θi (s, a) -max a Q * θj (s, a ) + γ 1 -γ θ i -θ j 1 ≤ γ sup s,a Q * θi (s, a) -Q * θj (s, a) + γ 1 -γ θ i -θ j 1 Solving for sup s,a Q * θi (s, a) -Q * θj (s, a) , sup s,a Q * θi (s, a) -Q * θj (s, a) ≤ γ (1 -γ) 2 θ i -θ j 1 . Plugging this back in, |V * θi (s) -V * θj (s)| ≤ γ (1 -γ) 2 θ i -θ j 1 . Both these results lend more intuition for casting the multi-task setting under the HiP-MDP formalism. The difference in the optimal performance between any two environments is controlled by the distance between the hidden parameters for corresponding environments. One can interpret the hidden parameter as a knob to allow precise changes across the tasks.

B.2 EXPECTED ERROR BOUNDS

In MTRL, we are concerned with the performance over a family of tasks. The empirical risk is typically defined as follows for T tasks (Maurer et al., 2016) : avg (θ) = 1 T T t=1 E[ (f t (h(w t (X))), Y ))]. Consequently, we bound the expected loss over the family of environments E with respect to θ. In particular, we are interested in the average approximation error and define it as the absolute model error averaged across all environments: avg (θ) = 1 |E| E i=1 V * θi (s) -V * θi (s) . ( ) Theorem 8. Given a family of environments M Θ , each parameterized with an underlying true hidden parameter θ 1 , θ 2 , • • • , θ E , and let θ1 , θ2 , • • • , θE be their respective approximations such that the average approximation error across all environments is bounded as follows: avg (θ) ≤ γ (1 -γ) 2 , ( ) where each environment's parameter θ i is -close to its approximation θi i.e. d( θi , θ i ) ≤ , where d is the distance metric defined in Eq. 1. Proof. We here consider the approximation error averaged across all environments as follows:  avg (θ) = 1 E E i=1 V * θi (s) -V * θi (s) avg (θ) = 1 E E i=1 | max a Q * θi (s, a) -max a Q * θi (s, a )| ≤ 1 E E i=1 max a |Q * θi (s, a) -Q * θi (s, a)| Q * θi (s, a) -Q * θi (s, a) ≤ γ sup s,a |E s ∼T θi (•|s,a) V * θi (s ) -E s ∼T θ i (•|s,a) V * θi (s )| = γ sup s,a E s ∼T θi (•|s,a) V * θi (s ) -V * θi (s ) + E s ∼T θ i (•|s,a), s ∼T θi (•|s,a) V * θi (s ) -V * θi (s ) ≤ γ sup s,a E s ∼T θi (•|s,a) V * θi (s ) -V * θi (s ) + γ sup s,a E s ∼T θ i (•|s,a), s ∼T θi (•|s,a) V * θi (s ) -V * θi (s ) ≤ γ sup s,a E s ∼T θi (•|s,a) V * θi (s ) -V * θi (s ) + γ 1 -γ | θi -θ i | ≤ γ max s V * θi (s) -V * θi (s) + γ 1 -γ | θi -θ i | = γ max s max a Q * θi (s, a) -max a Q * θi (s, a ) + γ 1 -γ | θi -θ i | ≤ γ sup s,a Q * θi (s, a) -Q * θi (s, a) + γ 1 -γ | θi -θ i | Solving for sup s,a Q * θi (s, a) -Q * θi (s, a) , sup s,a Q * θi (s, a) -Q * θi (s, a) ≤ γ (1 -γ) 2 | θi -θ i | Plugging Eq. 13 back in Eq. 12, avg (θ) ≤ 1 E E i=1 γ (1 -γ) 2 | θi -θ i | = γ E(1 -γ) 2 | θi=1 -θ i=1 | + | θi=2 -θ i=2 | + • • • + | θi=E -θ i=E | We now consider that the distance between the approximated θi and the underlying hidden parameter θ i ∈ M E is defined as in Eq. 1, such that: d( θi , θ i ) ≤ θ Plugging this back concludes the proof, avg (θ) ≤ γ θ (1 -γ) 2 . It is interesting to note that the average approximation error across all environments is independent of the number of environments and primarily governed by the error in approximating the hidden parameter θ for each environment.

C ADDITIONAL RESULTS AND PROOFS FOR HIP-BMDP RESULTS

We first compute L ∞ norm bounds for Q error under approximate abstractions and transfer bounds. Theorem 9 (Q error). Given an MDP Mθ built on a ( R , T , θ )-approximate bisimulation abstraction of an instance of a HiP-BMDP M θ , we denote the evaluation of the optimal Q function of Mθ on M as [Q * Mθ ] M θ . The value difference with respect to the optimal Q * M is upper bounded by Q * M θ -[Q * Mθ ] M θ ∞ ≤ R + γ( T + θ ) R max 2(1 -γ) . Proof. In the HiP-BMDP setting, we have a global encoder φ over all tasks, but the difference in transition distribution also includes θ. The reward functions are the same across tasks, so there is no change to R . However, we now must incorporate difference in dynamics in T . Assuming we have two environments with hidden parameters θ i , θ j ∈ Θ, we can compute θi,θj T across those two environments by joining them into a super-MDP: θi,θj T = sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT θi (x 1 , a) -ΦT θj (x 2 , a) 1 ≤ sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT θi (x 1 , a) -ΦT θi (x 2 , a) 1 + ΦT θi (x 2 , a) -ΦT θj (x 2 , a) 1 ≤ sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT θi (x 1 , a) -ΦT θi (x 2 , a) 1 + sup a∈A, x1,x2∈X ,φ(x1)=φ(x2) ΦT θi (x 2 , a) -ΦT θj (x 2 , a) 1 = θi T + θ i -θ j 1 This result is intuitive in that with a shared encoder learning a per-task bisimulation relation, the distance between bisimilar states from another task depends on the change in transition distribution between those two tasks. We can now extend the single-task bisimulation bound (Theorem 5) to the HiP-BMDP setting by denoting approximation error of θ as θ -θ 1 < θ . Theorem 4. For any φ which defines an ( R , T , θ )-approximate bisimulation abstraction on a HiP-BMDP family M Θ , we define the empirical measurement of Q * Mθ over D to be Q * MD θ . Then, with probability ≥ 1 -δ, Q * M θ -[Q * MD θ ] M θ ∞ ≤ R + γ( T + θ ) R max 2(1 -γ) + R max (1 -γ) 2 1 2n φ (D) log 2|φ(X )||A| δ . Proof. Q * M θ -[Q * MD θ ] M θ ∞ ≤ Q * M θ -[Q * Mθ ] M θ ∞ + [Q * Mθ ] M θ -[Q * MD θ ] M θ ∞ = Q * M θ -[Q * Mθ ] M θ ∞ + Q * Mθ -Q * MD θ ∞ The first term is solved by Theorem 2, so we only need to solve the second term using McDiarmid's inequality and the knowledge that the value function of a bisimulation representation is 1 1-γ -Lipschitz from Theorem 1. First, we write this difference to be a deviation from an expectation in order to apply the concentration inequality. Q * Mθ -Q * MD θ ∞ = Q * Mθ -T φ D Q * Mθ + T φ D Q * Mθ -T φ D Q * MD θ ∞ ≤ Q * Mθ -T φ D Q * Mθ ∞ + γ Q * Mθ -Q * MD θ ∞ ≤ 1 1 -γ T φ D Q * Mθ -T φ Q * Mθ ∞ Now we can apply McDiarmid's inequality, P D Q * Mθ -Q * MD θ ≥ t ≤ 2 exp - 2t 2 |D φ(x),a | R 2 max /(1 -γ) 2 . Solve for the t that makes this inequality hold for all (φ(x), a) ∈ X × A with a union bound over all |φ(X )||A| abstract states, t > R max 1 -γ 1 2n φ (D) log 2|φ(X )||A| δ . Combine to get Meta-RL Algorithm The meta-RL algorithm for the HiP-MDP setting can be found in Algorithm 3. We take the PEARL algorithm (Rakelly et al., 2019) and incorporate our HiP-MDP objective (text shown in red color) D.1 BASELINES For PCGrad, the authors recommend projecting the gradient with respect to all previous tasks. In practice, that leads to very poor training. Instead, we observe that it is better to project the gradients with respect to any one task (randomly selected per update). We use this scheme in all the experiments. For GradNorm, we observe that the learned weights w i (for weighing per-task loss) can become negative for some tasks (which means the model tries to unlearn those tasks). In practice, we clamp the w i values to not become smaller than a threshold. 10 -3 α ψ (ie α for ψ) for Finger-Spin-V0 0.1 α ψ (ie α for ψ) for Cheetah-Run-V0 0.1 α ψ (ie α for ψ) for Walker-Run-V0 Q * M θ -[Q * MD θ ] M θ ∞ ≤ R + γ( T + θ ) R max 2(1 -γ) + R max (1 -γ) 2 1 2n φ (D) log 2|φ(X )||A| δ . 1.0 α ψ (ie α for ψ) for Walker-Run-V1 0.01 α distal (ie α for Distral) for Finger-Spin-V0 0.05 α distal (ie α for Distral) for Cheetah-Run-V0 0.01 α distal (ie α for Distral) for Walker-Run-V0 0.05 α distal (ie α for Distral) for Walker-Run-V1 0.01 β distal (ie β for Distral) 1.0 Table 1: A complete overview of used hyper parameters.

D.2.2 METARL ALGORITHM

For MetaRL, we use the same hyperparameters as used by PEARL Rakelly et al. (2019) . We set α φ = 0.01 for all environments, other than Walker-Stand environments where α φ = 0.001.

E ADDITIONAL RESULTS

Along with the environments described in 4, we considered the following additional environments: 1. Walker-Stand-V0: Walker-Stand task where the friction coefficient, between the ground and the walker's leg, varies across different environments. 2. Walker-Walk-V0: Walker-Walk task where the friction coefficient, between the ground and the walker's leg, varies across different environments. 3. Walker-Stand-V1: Walker-Stand task where the size of left-foot of the walker varies across different environments.



We use this assumption only for theoretical results, but our method can be applied to continuous domains. Any k-order MDP can be made Markov by stacking the previous k observations and actions together. We overload notation here since the true state space is latent. We again overload notation here to refer to the learned hyperparameters as θ, as the true ones are latent. This is not a restrictive assumption to make, as any distribution can be mapped to a Gaussian with an encoder of sufficient capacity. EXPERIMENTS & RESULTSWe use environments from Deepmind Control Suite (DMC)(Tassa et al., 2018) to evaluate our method for learning HiP-BMDPs for both multi-task RL and meta-reinforcement learning settings.



Figure 1: Visualizations of the typical MTRL setting and the HiP-MDP setting.

Figure 2: Graphical model of HiP-BMDP setting (left). Flow diagram of learning a HiP-BMDP (right). Two environment ids are selected by permuting a randomly sampled batch of data from the replay buffer, and the loss objective requires computing the Wasserstein distance of the predicted next-step distribution for those states.

Figure 3: Variation in Cheetah-Run-V0 tasks.

Figure4: Multi-Task Setting. Zero-shot generalization performance on the extrapolation tasks. We see that our method, HiP-BMDP, performs best against all baselines across all environments. Note that the environment steps (on the x-axis) denote the environment steps for each task. Since we are training over four environments, the actual number of steps is approx. 3.2 million.

Figure 6: Average per-step model error (in latent space) after unrolling the transition model for 100 steps.

Figure 5: Few-shot generalization performance on the interpolation (2 left) and extrapolation (2 right) tasks.

Figure 7: Zero-shot generalization performance on the evaluation tasks in the MTRL setting with partial observability. HiP-BMDP (ours) consistently outperforms other baselines (left). We also show decreasing performance by HiP-BMDP as p increases (right). 10 seeds, 1 stderr shaded.

12)Let us consider an environment θ i ∈ M E for which we can bound the RHS withsup s,a |Q * θi (s, a)-Q * θi (s, a)| ≤ sup s,a |r θi (s, a)-r θi (s, a)|+γ sup s,a |E s ∼T θi (•|s,a) V * θi) (s )-E s ∼T θ i (•|s,a) V * θi (s )|Considering the family of environments M E have the same reward function and is known, resulting in first term to be 0. sup s,a

ψ, M ) = L i (ψ, M, i, t) using Equation (3) 3: ψ ← ψ -α 1 ∇ θ i L 4: M ← M -α 2 ∇ θ i L 5: end forD ADDITIONAL IMPLEMENTATION DETAILSIn Figure8, we show the variation in the left foot of the walker.

Figure 8: Variation in Walker (V1) across different tasks.

Algorithm 1 HiP-BMDP training for the Multi-task RL setting. Require: Along with DeepMDP components (Actor, Critic, Dynamics Model (M )), an additional environment encoder ψ to generated task-specific θ parameters. 1: for each timestep t = 1..T do Batches of data for the different tasks {T i } i=1...T sampled from the Replay Buffer D, learning rates α 1 and α 2 , index of the current task i, Transition Model M , environment encoder ψ. 1: for each batch of dataset t = 1..T , t = i do

D.2 HYPER PARAMETERS D.2.1 MTRL ALGORITHM All the hyper parameters (for MTRL algorithm) are listed in Table 1.

annex

Algorithm 3 HiP-MDP training for the meta-RL setting. Require: Batch of training tasks {T i } i=1...T from p(T ), learning rates α 1 , α 2 , α 3 , α φ 1: Initialize replay buffers B i for each training task 2: while not done do 3:for each T i do 4:Initialize context C i = {} 5:for k = 1, . . . , K do 6:Sample z ∼ q φ (z|C i ) 7:Gather data from π θ (a|s, z) and add to B i 8:Update C i = {(s j , a j , s j , r j )} j:1...N ∼ B i 9:end for 10:end for 11:for step in training steps do 12:for each T i do 13:Sample context C i ∼ S c (B i ) and RL batch b i ∼ B i 14:Sample z ∼ q φ (z|C i )15:L i actor = actor (b i , z)16:Sample a RL batch b j from any other task j 19:Compute L i BiSim = L i (q, T, i, j) using the equation 3 20:end forend for 25: end while 4. Walker-Walk-V1: Walker-Walk task where the size of left-foot of the walker varies across different environments.

E.1 MULTI-TASK SETTING

In Figure 10 , we observe that the HiP-BMDP method consistently outperforms other baselines when evaluated on the interpolation environments (zero-shot transfer). As noted previously, the effectiveness of our proposed model can not be attributed to task-embeddings alone as HiP-BMDPnobisim model uses the same architecture as the HiP-BMDP model but does not include the task bisimulation metric loss. We hypothesise that the Distral-Ensemble baseline behaves poorly because it cannot leverage a shared global dynamics model. results on the interpolation setup and in Figure 14 , we show the results on the extrapolation setup.In some environments (eg Walker-Walk-V1), the proposed approach (blue) converges faster to a threshold reward (green) than the baseline. In the other environments, the gains are quite small.

E.3 EVALUATING THE UNIVERSAL TRANSITION MODEL.

We investigate how well the transition model performs in an unseen environment by only adapting the task parameter θ. We instantiate a new MDP, sampled from the family of MDPs, and use a behavior policy to collect transitions. These transitions are used to update only the θ parameter, and the transition model is evaluated by unrolling the transition model for k-steps. In Figures 11 and 12 , we report the average, per-step model error in latent space, averaged over 10 environments over 5 and 100 steps respectively. While we expect both the proposed setup and baseline setups to adapt to the new environment, we expect the proposed setup to adapt faster because of the exploitation of underlying structure. We indeed observe that for both 5 step and 100 step unrolls, the proposed HiP-BMDP model adapts much faster than the baseline HiP-BMDP-nobisim (Figures 11 and 12 ) Walker-Stand-V0, Cheetah-Run-V0 (top row), Walker-Run-V1, Walker-Walk-V1 and Walker-Stand-V1 (bottom row) respectively. We note that for the Walker-Walk-V1, the proposed approach (blue) converges faster to a threshold reward (green) than the baseline. In other environments, the gains are quite small. (bottom row) respectively. We note that for the Walker-Walk-V1, the proposed approach (blue) converges faster to a threshold reward (green) than the baseline. In other environments, the gains are quite small.

