PERFORMANCE BOUNDS FOR MODEL AND POLICY TRANSFER IN HIDDEN-PARAMETER MDPS

Abstract

In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiP-MDP is a natural model for families of tasks in which meta-and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.



Hidden-parameter Markov Decision Processes (HiP-MDPs) (Doshi-Velez & Konidaris, 2016 ) describe a family of related tasks by modeling task variations with a set of low-dimensional hidden parameters. HiP-MDPs are widely used in recent meta-Reinforcement Learning (RL) and lifelong RL works, in which an agent needs to quickly adapt to new tasks by transferring the knowledge from previous tasks. To solve HiP-MDPs, most existing algorithms first infer the hidden parameters from previous experience (e.g. by learning a context encoder that maps trajectories to a hidden parameter estimate θ), then use the estimated hidden parameters to solve new tasks with either model (Killian et al., 2017; Lee et al., 2020; Fu et al., 2022) or policy transfer (Yao et al., 2018; Rakelly et al., 2019) algorithms. As shown in Figure 1 , in model transfer, the inferred hidden parameters are used to build a simulator of the environment, which can then be used for planning. In policy transfer, the inferred hidden parameters are used to parameterize the policy of the new task directly. Previous works have observed mixed empirical evidence for when each approach performs better (Yao et al., 2018; Lee et al., 2020; Fu et al., 2022) . In this work, we take a step in the theoretical direction by studying the regret bounds of model and policy transfer algorithms, respectively, which helps characterize the robustness of these algorithms. Two main factors affect the performance of HiP-MDP algorithms: (1) the estimation accuracy of the hidden parameter used as an input to either the learned model or policy, and (2) the quality of the learned model or policy (i.e., given an accurately estimated hidden parameter, whether the agent can reach the optimal performance). In practice, the HiP-MDP tasks considered in most recent meta-RL papers (e.g., changing the physical properties of MuJoCo-simulated robots) are relatively easy to solve when the hidden parameters are given. We therefore assume the learned model and policy are optimal and theoretically analyze how hidden parameter estimation error affects the final performance of model and policy transfer algorithms. Our contributions are as follows. We first derive the conditions under which the value function of HiP-MDPs is Lipschitz continuous. We then derive upper bounds for regret in model and policy transfer algorithms, respectively, when the estimation error of the hidden parameters is bounded. We also give an upper bound for multi-step prediction error in model transfer. We further prove that the derived bounds are tight when the transition dynamics, reward function and policy are linear and deterministic. As far as we are aware, these are the first theoretical results about policy and model transfer performance bounds in HiP-MDPs. Given the same hidden parameter estimation error bound, our results characterize when model or policy transfer can be more robust than the other: a slower increase in the Lipschitz constant with respect to the hidden parameter implies more robustness. In addition to theoretical analysis, we empirically study the performance of model and policy transfer algorithms in two continuous control domains. For each domain, we control the Lipschitz constants of the HiP-MDPs by altering the hyper-parameters of the environments. The results are consistent with our theoretical understanding of how the regrets of model and policy transfer algorithms scale with respect to the estimation error of the hidden parameters: a slower increase in Lipschitz constant with respect to the hidden parameter implies a smaller performance decay.

2. BACKGROUND AND RELATED WORK

HiP-MDPs model the variations in the transition dynamics and reward functions by assigning each task a hidden parameter θ, drawn from the distribution P Ω . The agent neither observes θ nor has access to the the distribution P Ω that generates the task family. For a given task, parameterized by θ ∈ Θ, the stochastic dynamics are given by T (s ′ |s, a; θ) and the deterministic reward function by R(s, a; θ). We consider continuous state and action space in this work (s ∈ S, a ∈ A). Upon encountering a new task, the agent estimates the new dynamics, T , and the new reward function, R, by inferring a distribution ω(θ) over the hidden parameter. If our estimation of θ is accurate, then ω(θ) should be close to the true distribution ω(θ), i.e. peaking at the true θ for this specific task. We assume the learned dependency of T and R on θ is accurate. Thus, if the estimation of θ is also accurate, the agent can solve the task completely. The HiP-MDP is an important setting widely used in recent meta/lifelong RL papers (see Appendix G). A number of approaches have been proposed to infer the hidden parameter θ of HiP-MDPs in the latent representation space, such as using Bayesian models to leverage prior knowledge (Killian et al., 2017; Yao et al., 2018; Fu et al., 2022) , or training a context encoder that maps trajectories to the latent parameter (Rakelly et al., 2019; Zintgraf et al., 2020; Lee et al., 2020) . Given the inferred hidden parameters, various RL algorithms have been used to adapt to the new tasks. The downstream RL algorithm given the inference of the hidden parameter includes training an off-policy actor-critic that takes in the estimated parameter as an additional input (Rakelly et al., 2019; Fakoor et al., 2020; Fu et al., 2021) , or leveraging recurrent neural networks (Zintgraf et al., 2020; Duan et al., 2016) , or planning based on the learned transition model that also takes in the inferred parameter (Lee et al., 2020; Mendonca et al., 2020) . Yet, in practice, there is no clear winner among these algorithms, leaving the question that what factors affect the performance of different algorithms, and how. Nair & Doshi-Velez study a similar problem but in a different setting where the hidden parameters are given (contextual MDP) and derive sample complexity bounds for model-based learning. In this work, we quantify how errors and uncertainty in estimating θ affect the performance of model and policy transfer methods through the lens of Lipschitz continuity. Given a distance metric on the space M , d M , Lipschitz continuity quantifies the smoothness of a function as follows. Definition 2.1. A function f : M 1 → M 2 is uniformly Lipschitz continuous if K M1,M2 f := sup x1,x2 d M2 (f (x 1 ), f (x 2 )) d M1 (x 1 , x 2 ) . (1) can be implied from the context. In the following theoretical analysis, we assume the HiP-MDPs are Lipschitz. Intuitively, this implies that when the hidden parameters of two MDPs are close, their transition dynamics and reward functions are similar, and for a given task, transitions dynamics and rewards do not change much given small changes in the states and actions. Similar assumptions have commonly been made in previous works (Hinderer, 2005; Asadi et al., 2018; Gottesman et al., 2021; Gelada et al., 2019; Ren et al., 2019; Luo et al., 2019) . We extend the smoothness assumption to the hidden parameters for a better understanding about the role of hidden parameter estimation error in the final performance. In addition to the smoothness assumptions about the HiP-MDPs, we make the following assumptions about the smoothness of policies. Definition 2.3. A policy π(a|s, ω(θ)) is (K S π , K Ω π )-Lipschitz continuous if: K S π := sup ω sup s1,s2 W (π(•|s 1 , ω(θ)), π(•|s 2 , ω(θ))) d S (s 1 , s 2 ) , K Ω π := sup s sup ω1,ω2 W (π(•|s, ω 1 (θ)), π(•|s, ω 2 (θ))) W (ω 1 (θ), ω 2 (θ)) are finite. We evaluate policies by calculating their expected discounted reward given a distribution of initial states, µ(s): J µ π := s V π g (s, ω(θ))µ(s)ds. Lipschitz continuity in MDPs has been studied by many previous works (Hinderer, 2005; Rachelson & Lagoudakis, 2010; Tang et al., 2020; Gottesman et al., 2021) . Pirotta et al. (2015) propose to use Lipschitz continuity property of MDPs to speed up policy gradient algorithms. Asadi et al. (2018) show how the magnitude of the Lipschitz constant of the transition dynamics affects the performance of model-based RL algorithm's performance is influenced by the magnitude of the Lipschitz constant of the model. Gelada et al. (2019) leverages Lipschitz continuity assumptions in regular MDPs to learn state abstractions. However, none of them conduct theoretical analysis on HiP-MDPs.

3. LIPSCHITZ VALUE FUNCTION OF HIP-MDPS

Our first main result is showing under which conditions the value functions are Lipschitz continuous for HiP-MDPs. We show this through a combination of Theorem 3.1 and Theorem 4.3 & 5.2 in the following sections. Theorem 3.1 mainly studies the Lipschitz continuity with respect to S and A while the hidden parameter is fixed. This can be seen as a generalization of theorems derived in prior papers (Rachelson & Lagoudakis, 2010; Pirotta et al., 2015; Asadi et al., 2018; Gelada et al., 2019) that study Lipschitz value function for a regular MDP. Theorem 3.1. For a (K S R , K S T , K A R , K A T , K Ω R , K Ω T )-Lipschitz HiP-MDP with a (K S π , K Ω π )-Lipschitz policy π, if γ(K S T + K S π K A T ) < 1, 1 the value function is Lipschitz continuous with respect to S and A with constants bounded by: K A Q π := sup ω sup s sup a1,a2 |Q(s, a 1 , ω) -Q(s, a 2 , ω)| d A (a 1 , a 2 ) ≤ K A R -γ(K A R K S T -K S R K A T ) 1 -γ(K S T + K S π K A T ) K S Q π := sup ω sup a sup s1,s2 |Q(s 1 , a, ω) -Q(s 2 , a, ω)| d S (s 1 , s 2 ) ≤ K S R -γK S π (K S R K A T -K A R K S T ) 1 -γ(K S T + K S π K A T ) K S V π := sup ω sup a sup s1,s2 |V (s 1 , ω) -V (s 2 , ω)| d S (s 1 , s 2 ) ≤ K S Q π + K A Q π K S π . Proof. (sketch) We derive the value difference bound by decomposing it into the combination of the Q-value difference with respect to S & A using our previous definitions and the Bellman update rule (9). Then we get the bounds for K A Q π and K S Q π respectively by applying the dual form of the Wasserstein Metric (Equation 2) and with the fixed-point iteration. See appendix E for all proofs. The above theorem shows the conditions under which the value functions of HiP-MDPs are Lipschitz continuous with respect to states and actions. We can get the following corollary when the policy is optimal: Corollary 3.2. If the policy π is optimal and γK S T < 1, then: K S Q * ≤ K S R 1 -γK S T , K A Q * ≤ K A R + γK S R K A T -γK S T K A R 1 -γK S T . Note that in contrast to K S Q * , K A Q * is affected by the smoothness of the transition dynamics and reward functions with respect to action space (K A T and K A R ). Theorem 3.1 and Corollary 3.2 also directly apply to a (K S R , K S T , K A R , K A T )-Lipschitz regular MDP with a K S π -Lipschitz policy, as the Lipschitz continuity assumptions with respect to hidden parameters are not used in both the final results and proofs. Specifically, the upper bound for K S Q * in Corollary 3.2 recovers the results in Asadi et al. (2018); Gelada et al. (2019) . If we assume the Lipschitz constants for state and action are the same, we get Corollary E.1, which recovers the results in Rachelson & Lagoudakis (2010) ; Pirotta et al. (2015) . For regular MDPs, our results are more general compared to prior work as (1) we do not assume that the Lipschitz constants for state and action are the same because that is usually not the case in practice; (2) our theory applies to all Lipschitz policies rather than only the optimal policy. One of our major goals in this work is to derive the conditions for Lipschitz value functions of HiP-MDPs. Theorem 3.1 shows that the value functions in HiP-MDPs are Lipschitz continuous with respect to S and A under certain conditions. To complete the theorem for Lipschitz value function in HiP-MDPs, we must also explore the conditions under which the value function is Lipschitz continuous with respect to the remaining input of Q-function in HiP-MDPs, ω. As we shown in the next two sections, exploring this actually results in different bounds for policy transfer and model transfer. By leveraging the results of Lipschitz value functions, we then find how the estimation error d(ω(θ), ω(θ)) ≤ ϵ of the hidden parameter at test time will affect the performance of model and policy transfer algorithms respectively. We quantify the performance gap by deriving the upper bound of the regret as a function of the error in estimating the hidden parameter. Note that throughout the following analysis, we abstract out the discussion of how ω(θ) is estimated, and assume it is given with an estimation error bounded by ϵ.

4. PERFORMANCE BOUNDS OF MODEL TRANSFER IN HIP-MDPS

We now focus on the regret of model transfer in HiP-MDPs induced by the hidden parameter estimation error, d(ω(θ), ω(θ)) ≤ ϵ. We assume that the agent has learnt an accurate transition function Tg (s ′ |s, a, ω(θ)) and reward function Rg (s, a, ω(θ)). Upon encountering a new task, the agent first estimates the hidden parameter and then feeds it into the learned transition function Tg and reward function Rg . The agent will use them to generate samples and train a task-specific policy. The algorithm framework can be found in Algorithm 2 and Figure 1 . Note that a similar paradigm has been widely used (Lee et al., 2020; Mendonca et al., 2020) in complex continuous control tasks. We first derive a bound of the compounding error of dynamics prediction given the estimation error ϵ in hidden parameter at test time, with the Lipschitz model class (Asadi et al., 2018 ) assumption (we include a definition in appendix E): Theorem 4.1. Assume the estimation error for hidden parameter θ is bounded by ϵ, that is, W (ω(θ), ω(θ)) ≤ ϵ. Further assume an accurately learned Tg induced by a Lipschitz model class F g with the Lipschitz constant K Ω T and K S T , a fixed sequence of actions a 0 , • • • , a n-1 , and an initial state distribution µ(s). Then ∀n ≥ 1, the n-step prediction error ξ(n): ξ(n) := W ( T n g (•|µ, ω), T n g (•|µ, ω)) ≤ K Ω T ϵ n-1 i=0 (K S T ) i , ( ) where T n g (•|µ, ω)) := Tg (•| Tg (•|... Tg (•|µ, a 0 , ω)..., a n-2 , ω), a n-1 , ω ) is a generalized n-step transition function, and T n g (•|µ, ω)) is defined similarly. The result shows how the multi-step prediction error ξ of a generalized transition function in HiP-MDP scales with respect to the hidden parameter estimation error ϵ. Additionally, the smoothness of the transition function also affects the multi-step prediction error. As we show in the experiments, this multi-step prediction bound can have a huge impact on the planning performance for model transfer. Instead of directly planning, some methods choose to further learn the policy using the learned model. We investigate the value estimation difference and regret of model transfer induced by the hidden parameter estimation error. Note that the remaining theoretical results in this paper do not involve the Lipschitz model class assumption. Lemma 4.2. Given a HiP-MDP with learned (K Ω R , K Ω T )-Lipschitz transition model T , reward function R, in model transfer, the generalized value function is Lipschitz continuous with respect to Ω with a constant bounded by: K Ω V π ≤ K Ω R + γK Ω Q π + γK S Q π K Ω T . We get the above bound for the Lipschitz value function when assuming the Lipschitz continuity of transition and reward function with respect to only the hidden parameter. The lemma shows the relationship between K Ω V π and K Ω Q π in model transfer case. Then, by further assuming the Lipschitz continuity of dynamics & policy in Lemma 9, we can get the following bound leveraging the dual form of Wasserstein Metric (Equation 2), and computing the fixed point of recurrence. Theorem 4.3. Given a HiP-MDP with learned (K S R, K S T , K A R , K A T , K Ω R , K Ω T )-Lipschitz transition dynamics T , reward function R, in model transfer, if γ(K S T + K S π K A T ) < 1, the generalized value function corresponding to the K S π -Lipschitz policy π(a|s) is Lipschitz continuous with respect to Ω with a constant bounded by: K Ω V π ≤ K Ω R + γK S Q π K Ω T 1 -γ . ( ) Now if we further use the bounds derived in Theorem 3.1 to substitute K S Q π (for Theorem 5.2 we substitute K A Q π ), we can get a bound of K Ω V π that does not depend on K Q π . Compared to Lemma 4.2, the bound derived in Theorem 4.3 does not depend on K Ω Q π by leveraging Bellman equation (the details can be found in our proofs). In general, Theorem 4.3 characterizes how the value estimation difference in model transfer is affected by the reward/transition functions' robustness to the hidden parameter. Theorem 4.3 and 3.1 together show that the value function of a HiP-MDP for model transfer is Lipschitz continuous under certain conditions. Then we can derive the regret of model transfer given hidden parameter estimation error ϵ. Here we further assume the agent gets the optimal policy in the "simulated environment" created by the learned transition and reward function. We first introduce the following lemma: Lemma 4.4. Given a (K S R , K S T , K Ω R , K Ω T )-Lipschitz HiP-MDP, for the optimal policy π * , |V π * (s, ω 1 (θ)) -V π * (s, ω 2 (θ))| ≤ max a∈A |Q π * (s, a, ω 1 (θ)) -Q π * (s, a, ω 2 (θ))|. Given the optimal policy, this lemma shows the relationship between value estimation difference and Q-value estimation difference. Combining the results in Corollary 3.2, we have: Lemma 4.5. Given a HiP-MDP with learned (K S R, K S T , K Ω R , K Ω T )-Lipschitz transition model T , reward function R, the optimal policy π * (a|s) for T and R, starting from state distribution µ(s),if γK S T < 1, the regret of model transfer given hidden parameter estimation error ϵ is bounded by: |J µ π * -J µ π * | ≤ K Ω R 1 -γ • ϵ + γK S RK Ω T (1 -γ)(1 -γK S T ) • ϵ, Besides the reward and transition functions' smoothness with respect to the hidden parameter θ, the derived bound also shows that the distance between the expected return of the learned policy π * and the environment's true optimal policy increases with K S R and K S T , given the hidden parameter estimation error ϵ. In other words, the performance of model transfer algorithms decreases as the sensitivity of the dynamics changes in states increases, which aligns with the intuition that the model-based methods need to accurately predict state transitions and expected rewards given states. Furthermore, for the theorems derived in this section, we show that (proofs included in the appendix): Claim 4.6. Given a linear and deterministic transition function, the bound derived in Theorem 4.1 is tight. Given a linear and deterministic transition function and reward function, Lemma 4.5 is tight.

5. PERFORMANCE BOUNDS OF POLICY TRANSFER IN HIP-MDPS

We now focus on the regret bound induced by policy transfer algorithms in HiP-MDPs. We assume the error bound of the hidden parameters is the same (i.e., d(ω(θ) , ω(θ)) ≤ ϵ), and that the agent has learnt an accurate joint policy π(a|s, ω(θ)) during training. Upon encountering a new task, the agent feeds the estimated hidden parameter into the learned policy instead of the model. The agent will then directly use the joint policy to interact with the environment. The algorithm framework can be found in Algorithm 1 and Figure 1 . Similar policy transfer paradigm has also been widely used (Yao et al., 2018; Rakelly et al., 2019) . We investigate how the estimation error for the hidden parameter will affect the performance of the joint policy on a new task and compare with the performance bounds derived above for model transfer cases. Lemma 5.1. Given a HiP-MDP with learned K Ω π -Lipschitz joint policy π(a|s, ω(θ)), in policy transfer, the generalized value function is Lipschitz continuous with respect to Ω with a constant bounded by: K Ω V π ≤ K Ω Q π + K A Q π K Ω π . We obtain the above bound for the Lipschitz value function when assuming the Lipschitz continuity of the learned policy with respect to only the hidden parameter. The lemma shows the relationship between K Ω V π and K Ω Q π in the policy transfer case. Then, similar to model transfer, if we further assume the Lipschitz continuity of the dynamics and the policy and compute the fixed point of the recurrence of the Bellman update, we can get the following bound: Theorem 5.2. Given a (K A R , K A T , K S R , K S T )-Lipschitz HiP-MDP, in policy transfer, the generalized value function corresponding to the pretrained (K S π , K Ω π )-Lipschitz joint policy π(a|s, ω(θ)), if γ(K S T + K S π K A T ) < 1 , is Lipschitz continuous with respect to Ω with a constant bounded by K Ω V π ≤ K Ω π K A Q π 1 -γ . ( ) Compared to Lemma 5.1, the bound derived in Theorem 5.2 does not depend on K Ω Q π by leveraging Bellman equation. Theorem 5.2 characterizes how the value estimation difference in policy transfer is affected by the policy's robustness to the hidden parameter. Theorem 5.2 and 3.1 together show the value function of a HiP-MDP for policy transfer is Lipschitz continuous under certain conditions. Different from the bound in Theorem 4.3 for model transfer, for policy transfer methods this value estimation error is also affected by the policy's robustness with respect to the hidden parameters, as well as the Q-value's robustness to the actions (only with respect to states in Theorem 4.3), with no dependence on the reward and transition functions' smoothness with respect to the hidden parameters. We can further derive the Lipschitz constant of the expected discounted reward (regret) starting from the state distribution µ. Similar to what we did in model transfer case, if we also assume the learned joint policy over ω(θ) is optimal, by leveraging the results in Corollary 3.2, we have: Lemma 5.3. Given a (K A R , K A T , K S R , K S T )-Lipschitz HiP-MDP, a pretrained K Ω π -Lipschitz optimal joint policy π(a|s, ω(θ)), starting from state distribution µ(s), if γK S T ≤ 1, the regret of policy transfer induced by hidden parameter estimation error ϵ is bounded by: |J µ π * -J µ π | ≤ γK Ω π (K A R + γK S R K A T -γK S T K A R ) (1 -γ)(1 -γK S T ) • ϵ, Comparing the derived performance bound with the one in Lemma 4.5, we find that the distance between the expected return of the learned joint policy π and the environment's true optimal policy π * increases also with K A T and K A R , besides K S T and K S R . In other words, the performance of the policy transfer decreases as the sensitivity of the dynamics changes in both states and actions increases, while the model-based method is not quite sensitive to the dynamics change in actions. This result is reasonable as policy-based method is influenced mostly by direct impact of actions. Regarding the tightness of the derived bound, we get similar results as for model transfer: Claim 5.4. Given a linear and deterministic transition function, reward function and policy, the bounds derived in Lemma 5.3 are tight. Summary Besides characterizing the robustness of model and policy transfer methods with respect to hidden parameter estimation error separately, Lemma 4.5 and 5.3 also imply that, policy transfer methods are expected to perform more robust when difference in effects of different actions is small, whereas model transfer methods are expected to perform better when neighboring states have relatively similar dynamics. Given these bounds, one direct implication to HiP-MDP algorithms in practice is that we can infer the performance trend of policy transfer and model transfer algorithms by either qualitatively or quantitatively estimating the dynamics and reward functions' sensitivity to states and actions. This can further help us determine which method is probably more advantageous for a specific HiP-MDP problem. Note that K θ T and K Ω π in general can be dependent on the model class (architecture) of the model (e.g. neural nets) used to parameterized the policy or model.

6. EMPIRICAL EVALUATION

We now corroborate our theoretical results in relatively large scenarios where neural networks are needed to parameterize models and polices. We evaluate model and policy transfer methods on two continuous scenarios, ball-goal and ball-wind, where we can quantitatively estimate the influence of changing environment hyperparameters on different Lipschitz constants. Then by investigating whether the empirical performance changing is consistent with what we expect from theoretical results, we can corroborate whether our derived theorems match what happened in practice. For both scenarios, the goal of the agent is to control a ball to quickly reach a target position. In ball-goal, the angle of the goal direction is the hidden parameter. That is, the goal direction changes across tasks, and the agent will obtain maximum cumulative reward if it figures out the right angle for the current task and keeps moving in that direction until reaching the goal. In ball-wind, the goal position is fixed across different tasks but there is wind across the plane with different directions. The angle of the wind direction is the hidden parameter in this scenario. The reward function consists of a dense reward proportional to how closer the ball moves towards the goal compared to last step, a control cost, as well as a larger final reward for reaching the goal. Thus for ball-goal, only the reward function is changing with respect to different hidden parameters, while in ball-wind, both reward and transition functions are changing. More details about the environments can be found in Appendix B. For each HiP-MDP, we manually calculate the value of different Lipschitz constants approximately given the actual reward and transition functions for both scenarios. We further calculate the bounds derived in Lemma 4.5 and Lemma 5.3 using the results. The Lipschitz constants are given in Table 1 . We mainly investigate the effect of the step size v, goal distance g, and state accelerator m (ball-wind). The results imply how different Lipschitz constants are changing as we change those environmental hyper-parameters if our derived bounds are relatively tight and close to what happened in practice. Env K S T K A T K Ω T K S R K A R K Ω R K Ω π K J (Model transfer) K J (Policy transfer) ball-goal 1 v 0 2 v 2g g 2g 1-γ γ(1+γ)vg (1-γ) 2 ball-wind m 1 v m + 1 1 v √ 2v √ 2v-m(g-1)+g v+γ (1-γ)(1-γm) K Ω π • γ(1+γ) (1-γm)(1-γ) Table 1 : Lipschitz constants for ball-goal and ball-wind -K J v.s. v and g: how -K J is expected to change based on our our theoretical computing results in Table 1 . Each plot in the first row is associated with two plots below with corresponding color. The performance of the algorithm is expected to decrease if -K J decreases. 6.1 EMPIRICAL RESULTS CORRESPONDING TO LEMMA 4.5 AND LEMMA 5.3 As both Lemma 4.5 and Lemma 5.3 can be written in the form |J π -J π * | < K J • ϵ, we investigate how K J is changing as we change different environmental hyperparameters both theoretically: manually computing the Lipschitz constants given the environment's dynamics (appendix B) and the two lemmas (results shown in Table 1 ), and empirically: investigating how the algorithm's empirical performance is changing as we alter the environment's hyperparameters (K J is expected to inversely related to the expected return), and see whether the theoretical and empirical results are consistent. For empirical training, we use state-of-the-art meta-RL algorithms PEARL (Rakelly et al., 2019) as the policy transfer method to learn the joint policy, as well as CaDM (Lee et al., 2020) to learn the dynamics model. The context encoder (probabilistic encoder) is updated together with loss from Q-value prediction like in PEARL, as well as loss from dynamics model prediction like in CaDM. Given a new task, we let the agent collect transitions by interacting with the environment for only one episode with randomly initialized hidden-parameter distribution. 2 Then the agent will use the learned context encoder to infer a distribution over the hidden parameter of the current task. This distribution over the hidden parameter is fixed afterwards for the current task, i.e. ϵ is fixed and the same for both methods. Then, for policy transfer, we sample from the hidden parameter distribution and directly feed that inferred latent parameter into the learned joint policy. We evaluate the average return of this policy on the current task. For model transfer, we feed that inferred latent parameter into the learned transition model and reward function. We use the models to generate samples and run SAC (Haarnoja et al., 2018) to learn the task-specific policy for this "simulated environment". The empirical results are shown in Figure 2 . We plot -K J v.s. v/g using the results in Table 1 . Recall that K J is expected to inversely related to the average return. i.e., increase in -K J implies a decrease in the regret of the algorithm, thus the average return is expected to increase if our theoretical results match what happened in practice. Almost all of the results show that the change of the expected return is inversely related to the change of the approximated Lipschitz constant K J . The environment parameters directly affect K J and the trend of the performance change in a way that is highly consistent with our theoretical analysis. In general, we find that a slower increase in Lipschitz constant with respect to the hidden parameter implies a smaller performance decay. The empirical results implies that for real-world HiP-MDP tasks, if we can quantitatively estimate the Lipschitz constant K J for model and policy transfer methods given the environment parameters, we can estimate which one is expected to have better final performance. For instance, in ball-wind, if g is quite large, given the computed K J , we will expect model transfer to have better performance as it is less affected by the value of g compared to policy transfer method. Note that in our scenarios, the performance drop is mainly induced by the estimation error of the hidden parameters, and the learned dynamics and policies are near optimal. We show the empirical evidence in appendix C. We also show how the bound of the multi-step prediction error derived in Theorem 4.1 is related to the performance of the other planning method used in model transfer -Model Predictive Control (MPC) (Garcia et al., 1989) . MPC is widely used in recent deep model-based RL works (Chua et al., 2018; Nagabandi et al., 2019) . Different from the previous approaches, MPC does not need to explicitly estimate the value function and its performance is directly affected by the multi-step prediction error. With the results in Table 1 , we can get the multi-step prediction upper bound for ball-wind using Theorem 4.1: ξ(n) ≤ v n-1 i=0 m i • ϵ. The bound implies that if the theorem matches what happens in practice, the n-step prediction error will increase as we increase v and m, and thus the performance is expected to drop as we increase v and m. (The performance of MPC and the multi-step prediction error are inversely correlated.) Similar to the previous subsection, we investigate whether the empirical performance of MPC matches our predictions from theoretical results. During empirical evaluation, we first let the agent infer the hidden parameter and feed that into the learned transition function and reward function. Then at each time step, we let the agent randomly sample a large number of actions, and use the learned model to predict N steps into the future for each of them and choose the action with the highest predicted cumulative reward. We investigate how v and m affect the algorithm's final performance empirically. The results are shown in Figure 3 , the average return of the algorithm on test tasks decreases as we increase the value of v and m, which is consistent with our theoretical result. We also find that the variance of the performance (the width of error bar) is increasing as we use longer planning horizon for MPC.

7. CONCLUSION

Assuming the learned model/policy is optimal, we investgiated how the hidden parameter estimation error affects the robustness of model and policy transfer algorithms from a theoretical perspective, respectively. We show the conditions under which the value functions of HiP-MDPs are Lipschitz continuous. We further derive regret bounds for model and policy transfer, which are proved to be tight in linear and deterministic cases. Our empirical results are consistent with the theoretical results and indicate that a faster increase in Lipschitz constant with respect to the hidden parameter implies a larger performance decay. We note that in real-world HiP-MDP problems, especially when applying deep RL algorithms, the suboptimality of the learned model/policy can still play an important role in the potential performance drop on a new task, independently of the estimation error of the hidden parameter. And in many cases, policy transfer performs better than model transfer simply because the downstream model-free deep RL algorithms are consistently better in regular MDPs.

A MODEL TRANSFER AND POLICY TRANSFER

Algorithm 1: Policy transfer Input: Test tasks {τ i } i=1:K with hidden parameters θ i ∼ P Ω for each task τ i do Roll out policy π(a|s, ω(θ)) with randomly generated Gaussian ω(θ) to collect data c = {s 1 , a 1 , r 1 , • • • , s n , a n , r n } Infer the distribution ω(θ) over the hidden parameter of task τ i with the learned encoder q(ω(θ)|c) Roll out policy π(a|s, ω(θ)) to interact with environment (evaluation) end for Algorithm 2: Model transfer Input: Test tasks {τ i } i=1:K with hidden parameters θ i ∼ P Ω for each task τ i do Roll out policy π(a|s, ω(θ)) with randomly generated Gaussian ω(θ) to collect data c = {s 1 , a 1 , r 1 , • • • , s n , a n , r n } Infer the distribution ω(θ) over the hidden parameter of task τ i with the learned encoder q(ω(θ)|c) Planning with learned transition function T g (s ′ |s, a, ω(θ)) and reward function R g (s, a, , ω(θ)) (evaluation) end for Empirically, in typical deep meta-RL settings (Finn et al., 2017; Rakelly et al., 2019) , the true value of hidden parameters are not known to the agent both during training phase and evaluation phase. We follow this setting in out experiments and train a context encoder like the previous approaches to infer the hidden parameter in latent representation space. Then the true distribution over hidden parameter is also assumed to be implicitly mapped into the latent representation space. Our empirical experiments show that our theories can be extended to hidden parameter estimation in the latent space, and the results show consistent patterns with the theoretical results.

B EXPERIMENT SETTINGS

Ball-goal: The state space consists of the ball's x, y coordinates. The action space consists of the ball's x, y velocity. v controls the step size in the direction of x axis in the transition function. The reward is proportional to the how far the agent has moved towards the goal's position (g cos ϕ g , g sin ϕ g ). We fix the value of v, g and use ϕ g as the hidden parameter to create HiP-MDPs. Details are shown below: • State space: {s x , s y } • Action space: {a x , a y } • Transition function: s ′ x = s x + a x • v s ′ y = s y + a x • Reward function: R = (s x -g cos ϕ g ) 2 + (s y -g sin ϕ g ) 2 -(s ′ x -g cos ϕ g ) 2 + (s ′ y -g sin ϕ g ) 2 Ball-wind:The state space consists of the ball's x, y coordinates. The action (one dimension) describes the direction of the agent's next move. In the transition function, m is describes the value of state accelerator, v controls the step size of the wind, θ describes the direction of the wind. The reward is proportional to the how far the agent has moved towards the goal's position (g cos ϕ g , g sin ϕ g ), plus a goal reward when succeeds and a control penalty (-0.5). We fix the value of v, g, m, ϕ g and use θ as the hidden parameter to create HiP-MDPs. Details are shown below: • State space: {s x , s y } • Action space: {a} • Transition function: s ′ x = ms x + cos a -v cos θ s ′ y = ms y + sin a -v sin θ • Reward function: R = (s x -g cos ϕ g ) 2 + (s y -g sin ϕ g ) 2 -(s ′ x -g cos ϕ g ) 2 + (s ′ y -g sin ϕ g ) 2 + 1{ (s ′ x -g cos ϕ g ) 2 + (s ′ y -g sin ϕ g ) 2 ≤ 0.1} • 20 -0.5 When computing the Lipschitz constants for Ball-wind, we only consider the difference of distance towards goal part in the reward function for ease of calculation. We further show that in these two scenarios, the performance drop as we change environment parameters is mainly induced by the estimation error of the hidden parameters, and the learned transition/reward functions and policies are near optimal. As shown in Figure 5 , when we let the agent interact with the environments for more trajectories and keep using the collected data to infer the hidden parameter, changing of the environment's Lipschitz constant does not affect the performance as much as before when the estimation of the hidden parameter is far less accurate.

D COMPARISON OF THEOREM 3.1 WITH PREVIOUS RESULTS

Assumption 1. The policy is optimal. Assumption 2. S and A share the same Lipschitz constant. et al., 2018; Gelada et al., 2019 ) & Lagoudakis, 2010; Pirotta et al., 2015 ) Theorems Value difference bound Assumption 1 Assumption 2 Theorem 3.1 K S R +K A R K S π 1-γ(K S T +K A T K S π ) (Asadi K S R 1-γK S T ✓ (Rachelson K S R (1+K S π ) 1-γK S T (1+K S π ) ✓ Table 2: Comparison of Theorem 3.1 with previous results in regular MDPs Theorem 3.1 and Corollary 3.2 also directly apply to a (K S R , K S T , K A R , K A T )-Lipschitz regular MDP with a K S π -Lipschitz policy, as the Lipschitz continuity assumptions with respect to hidden parameters are not used in both the final results and proofs. In this sense, we show a comparison of Theorem 3.1 with previous results in regular MDPs in Table 2 . Note that in Theorem 3.1, we have: K S V π ≤ K S Q π + K A Q π K S π ≤ K S R + K A R K S π 1 -γ(K S T + K A T K S π ) Specifically, the upper bound for K S Q * in Corollary 3.2 recovers the results in (Asadi et al., 2018; Gelada et al., 2019) . If we assume the Lipschitz constants for state and action are the same, we get Corollary E.1, which recovers the results in (Rachelson & Lagoudakis, 2010; Pirotta et al., 2015) . For regular MDPs, our results are more general compared to prior works as (1) we do not assume that the Lipschitz constants for state and action are the same because this is usually not the case in practice; (2) our theory applies to all Lipschitz policies rather than only the optimal policy.

E PROOFS AND MORE THEORY RESULTS

Using the generalized functions defined in Section 2, we can easily extend Bellman update rule to HiP-MDPs: Q n+1 (s, a, ω(θ)) ← R g (s, a, ω(θ)) + γ Tg (s ′ |s, a, ω(θ))V π n (s ′ , ω(θ))ds ′ , where Q n+1 converges to Q * as n → ∞. Theorem 3.1.For a (K S R , K S T , K A R , K A T , K Ω R , K Ω T )-Lipschitz HiP-MDP with a (K S π , K Ω π )-Lipschitz policy π, if γ(K S T + K S π K A T ) < 1 , the value function is Lipschitz continuous with respect to S&A with constants bounded by: K A Q π := sup ω sup s sup a1,a2 |Q(s, a 1 , ω) -Q(s, a 2 , ω)| d A (a 1 , a 2 ) ≤ K A R -γ(K A R K S T -K S R K A T ) 1 -γ(K S T + K S π K A T ) K S Q π := sup ω sup a sup s1,s2 |Q(s 1 , a, ω) -Q(s 2 , a, ω)| d S (s 1 , s 2 ) ≤ K S R -γK S π (K S R K A T -K A R K S T ) 1 -γ(K S T + K S π K A T ) K S V π := sup ω sup a sup s1,s2 |V (s 1 , ω) -V (s 2 , ω)| d S (s 1 , s 2 ) ≤ K S Q π + K A Q π K S π . Proof. The proof is mainly based on the Bellman update rule (9), the dual form of Wasserstein Metric (Equation 2), and fixed point iteration. Recall that: Q n+1 (s, a, ω(θ)) ← R g (s, a, ω(θ)) + γ Tg (s ′ |s, a, ω(θ))V π n (s ′ , ω(θ))ds ′ Now let: K A Q π ,n+1 := sup ω sup s sup a1,a2 |Q n+1 (s, a 1 , ω) -Q n+1 (s, a 2 , ω)| d A (a 1 , a 2 ) ≤ sup ω sup s sup a1,a2 |R(s, a 1 , ω) -R(s, a 2 , ω)| d A (a 1 , a 2 ) + γ sup ω sup s sup a1,a2 s ′ (T (s ′ |s, a 1 , ω) -T (s ′ |s, a 2 , ω))V π (s ′ , ω)ds ′ d A (a 1 , a 2 ) = K A R + γ sup ω sup s sup a1,a2 K S V π s ′ (T (s ′ |s, a 1 , ω) -T (s ′ |s, a 2 , ω)) V π (s ′ ,ω) K S V π ds ′ d A (a 1 , a 2 ) ≤ K A R + γ sup ω sup f :K S f ≤1 sup s sup a1,a2 K S V π s ′ (T (s ′ |s, a 1 , ω) -T (s ′ |s, a 2 , ω))f (s ′ , ω)ds ′ d A (a 1 , a 2 ) ≤ K A R + γK S V π ,n K A T (10) The last inequality holds according to Equation 2. Similarly, we can get: K S Q π := sup ω sup a sup s1,s2 |Q n+1 (s 1 , a, ω) -Q n+1 (s 2 , a, ω)| d S (s 1 , s 2 ) ≤ sup ω sup a sup s1,s2 |R(s 1 , a, ω) -R(s 2 , a, ω)| d A (a 1 , a 2 ) + γ sup ω sup a sup s1,s2 s ′ (T (s ′ |s 1 , a, ω) -T (s ′ |s 2 , a, ω))V π (s ′ , ω)ds ′ d S (s 1 , s 2 ) = K S R + γ sup ω sup a sup s1,s2 K S V π s ′ (T (s ′ |s 1 , a, ω) -T (s ′ |s 2 , a, ω)) V π (s ′ ,ω) K S V π ds ′ d S (s 1 , s 2 ) ≤ K S R + γ sup ω sup f :K S f ≤1 sup a sup s1,s2 K S V π s ′ (T (s ′ |s 1 , a, ω) -T (s ′ |s 2 , a, ω))f (s ′ , ω)ds ′ d S (s 1 , s 2 ) ≤ K S R + γK V π ,n K S T We need to further derive the upper bound of K V : K S V π := sup ω sup s1,s2 |V (s 1 , ω) -V (s 2 , ω)| d S (s 1 , s 2 ) = sup ω sup s1,s2 |Q(s 1 , π(s 1 ), ω) -Q(s 2 , π(s 2 ), ω)| d S (s 1 , s 2 ) ≤ sup ω sup s1,s2 |Q(s 1 , π(s 1 ), ω) -Q(s 1 , π(s 2 ), ω)| d S (s 1 , s 2 ) + sup ω sup s1,s2 |Q(s 1 , π(s 2 ), ω) -Q(s 2 , π(s 2 ), ω)| d S (s 1 , s 2 ) ≤ K A Q K S π + K S Q Plugging Eqn 12 into Eqn 10& Eqn 11, we get: K A Q,n+1 ≤ K A R + γK A T (K S Q,n + K A Q,n K S π ) K S Q,n+1 ≤ K S R + γK S T (K S Q,n + K A Q,n K S π ) By computing the fixed point of the recurrence, we get: K A Q π = lim n→∞ K A Q,n+1 ≤ K A R -γ(K A R K S T -K S R K A T ) 1 -γ(K S T + K S π K A T ) , K S Q π = lim n→∞ K S Q,n+1 ≤ K S R -γK S π (K S R K A T -K A R K S T ) 1 -γ(K S T + K S π K A T ) . ( ) Corollary E.1. If we assume the Lipschitz constants for state and action are the same ( K A R = K S R = K S,A R , K A T = K S T = K S,A R = K S,A T ): |R g (s 1 , a 1 , ω) -R g (s 2 , a 2 , ω)| ≤ K S,A R d S,A ((s 1 , a 1 ), (s 2 , a 2 )) W (T g (•|s 1 , a 1 , ω), T g (•|s 2 , a 2 , ω)) ≤ K S,A T d s,a , then: K S,A Q π = K S Q π = K A Q π ≤ K S,A R 1 -γK S,A T (1 + K S π ) Proof. Let K A R = K S R = K S,A R , K A T = K S T = K S,A R = K S,A T in the bounds of Theorem 3.1. Corollary 3.2. If the policy π is optimal and γK S T < 1 ,then: K S Q * ≤ K S R 1 -γK S T , K A Q * ≤ K A R + γK S R K A T -γK S T K A R 1 -γK S T . Proof. When the policy is optimal, we have (Asadi et al., 2018; Gelada et al., 2019) : |V π * (s 1 , ω) -V π * (s 2 , ω)| = | max a∈A Q π * (s 1 , a, ω) -max a∈A Q π * (s 2 , a, ω)| ≤ max a∈A |Q π * (s 1 , a, ω) -Q π * (s 2 , a, ω)| Thus, K S V ≤ K S Q = sup ω sup a sup s1,s2 |Q(s 1 , a, ω) -Q(s 2 , a, ω)| d S (s 1 , s 2 ) Plugging Eqn 17 into Eqn 10 and Eqn 11, we get: K A Q,n+1 ≤ K A R + γK S Q,n K A T (18) K S Q,n+1 ≤ K S R + γK S Q,n K S T ) First compute the fixed point of the recurrence of Eqn 19, we get: K S Q * = lim n→∞ K S Q,n+1 ≤ K S R 1 -γK S T Now plugging Eqn 20 back into Eqn 18: K A Q * = lim n→∞ K A Q,n+1 ≤ K A R + γ K S R 1 -γK S T K A T = K A R + γK S R K A T -γK S T K A R 1 -γK S T Definition E.2. Given a metric state space (S, d S ), an action space A, and a metric hidden-parameter space (Θ, d Θ ), we define F g as a collection of functions: F g = {f : S × Θ → S} distributed according to g(f |a) where a ∈ A. We say that (F S g , F Θ g ) is Lipschitz model class in HiP-MDPs if K S T := sup f ∈F S g K S,S f , and K Θ T := sup f ∈F Θ g K Θ,S f are finite. The transition function associated with a Lipschitz model class then can be defined by: T (s ′ |s, a, θ) = f 1(f (s, θ) = s ′ )g(f |a), And the generalized transition function associated with a Lipschitz model class would be: T g (s ′ |s, a, ω) = θ f 1(f (s, θ) = s ′ )g(f |a)ω(θ)dθ We introduce the following two lemmas regarding Lipschitz model class (Asadi et al., 2018) (We also include a definition in appendix): Lemma E.3. (Asadi et al., 2018) A generalized transition function T g induced by a Lipschitz model class F g and fixed (θ, a) is Lipschitz with a constant: K µ T = K A,Θ W,W (T g ) := sup θ∈Θ sup a∈A sup µ1,µ2 W (T g (•|µ 1 , a, θ), T g (•|µ 2 , a, θ)) W (µ 1 , µ 2 ) ≤ K S T Lemma E.4. A generalized transition function T g induced by a Lipschitz model class F g and fixed (s, a) is Lipschitz with a constant: K Ω T = K A,S W,W (T g ) := sup s∈S sup a∈A sup ω1,ω2 W (T g (•|s, a, ω 1 ), T g (•|s, a, ω 2 )) W (ω 1 , ω 2 ) ≤ K Θ T (27) Intuitively, the above Lemmas give a bound on the differences in distributions an agent transitions to, as a function of how different the distributions of states it transitions from are, or how different the distributions of the hidden parameters are. It is theoretically pleasing that these bounds are given as the Lipschitz constants for the corresponding differences between point distributions of either states or hidden parameters. Proof for Lemma E.4 Proof. W (T g (•|s, a, ω 1 ), T g (•|s, a, ω 2 )) = sup h:K d S ,R (h)≤1 s ′ (T g (s ′ |s, a, ω 1 ) -T g (s ′ |s, a, ω 2 ))h(s ′ )ds ′ = sup h:K d S ,R (h)≤1 s ′ θ T (s ′ |s 0 , a, θ)(ω 1 (θ) -ω 2 (θ))h(s ′ )ds ′ dθ = sup h:K d S ,R (h)≤1 s ′ θ t g(t|a)1(t(s 0 , θ) = s ′ )(ω 1 (θ) -ω 2 (θ))h(s ′ )ds ′ dθ = sup h:K d S ,R (h)≤1 t g(t|a) θ s ′ 1(t(s 0 , θ) = s ′ )(ω 1 (θ) -ω 2 (θ))h(s ′ )ds ′ dθ = sup h:K d S ,R (h)≤1 t g(t|a) θ (ω 1 (θ) -ω 2 (θ))h(t(s o , θ))dθ ≤ t g(t|a) sup h:K d S ,R (h)≤1 θ (ω 1 (θ) -ω 2 (θ))h(t(s o , θ))dθ = K Θ T t g(t|a) sup h:K d S ,R (h)≤1 θ (ω 1 (θ) -ω 2 (θ)) h(t(s o , θ)) K Θ T dθ ≤ K Θ T t g(t|a) sup q:K d S ,R (q)≤1 θ (ω 1 (θ) -ω 2 (θ))q(s o , θ)dθ = K Θ T t g(t|a)W (ω 1 , ω 2 ) = K Θ T W (ω 1 , ω 2 ) Using Lemma E.3, we can derive the bound of the compounding error of dynamics prediction given the estimation error ϵ in hidden parameter at test time (Theorem 4.1). Theorem 4.1. Assuming the estimation for hidden parameter θ is bounded by ϵ, that is, W (ω(θ), ω(θ)) ≤ ϵ, an learned accurate Tg induced by a Lipschitz model class F g with the Lipschitz constant K Ω T and K S T , a fixed sequence of actions a 0 , • • • , a n-1 , and a start state distribution µ. Then ∀n ≥ 1: ξ(n) := W ( T n g (•|µ, ω), T n g (•|µ, ω)) ≤ K Ω T ϵ n-1 i=0 (K S T ) i , where T n g (•|µ, ω)) := Tg (•| Tg (•|... Tg (•|µ, a 0 , ω)..., a n-2 , ω), a n-1 , ω), and T n g (•|µ, ω)) is defined similarly. Proof. We first derive the bound for one-step prediction error given Lemma E.3. ξ(1) := W ( Tg (•|µ, ω), Tg (•|µ, ω)) := sup f ( T (s ′ |s, a 0 , ω) -T (s ′ |s, a 0 , ω))f (s ′ )µ(s)dsds ′ ≤ sup f ( T (s ′ |s, a 0 , ω) -T (s ′ |s, a 0 , ω))f (s ′ )ds ′ µ(s)ds = W ( T (•|s, a 0 , ω), T (•|s, a 0 , ω))µ(s)ds ≤ K Ω T ϵµ(s)ds = K Ω T ϵ Then, we can prove the bound for ξ(n):  ξ(n) := W ( T n g (•|µ, ω), T n g (•|µ, ω)) ≤ W ( T n g (•|µ, ω), T n g (•| T n-1 g (•|µ, ω), ω)) + W ( T n g (•| T n-1 g (•|µ, ω), ω), T n g (•|µ, ω)) = W ( T n g (•| T n-1 g (•|µ, ω), ω), T n g (•| T n-1 g (•|µ, ω), ω)) + W ( T n g (•| T n-1 g (•|µ, ω), ω), T n g (•| T n-1 g (•|µ, ω), ω)) ≤ K S T W ( T n-1 g (•|µ, ω), T n-1 g (•|µ, ω)) + K Ω T ϵ = K S T ξ(n -1) + K Ω T ϵ ≤ K Ω T ϵ n-1 i=0 (K S T ) i K Ω V π ≤ K Ω R + γK Ω Q π + γK S Q π K Ω T . Proof. Recall that: V π g (s) = R g (s, a, ω(θ)) + γ s ′ Tg (s ′ |s, a, ω(θ))Q π g (s, π(s), ω(θ))ds ′ Then we have: K Ω V π = sup s∈S sup ω1,ω2 |V π (s, ω 1 (θ)) -V π (s, ω 2 (θ))| d Ω (ω 1 , ω 2 ) ≤ sup s∈S sup ω1,ω2 | Rg (s, π(s), ω 1 (θ)) -Rg (s, π(s), ω 2 (θ))| d Ω (ω 1 , ω 2 ) +γ sup s∈S sup ω1,ω2 | s ′ 1 Tg (s ′ 1 |s, π(s), ω 1 (θ))Q π g (s ′ 1 , π(s), ω 1 (θ))ds ′ 1 -s ′ 2 Tg (s ′ 2 |s, π(s), ω 2 (θ))Q π g (s ′ 2 , π(s), ω 2 (θ))ds ′ 2 | d Ω (ω 1 , ω 2 ) = K Ω R + γ sup s∈S sup ω1,ω2 | s ′ 1 Tg (s ′ 1 |s, π(s), ω 1 )Q π g (s ′ 1 , π(s), ω 1 )ds ′ 1 -s ′ 2 Tg (s ′ 2 |s, π(s), ω 2 )Q π g (s ′ 2 , π(s), ω 2 )ds ′ 2 | d Ω (ω 1 , ω 2 ) By computing the limit of both sides: K Ω Q π = lim n→∞ K Q,n+1 ≤ lim n→∞ n i=0 γ i K Ω R + lim n→∞ n+1 i=1 γ i K S Q π • K Ω T + lim n→∞ γ n K Q,0 = K Ω R 1 -γ + γK S Q π • K Ω T 1 -γ + 0 From Lemma 4.2 we have: K Ω V π ≤ K Ω R + γK Ω Q π + γK S Q π K Ω T Thus, K Ω V π ≤ K Ω R + γK S Q π K Ω T 1 -γ Lemma 4.4. Given a (K S R , K S T , K Ω R , K Ω T )-Lipschitz HiP-MDP, for the optimal policy π * , |V π * (s, ω 1 (θ)) -V π * (s, ω 2 (θ))| ≤ max a∈A |Q π * (s, a, ω 1 (θ)) -Q π * (s, a, ω 2 (θ))|. Proof. |V π * (s, ω 1 (θ)) -V π * (s, ω 2 (θ))| = | max a∈A Q π * (s, a, ω 1 (θ)) -max a∈A Q π * (s, a, ω 2 (θ))| ≤ max a∈A |Q π * (s, a, ω 1 (θ)) -Q π * (s, a, ω 2 (θ))| Lemma 4.5.Given a HiP-MDP with learned (K S R, K S T , K Ω R , K Ω T )-Lipschitz transition model T , R, and the optimal policy π * (a|s) for T and R, starting from state distribution µ, the expected discounted reward difference induced by hidden parameter estimation error ϵ is bounded by: |J µ π * -J µ π * | ≤ K Ω R 1 -γ • ϵ + γK S RK Ω T (1 -γ)(1 -γK S T ) • ϵ, if γK S T < 1. Proof. Recall that: J µ π := s V π g (s, ω(θ))µ(s)ds. We have: K Ω J = sup s sup ω1,ω2 s |V π * (s, ω 1 (θ)) -V π * (s, ω 2 (θ))|µ(s)ds d(ω 1 (θ), ω 2 (θ)) ≤ sup s sup a sup ω1,ω2 s |Q π * (s, a, ω 1 (θ)) -Q π * (s, a, ω 2 (θ))|µ(s)ds d(ω 1 (θ), ω 2 (θ)) ≤ K Ω Q π * = lim n→∞ K Q,n+1 ≤ lim n→∞ n i=0 γ i K Ω R + lim n→∞ n+1 i=1 γ i K S Q * • K Ω T + lim n→∞ γ n K Q,0 = K Ω R 1 -γ + γK S Q * • K Ω T 1 -γ + 0 From Corollary 3.2, we have: K S Q * ≤ K S R 1 -γK S T Thus, K Ω J ≤ K Ω R (1 -γK S T ) + γK S RK Ω T (1 -γ)(1 -γK S T ) Claim 4.6. Given a linear and deterministic transition function and reward function, the bound derived in Theorem 4.1 & Lemma 4.5 are tight. Proof. Assume a linear transition function T for a new task with hidden parameter θ defined as: T (s, a) = As + Ba + Cθ, And a reward function defined as: R(s, a) = Ds + Ea + F θ. Assume there's an estimation error ϵ for θ, then the transition function and reward function becomes: T (s, a) = As + Ba + C(θ + ϵ), R(s, a) = Ds + Ea + F (θ + ϵ), First observe that: ∀s, a |T (s, a) -T (s, a)| = Cϵ, And that for n = 2: ∀s |T (T (s, a 0 ), a 1 ) -T ( T (s, a 0 ), a 1 )| = |A(C(ϵ + θ) -Cθ) + Ba 1 -Ba 1 + C(ϵ + θ) -Cθ| = |ACϵ + Cϵ| = Cϵ 1 i=0 A i (42) More generally, for n step compounding error of dynamics prediction T n and T n , given a fixed sequence of actions a 0 , a 1 , • • • , a n : ∀s |T n (s, a n ) -T n (s, a n )| = Cϵ n i=0 A i Thus, the bound in Theorem 4.1 is tight. Now consider the state s = 0 and the action space only consists of one action a = 0 (thus the policy is optimal), we can calculate the value of s predicted with T and R: V (s = 0) = R(0) + γ R(0 + C(θ + ϵ)) + γ R(AC(θ + ϵ) + C(θ + ϵ)) + • • • = F (θ + ϵ) ∞ j=0 γ j + DC(θ + ϵ) ∞ n=1 γ n n-1 i=0 A i = F (θ + ϵ) 1 -γ + DCγ(θ + ϵ) (1 -γ)(1 -γA) = (F (1 -γA) + DCγ)(θ + ϵ) (1 -γ)(1 -γA) Thus: |V (s = 0) -V (s = 0)| = F (1 -γA) + DCγ (1 -γ)(1 -γA) • ϵ The result matches the bound derived in Lemma 4.5. (In deterministic cases, we assume we directly estimate the value of θ, so K Ω T & K Ω R becomes K Θ T & K Θ R in this case and the Wasserstein distance becomes the L1 distance | θ -θ|.) Lemma 5.1. Given a HiP-MDP with learned K Ω π -Lipschitz joint policy π(a|s, ω(θ)), in Policy Transfer, the generalized value function with respect to Ω with a constant bounded by: K Ω V π ≤ K Ω Q π + K A Q π K Ω π . Proof. Given a new test task, recall that: V π(•|ω) g (s, ϕ) = Q π(•|ω) g (s, π(s, ω), ϕ) where we use ϕ to denote the true distribution ϕ(θ) over hidden parameter for the current task (which should be a Dirac δ-function in practice) of the environment. Then we have: K Ω V π = sup s sup ϕ sup ω1,ω2 |V π(•|ω1) g (s, ϕ) -V π(•|ω2) g (s, ϕ)| d(ω 1 , ω 2 ) ≤ sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) g (s, π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) g (s, π(s ′ , ω 2 ), ϕ)| d(ω 1 , ω 2 ) ≤ sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) g (s, π(s, ω 1 ), ϕ(θ)) -Q π(•|ω2) g (s, π(s, ω 1 ), ϕ(θ))| d(ω 1 , ω 2 ) + |Q π(•|ω2) g (s, π(s, ω 1 ), ϕ(θ)) -Q π(•|ω2) g (s, π(s, ω 2 ), ϕ(θ))| d(ω 1 , ω 2 ) ≤ sup s sup ϕ sup ω1,ω2 K Ω Q π d(ω 1 , ω 2 ) d(ω 1 , ω 2 ) + K A Q π d(π(ω 1 ), π(ω 2 ))ds ′ d(ω 1 , ω 2 ) ≤ K Ω Q π + K A Q π K Ω π d(ω 1 , ω 2 ) d(ω 1 , ω 2 ) ≤ K Ω Q π + K A Q π K Ω π Theorem 5.2. Given a (K A R , K A T , K S R , K S T )-Lipschitz HiP-MDP, in Policy Transfer, the generalized value function corresponding to the pretrained (K S π , K Ω π )-Lipschitz joint policy π(a|s, ω(θ)), if γ(K S T + K S π K A T ) < 1, is Lipschitz continuous with respect to Ω with a constant bounded by K Ω V π ≤ K Ω π K A Q π 1 -γ . ( ) Proof. Similar to model transfer, if we further assume the Lipschitz continuity of dynamics & policy in Lemma 5.1, we can get the result bound leveraging the dual form of Wasserstein Metric (Equation 2), triangle inequality, and computing the fixed point of recurrence. Recall that: Q(s, π(s, ω), ϕ) ← R g (s, π(s, ω), ϕ) + γ Tg (s ′ |s, π(s, ω), ϕ)V π (s ′ , ϕ)ds ′ K Ω Q π = sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 1 ), ϕ) -Q π(•|ω2) n+1 (s, π(s, ω 2 ), ϕ)| d(ω 1 , ω 2 ) ≤ sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 1 ), ϕ) -Q π(•|ω1) n+1 (s, π(s, ω 2 ), ϕ)| d(ω 1 , ω 2 ) + sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 2 ), ϕ) -Q π(•|ω2) n+1 (s, π(s, ω 2 ), ϕ)| d(ω 1 , ω 2 ) ≤ sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 1 ), ϕ) -Q π(•|ω1) n+1 (s, π(s, ω 2 ), ϕ)| d(π(•|ω 1 ), π(•|ω 2 )) • d(π(•|ω 1 ), π(•|ω 2 )) d(ω 1 , ω 2 ) + sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 2 ), ϕ) -Q π(•|ω2) n+1 (s, π(s, ω 2 ), ϕ)| d(ω 1 , ω 2 ) ≤K A Q π K Ω π + sup s sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, π(s, ω 2 ), ϕ) -Q π(•|ω2) n+1 (s, π(s, ω 2 ), ϕ)| d(ω 1 , ω 2 ) (47) Now let: K * Q π ,n+1 = sup s sup a sup ϕ sup ω1,ω2 |Q π(•|ω1) n+1 (s, a, ϕ) -Q π(•|ω2) n+1 (s, a, ϕ)| d(ω 1 , ω 2 ) ≤ 0 + γ • sup s sup a sup ϕ sup ω1,ω2 s ′ T g (s ′ |s, a, ϕ(θ))|Q π(•|ω1) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 2 ), ϕ)|ds ′ d(ω 1 , ω 2 ) ≤ γ • sup s sup a sup ϕ sup ω1,ω2 s ′ T g (s ′ |s, a, ϕ)|Q π(•|ω1) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 1 ), ϕ)|ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)|Q π(•|ω2) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 2 ), ϕ)|ds ′ d(ω 1 , ω 2 ) ≤ γ • sup s sup a sup ϕ sup ω1,ω2 s ′ T g (s ′ |s, a, ϕ)K * Q π ,n d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)K A Q π d(π(ω 1 ), π(ω 2 ))ds ′ d(ω 1 , ω 2 ) ≤ γ • sup s sup a sup ϕ sup ω1,ω2 s ′ T g (s ′ |s, a, ϕ)K * Q π ,n d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)K A Q π K Ω π d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) ≤ γ(K * Q π ,n + K A Q π K Ω π ) Equivalently: K * Q π ,n+1 ≤ γ n K * Q π ,0 + n+1 i=1 γ i K A Q π K Ω π Computing the limit of both sides: K * Q π = lim n→∞ K * Q π ,n+1 ≤ lim n→∞ γ n K * Q π ,0 + lim n→∞ n+1 i=1 γ i K A Q π K Ω π = 0 + γK A Q π K Ω π 1 -γ Plugging the results back in to Equation 47, we have: K Ω V π = K Ω Q π ≤ K A Q π K Ω π 1 -γ Lemma 5.3. Given a (K A R , K A T , K S R , K S T )-Lipschitz HiP-MDP, a pretrained K Ω π -Lipschitz optimal joint policy π(a|s, ω(θ)), starting from state distribution µ, the expected discounted reward difference induced by hidden parameter estimation error ϵ is bounded by: |J µ π * -J µ π | ≤ γK Ω π (K A R + γK S R K A T -γK S T K A R ) (1 -γ)(1 -γK S T ) • ϵ, if γK S T ≤ 1. Proof. When the policy is optimal, we have:  K Ω Q π ,n+1 = sup s ′ T g (s ′ |s, a, ϕ)K Q π ,n d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)K A Q π K Ω π d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) ≤ γ(K Ω Q π ,n + K A Q π K Ω π ) Equivalently: K Ω Q π ,n+1 ≤ γ n K Ω Q π ,0 + n+1 i=1 γ i K A Q π K Ω π Computing the limit of both sides: K Ω Q π = lim n→∞ K Q π ,n+1 ≤ lim n→∞ γ n K Ω Q π ,0 + lim n→∞ n+1 i=1 γ i K A Q π K Ω π = 0 + γK A Q π K θ π 1 -γ If the policy is optimal, we have: |R(s ′ 1 , π(s ′ 1 ), ω) -R(s ′ 2 , π(s ′ 2 ), ω)| d S (s 1 , s 2 ) ≤ K S R + K A R K S π + γ sup ω sup s1,s2 |R(s ′ 1 , π(s ′ 1 ), ω) -R(s ′ 2 , π(s ′ 2 ), ω)| d S (s 1 , s 2 ) ≤ K S R + K A R K S π + γ sup ω sup s1,s2 |R(s ′ 1 , π(s ′ 1 ), ω) -R(s ′ 2 , π(s ′ 1 ), ω)| + |R(s ′ 2 , π(s ′ 1 ), ω) -R(s ′ 2 , π(s ′ 2 ), ω)| d S (s 1 , s 2 ) ≤ K S R + K A R K S π + γ(K S R + K A R K S π ) sup ω sup s1,s2 |T (s 1 , π(s 1 ), ω) -T (s 2 , π(s 2 ), ω)| d S (s 1 , s 2 ) ≤ K S R + K A R K S π + γ(K S R + K A R K S π )(K S T + K A T K S π ) If the agent is at step N -2, similarly, we have: K S V ≤ K S R + K A R K S π + γ(K S R + K A R K S π )(K S T + K A T K S π ) + γ 2 (K S R + K A R K S π )(K S T + K A T K S π ) 2 By induction, if the agent is at step N -n, we have: K S V ≤ (K S R + K A R K S π ) • 1 -γ n (K S T + K A T K S π ) n 1 -γ(K S T + K A T K S π ) Thus, the assumption that γ(K S T + K S π K A T ) < 1 is unnecessary. Similarly, when the policy is optimal, at step N -1: |T (s 1 , a, ω) -T (s 2 , a, ω)| d S (s 1 , s 2 ) K S V = sup ω sup s1,s2 | 1 i=0 γ i (R 1 -R 2 )| d S ( ≤ K S R + γK S R K S T At step N -n, we have: K S V ≤ K S R • 1 -γ n (K S T ) n 1 -γK S T Thus, the assumption that γK S T < 1 is unnecessary. Intuitively, for tasks with infinite horizons, the assumption requires that the future states generated by close states are not too divergent. The threshold of "divergent" depends on how farsighted the agent is.

G APPLICATIONS OF HIP-MDPS

HiP-MDP is an important setting widely used in recent meta RL papers. It provides a natural testbed for meta/lifelong RL algorithms as the difference between tasks can be controlled by a lowdimensional latent vector. A commonly-used meta-RL benchmark (almost in every recent meta-RL



We make this assumption in many of our following theorems. See appendix F for explanations. The hidden parameters here and the inferred hidden parameters mentioned below are all referring to the mapping of the hidden parameter in the latent representation space through the context encoder. We assume the true distribution over the hidden parameters is also implicitly mapped into the latent space.



Figure 1: Difference between model and policy transfer in HiP-MDPs.

Figure2: First row: Average return on test tasks v.s. environmental properties (v and g). Second row: -K J v.s. v and g: how -K J is expected to change based on our our theoretical computing results in Table1. Each plot in the first row is associated with two plots below with corresponding color. The performance of the algorithm is expected to decrease if -K J decreases.

Figure 3: Average return on test tasks v.s. environmental properties (m & v) that affect the Lipschitz constants for MPC method on ball-wind.

Figure 4: The Ball environment used in our experiments.

Figure 5: Average return on test tasks v.s. environmental properties (v & g) for different estimation accuracy of the hidden parameter in ball-goal & ball-wind.

31)We can get the same results if we replaceT n g (•| T n-1 g (•|µ, ω), ω) with T n g (•| T n-1 g (•|µ, ω), ω) in the triangle inequality.Lemma 4.2.Given a HiP-MDP with learned (K Ω R , K Ω T )-Lipschitz transition model T , R, in Model Transfer, the generalized value function is Lipschitz continuous with respect to Ω with a constant bounded by:

g (s ′ |s, a, ϕ(θ))|Q π(•|ω1) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 2 ), ϕ)|ds ′ d(ω 1 , ω 2 ) s ′ T g (s ′ |s, a, ϕ)|Q π(•|ω1) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 1 ), ϕ)|ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)|Q π(•|ω2) n (s ′ , π(s ′ , ω 1 ), ϕ) -Q π(•|ω2) n (s ′ , π(s ′ , ω 2 ), ϕ)|ds ′ d(ω 1 , ω 2 ) g (s ′ |s, a, ϕ)K Q π ,n d(ω 1 , ω 2 )ds ′ d(ω 1 , ω 2 ) + s ′ T g (s ′ |s, a, ϕ)K A Q π d(π(ω 1 ), π(ω 2 ))ds ′ d(ω 1 , ω 2 )

|V π(ω1) (s, ϕ(θ)) -V π(ω2) (s, ϕ(θ))| ≤| max a Q π(ω1) (s, a, ϕ(θ)) -max a Q π(ω2) (s, a, ϕ(θ))| ≤ max a |Q π(ω1) (s, a, ϕ(θ)) -Q π(ω2) (s, a, ϕ(θ))|Then for the Lipschitz constant of the expected return function, we have:|V π(ω1) (s, ϕ(θ)) -V π(ω2) (s, ϕ(θ))|µ(s)ds d(ω 1 (θ), ω 2 (θ)) |Q π(ω1) (s, a, ϕ(θ)) -Q π(ω2) (s, a, ϕ(θ))|µ(s)ds d(ω 1 (θ), ω 2 (θ)) = K Ω Q π If the agent is at step N -1: 1 , π(s 1 ), ω) -R(s 2 , π(s 2 ), ω)| + γ|R(s ′ 1 , π(s ′ 1 ), ω) -R(s ′ 2 , π(s ′ 2 ), ω)| d S (s 1 , s 2 ) 1 , π(s 1 ), ω) -R(s 2 , π(s 1 ), ω)| + |R(s 2 , π(s 1 ), ω) -R(s 2 , π(s 2 ), ω)| d S (s 1 , s 2 )

s 1 , s 2 ) ′ 2 , a, ω)| d S (s 1 , s 2 )

8. ACKNOWLEDGEMENT

The authors would like to thank Kavosh Asadi, Saket Tiwari, Michael Littman for discussions and helpful feedback, and the anonymous reviewers for valuable feedback that improved the paper substantially. This work was supported in part by an NSF Graduate Research Fellowship under grant #2040433, NSF grants #1717569 #1955361 #IIS-2007076 and CAREER award #1844960, DARPA grant W911NF1820268, and ONR contracts N00014-17-1-2699, and was conducted using computational resources and services at the Center for Computation and Visualization, Brown University. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The content is solely the responsibility of the authors and does not necessarily represent the official views of DARPA, the NSF, the ONR, or the AFOSR.

annex

Recall that we quantify the distance between ω 1 (θ) and ω 2 (θ) using Wasserstein Metric W (ω 1 , ω 2 ), then< 1, the generalized value function corresponding to the K S π -Lipschitz policy π(a|s) is Lipschitz continuous with respect to Ω with a constant bounded by:Proof. By further assuming the Lipschitz continuity of dynamics & policy in Lemma 4.2, we can get the result bound leveraging the dual form of Wasserstein Metric (Equation 2), triangle inequality, and computing the fixed point of recurrence.Recall that:It is well known that as n → ∞, Q n+1 converges to Q * , now let:(33) Recall that we quantify the distance between ω 1 (θ) and ω 2 (θ) using Wasserstein Metric W (ω 1 , ω 2 ), then.Equivalently:Using the results from Corollary 3.2, we get:Claim 5.4. Given a linear and deterministic transition function, reward function and policy, the bounds derived in Lemma 5.3 are tight.Proof. Assume a linear transition function T for a new task defined as:a reward function defined as: R(s, a) = Ds + Ea, (50) And the optimal policy corresponding to hidden parameter θ defined as:Assume there's an estimation error ϵ for θ, then the policy becomes:Now consider the state s = 0, we can calculate the value of s predicted with T , R and π:The result matches the bound derived in Lemma 5.3. (In deterministic cases, we assume we directly estimate the value of θ, so K Ω π becomes K Θ π in this case and the Wasserstein distance becomes the L1 distance | θ -θ|.)Similar assumptions have been made in most of previous papers discussing Lipschitz continuity in RL (Rachelson & Lagoudakis, 2010; Pirotta et al., 2015; Asadi et al., 2018; Gelada et al., 2019) .Here, we make the following claim: Claim F.1. When a HiP-MDP has a fixed horizon N , the assumption that γ(K S T + K S π K A T ) < 1 and the assumption (when the policy is optimal) γK S T < 1 are both unnecessary.Proof.Firstly, if the agent is at the last step N of one episode:

