PERFORMANCE BOUNDS FOR MODEL AND POLICY TRANSFER IN HIDDEN-PARAMETER MDPS

Abstract

In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiP-MDP is a natural model for families of tasks in which meta-and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.



or policy transfer (Yao et al., 2018; Rakelly et al., 2019) algorithms. As shown in Figure 1 , in model transfer, the inferred hidden parameters are used to build a simulator of the environment, which can then be used for planning. In policy transfer, the inferred hidden parameters are used to parameterize the policy of the new task directly. Previous works have observed mixed empirical evidence for when each approach performs better (Yao et al., 2018; Lee et al., 2020; Fu et al., 2022) . In this work, we take a step in the theoretical direction by studying the regret bounds of model and policy transfer algorithms, respectively, which helps characterize the robustness of these algorithms. Two main factors affect the performance of HiP-MDP algorithms: (1) the estimation accuracy of the hidden parameter used as an input to either the learned model or policy, and (2) the quality of the learned model or policy (i.e., given an accurately estimated hidden parameter, whether the agent can reach the optimal performance). In practice, the HiP-MDP tasks considered in most recent meta-RL papers (e.g., changing the physical properties of MuJoCo-simulated robots) are relatively easy to solve when the hidden parameters are given. We therefore assume the learned model and policy are optimal and theoretically analyze how hidden parameter estimation error affects the final performance of model and policy transfer algorithms. Our contributions are as follows. We first derive the conditions under which the value function of HiP-MDPs is Lipschitz continuous. We then derive upper bounds for regret in model and policy transfer algorithms, respectively, when the estimation error of the hidden parameters is bounded. We also give an upper bound for multi-step prediction error in model transfer. We further prove that the derived bounds are tight when the transition dynamics, reward function and policy are linear and deterministic. As far as we are aware, these are the first theoretical results about policy and model transfer performance bounds in HiP-MDPs. Given the same hidden parameter estimation error bound, our results characterize when model or policy transfer can be more robust than the other: a slower increase in the Lipschitz constant with respect to the hidden parameter implies more robustness. In addition to theoretical analysis, we empirically study the performance of model and policy transfer algorithms in two continuous control domains. For each domain, we control the Lipschitz constants of the HiP-MDPs by altering the hyper-parameters of the environments. The results are consistent with our theoretical understanding of how the regrets of model and policy transfer algorithms scale with respect to the estimation error of the hidden parameters: a slower increase in Lipschitz constant with respect to the hidden parameter implies a smaller performance decay.

2. BACKGROUND AND RELATED WORK

HiP-MDPs model the variations in the transition dynamics and reward functions by assigning each task a hidden parameter θ, drawn from the distribution P Ω . The agent neither observes θ nor has access to the the distribution P Ω that generates the task family. For a given task, parameterized by θ ∈ Θ, the stochastic dynamics are given by T (s ′ |s, a; θ) and the deterministic reward function by R(s, a; θ). We consider continuous state and action space in this work (s ∈ S, a ∈ A). Upon encountering a new task, the agent estimates the new dynamics, T , and the new reward function, R, by inferring a distribution ω(θ) over the hidden parameter. If our estimation of θ is accurate, then ω(θ) should be close to the true distribution ω(θ), i.e. peaking at the true θ for this specific task. We assume the learned dependency of T and R on θ is accurate. Thus, if the estimation of θ is also accurate, the agent can solve the task completely. The HiP-MDP is an important setting widely used in recent meta/lifelong RL papers (see Appendix G). A number of approaches have been proposed to infer the hidden parameter θ of HiP-MDPs in the latent representation space, such as using Bayesian models to leverage prior knowledge (Killian et al., 2017; Yao et al., 2018; Fu et al., 2022) , or training a context encoder that maps trajectories to the latent parameter (Rakelly et al., 2019; Zintgraf et al., 2020; Lee et al., 2020) . Given the inferred hidden parameters, various RL algorithms have been used to adapt to the new tasks. The downstream RL algorithm given the inference of the hidden parameter includes training an off-policy actor-critic that takes in the estimated parameter as an additional input (Rakelly et al., 2019; Fakoor et al., 2020; Fu et al., 2021) (1)



Figure 1: Difference between model and policy transfer in HiP-MDPs. Hidden-parameter Markov Decision Processes (HiP-MDPs) (Doshi-Velez & Konidaris, 2016) describe a family of related tasks by modeling task variations with a set of low-dimensional hidden parameters. HiP-MDPs are widely used in recent meta-Reinforcement Learning (RL) and lifelong RL works, in which an agent needs to quickly adapt to new tasks by transferring the knowledge from previous tasks. To solve HiP-MDPs, most existing algorithms first infer the hidden parameters from previous experience (e.g. by learning a context encoder that maps trajectories to a hidden parameter estimate θ), then use the estimated hidden parameters to solve new tasks with either model (Killian et al., 2017; Lee et al., 2020; Fu et al., 2022) or policy transfer (Yao et al., 2018; Rakelly et al., 2019) algorithms. As shown in Figure1, in model transfer, the inferred hidden parameters are used to build a simulator of the environment, which can then be used for planning. In policy transfer, the inferred hidden parameters are used to parameterize the policy of the new task directly. Previous works have observed mixed empirical evidence for when each approach performs better(Yao et al., 2018; Lee et al., 2020;  Fu et al., 2022). In this work, we take a step in the theoretical direction by studying the regret bounds of model and policy transfer algorithms, respectively, which helps characterize the robustness of these algorithms.

, or leveraging recurrent neural networks(Zintgraf et al.,  2020; Duan et al., 2016), or planning based on the learned transition model that also takes in the inferred parameter(Lee et al., 2020; Mendonca et al., 2020). Yet, in practice, there is no clear winner among these algorithms, leaving the question that what factors affect the performance of different algorithms, and how. Nair & Doshi-Velez study a similar problem but in a different setting where the hidden parameters are given (contextual MDP) and derive sample complexity bounds for model-based learning. In this work, we quantify how errors and uncertainty in estimating θ affect the performance of model and policy transfer methods through the lens of Lipschitz continuity.Given a distance metric on the space M , d M , Lipschitz continuity quantifies the smoothness of a function as follows. Definition 2.1. A function f : M 1 → M 2 is uniformly Lipschitz continuous if K M1,M2 f := sup x1,x2 d M2 (f (x 1 ), f (x 2 )) d M1 (x 1 , x 2 ) .

