PERFORMANCE BOUNDS FOR MODEL AND POLICY TRANSFER IN HIDDEN-PARAMETER MDPS

Abstract

In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiP-MDP is a natural model for families of tasks in which meta-and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.



As shown in Figure 1 , in model transfer, the inferred hidden parameters are used to build a simulator of the environment, which can then be used for planning. In policy transfer, the inferred hidden parameters are used to parameterize the policy of the new task directly. Previous works have observed mixed empirical evidence for when each approach performs better (Yao et al., 2018; Lee et al., 2020; Fu et al., 2022) . In this work, we take a step in the theoretical direction by studying the regret bounds of model and policy transfer algorithms, respectively, which helps characterize the robustness of these algorithms. Two main factors affect the performance of HiP-MDP algorithms: (1) the estimation accuracy of the hidden parameter used as an input to either the learned model or policy, and (2) the quality of the learned model or policy (i.e., given an accurately estimated hidden parameter, whether the agent can reach the optimal performance). In practice, the HiP-MDP tasks considered in most recent meta-RL papers (e.g., changing the physical properties of MuJoCo-simulated robots) are relatively easy to



Figure 1: Difference between model and policy transfer in HiP-MDPs. Hidden-parameter Markov Decision Processes (HiP-MDPs) (Doshi-Velez & Konidaris, 2016) describe a family of related tasks by modeling task variations with a set of low-dimensional hidden parameters. HiP-MDPs are widely used in recent meta-Reinforcement Learning (RL) and lifelong RL works, in which an agent needs to quickly adapt to new tasks by transferring the knowledge from previous tasks. To solve HiP-MDPs, most existing algorithms first infer the hidden parameters from previous experience (e.g. by learning a context encoder that maps trajectories to a hidden parameter estimate θ), then use the estimated hidden parameters to solve new tasks with either model (Killian et al., 2017; Lee et al., 2020; Fu et al., 2022) or policy transfer (Yao et al., 2018; Rakelly et al., 2019) algorithms. As shown in Figure1, in model transfer, the inferred hidden parameters are used to build a simulator of the environment, which can then be used for planning. In policy transfer, the inferred hidden parameters are used to parameterize the policy of the new task directly. Previous works have observed mixed empirical evidence for when each approach performs better(Yao et al., 2018; Lee et al., 2020;  Fu et al., 2022). In this work, we take a step in the theoretical direction by studying the regret bounds of model and policy transfer algorithms, respectively, which helps characterize the robustness of these algorithms.

