HIERARCHICAL PROTOTYPES FOR UNSUPERVISED DY-NAMICS GENERALIZATION IN MODEL-BASED REIN-FORCEMENT LEARNING

Abstract

Generalization remains a central challenge in model-based reinforcement learning. Recent works attempt to model the environment-specific factor and incorporate it as part of the dynamic prediction to enable generalization to different contexts. By estimating environment-specific factors from historical transitions, earlier research was unable to clearly distinguish environment-specific factors from different environments, resulting in poor performance. To address this issue, we introduce a set of environment prototypes to represent the environmental-specified representation for each environment. By encouraging learned environment-specific factors to resemble their assigned environmental prototypes more closely, the discrimination of factors between different environments will be enhanced. To learn such prototypes in the unsupervised manner, we propose a hierarchical prototypical method which first builds trajectory embeddings according to the trajectory label information, and then hierarchically constructs environmental prototypes from trajectory prototypes sharing similar semantics. Experiments demonstrate that environment-specific factors estimated by our method have superior clustering performance and can improve MBRL's generalisation performance in six environments consistently.

1. INTRODUCTION

Reinforcement learning (RL) has achieved great success in solving sequential decision-making problems, e.g., board games (Silver et al., 2016; 2017; Schrittwieser et al., 2020) , computer games (Mnih et al., 2013; Silver et al., 2018; Vinyals et al., 2019) , and robotics (Levine & Abbeel, 2014; Bousmalis et al., 2018) , but it still suffers from the low sample efficiency problem, making it challenging to solve real-world problems, especially for those with limited or expensive data (Gottesman et al., 2018; Lu et al., 2018; 2020; Kiran et al., 2020) .In contrast, model-based reinforcement learning (MBRL) (Janner et al., 2019; Kaiser et al., 2019; Schrittwieser et al., 2020; Zhang et al., 2019; van Hasselt et al., 2019; Hafner et al., 2019b; a; Lenz et al., 2015) has recently received wider attention, because it explicitly builds a predictive model and can generate samples for learning RL policy to alleviate the sample inefficiency problem. As a sample-efficient alternative, the model-based RL method derives a policy from the learned environmental dynamics prediction model. Therefore, the dynamics model's prediction accuracy is highly correlated with policy quality (Janner et al., 2019) . However, it has been evidenced that the learned dynamics prediction model is not robust to the change of environmental dynamics (Lee et al., 2020; Seo et al., 2020; Guo et al., 2021) , and thus the agent in model-based RL algorithms has a poor generalization ability on the environments with different dynamics. Such a vulnerability to the change in environmental dynamics makes model-based RL methods unreliable in real-world applications where the factors that can affect dynamics are partially observed. For example, the friction coefficient of the ground is usually difficult to measure, while the changes in it can largely affect the dynamics when controlling a robot walking on the grounds, leading to the performance degradation of an agent trained by model-based RL methods (Yang et al., 2019; Gu et al., 2017; Nagabandi et al., 2018b) . Recent Studies (Seo et al., 2020; Nagabandi et al., 2018a; Lee et al., 2020; Guo et al., 2021) have demonstrated that incorporating environmental factor Z into dynamics prediction facilitates the generalisation of model-based RL methods to unseen environments. However, environmental factors are unobservable in the majority of applications; for instance, the friction coefficient is not available for robots. Therefore, estimating semantical meaningful Z for each environments is the first step for generalization of model-based RL. However, it is not easy to implement, because the environment is hard to label. For example, it is impractical to measure the friction coefficient of every road. Without the label information of environments, Zs estimated from previous methods (Seo et al., 2020; Nagabandi et al., 2018a; Lee et al., 2020; Guo et al., 2021) cannot form clear clusters for different environments as Figure 3 shows. These entangled Zs cannot represent the distinct environmental specific information, and thus may deviate the learned dynamics prediction function from the true one, resulting in the poor generalization ability. In this paper, we propose a hierarchical prototypical method (HPM) with the objective of learning an environment-specific representation with distinct clusters. By representing environment-specific information semantically meaningfully, HPM learns more generalizable dynamics prediction function. To achieve this, our method propose to construct a set of environmental prototypes to capture environment-specific information for each environment. By enforcing the estimated Ẑ to be more similar to its respective environmental prototypes and dissimilar to other prototypes, the estimated Ẑs can form compact clusters for the purpose of learning a generalizable dynamics prediction function. Because environmental labels are not available, we cannot construct environmental prototypes directly. To address this issue, we begin by developing easily-learned trajectory prototypes based on the trajectory label. Then, environmental prototypes can be created by merging trajectory prototypes with similar semantics, as suggested by the natural hierarchical relationship between trajectory and environment. With the built hierarchical prototypical structure, we further propose a prototypical relational loss to learn Z from past transitions. Specifically, we not only aggregate the Ẑs with similar causal effects by optimizing the relational loss (Guo et al., 2021) but also aggregate Ẑ with its corresponding trajectory and environmental prototypes via the relational loss. In addition, to alleviate the over-penalization of semantically similar prototypes, we propose to penalize prototypes adaptively with the intervention similarity. In the experiments, we evaluate our method on a range of tasks in OpenAI gym (Brockman et al., 2016) and Mujoco (Todorov et al., 2012) . The experimental results show that our method can form more clear and tighter clusters for Ẑs, and such Ẑs can improve the generalization ability of model-based RL methods and achieve state-of-art performance in new environments with different dynamics without any adaptation step.

2. RELATED WORK

Model-based reinforcement learning With the learned dynamics prediction model, Model-based Reinforcement Learning (MBRL) takes advantage of high data efficiency. The learned prediction model can generate samples for training policy (Du & Narasimhan, 2019; Whitney et al., 2019) or planning ahead in the inference (Atkeson & Santamaria, 1997; Lenz et al., 2015; Tassa et al., 2012) . Therefore, the performance of MBRL highly relies on the prediction accuracy of the dynamics predictive model. To improve the predictive model's accuracy of MBRL, several methods were proposed, such as ensemble methods (Chua et al., 2018) , latent dynamics model (Hafner et al., 2019b; a; Schrittwieser et al., 2020) , and bidirectional prediction (Lai et al., 2020) . However, current predictive methods are still hard to generalize well on unseen dynamics, which hinders the application of MBRL methods in the real-world problems.

Dynamics generalization in model-based reinforcement learning

To adapt the MBRL to unknown dynamics, meta-learning methods (Nagabandi et al., 2018a; b; Saemundsson et al., 2018) attempted to update model parameters by updating a small number of gradient updates (Finn et al., 2017) or hidden representations of a recurrent model (Doshi-Velez & Konidaris, 2016) . Then, using multi-choice learning, (Lee et al., 2020; Seo et al., 2020) attempted to learn a generalised dynamics model by incorporating environmental-specified information or clustering dynamics implicitly, with the goal of adapting any dynamics without training. Through relational learning and causal effect estimation, RIA (Guo et al., 2021) aims to explicitly learn meaningful environmental-specific information. However, the dynamics change learned by RIA still suffer from a high variance issue. Prototypical methods By learning an encoder to embed data in a low-dimensional representation space, prototypical methods gain a set of prototypical embeddings, which are referred to as

