HIERARCHICAL PROTOTYPES FOR UNSUPERVISED DY-NAMICS GENERALIZATION IN MODEL-BASED REIN-FORCEMENT LEARNING

Abstract

Generalization remains a central challenge in model-based reinforcement learning. Recent works attempt to model the environment-specific factor and incorporate it as part of the dynamic prediction to enable generalization to different contexts. By estimating environment-specific factors from historical transitions, earlier research was unable to clearly distinguish environment-specific factors from different environments, resulting in poor performance. To address this issue, we introduce a set of environment prototypes to represent the environmental-specified representation for each environment. By encouraging learned environment-specific factors to resemble their assigned environmental prototypes more closely, the discrimination of factors between different environments will be enhanced. To learn such prototypes in the unsupervised manner, we propose a hierarchical prototypical method which first builds trajectory embeddings according to the trajectory label information, and then hierarchically constructs environmental prototypes from trajectory prototypes sharing similar semantics. Experiments demonstrate that environment-specific factors estimated by our method have superior clustering performance and can improve MBRL's generalisation performance in six environments consistently.

1. INTRODUCTION

Reinforcement learning (RL) has achieved great success in solving sequential decision-making problems, e.g., board games (Silver et al., 2016; 2017; Schrittwieser et al., 2020 ), computer games (Mnih et al., 2013; Silver et al., 2018; Vinyals et al., 2019), and robotics (Levine & Abbeel, 2014; Bousmalis et al., 2018) , but it still suffers from the low sample efficiency problem, making it challenging to solve real-world problems, especially for those with limited or expensive data (Gottesman et al., 2018; Lu et al., 2018; 2020; Kiran et al., 2020) .In contrast, model-based reinforcement learning (MBRL) (Janner et al., 2019; Kaiser et al., 2019; Schrittwieser et al., 2020; Zhang et al., 2019; van Hasselt et al., 2019; Hafner et al., 2019b; a; Lenz et al., 2015) has recently received wider attention, because it explicitly builds a predictive model and can generate samples for learning RL policy to alleviate the sample inefficiency problem. As a sample-efficient alternative, the model-based RL method derives a policy from the learned environmental dynamics prediction model. Therefore, the dynamics model's prediction accuracy is highly correlated with policy quality (Janner et al., 2019) . However, it has been evidenced that the learned dynamics prediction model is not robust to the change of environmental dynamics (Lee et al., 2020; Seo et al., 2020; Guo et al., 2021) , and thus the agent in model-based RL algorithms has a poor generalization ability on the environments with different dynamics. Such a vulnerability to the change in environmental dynamics makes model-based RL methods unreliable in real-world applications where the factors that can affect dynamics are partially observed. For example, the friction coefficient of the ground is usually difficult to measure, while the changes in it can largely affect the dynamics when controlling a robot walking on the grounds, leading to the performance degradation of an agent trained by model-based RL methods (Yang et al., 2019; Gu et al., 2017; Nagabandi et al., 2018b) . Recent Studies (Seo et al., 2020; Nagabandi et al., 2018a; Lee et al., 2020; Guo et al., 2021) have demonstrated that incorporating environmental factor Z into dynamics prediction facilitates the generalisation of model-based RL methods to unseen environments. However, environmental factors

