ϵ-INVARIANT HIERARCHICAL REINFORCEMENT LEARNING FOR BUILDING GENERALIZABLE POLICY Anonymous authors Paper under double-blind review

Abstract

Goal-conditioned Hierarchical Reinforcement Learning (HRL) has shown remarkable potential for solving complex control tasks. However, existing methods struggle in tasks that require generalization since the learned subgoals are highly task-specific and therefore hardly reusable. In this paper, we propose a novel HRL framework called ϵ-Invariant HRL that uses abstract, task-agnostic subgoals reusable across tasks, resulting in a more generalizable policy. Although such subgoals are reusable, a transition mismatch problem caused by the inevitable incorrect value evaluation of subgoals can lead to non-stationary learning and even collapse. We mitigate this mismatch problem by training the high-level policy to be adaptable to the stochasticity manually injected into the low-level policy. As a result, our framework can leverage reusable subgoals to constitute a hierarchical policy that can effectively generalize to unseen new tasks. Theoretical analysis and experimental results in continuous control navigation tasks and challenging zero-shot generalization tasks show that our approach significantly outperforms state-of-the-art methods.

1. INTRODUCTION

Goal-conditioned Hierarchical Reinforcement Learning (HRL) methods have shown remarkable potential to solve complex tasks (Nachum et al., 2018b) (Kim et al., 2021) (Costales et al., 2021) , such as continuous control tasks that require long-horizon navigation (Li et al., 2021 )(Gürtler et al., 2021) . Goal-conditioned HRL uses subgoals to decompose the original task into several sub-tasks, training a high-level policy to output subgoals and a low-level policy that executes raw actions conditioned on the subgoals. Therefore, the performance of goal-conditioned HRL mainly relies on a well-designed subgoal space. In particular, in many complex realistic scenarios such as robotics, reusable subgoals are necessary for building a generalizable policy that is widely applicable to different tasks. Prevailing strategies usually utilize task-specific subgoals such as representations extracted from essential states or trajectories (Nachum et al., 2018a )(Li et al., 2021 )(Jiang et al., 2019b) , or select a subgoal space based on prior knowledge such as (a subspace of) the raw state space (Nachum et al., 2018b) (Zhang et al., 2020) . Such subgoals depend on the specific task and thus often cannot be reused in different tasks. For instance, in a maze navigation task, a subgoal representing coordinates in a maze may not be reachable in another maze, although the two mazes can be similar. As a result, policies based on these subgoals cannot be reused in different tasks. To construct reusable subgoals, a readily applicable choice is to use invariable abstract physical quantities as subgoals such as directions in a navigation task. These abstract subgoals are usually task-agnostic and thereby naturally reusable. However, how to build a generalizable policy based on these abstract subgoals remains a challenge due to the transition mismatch problem, which is common in HRL (Zhang et al., 2022 ) (Levy et al., 2018) and can lead to the non-stationary learning process. In general, the transition mismatch problem emerges when the high-level policy evaluates the subgoals with incorrect transition and rewards due to an inadequately trained low-level policy. To introduce the problem clearly, we will show it in a prevailing challenging task considered by prior HRL methods, i.e., a long-horizon maze navigation problem based on a legged robot, despite our idea is not limited to this particular task. Consider that the high-level produces a subgoal to go to position with coordinates (x, y), but the immature low-level policy reached (x ′ , y ′ ). Then, the high-level will evaluate the subgoal by the incorrect reward from (x ′ , y ′ ), which will lead to

