ϵ-INVARIANT HIERARCHICAL REINFORCEMENT LEARNING FOR BUILDING GENERALIZABLE POLICY Anonymous authors Paper under double-blind review

Abstract

Goal-conditioned Hierarchical Reinforcement Learning (HRL) has shown remarkable potential for solving complex control tasks. However, existing methods struggle in tasks that require generalization since the learned subgoals are highly task-specific and therefore hardly reusable. In this paper, we propose a novel HRL framework called ϵ-Invariant HRL that uses abstract, task-agnostic subgoals reusable across tasks, resulting in a more generalizable policy. Although such subgoals are reusable, a transition mismatch problem caused by the inevitable incorrect value evaluation of subgoals can lead to non-stationary learning and even collapse. We mitigate this mismatch problem by training the high-level policy to be adaptable to the stochasticity manually injected into the low-level policy. As a result, our framework can leverage reusable subgoals to constitute a hierarchical policy that can effectively generalize to unseen new tasks. Theoretical analysis and experimental results in continuous control navigation tasks and challenging zero-shot generalization tasks show that our approach significantly outperforms state-of-the-art methods.

1. INTRODUCTION

Hierarchical Reinforcement Learning (HRL) methods have shown remarkable potential to solve complex tasks (Nachum et al., 2018b) (Kim et al., 2021) (Costales et al., 2021) , such as continuous control tasks that require long-horizon navigation (Li et al., 2021) (Gürtler et al., 2021) . Goal-conditioned HRL uses subgoals to decompose the original task into several sub-tasks, training a high-level policy to output subgoals and a low-level policy that executes raw actions conditioned on the subgoals. Therefore, the performance of goal-conditioned HRL mainly relies on a well-designed subgoal space. In particular, in many complex realistic scenarios such as robotics, reusable subgoals are necessary for building a generalizable policy that is widely applicable to different tasks. Prevailing strategies usually utilize task-specific subgoals such as representations extracted from essential states or trajectories (Nachum et al., 2018a )(Li et al., 2021 )(Jiang et al., 2019b) , or select a subgoal space based on prior knowledge such as (a subspace of) the raw state space (Nachum et al., 2018b ) (Zhang et al., 2020) . Such subgoals depend on the specific task and thus often cannot be reused in different tasks. For instance, in a maze navigation task, a subgoal representing coordinates in a maze may not be reachable in another maze, although the two mazes can be similar. As a result, policies based on these subgoals cannot be reused in different tasks. To construct reusable subgoals, a readily applicable choice is to use invariable abstract physical quantities as subgoals such as directions in a navigation task. These abstract subgoals are usually task-agnostic and thereby naturally reusable. However, how to build a generalizable policy based on these abstract subgoals remains a challenge due to the transition mismatch problem, which is common in HRL (Zhang et al., 2022 ) (Levy et al., 2018) and can lead to the non-stationary learning process. In general, the transition mismatch problem emerges when the high-level policy evaluates the subgoals with incorrect transition and rewards due to an inadequately trained low-level policy. To introduce the problem clearly, we will show it in a prevailing challenging task considered by prior HRL methods, i.e., a long-horizon maze navigation problem based on a legged robot, despite our idea is not limited to this particular task. Consider that the high-level produces a subgoal to go to position with coordinates (x, y), but the immature low-level policy reached (x ′ , y ′ ). Then, the high-level will evaluate the subgoal by the incorrect reward from (x ′ , y ′ ), which will lead to inaccurate value estimation and even the collapse of the hierarchical policy learning process (Igl et al., 2020b )(Wang et al., 2021) . Previous work proposes to alleviate the mismatch problem in single-task settings by refining the wrong transition, such as relabeling the wrong transition by the correct transition obtained from the replay buffer (Levy et al., 2018 ) (Nachum et al., 2018b ) (Kim et al., 2021) . However, these methods fail to solve the transition mismatch problem in the multi-task setting, where this problem can be exacerbated by the confusion of changeable tasks with the same abstract subgoals. As successful trajectories are not identical across different tasks, relabeling by sampled trajectories may aggravate the mismatch problem. In this paper, we propose a novel HRL framework called ϵ-Invariant HRL to leverage task-agnostic abstract subgoals as well as alleviate the influence of the transition mismatch problem. We propose a method called ϵ-Invariant Randomization to inject controllable stochasticity into the low-level policy during the training of the high-level policy. The key idea is to see the mismatch as a kind of randomness and train the high-level policy to be adaptable to this randomness. We term the subgoals introduced with randomness ϵ-Invariant subgoals (the definition is in 3.1). Based on this framework, we propose a parallel synchronous algorithm from A2C Mnih et al. ( 2016) to evaluate the online policy and update the network by an expected gradient from multiple trajectories sampled with the same parameter instead of updating by a single gradient, which can alleviate the mismatch problem with the expected transition, reducing of influence of incorrect transitions and rewards. The parallel algorithm is proven to possess the potential to solve changeable scenarios and randomness in the environments (Hou et al., 2022 )(Espeholt et al., 2018) . After solving the mismatch problem, our HRL policy can leverage general task-agnostic subgoals and generalize even to unseen tasks. To demonstrate the superiority of our HRL framework, we extend the widely-used benchmark based on (Duan et al., 2016)'s work using the MuJoCo (Todorov et al., 2012) simulator. These experiments are designed to control high-dimensional robots to solve long-horizon maze navigation tasks in different mazes with sparse rewards (see details in section 4.1). Some tasks are in a stochastic environment with changeable structures to check the robustness of policy. In these difficult tasks, our method achieves state-of-the-art results compared with the most advanced RL and HRL methods. Our method also shows novel abilities in generalization tasks, in which the structures of the mazes are unseen, even with unseen pre-trained robots. To the best of our knowledge, such complex generalization tasks can hardly be solved by previous methods, and we are the first to build an HRL policy that exhibits generalization capability across different mazes and different robots. In summary, our contributions are three-fold: 1. We devise an HRL framework for generalizing in high-dimensional controlling maze navigation tasks with a theoretical guarantee. 2. We propose a randomization method for the transition mismatch problem in generalization tasks along with an algorithm that enables stable learning. 3. We provide a new benchmark for evaluating the approaches for high-dimensional maze navigation tasks, and our method outperforms SOTA algorithms. To the best of our knowledge, we are the first to build policies that can generalize to such zero-shot navigation tasks with unseen mazes and unseen robots.

2. PRELIMINARIES

We formulate the task in this paper as a goal-conditioned Markov decision process (MDP) (Sutton & Barto, 2018) , defined as a tuple < S, G, A, P, R, γ >. S is the state space, A is the action space, and G is the goal space, which is a set of consistent invariant actions. In this paper we focus on mazenavigating tasks, so that we choose the relative displacement of directions "up, down, left, right" (or "x+, x-, y+, y-" in the coordinate system of the environment) as the goals (see details in Section 3.1). P is the transition probability matrix and P (s ′ |s, a) is the one-step transition probability. R(s, a) is the reward function, γ ∈ [0, 1) is the discount factor. Goal-conditional HRL. We consider the framework that consists of two hierarchies: a high level policy π H = π(g|s; θ h ) and a low-level policy π L = π(a|s, g; θ l ), where θ h , θ l is the parameter of the two policies parameterized by neural networks respectively. At a high-level timestep t, the high-level policy generates a high-level action, i.e. subgoal g t ∈ G by g t ∼ π(g|s t ; θ h ). The

