ϵ-INVARIANT HIERARCHICAL REINFORCEMENT LEARNING FOR BUILDING GENERALIZABLE POLICY Anonymous authors Paper under double-blind review

Abstract

Goal-conditioned Hierarchical Reinforcement Learning (HRL) has shown remarkable potential for solving complex control tasks. However, existing methods struggle in tasks that require generalization since the learned subgoals are highly task-specific and therefore hardly reusable. In this paper, we propose a novel HRL framework called ϵ-Invariant HRL that uses abstract, task-agnostic subgoals reusable across tasks, resulting in a more generalizable policy. Although such subgoals are reusable, a transition mismatch problem caused by the inevitable incorrect value evaluation of subgoals can lead to non-stationary learning and even collapse. We mitigate this mismatch problem by training the high-level policy to be adaptable to the stochasticity manually injected into the low-level policy. As a result, our framework can leverage reusable subgoals to constitute a hierarchical policy that can effectively generalize to unseen new tasks. Theoretical analysis and experimental results in continuous control navigation tasks and challenging zero-shot generalization tasks show that our approach significantly outperforms state-of-the-art methods.

1. INTRODUCTION

Goal-conditioned Hierarchical Reinforcement Learning (HRL) methods have shown remarkable potential to solve complex tasks (Nachum et al., 2018b) (Kim et al., 2021) (Costales et al., 2021) , such as continuous control tasks that require long-horizon navigation (Li et al., 2021) (Gürtler et al., 2021) . Goal-conditioned HRL uses subgoals to decompose the original task into several sub-tasks, training a high-level policy to output subgoals and a low-level policy that executes raw actions conditioned on the subgoals. Therefore, the performance of goal-conditioned HRL mainly relies on a well-designed subgoal space. In particular, in many complex realistic scenarios such as robotics, reusable subgoals are necessary for building a generalizable policy that is widely applicable to different tasks. Prevailing strategies usually utilize task-specific subgoals such as representations extracted from essential states or trajectories (Nachum et al., 2018a) (Li et al., 2021) (Jiang et al., 2019b) , or select a subgoal space based on prior knowledge such as (a subspace of) the raw state space (Nachum et al., 2018b) (Zhang et al., 2020) . Such subgoals depend on the specific task and thus often cannot be reused in different tasks. For instance, in a maze navigation task, a subgoal representing coordinates in a maze may not be reachable in another maze, although the two mazes can be similar. As a result, policies based on these subgoals cannot be reused in different tasks. To construct reusable subgoals, a readily applicable choice is to use invariable abstract physical quantities as subgoals such as directions in a navigation task. These abstract subgoals are usually task-agnostic and thereby naturally reusable. However, how to build a generalizable policy based on these abstract subgoals remains a challenge due to the transition mismatch problem, which is common in HRL (Zhang et al., 2022) (Levy et al., 2018) and can lead to the non-stationary learning process. In general, the transition mismatch problem emerges when the high-level policy evaluates the subgoals with incorrect transition and rewards due to an inadequately trained low-level policy. To introduce the problem clearly, we will show it in a prevailing challenging task considered by prior HRL methods, i.e., a long-horizon maze navigation problem based on a legged robot, despite our idea is not limited to this particular task. Consider that the high-level produces a subgoal to go to position with coordinates (x, y), but the immature low-level policy reached (x ′ , y ′ ). Then, the high-level will evaluate the subgoal by the incorrect reward from (x ′ , y ′ ), which will lead to inaccurate value estimation and even the collapse of the hierarchical policy learning process (Igl et al., 2020b) (Wang et al., 2021) . Previous work proposes to alleviate the mismatch problem in single-task settings by refining the wrong transition, such as relabeling the wrong transition by the correct transition obtained from the replay buffer (Levy et al., 2018) (Nachum et al., 2018b) (Kim et al., 2021) . However, these methods fail to solve the transition mismatch problem in the multi-task setting, where this problem can be exacerbated by the confusion of changeable tasks with the same abstract subgoals. As successful trajectories are not identical across different tasks, relabeling by sampled trajectories may aggravate the mismatch problem. In this paper, we propose a novel HRL framework called ϵ-Invariant HRL to leverage task-agnostic abstract subgoals as well as alleviate the influence of the transition mismatch problem. We propose a method called ϵ-Invariant Randomization to inject controllable stochasticity into the low-level policy during the training of the high-level policy. The key idea is to see the mismatch as a kind of randomness and train the high-level policy to be adaptable to this randomness. We term the subgoals introduced with randomness ϵ-Invariant subgoals (the definition is in 3.1). Based on this framework, we propose a parallel synchronous algorithm from A2C Mnih et al. (2016) to evaluate the online policy and update the network by an expected gradient from multiple trajectories sampled with the same parameter instead of updating by a single gradient, which can alleviate the mismatch problem with the expected transition, reducing of influence of incorrect transitions and rewards. The parallel algorithm is proven to possess the potential to solve changeable scenarios and randomness in the environments (Hou et al., 2022) (Espeholt et al., 2018) . After solving the mismatch problem, our HRL policy can leverage general task-agnostic subgoals and generalize even to unseen tasks. To demonstrate the superiority of our HRL framework, we extend the widely-used benchmark based on (Duan et al., 2016) 's work using the MuJoCo (Todorov et al., 2012) simulator. These experiments are designed to control high-dimensional robots to solve long-horizon maze navigation tasks in different mazes with sparse rewards (see details in section 4.1). Some tasks are in a stochastic environment with changeable structures to check the robustness of policy. In these difficult tasks, our method achieves state-of-the-art results compared with the most advanced RL and HRL methods. Our method also shows novel abilities in generalization tasks, in which the structures of the mazes are unseen, even with unseen pre-trained robots. To the best of our knowledge, such complex generalization tasks can hardly be solved by previous methods, and we are the first to build an HRL policy that exhibits generalization capability across different mazes and different robots. In summary, our contributions are three-fold: 1. We devise an HRL framework for generalizing in high-dimensional controlling maze navigation tasks with a theoretical guarantee. 2. We propose a randomization method for the transition mismatch problem in generalization tasks along with an algorithm that enables stable learning. 3. We provide a new benchmark for evaluating the approaches for high-dimensional maze navigation tasks, and our method outperforms SOTA algorithms. To the best of our knowledge, we are the first to build policies that can generalize to such zero-shot navigation tasks with unseen mazes and unseen robots.

2. PRELIMINARIES

We formulate the task in this paper as a goal-conditioned Markov decision process (MDP) (Sutton & Barto, 2018) , defined as a tuple < S, G, A, P, R, γ >. S is the state space, A is the action space, and G is the goal space, which is a set of consistent invariant actions. In this paper we focus on mazenavigating tasks, so that we choose the relative displacement of directions "up, down, left, right" (or "x+, x-, y+, y-" in the coordinate system of the environment) as the goals (see details in Section 3.1). P is the transition probability matrix and P (s ′ |s, a) is the one-step transition probability. R(s, a) is the reward function, γ ∈ [0, 1) is the discount factor. Goal-conditional HRL. We consider the framework that consists of two hierarchies: a high level policy π H = π(g|s; θ h ) and a low-level policy π L = π(a|s, g; θ l ), where θ h , θ l is the parameter of the two policies parameterized by neural networks respectively. At a high-level timestep t, the high-level policy generates a high-level action, i.e. subgoal g t ∈ G by g t ∼ π(g|s t ; θ h ). The high-level policy gives a goal every k steps and the low-level policy executes the subgoal g t in k steps. Then the high-level policy receives the accumulative external rewards from the environment r h t = tk+k-1 i=tk R(s i , a i ) in k steps. The goal of the high-level is to maximize the expected return E[ H t=0 γ t r h t ] by driving the low-level policy. The low-level policy receives additional intrinsic reward r l for efficient learning as follows: r l (s tk , g t , a tk , s tk+1 ) = α • cos < g t , φ(s tk+1 ) -φ(s tk ) > •||φ(s tk+1 ) -φ(s tk )|| (1) where g t = g t is an invariable direction vector of "{x+, x-, y+, y-}", the φ is a coordinate extracting function form state s and α is a constant coefficient. The reward means that the further the agent goes towards the goal direction, the more reward can it obtain. Different from current HRL methods as done in (Kim et al., 2021 ) (Nachum et al., 2018b) , we do not use the relative distance of specific coordinates as the intrinsic reward function but use the fixed distance towards invariable and orthogonal goal directions. By this reward function, the agent will learn to walk towards an abstracted direction according to the subgoal instead of towards a specific location, which preserves the generalization potential for different tasks.

3. APPROACH WITH ANALYSIS

Figure 2 : HRL framework illustration. By the ϵinvariant subgoals, the high-level policy can adapt to different maze tasks and different pre-trained low-level robots. Motivation. We hope to build an HRL policy to solve maze-navigation and generalization tasks of high-dimensional continuous controlling robots. The difficulty of these tasks is usually caused by large exploration space and sparse rewards. Thus we aim to build a subgoal setting, which can not only reduce the difficulty of exploration but also are reusable in different tasks. Consider that many previous RL works research on tabular mazes, such as the BabyAI platform (Chevalier-Boisvert et al., 2018) , if the maze with continuous state space can be discretized into several blocks, the learning process will be more efficient than the original one. So that we set the subgoal space by abstracted task-agnostic subgoals like directions of "front, back, left, right". These subgoals mean the agent should move in the direction of a fixed step. In such setting, the maze is divided into blocks according to the distance of the movement in the high-level perspective. The high-level policy can make a decision on finite discrete space, which will significantly improve the exploration efficiency. Meanwhile, such subgoals are physically invariable so that can be reused in any maze. Policy based on these subgoals can also be general among tasks. Challenges. Although the motivation is concise, it needs to overcome three challenges for implementation. (1) Will the hierarchical policy in our setting seriously break the performance of the original optimal policy? How to evaluate the error? Is the error controllable? (2) In HRL the fixed oracle subgoals setting with respective learning process often lead to collapse due to the mismatch of the two hierarchies . How to let the agent learn stably and safely and overcome the mismatch? In the following paragraphs, we will answer the two questions respectively by theoretical analysis for discussing the effectiveness of our method, and framework building with algorithm design for stable policy learning.

3.1. THEORETICAL ANALYSIS

Limited by the length of the paper, in this section we just show the main idea of proof. The details can be seen in Appendix A. Our main idea to prove the effectiveness of our method is to prove the error between optimal policy and ours in a metric is bounded and can be controlled so that our methods can be used as a suboptimal policy to approximate the optimal one. The results in stochastic MDPs show that the error between our hierarchical policy and the optimal policy can be bounded. Firstly, we give the mathematical definition of the ϵ-invariant subgoals (as the high-level abstracted actions). Definition 3.1. (ϵ-invariant subgoals) For every environment E and an ϵ-invariant subgoal g ∆ ∈ G ∆ , transitions of the subgoal from 's' to 's ′ ' and 's ′ ' to 's ′′ ' with optimal low-level policy, for any s, satisfying: E s ′ [φ(s) -φ(s ′ )] = E s ′′ [E s ′ [φ(s ′ )] -φ(s ′′ )] (2) where σ φ(s ′ ),x , σ φ(s ′ ),y ≤ ϵ Here π H (g|s) is the high-level policy. s ′ and s ′′ are the achieved states by the ϵ-invariant subgoal g ∆ with optimal low-level policy. φ is the coordinate extracting function and φ(s) is the coordinate vector of observation s. σ φ(s ′ ),x , σ φ(s ′ ),y are the variance of achieved state in x, y directions. We consider our high-level policy learning in stochastic MDP so that the s ′ is a random variable depending on s and g ∆ , and we limit the variance of s ′ by a little constant ϵ. The equation means such subgoals will give an excepted invariable direction in the coordinate system with a little randomness. Error Bound in Stochastic MDP. As shown in Fig 3 , the trajectories of RL method, traditional HRL method, and our method are different, and for that, we build our high-level policy in stochastic MDP with stochastic subgoals. We consider a goal-conditional HRL where the high-level policy makes a decision every k steps. To compare with the HRL method, the optimal value function of original RL methods in deterministic-MDP can be rewritten with k step as: V * k (s t ) = E τ k [R(τ k ) + γ k V (S t+k )] where τ k is a trajectory in k steps with optimal policy π * (τ k |s t ), R(τ k ) is the accumulative discounted reward in k steps. And equation 3 can be rewritten as: V * k (s t ) = τ k P π (τ k |s t )[R(τ k ) + γ k V (S t+k )]. With equation 3, we can define the error between traditional RL method and our method: |V * k (s t ) - V ϵ H (s t )|. To analyze and control the error, we introduce the value V * H (s t ) in optimal stochastic HRL with coordinates as subgoals: |V * k (s t ) -V ϵ H (s t )| ≤ |V * k (s t ) -V * H (s t )| HRL error + |V * H (s t ) -V ϵ H (s t )| subgoal error Here we only give the conclusion of our analysis, that is: ||V * k -V * H || ∞ ≤ ν k kR max 2(1 -γ k ) + γ k ν k R max 2(1 -γ k ) 2 and ||V * H -V ϵ H || ∞ ≤ L φ (2δ max + 3 √ 2ϵ)R max 2(1 -γ k ) (k + γ k 1 -γ k ) (6) where δ max = max g,g∆ {||g||, ||g ∆ ||}, R max is the bound of reward function, ν k is the transition mismatch rate between optimal RL method and HRL method with stochastic coordinate subgoals. L φ is the Lipschitz constant. The proof is shown in Appendix A. The result shows that the error between our method and optimal policy indeed can be bounded and controlled. That means our policy can be used as a sub-optimal policy to approximate the original optimal one.

3.2. FRAMEWORK

In this section, we will show how to build a hierarchical framework to handle the abstracted ϵinvariant subgoals. Generally, the high-level policy decides the direction as subgoals by pixel observation, and the low-level policy receives the expected direction and walks towards the direction for a distance. Thanks to the abstractness of the subgoals defined in 3.1, although the high-level and the low-level policy can learn together, we train them respectively to accelerate the learning process. So that our method decomposes the learning process into two stages with discrete general subgoals, where the high level learns by external rewards and the low level learns by intrinsic rewards. Essentially, the discrete general subgoals are invariable displacement of fixed directions, which will reduce the difficulty of exploration and improve learning efficiency by discretizing the state space into little blocks. Meanwhile, they are reusable and general in any maze-navigation tasks, so as to preserve the generalization abilities of the high-level policy. Modular Pixel Observation. As we utilize the direction as abstracted subgoals for the high-level policy, the influence of the subgoals can hardly be perceived by the agent with the posture of the robot. We add a pixel observation at the top view to look down at the agent. (See in figure 2 ) It will observe a region of fixed size by a camera following the robot. From this perspective, the influence of the subgoal of directions is invariable and consistent among maze environments. The agent can observe the change in the environment close to the robot. Meanwhile, pixel observation is more adaptable than raw data of posture, which can improve generalization abilities. We provide the pixel observation for the high-level policy to decide which direction to go. ϵ-Invariant Randomization. As we train the two hierarchies respectively with abstracted subgoals, we train the high-level by the movement directly from the simulator instead of the real walking of the robot for faster learning. However, abstracted subgoals will bring the mismatch problem between the two levels. Because to control a robot moving by legs strictly towards a direction is very hard for RL methods. So that even the well-learned walk policy will contain a slight deviation in the vertical direction of the goal direction. Such inevitable deviation (a performance of mismatch) will cause incorrect evaluation of the subgoals, which will lead to nonstationarity and even collapse in a changeable environment. To alleviate the problem, we introduce the ϵ-Invariant Randomization method to build the high-level learning process as a stochastic MDP. When training the high-level policy, the simulator moves the robot with random postures and random positions. The postures are sampled from the walking postures of well-learned low-level policy with random subgoals and recorded as offline data. The random position means that a random deviation will be added to the original movement. For instance, the high-level policy gives a "x+" direction as subgoals, then the simulator will move the robot to "x+" with a little offset of ∆ x , ∆ y ∼ N (0, σ), where σ ≤ ϵ is much less than the moving distance. As a result, if the high-level policy can overcome the randomness, the wrong execution of the low-level policy will be seen as a sample of the distribution of randomness, which will improve the affordance of the high-level policy. Network structure and more details can be seen in Appendix C. Algorithm. As we introduce randomness into the learning process to alleviate the mismatching problem, the high level should overcome the stochasticity in the environment to obtain a wellperformed policy. To solve the randomness, we propose a parallel algorithm called the parallel expected gradient advantage actor-critic (PEG-A2C) algorithm because the parallel algorithm is proven to possess the potential to adapt to a changeable environment (Hou et al., 2022 )(Espeholt 

4. EXPERIMENTS

In our experiments we aim to answer the following questions: (1) Can our method learn stably in different complex maze navigation by a robot with high-dimensional action space? (2) How does our method perform compared with advanced RL and HRL algorithms? (3) Can our method generalize to different unseen maze tasks even unseen robots without retraining?

4.1. ENVIRONMENTS SETTING

We evaluate a suite of MuJoCo (Todorov et al., 2012) maze-navigating tasks modified from the benchmark from (Duan et al., 2016) . Different from the setting of (Kim et al., 2021 ) (Nachum et al., 2018b) , we utilize sparse external rewards to make them more challenging. In these tasks, the agent should control an ant robot to achieve a specified area in the maze which is far away from the initial position. To compare completely, we also add experiments on dense reward, where the reward is 1/(1 + d) of Euclidean distance d between the agent and the current goal in the coordinate system for every step. The mazes are Ant ⊃-Maze, Ant Random Square Maze, Ant S-shaped Maze, Ant Spiral Maze, Ant Spiral Maze, and Generalization Maze (See in Fig 4) . They are all difficult mazes with the long navigating horizon. In these mazes, the goal is to pass the door or go to a specific region. Especially Ant Random Square Maze is a task in a stochastic environment with random initial positions of both the agent and the door. Generalization Maze is a task with three unseen mazes of different structures. Ant S-shaped Maze and Ant Spiral Maze are extremely long-horizon mazes that require at least thousands of steps by the optimal policy. More details can be seen in Appendix D. Implementation. We build our HRL agent by two policies. The high-level policy is learned by our PAG-A2C algorithm, and the low-level policy is a goal-conditioned policy modified from DroQ (Hiraoka et al., 2021) algorithm, of which the subgoals are directions of 'x+','y+','x-','y-'. As the simulator in MoJoCo allows direct movement of coordinates, we train the two policies respectively in two stages. The high-level policy learns with direct movement with randomness, and the low-level learns by intrinsic reward defined in equation 1. As the subgoals and the low-level policy can be reused in different tasks, we mainly show the high-level training curves and success rates in different tasks and use the same well-learned low-level policy for the ant robot.

4.2. BASELINES

We compare our method to several state-of-the-art (SOTA) model-free RL and HRL algorithms in the high-dimensional continuous control tasks said above. DroQ. It is the SOTA RL algorithm for high-dimensional continuous control tasks (Hiraoka et al., 2021) , which is the most efficient RL method. It is effective for both simple robots like Hopper and complex robots with large numbers of degrees of freedom like Humanoid. HIGL. It is the SOTA HRL algorithm for high-dimensional maze-navigating tasks with both sparse rewards and dense rewards for MuJoCo suite (Kim et al., 2021) . HESS. It is one of the most advanced subgoal learning HRL methods for continuous control tasks (Li et al., 2021) .

RAND-H.

It is a variety of our HRL method, of which the high-level policy is a random policy and the low-level policy is well-learned. The baseline is used as an ablation study of our method to show the capabilities of our subgoal setting. our-oracle. It is a variety of our HRL method, of which the high-level policy is trained without the low-level policy. The robot will walk by the oracle movement of coordinates executed directly by the simulator instead of the low-level policy. The baseline is also used as an ablation study of our method to show the efficiency of the high-level policy.

4.3. RESULTS OF COMPARATIVE EXPERIMENTS

For a fair comparison, we utilize the evaluated reward curves with both the high-level and the lowlevel, although our hierarchical policies can be trained respectively. The curves are shown in figure 5 . We can see that our method outperforms both the SOTA RL and HRL methods. The curves are cut by the same episode to align the results of different methods. All the curves are smoothed by a sliding window. More details can be seen in Appendix D. Robustness to Stochasticity and Mismatch. In these tasks, the 'Ant Random Square' is a square maze with a random initial position of robots and goals. In this task, the mismatch problem will become critical as the successful trajectory is episodically changeable. So that correction by historical samples is invalid. As shown in the curve in figure 5 , the HIGL method can obtain a reward at first, but gradually reduced it with a downward trend. Our method can learn with a gradually increasing return. The curve shows that our method can adapt to such a stochastic environment and learn stably, proving that our algorithm indeed can overcome randomness. But other methods can hardly adapt to such a stochastic environment. Comparative Result. The non-hierarchical method DroQ performs poorly in all the tasks with sparse rewards, demonstrating the strength of the hierarchical structures in solving long-horizon tasks with sparse rewards. HESS method also performs not so well with little accumulate return, due to the requirement of large exploration episode steps for sparse reward, which is consistent with the results reported in their paper (Li et al., 2021) . The SOTA HIGL method outperforms other baselines while underperforming our method, which learns with a gradually rising average reward but slower than ours, showing the superiority of our methods. In tasks with dense rewards (figure 6 ), we utilize average step reward and goal-reaching success rate to evaluate the performance of these methods. The performance of previous methods is improved a lot but the curves of success rate are still under our method. Ablation Study. The RAND-H and our-oracle show the ablation result of our method. The highlevel policy with oracle movement (our-oracle) can learn stably with our algorithm. We can see that the low-level policy (RAND-H) can obtain rewards in a certain frequency with our subgoal setting even with random high-level policy, so as to preserve stable feedback for the learning process. The two ablation experiment curves show the reason why our method learns faster. Zero-shot Robot. In this experiment, we test the single high-level policy with the oracle movement and the whole HRL policy respectively. In table 2 we show zero-shot result of the mere high-level policy in 'Maze ⊃ shape'. It is to show the adaptive capabilities of our high-level policy with our subgoal setting. From the results, we can see that such generalization tasks can hardly be solved by previous RL and HRL methods due to the specific subgoal or goal setting. They can neither adapt to different maze tasks nor change the low-level policy without retraining the high-level policy. Our HRL method uses invariable abstracted subgoals and can adapt to different scenarios. Our method is also flexible and can adapt to different unseen robots with the low-level policy receiving the same subgoals with a certain success rate. Visualization Results. To show the generalization ability of our HRL policy, we show the heatmap of trajectories of our method and HIGL (figure 7 ). We can see that the HIGL agent mainly moves near the initial position and is blocked by unseen structures, but our agent can achieve the door more frequently. 

5. RELATED WORK

Goal-conditioned HRL methods. In the recent few years, hierarchical reinforcement learning methods have been widely studied for complex high-dimensional control and long-horizon tasks. There are many goal-conditioned HRL methods for subgoals learning, handling and discovery for these difficult tasks (Nachum et al., 2018b ) (Li et al., 2021) (Kim et al., 2021) (Gürtler et al., 2021) (Li et al., 2020) (Zhang et al., 2020) . These methods usually leverage task-specific subgoals such as spe- cific coordinate (Nachum et al., 2018b) , sampled abstracted or original states (Nachum et al., 2018a) or abstracted trajectory descriptions (Jiang et al., 2019a) to complete complex tasks. Different from previous methods, our framework utilizes the ϵ-invariant subgoals (direction with randomness) as abstracted high-level actions, which do not stand for task-specific descriptions but the invariable relative effects of action sequences. Such subgoals can be reused ignoring the change of states in different tasks, which are more general. Generalizable Policy Learning. Constructing policies can be reuse (or generalize) in new (even unseen) tasks is a long-standing challenge, which is researched by many works in different perspectives (Igl et al., 2020a ) (Xu et al., 2022 ) (Wang et al., 2019 ) (Wang et al., 2020) . Recent works are mainly three lines for generalizable policy learning. (1) Learning reusable or transferable skills, option or policies for different tasks (Shah et al., 2021 ) (Nam et al., 2021) (Klissarov & Precup, 2021) . (2) Learning policies adapting to a distribution of environments or tasks with different dynamics by risk-sensitive objective functions (Lyle et al., 2022) (Lei & Ying, 2020 ) (Kirsch et al., 2019) . (3) Learning to extract reusable abstracted representations like language, logical symbols or graph from visual observation or state-action trajectories (Jain et al., 2020 ) (Agarwal et al., 2020) (Vaezipoor et al., 2021) . Our method is related to the third line, i.e., utilize the abstracted representations for improving the generalization abilities of the policy. But the difference is that we do not leverage representations expressing tasks in a distribution or invariable abstracted trajectories. Instead, we use the abstracted representation to represent the physically invariable relative locomotion. Such representations are usually general in the kind of maze-navigation tasks, thus leading to strong generalization abilities.

6. CONCLUSION

In this paper, we propose a novel HRL framework to build a generalizable hierarchical policy with abstract task-agnostic subgoals. We give theoretical analysis to prove the effectiveness of our method and design algorithm to overcome the transition mismatch problem in generalization tasks to construct a stable policy learning process. Strong results in challenging experiments show the superiority of our method. Meanwhile, our method can achieve zero-shot generalization in different unseen tasks, which cannot be dealt with by previous methods. We believe that idea of our method could be used beyond our task setting. For future work, we will focus on more general policy learning to solve more complex tasks, not limiting to HRL.

A THEORETICAL ANALYSIS

With equation 3, we can define the error between traditional RL method and our method: |V * k (s t ) - V ϵ H (s t )|. To analyze and control the error, we introduce the value V * H (s t ) in optimal stochastic HRL with coordinates as subgoals: |V * k (s t ) -V ϵ H (s t )| ≤ |V * k (s t ) -V * H (s t )| HRL error + |V * H (s t ) -V ϵ H (s t )| subgoal error so that we can control the error by two parts: (1) the error from RL policy to stochastic HRL policy and (2) the error from ordinates subgoals to ϵ-invariant subgoals. HRL Error. For the former (denote as e H = |V * k (s t ) -V * H (s t )|) , the error bound can be controlled by the corollary of conclusion of Zhang's work (Zhang et al., 2022) . We firstly formalize a critical factor termed transition mismatch rate between RL policy and stochastic HRL policy. (The difference between our work and Zhang's is that the factor in our work represents the difference between RL and HRL policy, but the difference between two HRL methods in Zhang's work.) Definition A.1. (k-step transition mismatch) For a goal-conditioned MDP M with P k (s t+k |s t , g t ) as its k-step transition probability under an optimal goal-conditioned policy π H (g t |s t ), the k-step transition mismatch rate of M is defined as: ν k ≜ max st,gt,s t+k |P π (τ k |s t ) -π H (g t |s t )P k (s t+k |s t , g t )| The mismatch rate ν k captures the difference between RL policy and HRL policy by the part of trajectories in k steps. The less uncertainty the high-level policy performs with, the less ν k is. The deterministic-MDP of high-level policy will lead to ν k = 0. Then for error e H , we have the conclusion as follows: Lemma A.2. (Corollary of Theorem 2 in (Zhang et al., 2022) ) With discounted factor γ, bounded reward function R max = max s,a R(s, a) and ν k defined in equation 8, there is ||V * k -V * H || ∞ ≤ ν k kR max 2(1 -γ k ) + γ k ν k R max 2(1 -γ k ) 2 Proof. See Appendix.A.1 Lemma A.2 means the error between RL method and stochastic HRL method will be controlled by ν k . If the distribution of the randomness of subgoals is in a little region with little variance, the HRL policy can approximate the optimal RL policy with controllable error. Subgoal Error. To analyses the error caused by our subgoals (the latter error is denoted as e S = |V * H (s t )-V ϵ H (s t )| ), we should introduce a new metric. We first consider the actually reached region caused by traditional coordinate subgoals and our ϵ-invariant subgoals. If the region and trajectories of the two HRL policies are close to each other in the coordinate system, to some extent, that means the policy and transition are similar. On the other hand, to measure the influence of the traditional subgoals and the ϵ-invariant subgoals, the distance and trajectories in the coordinate system can directly reflect the difference between them. Thus, we assume that: Assumption A.3. (Lipschitz condition) In the k-step reachable region under an optimal goalconditioned policy, for a traditional goal-conditioned MDP M with subgoal g t and P k (s t+k |s t , g t ) as its k-step transition probability, as well as an M ϵ with ϵ-invariant subgoals g ∆ and transitions P k (s t+k |s t , g ∆ ), there is: |P k (s t+k |s t , g t ) -P k (s ′ t+k |s t , g ∆ )| ≤ L φ ||φ(s t+k ) -φ(s ′ t+k )|| (10) where L φ is the Lipschitz constant, φ is the coordinate extracting function defined above. The Lipschitz condition is common and widely used in many other RL or HRL works (Wang et al., 2019 ) (Shah et al., 2020) (Gogianu et al., 2021) . Assumption A.3 means that, if the low-level policy learns well and can execute the subgoals given by the high-level policy, the difference between the two HRL policies can be measured by the Euclidean distance of displacement in the coordinate system. The distance is in the reachable region with optimal low-level policy in k steps, from the same start state s t . If the reached states are closed, the policies are considered similar. By the setting above, we have the following theorem which provides a suboptimality upper bound for error e S : Theorem A.4. With discounted factor γ, bounded reward function R max = max s,a R(s, a), Lipschitz constant L φ , variance bound ϵ defined above, with a high probability there is ||V * H -V ϵ H || ∞ ≤ L φ (2δ max + 3 √ 2ϵ)R max 2(1 -γ k ) (k + γ k 1 -γ k ) ( ) where δ max = max Proof. See in Appendix. A.2 The theorem A.4 means that the error between the traditional HRL method coordinates as subgoals and our method can be constrained by the similarity metric of the two policies. By the conclusion above, our method has the theoretical guarantee to approximat to the original optimal policy with a controllable error bound. Meanwhile, in our methods, the subgoals can be reused in different maze environments, which leads to superiority in generalization tasks. A.1 PROOF OF LEMMA 1 Proof. Consider a state s t ∈ S, for the value function of k step: V * k (s t ) = τ k P π (τ k |s t )[R(τ k |s t ) + γ k V * (S t+k )] where P π (τ k |s t ) = P π (s t+k |s t ) is the k-step transition probability with policy π, R(τ k |s t ) = R(s t+k , s t ) = t+k-1 i=t+1,si∈τ k γ i R(s i ) is the bounded accumulative return in τ k , τ k is the possible k-step trajectory by optimal policy π * executing k times. The optimal policy should maximize the external reward from the environment: π * = arg max π E (s,a)∼π [ T t=1 γ t R(s t , a t )] equation 12 can be rewritten as vector form for any state s t ∈ S, i.e.: V * k (s t ) = τ k P π (s t ), R(s t ) + γ k V * where ⟨•, •⟩ is the inner product of vectors, P π (s t ) ∈ [0, 1] |S| , ∀s ∈ S, ||P π (s t )|| 1 = 1 denote the probablistic distribution vector of the k-step trajectory starting from s t under the policy π * . R(s t ) ∈ [0, kR max ] |S| is the accumulative rewards vector of k-step of the k-step trajectory. V * is the optimal value vector function of all the states. Similarly, value of HRL method with stochastic subgoals can be written as follow: V * H = g∈G π H (g t |s t ) s t+k ∈S P k (s t+k |s t , g t )[R(s t+k , s t ) + γ k V * H (S t+k )] = g∈G π H (g t |s t )P k (g t , s t ), R(s t ) + γ k V * H ( ) where π H (g t |s t ) is the high-level policy to choose subgoal g t , P k (s t+k |s t , g t ) is the k-step transition probability of the low-level policy with subgoal g t . R(s t+k , s t ) is also the k-step accumulative rewards. P k (s t , g t ) is the probablistic distribution vector of the k-step trajectory starting from s t with subgoal g t . Denote that PH (s t , g t ) ≜ g∈G π H (g t |s t )P k (g t , s t ), so that definition A.1 can be rewritten as: ν k = max st,gt ||P π (s t ) -PH (s t , g t )|| ∞ Then for every s t ∈ S, there is: |V * (s t ) -V * H (s t )| = τ k P π (s t ), R(s t ) + γ k V * - g∈G π H (g t |s t )P k (g t , s t ), R(s t ) + γ k V * H ≤ τ k P π (s t ) - g∈G π H (g t |s t )P k (g t , s t ), R(s t ) +γ k τ k P π (s t ), V * - g∈G π H (g t |s t )P k (g t , s t ), V H note that τ k P π (s t ), 1 = 1 and g∈G π H (g t |s t )P k (g t , s t ), 1 = 1, where 1 is an all-one vector of |S|-dimension, for the first term of equation 17, we have τ k P π (s t ) - g∈G π H (g t |s t )P k (g t , s t ), R(s t ) = τ k P π (s t ) - g∈G π H (g t |s t )P k (g t , s t ), R(s t ) - kR max 2 • 1 by Holder inequality, there is: (18) ≤ τ k P π (s t ) - g∈G π H (g t |s t )P k (g t , s t ) 1 • R(s t ) - kR max 2 • 1 ∞ ≤ max st,gt P π (s t ) -PH (s t , g t ) ∞ R(s t ) - kR max 2 • 1 ∞ = ν k kR max 2 The second term can be similarly bounded by: γ k τ k P π (s t ), V * - g∈G π H (g t |s t )P k (g t , s t ), V * H ≤γ k τ k P π (s t ), V * - τ k P π (s t ), V * H +γ k τ k P π (s t ), V * H - g∈G π H (g t |s t )P k (g t , s t ), V * H ( ) where γ k τ k P π (s t ), V * - τ k P π (s t ), V * H ≤ γ k ∥V * -V * H ∥ ∞ and γ k τ k P π (s t ), V * H - g∈G π H (g t |s t )P k (g t , s t ), V * H ≤γ k τ k P π (s t ) - g∈G π H (g t |s t )P k (g t , s t ) 1 • V * H - kR max 2(1 -γ k ) • 1 ∞ ≤γ k max st,gt P π (s t ) -PH (s t , g t ) ∞ * H - kR max 2(1 -γ k ) • 1 ∞ ≤γ k ν k kR max 2(1 -γ k ) (22) So that with 19, 21 and 22, there is |V * (s t ) -V * H (s t )| ≤ ν k kR max 2 + γ k ν k kR max 2(1 -γ k ) + γ k ∥V * -V * H ∥ ∞ Since 24 holds for all s ∈ S, so there is: ∥V * -V * H ∥ ∞ ≤ ν k kR max 2(1 -γ k ) + γ k ν k kR max 2(1 -γ k ) 2 A.2 PROOF OF THEOREM 1 Proof. Consider assumption A.3 there is: φ(s t+k ) -φ(s ′ t+k ) = φ(s t+k ) -φ(s t ) + φ(s t ) -φ(s ′ t+k ) restate that the φ(s t ) is the coordinate of state s t , so that φ(s t+k ) -φ(s t ) is the direction vector of the movement from step t to t + k. By definition 3.1, we denote the relative displacement ∆ t = ||g ∆ || = E s t+k [φ(s t ) -φ(s t+k )] for any state s t with ϵ-invariant subgoal g ∆ . So that the displacement vector can be written as ⃗ g ∆t = ⃗ ∆ t + ⃗ ∆ ϵ , where ⃗ ∆ t is a fixed vector of expected direction and ⃗ ∆ ϵ is a random vector. ∆ ϵ,x , ∆ ϵ,y ∼ N (0, σ), where σ ≤ ϵ is much less than the moving distance ∆ t . Thus, the equation 25 can be rewritten as: 25 = φ(s t+k ) -φ(s t ) -( ⃗ ∆ t + ⃗ ∆ ϵ ) ≤ φ(s t+k ) -φ(s t ) -⃗ ∆ t + ⃗ ∆ ϵ Consider that in normal distribution, the random variable has a high probability fall into the region of [-3σ, 3σ] . So that with a high probability, there is || ⃗ ∆ ϵ || ≤ 2 × (3σ) 2 ≤ 3 √ 2ϵ. Thus, the assumption A.3 can be rewritten as follows: |P k (s t+k |s t , g t ) -P k (s ′ t+k |s t , g ∆ )| ≤ L φ (2δ max + 3 √ 2ϵ) where δ max = max g,g∆ {||g||, ∆ t }. The equation 27 can be seen as the HRL transition mismatch rate between our method and optimal HRL method with stochastic coordinate subgoals. It is bounded by the maximal k-step movement of the agent with the variance of the random variable. With lemma A.2, the similar conclusion can be obtained by changing the k-step transition mismatch to equation 27. Then we can get the result of A.4.

B ALGORITHM

Our algorithm for the high-level policy learning is as follows (algorithm 1): Algorithm 1 PEG-A2C Algorithm Synchronize and update parameters 20: end for 

D DETAILS OF EXPERIMENTS

Ant ⊃-Maze. In this task, the agent should navigate from the bottom to the top. Different from the setting in previous works, the maze is larger, and the agent will only attain reward '1' once when it passes the corner and reach the final region. Ant Random Square Maze. It is an empty room with a door and a robot. In this task, the agent will start at a random initial position every episode, to walk towards the yellow door, which also has a random position on the wall. Only when the agent passes the door will it obtain a reward. Ant S-shaped Maze. In this task, the agent will start at the left region and should pass three doors. The trajectory is circuitous and long-horizon, especially the final door is more difficult to achieve than the region in Ant ⊃-Maze. Every first-time transiting of the doors will give the agent a reward. Ant Spiral Maze. This task is in a large maze with spiral routes and five doors and the agent will start at the middle region. Such a long-horizon task requires at least thousands of steps to move to the final region. Also, every first time transiting the doors will give the agent a reward. Generalization Maze. This task includes three fixed mazes of 'Maze-g1', 'Maze-g2', and 'Maze-g3' (figure 1 ), which are variants of 'Ant Random Square Maze' with different unseen structures. Only when the agent passes the door will it obtain a reward.

D.1 STEPS OF EVERY EPISODE OF DIFFERENT TASKS

Table 3 shows the maximal steps of every episode of every task.



Figure 1: Illustration of Zero-shot Generalization Maze Task. From left to right is Maze-g2, Maze-g3, Maze-g1. Details are in section 4.1

Figure 3: Comparison of different frameworks. ϵ-invariant subgoals (invariable directions with little randomness) can be reused in different navigating tasks.

Figure 4: An illustration of the shape of maze environments of our benchmark. They are Random Square, ⊃ Maze, S-shaped Maze, and Spiral Maze respectively. The blue arrow is the successful trajectory.

Figure 5: Comparative experiment results with strong baselines in sparse-reward tasks. The mean and variance are calculated by 3 runs.

Figure 6: Comparative experiment results in dense-reward tasks. (a) is curve of average reward of every step. (b) is curve of average success rate = achieved goals/total goals

Figure 7: Visualization result of zero-shot maze generalization tasks. The red and blue points represent the achieved positions in the maze. The more lightly the position is, the more frequently the agent achieving. The histogram is to straightly show the frequency of the achievement of every position.

Figure 8: Structure of the network.

Generalization Task Result for Different Mazes (Zero-shot Success Rate %). 'Maze-ori' is the trained maze, and the other three are zero-shot generalization maze . In these experiments, we utilize the policy trained in 'Ant Square Maze' and test in new unseen mazes of fixed shapes without retraining. We compare our method with the HIGL method by the well-performed policy of training process in the Ant Random Square environment with sparse reward. Result is shown in table 1. The test mazes are fixed shapes without randomness. so that sometimes the test result can be better than the training result.

Generalization Task Result for Different Robot (Zero-shot Success Rate %). Except 'Ant', the other robot are unseen. Average success rate = achieved goals/total goals

||}. ||g|| and ||g ∆ || are the distance of the relative displacement from any s t to s t+k by subgoal g t and g ∆ respectively.

1: Initialize multi-process actor parameters θ i a for i ∈ [1, n] 2: Initialize multi-process value parameters θ i v for i ∈ [1, n] 3: for episodes in 1,M do Perform a t according to policy π(a t |s t )

annex

Conference on Learning Representations, 2020. For the sparse reward, it will be obtained by the agent when the agent goes across the door, i.e., the coordinates of the agent fall into a region of the door. For the dense reward, it is 1/(1 + d) of Euclidean distance d between the agent and the current goal in the coordinate system for every step.Once the agent goes to the current goal and gets the reward, the goal will update, and the reward will be calculated by the new goal. As a result, the curves of average reward in these tasks may decline sometimes. 

