CRISP: CURRICULUM INDUCING PRIMITIVE IN-FORMED SUBGOAL PREDICTION FOR HIERARCHICAL REINFORCEMENT LEARNING

Abstract

Hierarchical reinforcement learning is a promising approach that uses temporal abstraction to solve complex long horizon problems. However, simultaneously learning a hierarchy of policies is unstable as it is challenging to train higherlevel policy when the lower-level primitive is non-stationary. In this paper, we propose to generate a curriculum of achievable subgoals for evolving lower-level primitives using reinforcement learning and imitation learning. The lower level primitive periodically performs data relabeling on a handful of expert demonstrations using our primitive informed parsing. We provide expressions to bound the sub-optimality of our method and develop a practical algorithm for hierarchical reinforcement learning. Since our approach uses a handful of expert demonstrations, it is suitable for most robotic control tasks. Experimental results on complex maze navigation and robotic manipulation environments show that inducing hierarchical curriculum learning significantly improves sample efficiency, and results in better learning of goal conditioned policies in temporally extended tasks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have made significant progress in solving continuous control tasks like performing robotic arm manipulation (Levine et al., 2015; Vecerík et al., 2017) and learning dexterous manipulation (Rajeswaran et al., 2017) . However, the success of RL algorithms on complex long horizon continuous tasks has been limited by issues like long term credit assignment and inefficient exploration (Nachum et al., 2019; Kulkarni et al., 2016) , especially in sparse reward scenarios (Andrychowicz et al., 2017) . Hierarchical reinforcement learning (HRL) (Dayan & Hinton, 1993; Sutton et al., 1999; Parr & Russell, 1998) promises the benefits of temporal abstraction and efficient exploration for solving tasks that require long term planning. In goal-conditioned hierarchical framework, the high-level policy predicts subgoals for lower primitive, which in turn performs primitive actions directly on the environment (Nachum et al., 2018; Vezhnevets et al., 2017; Levy et al., 2017) . However, simultaneously learning multi-level policies has been found to be challenging in practice due to non-stationary higher level state transition and reward functions. Prior works have leveraged expert demonstrations to bootstrap learning (Nair et al., 2017; Rajeswaran et al., 2017; Hester et al., 2017) . Some approaches rely on leveraging expert demonstrations via fixed parsing, and consequently bootstrapping multi-level hierarchical RL policy using imitation learning (Gupta et al., 2019) . Generating an efficient subgoal transition dataset is crucial in such tasks. In this work, we propose an adaptive parsing technique for leveraging expert demonstrations and show that it outperforms fixed parsing based approaches on tasks that require long term planning. Ideally, a good subgoal should properly balance the task split between the hierarchical levels according to current goal reaching ability of the lower primitive, thus avoiding degenerate solutions. As the lower primitive improves, the subgoals provided to lower primitive should become progressively more difficult, such that (i) the subgoals are always achievable by the current lower level primitive, (ii) task split is properly balanced between hierarchical levels, and (iii) reasonable progress is made towards achieving the final goal. In this work, we introduce hierarchical curriculum learning to deal with non-stationarity issue. We build upon these ideas and propose a generally applicable HRL approach: Curriculum inducing primitive informed subgoal prediction (CRISP). 1

