CRISP: CURRICULUM INDUCING PRIMITIVE IN-FORMED SUBGOAL PREDICTION FOR HIERARCHICAL REINFORCEMENT LEARNING

Abstract

Hierarchical reinforcement learning is a promising approach that uses temporal abstraction to solve complex long horizon problems. However, simultaneously learning a hierarchy of policies is unstable as it is challenging to train higherlevel policy when the lower-level primitive is non-stationary. In this paper, we propose to generate a curriculum of achievable subgoals for evolving lower-level primitives using reinforcement learning and imitation learning. The lower level primitive periodically performs data relabeling on a handful of expert demonstrations using our primitive informed parsing. We provide expressions to bound the sub-optimality of our method and develop a practical algorithm for hierarchical reinforcement learning. Since our approach uses a handful of expert demonstrations, it is suitable for most robotic control tasks. Experimental results on complex maze navigation and robotic manipulation environments show that inducing hierarchical curriculum learning significantly improves sample efficiency, and results in better learning of goal conditioned policies in temporally extended tasks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have made significant progress in solving continuous control tasks like performing robotic arm manipulation (Levine et al., 2015; Vecerík et al., 2017) and learning dexterous manipulation (Rajeswaran et al., 2017) . However, the success of RL algorithms on complex long horizon continuous tasks has been limited by issues like long term credit assignment and inefficient exploration (Nachum et al., 2019; Kulkarni et al., 2016) , especially in sparse reward scenarios (Andrychowicz et al., 2017) . Hierarchical reinforcement learning (HRL) (Dayan & Hinton, 1993; Sutton et al., 1999; Parr & Russell, 1998) promises the benefits of temporal abstraction and efficient exploration for solving tasks that require long term planning. In goal-conditioned hierarchical framework, the high-level policy predicts subgoals for lower primitive, which in turn performs primitive actions directly on the environment (Nachum et al., 2018; Vezhnevets et al., 2017; Levy et al., 2017) . However, simultaneously learning multi-level policies has been found to be challenging in practice due to non-stationary higher level state transition and reward functions. Prior works have leveraged expert demonstrations to bootstrap learning (Nair et al., 2017; Rajeswaran et al., 2017; Hester et al., 2017) . Some approaches rely on leveraging expert demonstrations via fixed parsing, and consequently bootstrapping multi-level hierarchical RL policy using imitation learning (Gupta et al., 2019) . Generating an efficient subgoal transition dataset is crucial in such tasks. In this work, we propose an adaptive parsing technique for leveraging expert demonstrations and show that it outperforms fixed parsing based approaches on tasks that require long term planning. Ideally, a good subgoal should properly balance the task split between the hierarchical levels according to current goal reaching ability of the lower primitive, thus avoiding degenerate solutions. As the lower primitive improves, the subgoals provided to lower primitive should become progressively more difficult, such that (i) the subgoals are always achievable by the current lower level primitive, (ii) task split is properly balanced between hierarchical levels, and (iii) reasonable progress is made towards achieving the final goal. In this work, we introduce hierarchical curriculum learning to deal with non-stationarity issue. We build upon these ideas and propose a generally applicable HRL approach: Curriculum inducing primitive informed subgoal prediction (CRISP). CRISP parses a handful of expert demonstrations using our novel subgoal relabeling method: primitive informed parsing (PIP). In PIP, current lower primitive is used to perform data relabeling on expert demonstrations dataset to generate efficent subgoal supervision for the higher level policy. Since the lower primitive performs data relabeling, this approach does not require explicit labeling or segmentation of demonstrations by an expert. The periodically generated higher level subgoal dataset is used with an additional imitation learning (IL) objective to provide curriculum based regularization for the higher policy. For imitation learning, we devise inverse reinforcement learning regularizer (Ghasemipour et al., 2020; Kostrikov et al., 2018; Ho & Ermon, 2016) , which constraints the state marginal of the learned policy to be similar to that of the expert demonstrations. The details of CRISP, PIP, and IRL objective are mentioned in Section 3. We also derive sub-optimality bounds in Section 3.2 to theoretically justify the benefits of curriculum learning in hierarchical framework. Finally, we provide a practical approach to perform hierarchical reinforcement learning. Since our approach uses a handful of expert demonstrations, it is generally applicable on most complex long horizon tasks. We perform experiments on random maze navigation and complex robotic pick and place environments, and empirically verify that the proposed approach clearly outperforms the baseline approaches on long horizon tasks.

2. BACKGROUND

We consider Universal Markov Decision Process (UMDP) (Schaul et al., 2015) setting, which is a Markov Decision process (MDP) augmented with the goal space G. UMDPs are represented as a 6-tuple (S, A, P, R, γ, G), where S is the state space, A is the action space, P (s ′ |s, a) = P(s t+1 = s ′ |s t = s, a t = a) is the transition function that describes the probability of reaching state s ′ when the agent takes action a in the current state s. The reward function R generates rewards r at every timestep, γ is the discount factor, and G is the goal space. In the UMDP setting, a fixed goal g is selected for an episode, and π(a|s, g) denotes the goal-conditioned policy. d π (s) = (1 -γ) T t=0 γ t P (s t = s|π) represents the discounted future state distribution, and d π c (s) = (1 -γ c ) T t=0 γ tc P (s tc = s|π) represents the c-step future state distribution for policy π. The overall objective is to learn policy π(a|s, g) which maximizes the expected future discounted reward objective J = (1 -γ) -1 E s∼d π ,a∼π(a|s,g),g∼G [r(s t , a t , g)] Let s be the current state and g be the final goal for the current episode. In our goal-conditioned hierarchical RL setup, the overall policy π is divided into multi-level policies. The higher level policy π H (s g |s, g) predicts subgoals (Dayan & Hinton, 1993) s g for the lower level primitive π L (a|s, s g ), which in turn executes primitive actions a directly on the environment. The lower primitive π L tries to achieve subgoal s g within c timesteps by maximizing intrinsic rewards r in provided by the higher level policy. The higher level policy π H gets extrinsic reward r ex from the environment, and predicts the next subgoal s g for the lower primitive. The process is continued until either the final goal g is achieved, or the episode terminates. We consider sparse reward setting where the lower primitive is sparsely rewarded intrinsic reward r in if the agent reaches within δ L distance of the predicted subgoal s g : r in = 1(∥s t -s g ∥ 2 ≤ δ L ), and the higher level policy is sparsely rewarded extrinsic reward r ex if the achieved goal is within δ H distance of the final goal g: r ex = 1(∥s t -g∥ 2 ≤ δ H ). We assume access to a handful of expert demonstrations D = {e i } N i=1 , where e i = (s e 0 , s e 1 , . . . , s e T -1 ). We only assume access to demonstration states s e i (and not demonstration actions) which can be obtained in most robotic control tasks.

3. METHODOLOGY

In this section, we explain our hierarchical curriculum learning based approach CRISP. An overview of the method is depicted in Figure 1 . First, we formulate our primitive informed parsing method PIP, which periodically performs data relabeling on expert demonstrations to populate subgoal transition dataset. Then, we explain how we use this dataset to learn high level policy using reinforcement learning and additional inverse reinforcement learning(IRL) based regularization objective.

