IMITATING GRAPH-BASED PLANNING WITH GOAL-CONDITIONED POLICIES

Abstract

Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoalconditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoalconditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal-and subgoal-conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks. 1 

1. INTRODUCTION

Many sequential decision making problems can be expressed as reaching a given goal, e.g., navigating a walking robot (Schaul et al., 2015; Nachum et al., 2018) and fetching an object using a robot arm (Andrychowicz et al., 2017) . Goal-conditioned reinforcement learning (GCRL) aims to solve this problem by training a goal-conditioned policy to guide an agent towards reaching the target-goal. In contrast to many of other reinforcement learning frameworks, GCRL is capable of solving different problems (i.e., different goals) using a single policy. An intriguing characteristic of GCRL is its optimal substructure property; any sub-path of an optimal goal-reaching path is an optimal path for its endpoint (Figure 1a ). This implies that a goal-conditioned policy is replaceable by a policy conditioned on a "subgoal" existing between the goal and the agent. Based on this insight, researchers have investigated graph-based planning to construct a goal-reaching path by (a) proposing a series of subgoals and (b) executing policies conditioned on the nearest subgoal (Savinov et al., 2018; Eysenbach et al., 2019; Huang et al., 2019) . Since the nearby subgoals are easier to reach than the faraway goal, such planning improves the success ratio of the agent reaching the target-goal during sample collection. In this paper, we aim to improve the existing GCRL algorithms to be even more faithful to the optimal substructure property. To be specific, we first incorporate the optimal substructure property into the training objective of GCRL to improve the sample collection algorithm. Next, when executing a policy, we consider using all the proposed subgoals as an endpoint of sub-paths instead of using just the subgoal nearest to the agent (Figure 1b ). Contribution. We present Planning-guided self-Imitation learning for Goal-conditioned policies (PIG), a novel and generic framework that builds upon the existing GCRL frameworks that use graph-based planning. PIG consists of the following key ingredients (see Figure 2 ): (a) If (l 1 , lfoot_1 , lfoot_2 , l 4 , l 5 ) is an optimal l 5 -reaching path, all the sub-paths are optimal for reaching l 5 . (b) Previous works guide the agent using a l 2 -reaching sub-path. Our work uses all the possible sub-paths that reach l 2 , l 3 , l 4 , l 5 . Figure 1 : Illustration of (a) optimal substructure property and (b) sub-paths considered in previous works and our approach for guiding the training of a goal-reaching agent. • Training with self-imitation: we propose a new training objective that encourages a goalconditioned policy to imitate the subgoal-conditioned policy. Our intuition is that policies conditioned on nearby subgoals are more likely to be accurate than the policies conditioned on a faraway goal. In particular, we consider the imitation of policies conditioned on all the subgoals proposed by the graph-based planning algorithm. • Execution 2 with subgoal skipping: As an additional technique that fits our self-imitation loss, we also propose subgoal skipping, which randomizes a subgoal proposed by the graph-based planning to further improve the sample-efficiency. During the sample-collection stage and deployment stage, policies randomly "skip" conditioning on some of the subgoals proposed by the planner when it is likely that the learned policies can reach the proposed subgoals. Such a procedure is based on our intuition that an agent may find a better goal-reaching path by ignoring some subgoals proposed by the planner when the policy is sufficiently trained with our loss. We demonstrate the effectiveness of PIG on various long-horizon continuous control tasks based on MuJoCo simulator (Todorov et al., 2012) . In our experiments, PIG significantly boosts the sampleefficiency of an existing GCRL method, i.e., mapping state space (MSS) (Huang et al., 2019), 3 particularly in long-horizon tasks. For example, MSS + PIG achieves the success rate of 57.41% in Large U-shaped AntMaze environment, while MSS only achieves 19.08%. Intriguingly, we also find that the PIG-trained policy performs competitively even without any planner; this could be useful in some real-world scenarios where planning cost (time or memory) is expensive (Bency et al., 2019; Qureshi et al., 2019) .

2. RELATED WORK

Goal-conditioned reinforcement learning (GCRL). GCRL aims to solve multiple tasks associated with target-goals (Andrychowicz et al., 2017; Kaelbling, 1993; Schaul et al., 2015) . Typically, GCRL algorithms rely on the universal value function approximator (UVFA) (Schaul et al., 2015) , which is a single neural network that estimates the true value function given not just the states but also the target-goal. Furthermore, researchers have also investigated goal-exploring algorithms (Mendonca et al., 2021; Pong et al., 2020) to avoid any local optima of training the goal-conditioned policy. Graph-based planning for GCRL. To solve long-horizon GCRL problems, graph-based planning can guide the agent to condition its policy on a series of subgoals that are easier to reach than the faraway target goal (Eysenbach et al., 2019; Hoang et al., 2021; Huang et al., 2019; Laskin et al., 2020; Savinov et al., 2018; Zhang et al., 2021) . To be specific, the corresponding frameworks build a graph where nodes and edges correspond to states and inter-state distances, respectively. Given a shortest path between two nodes representing the current state and the target-goal, the policy conditions on a subgoal represented by a subsequent node in the path. For applying graph-based planning to complex environments, recent progress has mainly been made in building a graph that represents visited state space well while being scalable to large environments. For example, Huang et al. (2019) and Hoang et al. (2021) limits the number of nodes in a graph and makes nodes to cover visited state space enough by containing nodes that are far from each other in terms of L2 distance or successor feature similarity, respectively. Moreover, graph sparsification via



Code is available at https://github.com/junsu-kim97/PIG In this paper, we use the term "execution" to denote both (1) the roll-out in training phase and (2) the deployment in test phase. We note that PIG is a generic framework that can be also incorporated into any planning-based GCRL methods, other than MSS. Nevertheless, we choose MSS because it is one of the most representative GCRL works as most recent works(Hoang et al., 2021; Zhang et al., 2021) could be considered as variants of MSS.

