LEARNING TO REACH GOALS VIA ITERATED SUPERVISED LEARNING

Abstract

Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any trajectory is a successful demonstration for reaching the final state in that same trajectory. We propose a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. Each iteration, the agent collects new trajectories using the latest policy, and maximizes the likelihood of the actions along these trajectories under the goal that was actually reached, so as to improve the policy. We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.

1. INTRODUCTION

Reinforcement learning (RL) provides an elegant framework for agents to learn general-purpose behaviors supervised by only a reward signal. When combined with neural networks, RL has enabled many notable successes, but our most successful deep RL algorithms are far from a turnkey solution. Despite striving for data efficiency, RL algorithms, especially those using temporal difference learning, are highly sensitive to hyperparameters (Henderson et al., 2018) and face challenges of stability and optimization (Tsitsiklis & Van Roy, 1997; van Hasselt et al., 2018; Kumar et al., 2019b) , making such algorithms difficult to use in practice. If agents are supervised not with a reward signal, but rather demonstrations from an expert, the resulting class of algorithms is significantly more stable and easy to use. Imitation learning via behavioral cloning provides a simple paradigm for training control policies: maximizing the likelihood of optimal actions via supervised learning. Imitation learning algorithms using deep learning are mature and robust; these algorithms have demonstrated success in acquiring behaviors reliably from high-dimensional sensory data such as images (Bojarski et al., 2016; Lynch et al., 2019) . Although imitation learning via supervised learning is not a replacement for RL -the paradigm is limited by the difficulty of obtaining kinesthetic demonstrations from a supervisor -the idea of learning policies via supervised learning can serve as inspiration for RL agents that learn behaviors from scratch. In this paper, we present a simple RL algorithm for learning goal-directed policies that leverages the stability of supervised imitation learning without requiring an expert supervisor. We show that when learning goal-directed behaviors using RL, demonstrations of optimal behavior can be generated from sub-optimal data in a fully self-supervised manner using the principle of data relabeling: that every trajectory is a successful demonstration for the state that it actually reaches, even if it is sub-optimal for the goal that was originally commanded to generate the trajectory. A similar observation of hindsight relabelling was originally made by Kaelbling (1993) , more recently popularized in the deep RL literature (Andrychowicz et al., 2017) , for learning with off-policy value-based methods and policy-gradient methods (Rauber et al., 2017) . When goal-relabelling, these algorithms recompute the received rewards as though a different goal had been commanded. In this work, we instead notice that goal-relabelling to the final state in the trajectory allows an algorithm to re-interpret an action collected by a sub-optimal agent as though it were collected by an expert agent, just for a different goal. This leads to a substantially simpler algorithm that relies only on a supervised imitation learning primitive, avoiding the challenges of value function estimation. By generating demonstrations using hindsight relabelling, we are able to apply goal-conditioned imitation learning primitives (Gupta et al., 2019; Ding et al., 2019) on data collected by sub-optimal agents, not just from an expert supervisor. We instantiate these ideas as an algorithm that we call goal-conditioned supervised learning (GCSL). At each iteration, trajectories are collected commanding the current goal-conditioned policy for some set of desired goals, and then relabeled using hindsight to be optimal for the set of goals that were actually reached. Supervised imitation learning with this generated "expert" data is used to train an improved goal-conditioned policy for the next iteration. Interestingly, this simple procedure provably optimizes a lower bound on a well-defined RL objective; by performing self-imitation on all of its own trajectories, an agent can iteratively improve its own policy to learn optimal goal-reaching behaviors without requiring any external demonstrations and without learning a value function. While self-imitation RL algorithms typically choose a small subset of trajectories to imitate (Oh et al., 2018; Hao et al., 2019) or learn a separate value function to reweight past experience (Neumann & Peters, 2009; Abdolmaleki et al., 2018; Peng et al., 2019) , we show that GCSL learns efficiently while training on every previous trajectory without reweighting, thereby maximizing data reuse. The main contribution of our work is GCSL, a simple goal-reaching RL algorithm that uses supervised learning to acquire policies from scratch. We show, both formally and empirically, that any trajectory taken by the agent can be turned into an optimal one using hindsight relabelling, and that imitation of these trajectories (provably) enables an agent to (iteratively) learn goal-reaching behaviors. That iteratively imitating all the data from a sub-optimal agent leads to optimal behavior is a non-trivial conclusion; we formally verify that the procedure optimizes a lower-bound on a goal-reaching RL objective and derive performance bounds when the supervised learning objective is sufficiently minimized. In practice, GCSL is simpler, more stable, and less sensitive to hyperparameters than value-based methods, while still retaining the benefits of off-policy learning. Moreover, GCSL can leverage demonstrations (if available) to accelerate learning. We demonstrate that GCSL outperforms value-based and policy gradient methods on several challenging robotic domains.

2. PRELIMINARIES

Goal reaching. The goal reaching problem is characterized by the tuple S, A, T , ρ(s 0 ), T, p(g) , where S and A are the state and action spaces, T (s |s, a) is the transition function, ρ(s 0 ) is the initial state distribution, T the horizon length, and p(g) is the distribution over goal states g ∈ S. We aim to find a time-varying goal-conditioned policy π(•|s, g, h): S × S × [T ] → ∆(A), where ∆(A) is the probability simplex over the action space A and h is the remaining horizon. We will say that a goal is achieved if the agent has reached the goal at the end of the episode. Correspondingly, the learning problem is to acquire a policy that maximizes the probability of achieving the desired goal: J(π) = E g∼p(g) P πg (s T = g) . (1) Notice that unlike a shortest-path objective, this final-timestep objective provides no incentive to find the shortest path to the goal. We shall see in Section 3 that this notion of optimality is more than a simple design choice: hindsight relabeling for optimality emerges naturally when maximizing the probability of achieving the goal, but does not when minimizing the time to reach the goal. The final timestep objective is especially useful in practical applications where reaching a particular goal is challenging, but once a goal is reached, it is possible to remain at the goal. When reaching the goal is itself challenging, forcing the agent to reach the goal as fast as possible can make the learning problem unduly difficult. In contrast, this objective just requires the agent to eventually reach, and then stay at the goal, a more straightforward learning problem. In addition, the final timestep objective is useful when trying to learn robust solutions that potentially take longer over shorter solutions that

