GOAL-DRIVEN IMITATION LEARNING FROM OBSERVA-TION BY INFERRING GOAL PROXIMITY Anonymous

Abstract

Humans can effectively learn to estimate how close they are to completing a desired task simply by watching others fulfill the task. To solve the task, they can then take actions towards states with higher estimated proximity to the goal. From this intuition, we propose a simple yet effective method for imitation learning that learns a goal proximity function from expert demonstrations and online agent experience, and then uses the learned proximity to provide a dense reward signal for training a policy to solve the task. By predicting task progress as the temporal distance to the goal, the goal proximity function improves generalization to unseen states over methods that aim to directly imitate expert behaviors. We demonstrate that our proposed method efficiently learns a set of goal-driven tasks from state-only demonstrations in navigation, robotic arm manipulation, and locomotion tasks.

1. INTRODUCTION

Humans are capable of effectively leveraging demonstrations from experts to solve a variety of tasks. Specifically, by watching others performing a task, we can learn to infer how close we are to completing the task, and then take actions towards states closer to the goal of the task. For example, after watching a few tutorial videos for chair assembly, we learn to infer how close an intermediate configuration of a chair is to completion. With the guidance of such a task progress estimate, we can efficiently learn to assemble the chair to progressively get closer to and eventually reach, the fully assembled chair. Can machines likewise first learn an estimate of progress towards a goal from demonstrations and then use this estimate as guidance to move closer to and eventually reach the goal? Typical learning from demonstration (LfD) approaches (Pomerleau, 1989; Pathak et al., 2018; Finn et al., 2016) greedily imitate the expert policy and therefore suffer from accumulated errors causing a drift away from states seen in the demonstrations. On the other hand, adversarial imitation learning approaches (Ho & Ermon, 2016; Fu et al., 2018) encourage the agent to imitate expert trajectories with a learned reward that distinguishes agent and expert behaviors. However, such adversarially learned reward functions often overfit to the expert demonstrations and do not generalize to states not covered in the demonstrations (Zolna et al., 2019) , leading to unsuccessful policy learning. Inspired by how humans leverage demonstrations to measure progress and complete tasks, we devise an imitation learning from observation (LfO) method which learns a task progress estimator and uses the task progress estimate as a dense reward signal for training a policy as illustrated in Figure 1 . To measure the progress of a goal-driven task, we define goal proximity as an estimate of temporal distance to the goal (i.e., the number of actions required to reach the goal). In contrast to prior adversarial imitation learning algorithms, by having additional supervision of task progress and learning to predict it, the goal proximity function can acquire more structured task-relevant information, and hence generalize better to unseen states and provide better reward signals. However, the goal proximity function can still output inaccurate predictions on states not in demonstrations, which results in unstable policy training. To improve the accuracy of the goal proximity function, we continually update the proximity function with trajectories both from expert and agent. In addition, we penalize trajectories with the uncertainty of the goal proximity prediction, which prevents the policy from exploiting high proximity estimates with high uncertainty. As a result, by leveraging the agent experience and predicting the proximity function uncertainty, our method can achieve more efficient and stable policy learning.

