VISUAL IMITATION WITH REINFORCEMENT LEARN-ING USING RECURRENT SIAMESE NETWORKS

Abstract

It would be desirable for a reinforcement learning (RL)-based agent to learn behaviour by merely watching a demonstration. However, defining rewards that facilitate this goal within the RL paradigm remains a challenge. Here we address this problem with Siamese networks, trained to compute distances between observed behaviours and an agent's behaviours. We use a recurrent neural network (RNN)-based comparator model to learn such distances in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we have also found that the inclusion of multi-task data and an additional image encoding loss helps enforce temporal consistency and improve policy learning. These two components appear to balance reward for matching a specific instance of a behaviour versus that behaviour in general. Furthermore, we focus here on a particularly challenging form of this problem where only a single expert demonstration is provided for a given task. We demonstrate our approach on simulated humanoid, dog and raptor agents in 2D and a 3D quadruped and humanoid. In these environments, we show that our method outperforms the state-of-the-art, Generative Adversarial Imitation from Observation (GAIfO) (i.e. Generative Adversarial Imitation Learning (GAIL) without access to actions) and Time-Contrastive Network (TCN).

1. INTRODUCTION

In nature, many intelligent beings (agents) can imitate their peers (experts) by watching them. In order to learn from observation alone, the agent must compare its own behavior to the expert's, mimicking their movements (Blakemore & Decety, 2001) . While this process seems to come as second nature to humans and many animals, formulating a framework and metrics that can measure how different a expert's demonstration is from an agent's reenactment in this setting is challenging. While robots have access to their state information, humans and animals simply observe others performing tasks relying only upon visual perceptions of demonstrations, creating a mental representation of the target motion. In this work we ask: Can agents learn these representations in order to learn imitative policies from a single demonstration? One of the core problems of imitation learning is how to align a demonstration in space and time with the agent's own state. To address this, the imitation framework has to learn a distance function between agent and expert. The distance function in our work makes use of positive and negative examples, including types of adversarial examples, similar to GAIL (Ho & Ermon, 2016) and GAIfO (Torabi et al., 2018b) . These works require expert policies to generate large amount of demonstration data and GAIL need action information as well. These works train a discriminator to recognize in-distribution examples. In this work, we extend these techniques by learning distances between motions, using noisy visual data without action information, and using the distance function as reward signal to train RL policies. In Figure 1b an outline of our method for visual imitation is given. As we show in the paper, this new formulation can be extended to assist in training the distance function using multi-task data, which improves the model's accuracy and enables its reuse on different tasks. Additionally, while previous methods have focused on computing distances between single states, we construct a cost function that takes into account the demonstration ordering as well as the state using a recurrent Siamese network to learn smoother distances between motions. 1

