VISUAL IMITATION WITH REINFORCEMENT LEARN-ING USING RECURRENT SIAMESE NETWORKS

Abstract

It would be desirable for a reinforcement learning (RL)-based agent to learn behaviour by merely watching a demonstration. However, defining rewards that facilitate this goal within the RL paradigm remains a challenge. Here we address this problem with Siamese networks, trained to compute distances between observed behaviours and an agent's behaviours. We use a recurrent neural network (RNN)-based comparator model to learn such distances in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we have also found that the inclusion of multi-task data and an additional image encoding loss helps enforce temporal consistency and improve policy learning. These two components appear to balance reward for matching a specific instance of a behaviour versus that behaviour in general. Furthermore, we focus here on a particularly challenging form of this problem where only a single expert demonstration is provided for a given task. We demonstrate our approach on simulated humanoid, dog and raptor agents in 2D and a 3D quadruped and humanoid. In these environments, we show that our method outperforms the state-of-the-art, Generative Adversarial Imitation from Observation (GAIfO) (i.e. Generative Adversarial Imitation Learning (GAIL) without access to actions) and Time-Contrastive Network (TCN).

1. INTRODUCTION

In nature, many intelligent beings (agents) can imitate their peers (experts) by watching them. In order to learn from observation alone, the agent must compare its own behavior to the expert's, mimicking their movements (Blakemore & Decety, 2001) . While this process seems to come as second nature to humans and many animals, formulating a framework and metrics that can measure how different a expert's demonstration is from an agent's reenactment in this setting is challenging. While robots have access to their state information, humans and animals simply observe others performing tasks relying only upon visual perceptions of demonstrations, creating a mental representation of the target motion. In this work we ask: Can agents learn these representations in order to learn imitative policies from a single demonstration? One of the core problems of imitation learning is how to align a demonstration in space and time with the agent's own state. To address this, the imitation framework has to learn a distance function between agent and expert. The distance function in our work makes use of positive and negative examples, including types of adversarial examples, similar to GAIL (Ho & Ermon, 2016) and GAIfO (Torabi et al., 2018b) . These works require expert policies to generate large amount of demonstration data and GAIL need action information as well. These works train a discriminator to recognize in-distribution examples. In this work, we extend these techniques by learning distances between motions, using noisy visual data without action information, and using the distance function as reward signal to train RL policies. In Figure 1b an outline of our method for visual imitation is given. As we show in the paper, this new formulation can be extended to assist in training the distance function using multi-task data, which improves the model's accuracy and enables its reuse on different tasks. Additionally, while previous methods have focused on computing distances between single states, we construct a cost function that takes into account the demonstration ordering as well as the state using a recurrent Siamese network to learn smoother distances between motions. Our contribution, Visual Imitation with RL (VIRL), consists of proposing and exploring these forms of recurrent Siamese networks as a way to address a critical problem in defining the reward structure for imitation learning from video for deep RL agents. We accomplish this using simulated humanoid robots inhabiting a physics simulation environment and for the challenging setting of imitation learning from a single expert demonstration. Our approach enables us to train agents that can imitate many types of behaviours that include walking, running and jumping. We perform experiments for multiple simulated robots in both 2D and 3D, including recent Sim2Real quadruped robots and a humanoid with 38 degrees of freedom (DoF), which is a particularly challenging problem domain.

2. PRELIMINARIES

Here we provide a very brief review of some fundamental methods that are related to the new approach we present here. Reinforcement Learning (RL) is frequently formulated within the framework of Markov decision process (MDP) where at every time step t, the world (including the agent) exists in a state s t ∈ S, where the agent is able to perform actions a t ∈ A and where states and actions are discrete. The action to take is determined according to a policy π(a t |s t ) which results in a new state s t+1 ∈ S and reward r t = R(s t , a t , s t+1 ) according to the transition probability function T (r t , s t+1 |s t , a t ). The policy is optimized to maximize the future discounted reward E r0,...,r T T t=0 γ t r t , where T is the max time horizon, and γ is the discount factor, indicating the planning horizon length. The formulation above generalizes to continuous states and actions, which is the situation for the agents we consider in our work. Imitation Learning is typically cast as the process of training a new policy to reproduce the behaviour of some expert policy. Behavioral cloning is a fundamental method for imitation learning. Given an expert policy π E possibly represented as a collection of trajectories τ = (s 0 , a 0 ), . . . , (s T , a T ) a new policy π can be learned to match this trajectory using supervised learning and maximizing the expectation E π E T t=0 log π(a t |s t , θ π ) . While this simple method can work well, it often suffers from distribution mismatch issues leading to compounding errors as the learned policy deviates from the expert's behaviour (Ross et al., 2011b) . Inverse reinforcement learning avoids this issue by extracting a reward function from observed optimal behaviour (Ng et al., 2000) . In our approach, we learn a distance function that allows an agent to compare an observed behavior to its own current behavior to define its reward r t at a given time step. Our comparison is performed with respect to a reference activity but the comparison network can be trained across a collection of different behaviours. Further, we do not assume the example data to be optimal. See Appendix 7.2 for further discussion of the connections of our work to inverse reinforcement learning.



Figure 1: Overview of our method: We aim to learn a distance function (1a) and then use that distance function as a reward function for RL (1b). At the current timestep, observations (o) of the reference motion and the agent are encoded (e) and fed into LSTMs (leading to hidden states h). Fig.1ashows how the reward model is trained using both Siamese and AE losses. There are: VAE reconstruction losses on static images (L V AEI ), sequence-to-sequence AE losses (L RAES ), one for the reference and one for the agent (which we do not show in pink to simplify the figure). There is a Siamese loss between encoded images (L SN I ) and a Siamese loss that is computed between encoded states over time (L SN S ). Fig.1bshows how the reward is calculated at every timestep. Reward for the agent at every timestep consists of the distance between encoded images and encoded LSTM hidden states.

