LEARNING FROM DEMONSTRATIONS WITH ENERGY BASED GENERATIVE ADVERSARIAL IMITATION LEARNING

Abstract

Traditional reinforcement learning methods usually deal with the tasks with explicit reward signals. However, for vast majority of cases, the environment wouldn't feedback a reward signal immediately. It turns out to be a bottleneck for modern reinforcement learning approaches to be applied into more realistic scenarios. Recently, inverse reinforcement learning (IRL) has made great progress in making full use of the expert demonstrations to recover the reward signal for reinforcement learning. And generative adversarial imitation learning is one promising approach. In this paper, we propose a new architecture for training generative adversarial imitation learning which is so called energy based generative adversarial imitation learning (EB-GAIL). It views the discriminator as an energy function that attributes low energies to the regions near the expert demonstrations and high energies to other regions. Therefore, a generator can be seen as a reinforcement learning procedure to sample trajectories with minimal energies (cost), while the discriminator is trained to assign high energies to these generated trajectories. In detail, EB-GAIL uses an auto-encoder architecture in place of the discriminator, with the energy being the reconstruction error. Theoretical analysis shows our EB-GAIL could match the occupancy measure with expert policy during the training process. Meanwhile, the experiments depict that EB-GAIL outperforms other SoTA methods while the training process for EB-GAIL can be more stable.

1. INTRODUCTION

Motivated by applying reinforcement learning algorithms into more realistic tasks, we find that most realistic environments cannot feed an explicit reward signal back to the agent immediately. It becomes a bottleneck for traditional reinforcement learning methods to be applied into more realistic scenarios. So how to infer the latent reward function from expert demonstrations is of great significance. Recently, a lot of great work have been proposed to solve this problem. They are also successfully applied in scientific inquiries, such as Stanford autonomous helicopter Abbeel et al. The goal of imitation learning is to mimic the expert behavior from expert demonstrations without access to a reinforcement signal from the environment. The algorithms in this field can be divided into two board categories: behavioral cloning and inverse reinforcement learning. Behavioral cloning formulate this problem as a supervised learning problem which aims at mapping state action pairs from expert trajectories to policy. These methods suffer from the problem of compounding errors (covariate shift) which only learn the actions of the expert but not reason about what the expert is trying to achieve. By the contrast, inverse reinforcement learning recovers the reward function from expert demonstrations and then optimize the policy under such an unknown reward function. In this paper, we propose energy-based generative adversarial imitation learning which views the discriminator as an energy function without explicit probabilistic interpretation. The energy function computed by the discriminator can be viewed as a trainable cost function for the generator, while the discriminator is trained to assign low energy values to the regions of expert demonstrations, and



(2006) Abbeel et al. (2007) Ng et al. (2004) Coates et al. (2008) Abbeel et al. (2008a) Abbeel et al. (2010), as well as practical challenges such as navigation Ratliff et al. (2006) Abbeel et al. (2008b) Ziebart et al. (2008) Ziebart et al. (2010) and intelligent building controls Barrett & Linder (2015).

