PRIMAL WASSERSTEIN IMITATION LEARNING

Abstract

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

1. INTRODUCTION

Reinforcement Learning (RL) has solved a number of difficult tasks whether in games (Tesauro, 1995; Mnih et al., 2015; Silver et al., 2016) or robotics (Abbeel & Ng, 2004; Andrychowicz et al., 2020) . However, RL relies on the existence of a reward function, that can be either hard to specify or too sparse to be used in practice. Imitation Learning (IL) is a paradigm that applies to these environments with hard to specify rewards: we seek to solve a task by learning a policy from a fixed number of demonstrations generated by an expert. IL methods can typically be folded into two paradigms: Behavioral Cloning, or BC (Pomerleau, 1991; Bagnell et al., 2007; Ross & Bagnell, 2010) and Inverse Reinforcement Learning, or IRL (Russell, 1998; Ng et al., 2000) . In BC, we seek to recover the expert's behavior by directly learning a policy that matches the expert behavior in some sense. In IRL, we assume that the demonstrations come from an agent that acts optimally with respect to an unknown reward function that we seek to recover, to subsequently train an agent on it. Although IRL methods introduce an intermediary problem (i.e. recovering the environment's reward) they are less sensitive to distributional shift (Pomerleau, 1991) , they generalize to environments with different dynamics (Piot et al., 2013) , and they can recover a near-optimal agent from suboptimal demonstrations (Brown et al., 2019; Jacq et al., 2019) . However, IRL methods are usually based on an iterative process alternating between reward estimation and RL, which might result in poor sample-efficiency. Earlier IRL methods (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008) require multiple calls to a Markov decision process solver (Puterman, 2014) , whereas recent adversarial IL approaches (Finn et al., 2016; Ho & Ermon, 2016; Fu et al., 2018) interleave the learning of the reward function with the learning process of the agent. Adversarial IL methods are based on an adversarial training paradigm similar to Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , where the learned reward function can be thought of as the confusion of a discriminator that learns to differentiate expert transitions from non expert ones. These methods are well suited to the IL problem since they implicitly minimize an f -divergence between the state-action distribution of an expert and the state-action distribution of the learning agent (Ghasemipour et al., 2019; Ke et al., 2019) . However the interaction between a generator (the policy) and the discriminator (the reward function) makes it a minmax optimization problem, and therefore comes with practical challenges that might include training instability, sensitivity to hyperparameters and poor sample efficiency.

