PRIMAL WASSERSTEIN IMITATION LEARNING

Abstract

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

1. INTRODUCTION

Reinforcement Learning (RL) has solved a number of difficult tasks whether in games (Tesauro, 1995; Mnih et al., 2015; Silver et al., 2016) or robotics (Abbeel & Ng, 2004; Andrychowicz et al., 2020) . However, RL relies on the existence of a reward function, that can be either hard to specify or too sparse to be used in practice. Imitation Learning (IL) is a paradigm that applies to these environments with hard to specify rewards: we seek to solve a task by learning a policy from a fixed number of demonstrations generated by an expert. IL methods can typically be folded into two paradigms: Behavioral Cloning, or BC (Pomerleau, 1991; Bagnell et al., 2007; Ross & Bagnell, 2010) and Inverse Reinforcement Learning, or IRL (Russell, 1998; Ng et al., 2000) . In BC, we seek to recover the expert's behavior by directly learning a policy that matches the expert behavior in some sense. In IRL, we assume that the demonstrations come from an agent that acts optimally with respect to an unknown reward function that we seek to recover, to subsequently train an agent on it. Although IRL methods introduce an intermediary problem (i.e. recovering the environment's reward) they are less sensitive to distributional shift (Pomerleau, 1991) , they generalize to environments with different dynamics (Piot et al., 2013) , and they can recover a near-optimal agent from suboptimal demonstrations (Brown et al., 2019; Jacq et al., 2019) . However, IRL methods are usually based on an iterative process alternating between reward estimation and RL, which might result in poor sample-efficiency. Earlier IRL methods (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008) require multiple calls to a Markov decision process solver (Puterman, 2014) , whereas recent adversarial IL approaches (Finn et al., 2016; Ho & Ermon, 2016; Fu et al., 2018) interleave the learning of the reward function with the learning process of the agent. Adversarial IL methods are based on an adversarial training paradigm similar to Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , where the learned reward function can be thought of as the confusion of a discriminator that learns to differentiate expert transitions from non expert ones. These methods are well suited to the IL problem since they implicitly minimize an f -divergence between the state-action distribution of an expert and the state-action distribution of the learning agent (Ghasemipour et al., 2019; Ke et al., 2019) . However the interaction between a generator (the policy) and the discriminator (the reward function) makes it a minmax optimization problem, and therefore comes with practical challenges that might include training instability, sensitivity to hyperparameters and poor sample efficiency. In this work, we use the Wasserstein distance as a measure between the state-action distributions of the expert and of the agent. Contrary to f -divergences, the Wasserstein distance is a true distance, it is smooth and it is based on the geometry of the metric space it operates on. The Wasserstein distance has gained popularity in GAN approaches (Arjovsky et al., 2017) through its dual formulation which comes with challenges (see Section 5). Our approach is novel in the fact that we consider the problem of minimizing the Wasserstein distance through its primal formulation. Crucially, the primal formulation prevents the minmax optimization problem, and requires little fine tuning. We introduce a reward function computed offline based on an upper bound of the primal form of the Wasserstein distance. As the Wasserstein distance requires a distance between state-action pairs, we show that it can be hand-defined for locomotion tasks, and that it can be learned from pixels for a hand manipulation task. The inferred reward function is like adversarial IL methods, but it is not re-evaluated as the agent interacts with the environment, therefore the reward function we define is computed offline. We present a true distance to compare the behavior of the expert and the behavior of the agent, rather than using the common proxy of performance with respect to the true return of the task we consider (as it is unknown in general). Our method recovers expert behaviour comparably to existing state-of-the-art methods while being based on significantly fewer hyperparameters; it operates even in the extreme low data regime of demonstrations, and is the first method that makes Humanoid run with a single (subsampled) demonstration.

2. BACKGROUND AND NOTATIONS

Markov decision processes. We describe environments as episodic Markov Decision Processes (MDP) with finite time horizon (Sutton & Barto, 2018) (S, A, P, r, γ, ρ 0 , T ), where S is the state space, A is the action space, P is the transition kernel, r is the reward function, γ is the discount factor, ρ 0 is the initial state distribution and T is the time horizon. We will denote the dimensionality of S and A as |S| and |A| respectively. A policy π is a mapping from states to distributions over actions; we denote the space of all policies by Π. In RL, the goal is to learn a policy π * that maximizes the expected sum of discounted rewards it encounters, that is, the expected return. Depending on the context, we might use the concept of a cost c rather than a reward r (Puterman, 2014), which essentially moves the goal of the policy from maximizing its return to minimizing its cumlative cost. State action distributions. Suppose a policy π visits the successive states and actions s 1 , a 1 , . . . , s T , a T during an episode, we define the empirical state-action distribution ρπ as: ρπ = 1 T T t=1 δ st,at , where δ st,at is a Dirac distribution centered on (s t , a t ). Similarly, suppose we have a set of expert demonstrations D = {s e , a e } of size D, then the associated empirical expert state-action distribution ρe is defined as: ρe = 1 D (s,a)∈D δ s,a . Wasserstein distance. Suppose we have the metric space (M, d) where M is a set and d is a metric on M . Suppose we have µ and ν two distributions on M with finite moments, the p-th order Wasserstein distance (Villani, 2008) is defined as W p p (µ, ν) = inf θ∈Θ(µ,ν) M ×M d(x, y) p dθ(x, y), where Θ(µ, ν) is the set of all couplings between µ and ν. In the remainder, we only consider distributions with finite support. A coupling between two distributions of support cardinal T and D is a doubly stochastic matrix of size T × D. We note Θ the set of all doubly stochastic matrices of size T × D: Θ = θ ∈ R T ×D + | ∀j ∈ [1 : D], T i =1 θ[i , j] = 1 D , ∀i ∈ [1 : T ], D j =1 θ[i, j ] = 1 T . The Wasserstein distance between distributions of state-action pairs requires the definition of a metric d in the space (S, A). Defining a metric in an MDP is non trivial (Ferns et al., 2004; Mahadevan & Maggioni, 2007 ); we show an example where the metric is learned from demonstrations in Section 4.4. For now, we assume the existence of a metric d : (S, A) × (S, A) → R + .

3. METHOD

We present the theoretical motivation of our approach: the minimization of the Wasserstein distance between the state-action distributions of the agent and the expert. We introduce a reward based on an upper-bound of the primal form of the Wasserstein distance inferred from a relaxation of

