ROBUST IMITATION VIA DECISION-TIME PLANNING Anonymous authors Paper under double-blind review

Abstract

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

1. INTRODUCTION

The objective of imitation learning is to optimize agent policies directly from demonstrations of expert behavior. Such a learning paradigm sidesteps reward engineering, which is a key bottleneck for applying reinforcement learning (RL) in many real-world domains, e.g., autonomous driving, robotics. In the presence of a finite dataset of expert demonstrations however, a key challenge with current approaches is that the learned policies can quickly deviate from intended expert behavior and lead to compounding errors at test-time (Osa et al., 2018) . Moreover, it has been observed that imitation policies can be brittle and drastically deteriorate in performance with even small perturbations to the dynamics during execution (Christiano et al., 2016; de Haan et al., 2019) . A predominant class of approaches to imitation learning is based on inverse reinforcement learning (IRL) and involve successive application of two steps: (a) an IRL step where the agent infers the (unknown) reward function for the expert, followed by (b) an RL step where the agent maximizes the inferred reward function via a policy optimization algorithm. For example, many popular IRL approaches consider an adversarial learning framework (Goodfellow et al., 2014) , where the reward function is inferred by a discriminator that distinguishes expert demonstrations from roll-outs of an imitation policy [IRL step] and the imitation agent maximizes the inferred reward function to best match the expert policy [RL step] (Ho & Ermon, 2016; Fu et al., 2017) . In this sense, reward inference is only an intermediary step towards learning the expert policy and is discarded posttraining of the imitation agent. We introduce Imitation with Planning at Test-time (IMPLANT), a new algorithm for imitation learning that incorporates decision-time planning within an IRL algorithm. During training, we can use any standard IRL approach to estimate a reward function and a stochastic imitation policy, along with an additional value function. The value function can be learned explicitly or is often a byproduct of standard RL algorithms that involve policy evaluation, such as actor-critic methods (Konda & Tsitsiklis, 2000; Peters & Schaal, 2008) . At decision-time, we use the learned imitation policy in conjunction with a closed-loop planner. For any given state, the imitation policy proposes a set of candidate actions and the planner estimates the returns for each of actions by performing fixedhorizon rollouts. The rollout returns are estimated using the learned reward and value functions. Finally, the agent picks the action with the highest estimated return and the process is repeated at each of the subsequent timesteps. Conceptually, IMPLANT aims to counteract the imperfections due to policy optimization in the RL step by using the reward function (along with a value function) estimated in the IRL step for decision-time planning. We demonstrate strong empirical improvements using this approach over benchmark imitation learning algorithms in a variety settings derived from the MuJoCo-based benchmarks in OpenAI Gym (Todorov et al., 2012; Brockman et al., 2016) . In default evaluation setup where train and test environments match, we observe that IMPLANT improves by 16.5% on average over the closest baseline. We also consider transfer setups where the imitation agent is deployed in test dynamics that differ from train dynamics and the test dynamics are inaccessible to the agent during both training and decision-time planning. In particular, we consider the following three setups: (a) "causal confusion" where the agent observes nuisance variables in the state representation during training (de Haan et al., 2019), (b) motor noise which adds noise in the executed actions during testing (Christiano et al., 2016) , and (c) transition noise which adds noise to the next state distribution during testing. In all these setups, we observe that IMPLANT consistently and robustly transfers to test environments with improvements of 35.2% on average over the closest baseline.

2. PRELIMINARIES

Problem Setup. We consider the framework of Markov Decision Processes (MDP) (Puterman, 1990 ). An MDP is denoted by a tuple M = (S, A, T , p 0 , r, γ), where S is the state space, A is the action space, T : S × A × S → R ≥0 are the stochastic transition dynamics, p 0 : S → R ≥0 is the initial state distribution, r : S × A → R is the reward function, and γ ∈ [0, 1) is the discount factor. We assume an infinite horizon setting. At any given state s ∈ S, an agent makes decisions via a stochastic policy π : S × A → R ≥0 . We denote a trajectory to be a sequence of state-action pairs τ = (s 0 , a 0 , s 1 , a 1 , • • • ). Any policy π, along with MDP parameters, induces a distribution over trajectories, which can be expressed as p π (τ ) = p(s 0 ) ∞ t=0 π(a t |s t )T (s t+1 |s t , a t ). The return of a trajectory is the discounted sum of rewards R(τ ) = ∞ t=0 γ t r(s t , a t ). In reinforcement learning (RL), the goal is to learn a parameterized policy π θ that maximizes the expected returns w.r.t. the trajectory distribution. Maximizing such an objective requires interaction with the underlying MDP for simulating trajectories and querying rewards. However, in many highstakes scenarios, the reward function is not directly accessible and hard to manually design. In imitation learning, we sidestep the availability of the reward function. Instead, we have access to a finite set of D trajectories τ E (a.k.a. demonstrations) that are sampled from an expert policy π E . Every trajectory τ ∈ τ E consists of a finite length sequence of state and action pairs τ = (s 0 , a 0 , s 1 , a 1 , • • • ), where s 0 ∼ p 0 (s), a t ∼ π E (•|s t ), and s t+1 ∼ T (•|s t , a t ). Our goal is to learn a parameterized policy π θ which best approximates the expert policy given access to τ E . Next, we discuss the two major families of techniques for imitation learning.

2.1. BEHAVIORAL CLONING

Behavioral cloning (BC) casts imitation learning as a supervised learning problem over state-action pairs provided in the expert demonstrations (Pomerleau, 1991) . In particular, we learn the policy parameters by solving a regression problem with states s t and actions a t as the features and target labels respectively. Formally, we minimize the following objective: BC (θ) := (st,at)∈τ E a t -π θ (s t ) 2 2 . In practice, BC agents suffer from distribution shift in high dimensions, where small deviations in the learned policy quickly accumulate during deployment and lead to a significantly different trajectory distribution relative to the expert (Ross & Bagnell, 2010; Ross et al., 2011) .

2.2. INVERSE REINFORCEMENT LEARNING

An alternative indirect approach to imitation learning is based on inverse reinforcement learning (IRL). Here, the goal is to infer a reward function for the expert and subsequently maximize the inferred reward to obtain a policy. For brevity, we focus on adversarial imitation learning approaches

