IMITATION WITH NEURAL DENSITY MODELS

Abstract

We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback-Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks.

1. INTRODUCTION

Imitation Learning (IL) algorithms aim to learn optimal behavior by mimicking expert demonstrations. Perhaps the simplest IL method is Behavioral Cloning (BC) (Pomerleau, 1991) which ignores the dynamics of the underlying Markov Decision Process (MDP) that generated the demonstrations, and treats IL as a supervised learning problem of predicting optimal actions given states. Prior work showed that if the learned policy incurs a small BC loss, the worst case performance gap between the expert and imitator grows quadratically with the number of decision steps (Ross & Bagnell, 2010; Ross et al., 2011a) . The crux of their argument is that policies that are "close" as measured by BC loss can induce disastrously different distributions over states when deployed in the environment. One family of solutions to mitigating such compounding errors is Interactive IL (Ross et al., 2011b; 2013; Guo et al., 2014) , which involves running the imitator's policy and collecting corrective actions from an interactive expert. However, interactive expert queries can be expensive and are seldom available. Another family of approaches (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim & Park, 2018; Wang et al., 2017) that have gained much traction is to directly minimize a statistical distance between state-action distributions induced by policies of the expert and imitator, i.e the occupancy measures ⇢ ⇡ E and ⇢ ⇡ ✓ . As ⇢ ⇡ ✓ is an implicit distribution induced by the policy and environment 1 , distribution matching with ⇢ ⇡ ✓ typically requires likelihood-free methods involving sampling. Sampling from ⇢ ⇡ ✓ entails running the imitator policy in the environment, which was not required by BC. While distribution matching IL requires additional access to an environment simulator, it has been shown to drastically improve demonstration efficiency, i.e the number of demonstrations needed to succeed at IL (Ho & Ermon, 2016) . A wide suite of distribution matching IL algorithms use adversarial methods to match ⇢ ⇡ ✓ and ⇢ ⇡ E , which requires alternating between reward (discriminator) and policy (generator) updates (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim et al., 2019) . A key drawback to such Adversarial Imitation Learning (AIL) methods is that they inherit the instability of alternating min-max optimization (Salimans et al., 2016; Miyato et al., 2018) which is generally not guaranteed to converge (Jin et al., 2019) . Furthermore, this instability is exacerbated in the IL setting where generator updates involve high-variance policy optimization and leads to sub-optimal demonstration efficiency. To alleviate this instability, (Wang et al., 2019; Brantley et al., 2020; Reddy et al., 2017) have proposed to do RL with fixed heuristic rewards. Wang et al. (2019) , for example, uses a heuristic reward that estimates the support of ⇢ ⇡ E which discourages the imitator from visiting out-of-support states. While having the merit of simplicity, these approaches have no guarantee of recovering the true expert policy. In this work, we propose a new framework for IL via obtaining a density estimate q of the expert's occupancy measure ⇢ ⇡ E followed by Maximum Occupancy Entropy Reinforcement Learning (Max-OccEntRL) (Lee et al., 2019; Islam et al., 2019) . In the MaxOccEntRL step, the density estimate q is used as a fixed reward for RL and the occupancy entropy H(⇢ ⇡ ✓ ) is simultaneously maximized, leading to the objective max ✓ E ⇢⇡ ✓ [log q (s, a)] + H(⇢ ⇡ ✓ ). Intuitively, our approach encourages the imitator to visit high density state-action pairs under ⇢ ⇡ E while maximally exploring the state-action 1 we assume only samples can be taken from the environment dynamics and its density is unknown 1

