IMITATION WITH NEURAL DENSITY MODELS

Abstract

We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback-Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks.

1. INTRODUCTION

Imitation Learning (IL) algorithms aim to learn optimal behavior by mimicking expert demonstrations. Perhaps the simplest IL method is Behavioral Cloning (BC) (Pomerleau, 1991) which ignores the dynamics of the underlying Markov Decision Process (MDP) that generated the demonstrations, and treats IL as a supervised learning problem of predicting optimal actions given states. Prior work showed that if the learned policy incurs a small BC loss, the worst case performance gap between the expert and imitator grows quadratically with the number of decision steps (Ross & Bagnell, 2010; Ross et al., 2011a) . The crux of their argument is that policies that are "close" as measured by BC loss can induce disastrously different distributions over states when deployed in the environment. One family of solutions to mitigating such compounding errors is Interactive IL (Ross et al., 2011b; 2013; Guo et al., 2014) , which involves running the imitator's policy and collecting corrective actions from an interactive expert. However, interactive expert queries can be expensive and are seldom available. Another family of approaches (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim & Park, 2018; Wang et al., 2017) that have gained much traction is to directly minimize a statistical distance between state-action distributions induced by policies of the expert and imitator, i.e the occupancy measures ⇢ ⇡ E and ⇢ ⇡ ✓ . As ⇢ ⇡ ✓ is an implicit distribution induced by the policy and environmentfoot_0 , distribution matching with ⇢ ⇡ ✓ typically requires likelihood-free methods involving sampling. Sampling from ⇢ ⇡ ✓ entails running the imitator policy in the environment, which was not required by BC. While distribution matching IL requires additional access to an environment simulator, it has been shown to drastically improve demonstration efficiency, i.e the number of demonstrations needed to succeed at IL (Ho & Ermon, 2016) . A wide suite of distribution matching IL algorithms use adversarial methods to match ⇢ ⇡ ✓ and ⇢ ⇡ E , which requires alternating between reward (discriminator) and policy (generator) updates (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim et al., 2019) . A key drawback to such Adversarial Imitation Learning (AIL) methods is that they inherit the instability of alternating min-max optimization (Salimans et al., 2016; Miyato et al., 2018) which is generally not guaranteed to converge (Jin et al., 2019) . Furthermore, this instability is exacerbated in the IL setting where generator updates involve high-variance policy optimization and leads to sub-optimal demonstration efficiency. To alleviate this instability, (Wang et al., 2019; Brantley et al., 2020; Reddy et al., 2017) have proposed to do RL with fixed heuristic rewards. Wang et al. (2019) , for example, uses a heuristic reward that estimates the support of ⇢ ⇡ E which discourages the imitator from visiting out-of-support states. While having the merit of simplicity, these approaches have no guarantee of recovering the true expert policy. In this work, we propose a new framework for IL via obtaining a density estimate q of the expert's occupancy measure ⇢ ⇡ E followed by Maximum Occupancy Entropy Reinforcement Learning (Max-OccEntRL) (Lee et al., 2019; Islam et al., 2019) . In the MaxOccEntRL step, the density estimate q is used as a fixed reward for RL and the occupancy entropy H(⇢ ⇡ ✓ ) is simultaneously maximized, leading to the objective max ✓ E ⇢⇡ ✓ [log q (s, a)] + H(⇢ ⇡ ✓ ). Intuitively, our approach encourages the imitator to visit high density state-action pairs under ⇢ ⇡ E while maximally exploring the state-action space. There are two main challenges to this approach. First, we require accurate density estimation of ⇢ ⇡ E , which is particularly challenging when the state-action space is high dimensional and the number of expert demonstrations are limited. Second, in contrast to Maximum Entropy RL (MaxEntRL), MaxOccEntRL requires maximizing the entropy of an implicit density ⇢ ⇡ ✓ . We address the former challenge leveraging advances in density estimation (Germain et al., 2015; Du & Mordatch, 2018; Song et al., 2019) . For the latter challenge, we derive a non-adversarial model-free RL objective that provably maximizes a lower bound to occupancy entropy. As a byproduct, we also obtain a model-free RL objective that lower bounds reverse Kullback-Lieber (KL) divergence between ⇢ ⇡ ✓ and ⇢ ⇡ E . The contribution of our work is introducing a novel family of distribution matching IL algorithms, named Neural Density Imitation (NDI), that (1) optimizes a principled lower bound to the additive inverse of reverse KL, thereby avoiding adversarial optimization and (2). advances state-of-the-art demonstration efficiency in IL.

2. IMITATION LEARNING VIA DENSITY ESTIMATION

We model an agent's decision making process as a discounted infinite-horizon Markov Decision Process (MDP) M = (S, A, P, P 0 , r, ). Here S, A are state-action spaces, P : S ⇥ A ! ⌦(S) is a transition dynamics where ⌦(S) is the set of probability measures on S, P 0 : S ! R is an initial state distribution, r : S ⇥ A ! R is a reward function, and 2 [0, 1) is a discount factor. A parameterized policy ⇡ ✓ : S ! ⌦(A) distills the agent's decision making rule and {s t , a t } 1 t=0 is the stochastic process realized by sampling an initial state from s 0 ⇠ P 0 (s) then running ⇡ ✓ in the environment, i.e a t ⇠ ⇡ ✓ (•|s t ), s t+1 ⇠ P (•|s t , a t ). We denote by p ✓,t:t+k the joint distribution of states {s t , s t+1 , ..., s t+k }, where setting p ✓,t recovers the marginal of s t . The (unnormalized) occupancy measure of ⇡ ✓ is defined as ⇢ ⇡ ✓ (s, a) = P 1 t=0 t p ✓,t (s)⇡ ✓ (a|s). Intuitively, ⇢ ⇡ ✓ (s, a) quantifies the frequency of visiting the state-action pair (s, a) when running ⇡ ✓ for a long time, with more emphasis on earlier states. We denote policy performance as J(⇡ ✓ , r) = E ⇡ ✓ [ P 1 t=0 t r(s t , a t )] = E (s,a)⇠⇢⇡ ✓ [r(s, a)] where r is a (potentially) augmented reward function and E denotes the generalized expectation operator extended to non-normalized densities p : X ! R + and functions f : X ! Y so that E p[f (x)] = P x p(x)f (x). The choice of r depends on the RL framework. In standard RL, we simply have r = r, while in Maximum Entropy RL (MaxEntRL) (Haarnoja et al., 2017) , we have r(s, a) = r(s, a) log ⇡ ✓ (a|s). We denote the entropy of ⇢ ⇡ ✓ (s, a) as H(⇢ ⇡ ✓ ) = E ⇢⇡ ✓ [ log ⇢ ⇡ ✓ (s, a) ] and overload notation to denote the -discounted causal entropy of policy ⇡ ✓ as H(⇡ ✓ ) = E ⇡ ✓ [ P 1 t=0 t log ⇡ ✓ (a t |s t )] = E ⇢⇡ ✓ [ log ⇡ ✓ (a|s)]. Note that we use a generalized notion of entropy where the domain is extended to non-normalized densities. We can then define the Maximum Occupancy Entropy RL (MaxOccEntRL) (Lee et al., 2019; Islam et al., 2019) objective as J(⇡ ✓ , r = r) + H(⇢ ⇡ ✓ ). Note the key difference between MaxOccEntRL and MaxEntRL: entropy regularization is on the occupancy measure instead of the policy, i.e seeks state diversity instead of action diversity. We will later show in section 2.2, that a lower bound on this objective reduces to a complete model-free RL objective with an augmented reward r. Let ⇡ E , ⇡ ✓ denote an expert and imitator policy, respectively. Given only demonstrations D = {(s, a) i } k i=1 ⇠ ⇡ E of state-action pairs sampled from the expert, Imitation Learning (IL) aims to learn a policy ⇡ ✓ which matches the expert, i.e ⇡ ✓ = ⇡ E . Formally, IL can be recast as a distribution matching problem (Ho & Ermon, 2016; Ke et al., 2020)  between occupancy measures ⇢ ⇡ ✓ and ⇢ ⇡ E : maximize ✓ d(⇢ ⇡ ✓ , ⇢ ⇡ E ) where d(p, q) is a generalized statistical distance defined on the extended domain of (potentially) non-normalized probability densities p(x), q(x) with the same normalization factor Z > 0, i.e R x p(x)/Z = R x q(x)/Z = 1. For ⇢ ⇡ and ⇢ ⇡ E , we have Z = 1 1 . As we are only able to take samples from the transition kernel and its density is unknown, ⇢ ⇡ ✓ is an implicit distributionfoot_1 . Thus, optimizing Eq. 1 typically requires likelihood-free approaches leveraging samples from ⇢ ⇡ ✓ , i.e running ⇡ ✓ in the environment. Current state-of-the-art IL approaches use likelihood-free adversarial methods to approximately optimize Eq. 1 for various choices of d such as reverse Kullback-Liebler (KL) divergence (Fu et al., 2017; Kostrikov et al., 2020) and Jensen-Shannon (JS) divergence (Ho & Ermon, 2016) . However, adversarial methods are known to suffer from optimization instability which is exacerbated in the IL setting where one step in the alternating optimization involves RL.



we assume only samples can be taken from the environment dynamics and its density is unknown probability models that have potentially intractable density functions, but can be sampled from to estimate expectations and gradients of expectations with respect to model parameters(Huszár, 2017).

