IMITATION WITH NEURAL DENSITY MODELS

Abstract

We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback-Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks.

1. INTRODUCTION

Imitation Learning (IL) algorithms aim to learn optimal behavior by mimicking expert demonstrations. Perhaps the simplest IL method is Behavioral Cloning (BC) (Pomerleau, 1991) which ignores the dynamics of the underlying Markov Decision Process (MDP) that generated the demonstrations, and treats IL as a supervised learning problem of predicting optimal actions given states. Prior work showed that if the learned policy incurs a small BC loss, the worst case performance gap between the expert and imitator grows quadratically with the number of decision steps (Ross & Bagnell, 2010; Ross et al., 2011a) . The crux of their argument is that policies that are "close" as measured by BC loss can induce disastrously different distributions over states when deployed in the environment. One family of solutions to mitigating such compounding errors is Interactive IL (Ross et al., 2011b; 2013; Guo et al., 2014) , which involves running the imitator's policy and collecting corrective actions from an interactive expert. However, interactive expert queries can be expensive and are seldom available. Another family of approaches (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim & Park, 2018; Wang et al., 2017) that have gained much traction is to directly minimize a statistical distance between state-action distributions induced by policies of the expert and imitator, i.e the occupancy measures ⇢ ⇡ E and ⇢ ⇡ ✓ . As ⇢ ⇡ ✓ is an implicit distribution induced by the policy and environmentfoot_0 , distribution matching with ⇢ ⇡ ✓ typically requires likelihood-free methods involving sampling. Sampling from ⇢ ⇡ ✓ entails running the imitator policy in the environment, which was not required by BC. While distribution matching IL requires additional access to an environment simulator, it has been shown to drastically improve demonstration efficiency, i.e the number of demonstrations needed to succeed at IL (Ho & Ermon, 2016) . A wide suite of distribution matching IL algorithms use adversarial methods to match ⇢ ⇡ ✓ and ⇢ ⇡ E , which requires alternating between reward (discriminator) and policy (generator) updates (Ho & Ermon, 2016; Fu et al., 2017; Ke et al., 2020; Kostrikov et al., 2020; Kim et al., 2019) . A key drawback to such Adversarial Imitation Learning (AIL) methods is that they inherit the instability of alternating min-max optimization (Salimans et al., 2016; Miyato et al., 2018) which is generally not guaranteed to converge (Jin et al., 2019) . Furthermore, this instability is exacerbated in the IL setting where generator updates involve high-variance policy optimization and leads to sub-optimal demonstration efficiency. To alleviate this instability, (Wang et al., 2019; Brantley et al., 2020; Reddy et al., 2017) have proposed to do RL with fixed heuristic rewards. Wang et al. (2019) , for example, uses a heuristic reward that estimates the support of ⇢ ⇡ E which discourages the imitator from visiting out-of-support states. While having the merit of simplicity, these approaches have no guarantee of recovering the true expert policy. In this work, we propose a new framework for IL via obtaining a density estimate q of the expert's occupancy measure ⇢ ⇡ E followed by Maximum Occupancy Entropy Reinforcement Learning (Max-OccEntRL) (Lee et al., 2019; Islam et al., 2019) . In the MaxOccEntRL step, the density estimate q is used as a fixed reward for RL and the occupancy entropy H(⇢ ⇡ ✓ ) is simultaneously maximized, leading to the objective max ✓ E ⇢⇡ ✓ [log q (s, a)] + H(⇢ ⇡ ✓ ). Intuitively, our approach encourages the imitator to visit high density state-action pairs under ⇢ ⇡ E while maximally exploring the state-action space. There are two main challenges to this approach. First, we require accurate density estimation of ⇢ ⇡ E , which is particularly challenging when the state-action space is high dimensional and the number of expert demonstrations are limited. Second, in contrast to Maximum Entropy RL (MaxEntRL), MaxOccEntRL requires maximizing the entropy of an implicit density ⇢ ⇡ ✓ . We address the former challenge leveraging advances in density estimation (Germain et al., 2015; Du & Mordatch, 2018; Song et al., 2019) . For the latter challenge, we derive a non-adversarial model-free RL objective that provably maximizes a lower bound to occupancy entropy. As a byproduct, we also obtain a model-free RL objective that lower bounds reverse Kullback-Lieber (KL) divergence between ⇢ ⇡ ✓ and ⇢ ⇡ E . The contribution of our work is introducing a novel family of distribution matching IL algorithms, named Neural Density Imitation (NDI), that (1) optimizes a principled lower bound to the additive inverse of reverse KL, thereby avoiding adversarial optimization and (2). advances state-of-the-art demonstration efficiency in IL.

2. IMITATION LEARNING VIA DENSITY ESTIMATION

We model an agent's decision making process as a discounted infinite-horizon Markov Decision Process (MDP) M = (S, A, P, P 0 , r, ). Here S, A are state-action spaces, P : S ⇥ A ! ⌦(S) is a transition dynamics where ⌦(S) is the set of probability measures on S, P 0 : S ! R is an initial state distribution, r : S ⇥ A ! R is a reward function, and 2 [0, 1) is a discount factor. A parameterized policy ⇡ ✓ : S ! ⌦(A) distills the agent's decision making rule and {s t , a t } 1 t=0 is the stochastic process realized by sampling an initial state from s 0 ⇠ P 0 (s) then running ⇡ ✓ in the environment, i.e a t ⇠ ⇡ ✓ (•|s t ), s t+1 ⇠ P (•|s t , a t ). We denote by p ✓,t:t+k the joint distribution of states {s t , s t+1 , ..., s t+k }, where setting p ✓,t recovers the marginal of s t . The (unnormalized) occupancy measure of ⇡ ✓ is defined as ⇢ ⇡ ✓ (s, a) = P 1 t=0 t p ✓,t (s)⇡ ✓ (a|s). Intuitively, ⇢ ⇡ ✓ (s, a) quantifies the frequency of visiting the state-action pair (s, a) when running ⇡ ✓ for a long time, with more emphasis on earlier states. We denote policy performance as J(⇡ ✓ , r) = E ⇡ ✓ [ P 1 t=0 t r(s t , a t )] = E (s,a)⇠⇢⇡ ✓ [r(s, a)] where r is a (potentially) augmented reward function and E denotes the generalized expectation operator extended to non-normalized densities p : X ! R + and functions f : X ! Y so that E p[f (x)] = P x p(x)f (x). The choice of r depends on the RL framework. In standard RL, we simply have r = r, while in Maximum Entropy RL (MaxEntRL) (Haarnoja et al., 2017) , we have r(s, a) = r(s, a) log ⇡ ✓ (a|s). We denote the entropy of ⇢ ⇡ ✓ (s, a) as H(⇢ ⇡ ✓ ) = E ⇢⇡ ✓ [ log ⇢ ⇡ ✓ (s, a)] and overload notation to denote the -discounted causal entropy of policy ⇡ ✓ as H(⇡ ✓ ) = E ⇡ ✓ [ P 1 t=0 t log ⇡ ✓ (a t |s t )] = E ⇢⇡ ✓ [ log ⇡ ✓ (a|s)]. Note that we use a generalized notion of entropy where the domain is extended to non-normalized densities. We can then define the Maximum Occupancy Entropy RL (MaxOccEntRL) (Lee et al., 2019; Islam et al., 2019) objective as J(⇡ ✓ , r = r) + H(⇢ ⇡ ✓ ). Note the key difference between MaxOccEntRL and MaxEntRL: entropy regularization is on the occupancy measure instead of the policy, i.e seeks state diversity instead of action diversity. We will later show in section 2.2, that a lower bound on this objective reduces to a complete model-free RL objective with an augmented reward r. Let ⇡ E , ⇡ ✓ denote an expert and imitator policy, respectively. Given only demonstrations D = {(s, a) i } k i=1 ⇠ ⇡ E of state-action pairs sampled from the expert, Imitation Learning (IL) aims to learn a policy ⇡ ✓ which matches the expert, i.e ⇡ ✓ = ⇡ E . Formally, IL can be recast as a distribution matching problem (Ho & Ermon, 2016; Ke et al., 2020) between occupancy measures ⇢ ⇡ ✓ and ⇢ ⇡ E : maximize ✓ d(⇢ ⇡ ✓ , ⇢ ⇡ E ) where d(p, q) is a generalized statistical distance defined on the extended domain of (potentially) non-normalized probability densities p(x), q(x) with the same normalization factor Z > 0, i.e R x p(x)/Z = R x q(x)/Z = 1. For ⇢ ⇡ and ⇢ ⇡ E , we have Z = 1 1 . As we are only able to take samples from the transition kernel and its density is unknown, ⇢ ⇡ ✓ is an implicit distributionfoot_1 . Thus, optimizing Eq. 1 typically requires likelihood-free approaches leveraging samples from ⇢ ⇡ ✓ , i.e running ⇡ ✓ in the environment. Current state-of-the-art IL approaches use likelihood-free adversarial methods to approximately optimize Eq. 1 for various choices of d such as reverse Kullback-Liebler (KL) divergence (Fu et al., 2017; Kostrikov et al., 2020) and Jensen-Shannon (JS) divergence (Ho & Ermon, 2016) . However, adversarial methods are known to suffer from optimization instability which is exacerbated in the IL setting where one step in the alternating optimization involves RL. We instead derive a non-adversarial objective for IL. In this work, we choose d to be (generalized) reverse-KL divergence and leave derivations for alternate choices of d to future work. D KL (⇢ ⇡ ✓ ||⇢ ⇡ E ) = E ⇢⇡ ✓ [log ⇢ ⇡ E (s, a) log ⇢ ⇡ ✓ (s, a)] = J(⇡ ✓ , r = log ⇢ ⇡ E ) + H(⇢ ⇡ ✓ ) (2) We see that maximizing negative reverse-KL with respect to ⇡ ✓ is equivalent to Maximum Occupancy Entropy RL (MaxOccEntRL) with log ⇢ ⇡ E as the fixed reward. Intuitively, this objective drives ⇡ ✓ to visit states that are most likely under ⇢ ⇡ E while maximally spreading out probability mass so that if two state-action pairs are equally likely, the policy visits both. There are two main challenges associated with this approach which we address in the following sections. 1. log ⇢ ⇡ E is unknown and must be estimated from the demonstrations D. Density estimation remains a challenging problem, especially when there are a limited number of samples and the data is high dimensional (Liu et al., 2007) . Note that simply extracting the conditional ⇡(a|s) from an estimate of the joint ⇢ ⇡ E (s, a) is an alternate way to do BC and does not resolve the compounding error problem (Ross et al., 2011a) . 2. H(⇢ ⇡ ✓ ) is hard to maximize as ⇢ ⇡ ✓ is an implicit density. This challenge is similar to the difficulty of entropy regularizing generators (Mohamed & Lakshminarayanan, 2016; Belghazi et al., 2018; Dieng et al., 2019) for Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , and most existing approaches (Dieng et al., 2019; Lee et al., 2019) use adversarial optimization.

2.1. ESTIMATING THE EXPERT OCCUPANCY MEASURE

We seek to learn a parameterized density model q (s, a) of ⇢ ⇡ E from samples. We consider two canonical families of density models: Autoregressive models and Energy-based models (EBMs). Autoregressive Models (Germain et al., 2015; Papamakarios et al., 2017 ): An autoregressive model q (x) for x = (s, a) learns a factorized distribution of the form: q (x) = ⇧ i q i (x i |x <i ). For instance, each factor q i could be a mapping from x <i to a Gaussian density over x i . When given a prior over the true dependency structure of {x i }, this can be incorporated by refactoring the model. Autoregressive models are typically trained via Maximum Likelihood Estimation (MLE). Energy-based Models (EBM) (Du & Mordatch, 2018; Song et al., 2019) : Let E : S ⇥ A ! R be an energy function. An energy based model is a parameterized Boltzman distribution of the form q (s, a) ,a) , where Z( ,a) dsda denotes the partition function. Energy-based models are desirable for high dimensional density estimation due to their expressivity, but are typically difficult to train due to the intractability of computing the partition function. However, our IL objective in Eq. 1 conveniently only requires a non-normalized density estimate as policy optimality is invariant to constant shifts in the reward. Thus, we opted to perform non-normalized density estimation with EBMs using score matching which allows us to directly learn E without having to estimate Z( ). = 1 Z( ) e E (s ) = R S⇥A e E (s

2.2. MAXIMUM OCCUPANCY ENTROPY REINFORCEMENT LEARNING

In general maximizing the entropy of implicit distributions is challenging due to the fact that there is no analytic form for the density function. Prior works have proposed using adversarial methods involving noise injection (Dieng et al., 2019) and fictitious play (Brown, 1951; Lee et al., 2019) . We instead propose to maximize a novel lower bound to the additive inverse of an occupancy divergence which we prove is equivalent to maximizing a non-adversarial model-free RL objective. We first make clear the assumptions on the MDPs considered henceforth. Assumption 1 All considered MDPs have deterministic dynamics governed by a transition function P : S ⇥ A ! S. Furthermore, P is injective with respect to a 2 A, i.e 8s, a, a 0 it holds that a 6 = a 0 ) P (s, a) 6 = P (s, a 0 ). We note that Assumption 1 holds for most continuous robotics and physics environments as they are deterministic and inverse dynamics functions P 1 : S ⇥ S ! A have been successfully used in benchmark RL environments such as Mujoco (Todorov et al., 2012; Todorov, 2014) and Atari (Pathak et al., 2017) . Next we introduce a crucial ingredient in deriving our occupancy entropy lower bound, which is a tractable lower bound to Mutual Information (MI) first proposed by Nguyen, Wainright, and Jordan (Nguyen et al., 2010) , also known as the f -GAN KL (Nowozin et al., 2016) and MINE-f (Belghazi et al., 2018) . For random variables X, Y distributed according to p ✓xy (x, y), p ✓x (x), p ✓y (y) where ✓ = (✓ xy , ✓ x , ✓ y ), and any critic function f : X ⇥ Y ! R, it holds that I(X; Y |✓) I f NWJ (X; Y |✓) where, I f NWJ (X; Y |✓) := E p ✓xy [f (x, y)] e 1 E p ✓x [E p ✓y [e f (x,y) ]] This bound is tight when f is chosen to be the optimal critic f ⇤ (x, y) = log p ✓xy (x,y) p ✓x (x)p ✓y (y) + 1. We are now ready to state a lower bound to the occupancy entropy. Theorem 1 Let MDP M satisfy assumption 1 (App. A). For any critic f : S ⇥ S ! R, it holds that H(⇢ ⇡ ✓ ) H f (⇢ ⇡ ✓ ) where  H f (⇢ ⇡ ✓ ) := H(s 0 ) + (1 + )H(⇡ ✓ ) + 1 X t=0 t I f NWJ (s t+1 ; s t |✓) ) := H(⇡ ✓ )  H(⇢ ⇡ ✓ ) , but this bound has more slack and is limited to discrete state-spaces. (see Appendix A for details) Since occupancy entropy maximization is also a desirable exploration strategy in sparse environments (Hazan et al., 2019; Lee et al., 2019) , another interpretation of the SAELBO is as a surrogate objective for state-action level exploration. Furthermore, we posit that maximizing the SAELBO is more effective for state-action level exploration, i.e occupancy entropy maximization, than solely maximizing policy entropy. This is because, in discrete state-spaces, the SAELBO is a tighter lower bound to occupancy entropy than policy entropy, i.e H(⇡ ✓ )  H f (⇢ ⇡ ✓ )  H(⇢ ⇡ ✓ ) , and in continuous state-spaces, where Assumption 1 holds, the SAELBO is still a lower bound while policy entropy alone is neither a lower nor upper bound to occupancy entropy. Please see Appendix C.1 for experiments that show how SAELBO maximization can improve state-action level exploration over just policy entropy maximization. Next, we show that the gradient of the SAELBO is equivalent to the gradient of a model-free RL objective. Theorem 2 Let q ⇡ (a|s) and {q t (s)} t 0 be probability densities such that 8s, a 2 S ⇥ A satisfy q ⇡ (a|s) = ⇡ ✓ (a|s) and q t (s) = p ✓,t (s). Then for all f : S ⇥ S ! R, r ✓ H f (⇢ ⇡ ✓ ) = r ✓ J(⇡ ✓ , r = r ⇡ + r f ) where r ⇡ (s t , a t ) = (1 + ) log q ⇡ (a t |s t ) (7) r f (s t , a t , s t+1 ) = f (s t , s t+1 ) e E st⇠qt,st+1⇠qt+1 [e f (st,st+1) + e f (st,st+1) ] See Appendix A.2 for the proof. Theorem 2 shows that maximizing the SAELBO is equivalent to maximizing a discounted model-free RL objective with the reward r ⇡ + r f , where r ⇡ contributes to maximizing H(⇡ ✓ ) and r f contributes to maximizing P 1 t=0 t I f NWJ (s t+1 ; s t |✓). Note that evaluating r f entails estimating expectations with respect to q t , q t+1 . This can be accomplished by rolling out multiple trajectories with the current policy and collecting the states from time-step t, t + 1. Alternatively, if we assume that the policy is changing slowly, we can simply take samples of states from time-step t, t + 1 from the replay buffer. Combining the results of Theorem 1, 2, we end the section with a lower bound on the original distribution matching objective from Eq. 1 and show that maximizing this lower bound is again, equivalent to maximizing a model-free RL objective. Corollary 1 Let MDP M satisfy assumption 1 (App. A). For any critic f : S ⇥ S ! R, it holds that D KL (⇢ ⇡ ✓ ||⇢ ⇡ E ) J(⇡ ✓ , r = log ⇢ ⇡ E ) + H f (⇢ ⇡ ✓ ) Furthermore, let r ⇡ , r f be defined as in Theorem 2. Then, r ✓ J(⇡ ✓ , r = log ⇢ ⇡ E ) + H f (⇢ ⇡ ✓ ) = r ✓ J(⇡ ✓ , r = log ⇢ ⇡ E + r ⇡ + r f ) In the following section we derive a practical distribution matching IL algorithm combining all the ingredients from this section. Collect (s t , a t , s t+1 , r) ⇠ ⇡ ✓ and add to replay buffer B, where r = log q + ⇡ r ⇡ + f r f , r ⇡ (s t , a t ) = (1 + ) log ⇡ ✓ (a t |s t ) r f (s t , a t , s t+1 ) = f (s t , s t+1 ) e E st⇠Bt,st+1⇠Bt+1 [e f (st+1,st) + e f (st+1,st) ] and the critic is computed by f (s t+1 , s t ) = log e kst+1 stk 2 2 E Bt,Bt+1 [e kst+1 stk 2 2 ] + 1 Update ⇡ ✓ using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) : end 3 NEURAL DENSITY IMITATION (NDI) From previous section's results, we propose Neural Density Imitation (NDI) that works in two phases: Phase 1: Density estimation: We leverage Autoregressive models and EBMs for density estimation of the expert's occupancy measure ⇢ ⇡ E from samples. As in (Ho & Ermon, 2016; Fu et al., 2017) , we take the state-action pairs in the demonstration set D = {(s, a) i } N i=1 ⇠ ⇡ E to approximate samples from ⇢ ⇡ E and fit q on D. For Autoregressive models, we use Masked Autoencoders for Density Estimation (MADE) (Germain et al., 2015) where the entire collection of conditional density models {q i } is parameterized by a single masked autoencoder network. Specifically, we use a gaussian mixture variant (Papamakarios et al., 2017) of MADE where each of the conditionals q i map inputs x <i to the mean and covariance of a gaussian mixture distribution over x i . The MADE model is trained via Maximum Likelihood Estimation. With EBMs, we perform non-normalized log density estimation and thus directly parameterize the energy function E with neural networks since log q = E + log Z( ). We use Sliced Score Matching (Song et al., 2019) to train the EBM. Phase 2: MaxOccEntRL After we've acquired a log density estimate log q from the previous phase, we perform RL with entropy regularization on the occupancy measure. Inspired by Corollary 1, we propose the following RL objective max ✓ J(⇡ ✓ , r = log q + ⇡ r ⇡ + f r f ) where ⇡ , f > 0 are weights introduced to control the influence of the occupancy entropy regularization. In practice, Eq. 11 can be maximized using any RL algorithm by simply setting the reward function to be r from Eq. 11. In this work, we use Soft Actor-Critic (SAC) (Haarnoja et al., 2018) . Note that SAC already includes a policy entropy bonus, so we do not separately include one. For our critic f , we fix it to be a normalized RBF kernel for simplicity, f (s t+1 , s t ) = log e kst+1 stk 2 2 E qt,qt+1 [e kst+1 stk 2 2 ] + 1 (12) but future works could explore learning the critic to match the optimal critic. While simple, our choice of f emulates two important properties of the optimal critic f ⇤ (x, y) = log p(x|y) p(x) + 1: (1). it follows the same "form" of a log-density ratio plus a constant (2). consecutively sampled states from the joint, i.e s t , s t+1 ⇠ p ✓,t:t+1 have high value under our f since they are likely to be close to each other under smooth dynamics, while samples from the marginals s t , s t+1 ⇠ q t , q t+1 are likely to have lower value under f since they can be arbitrarily different states. To estimate the expectations with respect to q t , q t+1 in Eq. 8, we simply take samples of previously visited states at time t, t + 1 from the replay buffer. 

4. TRADE-OFFS BETWEEN DISTRIBUTION MATCHING IL ALGORITHMS

Adversarial Imitation Learning (AIL) methods find a policy that maximizes an upperbound to the additive inverse of an f -divergence between the expert and imitator occupancies (Ghasemipour et al., 2019; Ke et al., 2020) . For example, if the f -divergence is reverse KL, then for any D : s,a) ] S ⇥ A ! R, max ⇡ ✓ D KL (⇢ ⇡ ✓ ||⇢ ⇡ E )  max ⇡ ✓ log E ⇡ E [e D( E ⇡ ✓ [D(s, a)] where the bound is tight at D(s, a) = log ⇢⇡ ✓ (s,a) ⇢⇡ E (s,a) + C for any constant C. AIL alternates between, s,a) ] min D log E ⇡ E [e D( E ⇡ ✓ [D(s, a)], max ⇡ ✓ E ⇡ ✓ [D(s, a)] The discriminator update step in AIL minimizes the upper bound with respect to D, tightening the estimate of reverse KL, and the policy update step maximizes the tightened bound. We thus see that by using an upper bound, AIL innevitably ends up with alternating min-max optimization where policy and discriminator updates act in opposing directions. The key issue with such adversarial optimization lies not in coordinate descent itself, but in its application to a min-max objective which is widely known to gives rise to optimization instability (Salimans et al., 2016) . The key insight of NDI is to instead derive an objective that lower bounds the additive inverse of reverse KL. Recall from Eq. 9 that NDI maximizes the lower bound with the SAELBO H f (⇢ ⇡ ✓ ): max ⇡ ✓ D KL (⇢ ⇡ ✓ ||⇢ ⇡ E ) max ⇡ ✓ J(⇡ ✓ , r = log ⇢ ⇡ E ) + H f (⇢ ⇡ ✓ ) Unlike the AIL upper bound, this lower bound is not tight. With critic f updates, NDI alternates max f 1 X t=0 t I f NWJ (s t+1 ; s t |✓), max ⇡ ✓ J(⇡ ✓ , r = log ⇢ ⇡ E ) + (1 + )H(⇡ ✓ ) + 1 X t=0 t I f NWJ (s t+1 ; s t |✓) The critic update step in NDI maximizes the lower bound with respect to f , tightening the estimate of reverse KL, and the policy update step maximizes the tightened bound. In other words, for AIL, the policy ⇡ ✓ and discriminator D seek to push the upper bound in opposing directions while in NDI the policy ⇡ ✓ and critic f push the lower bound in the same direction. Unlike AIL, NDI does not perform alternating min-max but instead alternating max-max! While NDI enjoys non-adversarial optimization, it comes at the cost of having to use a non-tight lower bound to the occupancy divergence. On the otherhand, AIL optimizes a tight upper bound at the cost of unstable alternating min-max optimization. Support matching IL algorithms also avoid min-max but their objective is neither an upper nor lower bound to the occupancy divergence. Table 1 summarizes the trade-offs between different families of algorithms for distribution matching IL.

5. RELATED WORKS

Prior literature on Imitation learning (IL) in the absence of an interactive expert revolves around Behavioral Cloning (BC) (Pomerleau, 1991; Wu et al., 2019) , distribution matching IL (Ho & Ermon, 2016; Song et al., 2018; Ke et al., 2020; Ghasemipour et al., 2019; Kostrikov et al., 2020; Kim et al., 2019) , and Inverse Reinforcement Learning (Fu et al., 2017; Uchibe, 2018; Brown et al., 2019) . Many approaches in the latter category minimize statistical divergences using adversarial methods to solve a min-max optimization problem, alternating between reward (discriminator) and policy (generator) updates. ValueDICE, a more recently proposed adversarial IL approach, formulates reverse KL divergence into a completely off-policy objective thereby greatly reducing the number of environment interactions. A key issue with such Adversarial Imitation Learning (AIL) approaches is optimization instability (Miyato et al., 2018; Jin et al., 2019) . Recent works have sought to avoid adversarial optimization by instead performing RL with a heuristic reward function that estimates the support of the expert occupancy measure. Random Expert Distillation (RED) (Wang et al., 2019) and Disagreement-regularized IL (Brantley et al., 2020) are two representative approaches in this family. A key limitation of these approaches is that support estimation is insufficient to recover the expert policy and thus they require an additional behavioral cloning step. Unlike AIL, we maximize a non-adversarial RL objective and unlike heuristic reward approaches, our objective provably lower bounds reverse KL between occupancy measures of the expert and imitator. Density estimation with deep neural networks is an active research area, and much progress has been made towards modeling high-dimensional structured data like images and audio. Most successful approaches parameterize a normalized probability model and estimate it with maximum likelihood, e.g., autoregressive models (Uria et al., 2013; 2016; Germain et al., 2015; van den Oord et al., 2016) and normalizing flow models (Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018) . Some other methods explore estimating non-normalized probability models with MCMC (Du & Mordatch, 2019; Yu et al., 2020) or training with alternative statistical divergences such as score matching (Hyvärinen, 2005; Song et al., 2019; Song & Ermon, 2019) and noise contrastive estimation (Gutmann & Hyvärinen, 2010; Gao et al., 2019) . Related to MaxOccEntRL, recent works (Lee et al., 2019; Hazan et al., 2019; Islam et al., 2019) on exploration in RL have investigated state-marginal occupancy entropy maximization. To do so, (Hazan et al., 2019) requires access to a robust planning oracle, while (Lee et al., 2019) uses fictitious play, an alternative adversarial algorithm that is guaranteed to converge. Unlike these works, our approach maximizes the SAELBO which requires no planning oracle nor min-max optimization, and is trivial to implement with existing RL algorithms.

6. EXPERIMENTS

Environment: Following prior work, we run experiments on benchmark Mujoco (Todorov et al., 2012; Brockman et al., 2016) tasks: Hopper (11, 3), HalfCheetah (17, 6), Walker (17, 6), Ant (111, 8), and Humanoid (376, 17) , where the (observation, action) dimensions are noted parentheses. Pipeline: We train expert policies using SAC (Haarnoja et al., 2018) . All of our results are averaged across five random seeds where for each seed we randomly sample a trajectory from an expert, perform density estimation, and then MaxOccEntRL. Performance for each seed is averaged across 50 trajectories. For each seed we save the best imitator as measured by our augmented reward r from Eq. 11 and report its performance with respect to the ground truth reward. We don't perform sparse subsampling on the data as in (Ho & Ermon, 2016) since real world demonstration data typically aren't subsampled to such an extent and using full trajectories was sufficient to compare performance. Architecture: We experiment with two variants of our method, NDI+MADE and NDI+EBM, where the only difference lies in the density model. Across all experiments, our density model q is a two-layer MLP with 256 hidden units. For hyperparameters related to the MaxOccEntRL step, ⇡ = 0.2 is fixed and for f see Section 6.3. For full details on architecture see Appendix B. Baselines: We compare our method against the following baselines: (1). Behavioral Cloning (BC) (Pomerleau, 1991) : learns a policy via direct supervised learning on D. (2). Random Expert Distillation (RED) (Wang et al., 2019) : estimates the support of the expert policy using a predictor and target network (Burda et al., 2018) , followed by RL using this heuristic reward. (3). Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) : on-policy adversarial IL method which alternates reward and policy updates. (4). ValueDICE (Kostrikov et al., 2020) : current state-of-the-art adversarial IL method that works off-policy. See Appendix B for baseline implementation details. We randomly sample test states s and multiple test actions a s per test state, both from a uniform distribution, then visualize the log marginal log q (s) = log P as q (s, a s ) projected onto two state dimensions: one corresponding to forward velocity and the other a random selection. Much like true reward function in Mujoco environments, we found that the log marginal positively correlates with forward velocity on 4/5 tasks. marks when provided one demonstration and outperforms all baselines on all mujoco benchmarks. NDI+MADE achieves expert level performance on 4/5 tasks but fails on Ant. We found spurious modes in the density learned by MADE for Ant, and the RL algorithm was converging to these local maxima. We found that baselines are commonly unable to solve Humanoid with one demonstration (the most difficult task considered). RED is unable to perform well on all tasks without pretraining with BC as done in (Wang et al., 2019) . For fair comparisons with methods that do not use pretraining, we also do not use pretraining for RED. See Appendix C.4 for results with a BC pretraining step added to all algorithms. GAIL and ValueDICE perform comparably with each other, both outperforming behavioral cloning. We note that these results are somewhat unsurprising given that ValueDICE (Kostrikov et al., 2020) did not claim to improve demonstration efficiency over GAIL (Ho & Ermon, 2016) , but rather focused on reducing the number of environment interactions. Both methods notably under-perform the expert on Ant-v3 and Humanoid-v3 which have the largest state-action spaces. Although minimizing the number of environment interactions was not a targeted goal of this work, we found that NDI roughly requires an order of magnitude less environment interactions than GAIL. Please see Appendix C.5 for full environment sample complexity comparisons.

6.2. DENSITY EVALUATION

In this section, we examine the learned density model q for NDI+EBM and show that it highly correlates with the true mujoco rewards which are linear functions of forward velocity. We randomly sample test states s and multiple test actions a s per test state, both from a uniform distribution with boundaries at the minimum/maximum state-action values in the demonstration set. We then visualize the log marginal log q (s) = log P as q (s, a s ) projected on to two state dimensions: one corresponding to the forward velocity of the robot and the other a random selection, e.g the knee joint angle. Each point in Figure 1 corresponds to a projection of a sampled test state s, and the colors scale with the value of log q (s). For all environments besides Humanoid, we found that the density estimate positively correlates with velocity even on uniformly drawn state-actions which were not contained in the demonstrations. We found similar correlations for Humanoid on states in the demonstration set. Intuitively, a good density estimate should indeed have such correlations, since the true expert occupancy measure should positively correlate with forward velocity due to the expert attempting to consistently maintain high velocity.  )  H f (⇢ ⇡ ✓ )  H(⇢ ⇡ ✓ ) , and in continuous state-spaces, where Assumption 1 holds, the SAELBO is still a lower bound while policy entropy alone is neither a lower nor upper bound to occupancy entropy. As an artifact, we found that SAELBO maximization ( f > 0) leads to better occupancy distribution matching than sole policy entropy maximization ( f = 0). Table 3 shows the effect of the varying f on task (reward) and imitation performance (KL), i.e similarities between ⇡, ⇡ E measured as E s⇠⇡ [D KL (⇡(•|s)||⇡ E (•|s))]. Setting f too large ( 0.1) hurts both task and imitation performance as the MI reward r f dominates the RL objective. Setting it too small ( 0.0001), i.e only maximizing policy entropy H(⇡ ✓ ), turns out to benefit task performance, sometimes enabling the imitator to outperform the expert by concentrating most of it's trajectory probability mass to the mode of the expert's trajectory distribution. However, the boosted task performance comes at the cost of suboptimal imitation performance, e.g imitator cheetah running faster than the expert. We found that a middle point of f = 0.005 simultaneously achieves expert level task performance and good imitation performance. In summary, these results show that SAELBO H f maximization ( f > 0) improves distribution matching between ⇡, ⇡ E over policy entropy H(⇡ ✓ ) maximization ( f = 0), but distribution matching may not be ideal for task performance maximization, e.g in apprenticeship learning settings. See Appendix C.1, C.3 for extended ablation studies.

7. DISCUSSION AND OUTLOOK

This work's main contribution is a new principled framework for IL and an algorithm that obtains state-of-the-art demonstration efficiency. One future direction is to apply NDI to harder visual IL tasks for which AIL is known perform poorly. While the focus of this work is to improve on demonstration efficiency, another important IL performance metric is environment sample complexity. Future works could explore combining off-policy RL or model-based RL with NDI to improve on this end. Finally, there is a rich space of questions to answer regarding the effectiveness of the SAELBO reward r f . We posit that, for example, in video game environments r f may be crucial for success since state-action entropy maximization has been shown to be far more effective than policy entropy maximization (Burda et al., 2018) . Furthermore, one could improve on the tightness of SAELBO by incorporating negative samples (Van Den Oord et al., 2018) and learning the critic function f so that it is close to the optimal critic.



we assume only samples can be taken from the environment dynamics and its density is unknown probability models that have potentially intractable density functions, but can be sampled from to estimate expectations and gradients of expectations with respect to model parameters(Huszár, 2017).



Neural Density Imitation (NDI) Require: Demonstrations D ⇠ ⇡ E , Reward weights ⇡ , f , Fixed critic f Phase 1. Density estimation: Learn q (s, a) from D using MADE or EBMs Phase 2. MaxOccEntRL: for k = 1, 2, ... do

Figure1: Learned density visualization. We randomly sample test states s and multiple test actions a s per test state, both from a uniform distribution, then visualize the log marginal log q (s) = log P as q (s, a s ) projected onto two state dimensions: one corresponding to forward velocity and the other a random selection. Much like true reward function in Mujoco environments, we found that the log marginal positively correlates with forward velocity on 4/5 tasks.

See Appendix A.1 for the proof and a discussion of the bound tightness. Here onwards, we refer to H f (⇢ ⇡ ✓ ) from Theorem 1 as the State-Action Entropy Lower Bound (SAELBO). The SAELBO mainly decomposes into policy entropy H(⇡ ✓ ) and Mutual Information (MI) between consecutive states I f NWJ (s t+1 ; s t |✓). When Assumption 1 does not hold, we may still obtain a SAELBO with only the policy entropy term, i.e H f (⇢ ⇡ ✓

Comparison between different families of distribution matching IL algorithms

compares the ground truth reward acquired by agents trained with various IL algorithms when one demonstration is provided by the expert. (See Appendix C.2 for performance comparisons with varying demonstrations) NDI+EBM achieves expert level performance on all mujoco bench-

Task Performance when provided with one demonstration. NDI (orange rows) outperforms all baselines on all tasks. See Appendix C.2 for results with varying demonstrations.

Effect of varying MI reward weight f on (1). Task performance of NDI-EBM (top row) and (2). Imitation performance of NDI-EBM (bottom row) measured as the average KL divergence between ⇡, ⇡ E on states s sampled by running ⇡ in the true environment, i.e E s⇠⇡ [D KL (⇡(•|s)||⇡ E (•|s))], normalized by the average D KL between the random and expert policies. D KL (⇡||⇡ E ) can be computed analytically since ⇡, ⇡ E are conditional gaussians. Density model q is trained with one demonstration. Setting f too large hurts task performance while setting it too small is suboptimal for matching the expert occupancy. A middle point of f = 0.005 achieves a balance between the two metrics.As intuited in Section 2.2, maximizing the SAELBO can be more effective for occupancy entropy maximization, than solely maximizing policy entropy. (see Appendix C.1 for experiments that support this) This is because in discrete state-spaces the SAELBO H f (⇢ ⇡ ✓ ) is a tighter lower bound to occupancy entropy H(⇢ ⇡ ✓ ) than policy entropy H(⇡ ✓ ), i.e H(⇡ ✓

