LEARNING FROM DEMONSTRATIONS WITH ENERGY BASED GENERATIVE ADVERSARIAL IMITATION LEARNING

Abstract

Traditional reinforcement learning methods usually deal with the tasks with explicit reward signals. However, for vast majority of cases, the environment wouldn't feedback a reward signal immediately. It turns out to be a bottleneck for modern reinforcement learning approaches to be applied into more realistic scenarios. Recently, inverse reinforcement learning (IRL) has made great progress in making full use of the expert demonstrations to recover the reward signal for reinforcement learning. And generative adversarial imitation learning is one promising approach. In this paper, we propose a new architecture for training generative adversarial imitation learning which is so called energy based generative adversarial imitation learning (EB-GAIL). It views the discriminator as an energy function that attributes low energies to the regions near the expert demonstrations and high energies to other regions. Therefore, a generator can be seen as a reinforcement learning procedure to sample trajectories with minimal energies (cost), while the discriminator is trained to assign high energies to these generated trajectories. In detail, EB-GAIL uses an auto-encoder architecture in place of the discriminator, with the energy being the reconstruction error. Theoretical analysis shows our EB-GAIL could match the occupancy measure with expert policy during the training process. Meanwhile, the experiments depict that EB-GAIL outperforms other SoTA methods while the training process for EB-GAIL can be more stable.

1. INTRODUCTION

Motivated by applying reinforcement learning algorithms into more realistic tasks, we find that most realistic environments cannot feed an explicit reward signal back to the agent immediately. It becomes a bottleneck for traditional reinforcement learning methods to be applied into more realistic scenarios. So how to infer the latent reward function from expert demonstrations is of great significance. Recently, a lot of great work have been proposed to solve this problem. They are also successfully applied in scientific inquiries, such as Stanford autonomous helicopter Abbeel et al. (2006) Abbeel et al. (2007) Ng et al. (2004) Coates et al. (2008) Abbeel et al. (2008a) Abbeel et al. (2010) , as well as practical challenges such as navigation Ratliff et al. (2006) Abbeel et al. (2008b ) Ziebart et al. (2008) Ziebart et al. (2010) and intelligent building controls Barrett & Linder (2015) . The goal of imitation learning is to mimic the expert behavior from expert demonstrations without access to a reinforcement signal from the environment. The algorithms in this field can be divided into two board categories: behavioral cloning and inverse reinforcement learning. Behavioral cloning formulate this problem as a supervised learning problem which aims at mapping state action pairs from expert trajectories to policy. These methods suffer from the problem of compounding errors (covariate shift) which only learn the actions of the expert but not reason about what the expert is trying to achieve. By the contrast, inverse reinforcement learning recovers the reward function from expert demonstrations and then optimize the policy under such an unknown reward function. In this paper, we propose energy-based generative adversarial imitation learning which views the discriminator as an energy function without explicit probabilistic interpretation. The energy function computed by the discriminator can be viewed as a trainable cost function for the generator, while the discriminator is trained to assign low energy values to the regions of expert demonstrations, and higher energy values outside these regions. We use an auto-encoder to represent the discriminator, and the reconstruction error is thought to be the energy. There are many other choices to learn the energy function, but an auto-encoder is quite efficient. Our main contributions are summarized as follows. • An EB-GAIL framework with the discriminator using an auto-encoder architecture in which the energy is the reconstruction error. • Theoretical analysis shows that the policy rolls out the trajectories that are indistinguishable from the distribution of the expert demonstrations by matching the occupancy measure with the expert policy. • Experiments show that EB-GAIL outperforms several SoTA imitation learning algorithms while the training process for EB-GAIL can be more stable.

2. BACKGROUND

In this section, we'll briefly introduce the basic concepts in direct reinforcement learning Sutton & Barto (1998 ) Sugiyama (2015) , inverse reinforcement learning Ng & Russell (2000) , imitation learning Bagnell (2015) , and energy based models LeCun et al. (2006) .

2.1. DIRECT REINFORCEMENT LEARNING

Reinforcement Learning Sutton & Barto (1998) Sugiyama (2015) , which is usually used for sequential decision making problems, can help us to learn from the interactions with the environment. The process for direct reinforcement learning is that at each time step, the agent receives a state s t and chooses an action a t from the action space A, following a stochastic policy π(a t |s t ). After that, the environment transit to a new state s t+1 and gives a scalar reward r t back to the agent. This process continues until the agent reaches a terminal state. In this case, the training model can be thought as a Markov decision process (MDP), which is a tuple (S, A, T, γ, D, R). In this tuple, S is the state space; A is the action space; T = P sa is a probability matrix for state transitions, owing to the environment dynamics; γ ∈ (0, 1] is a discount factor; D is the initial-state transition distribution; and R : S → A is the reward function, which is assumed to be bounded in absolute value by 1. In addition, we also define that a policy π is a mapping from states to probability distributions over actions, which is also called a stochastic policy. For a certain task, the goal of direct reinforcement learning is to maximize the total future reward. For simplification, we define the value function to be a prediction of the total future reward, which can be shown as a discounted future reward: V = ∞ t=0 γ t R t . Besides, we also define the action value function as Q(s, a) = E[R t |s t = s, a t = a], which is the expected return for selecting action a in state s. According to the Bellman Optimality, an optimal action value function Q * (s, a) is the maximum action value achievable by any policy for state s and action a.

2.2. INVERSE REINFORCEMENT LEARNING

The goal of inverse reinforcement learning is to infer the reward signal with respect to the expert demonstrations which are assumed to be the observations of optimal behaviors Ng & Russell (2000) . In the past decade, a lot of great work have been proposed towards enlarging the ability of reward function representation. Take some traditional methods for example, in 2010, FIRL Levine & Popvić (2010) was proposed to learn a set of composites features based on logical conjunctions with nonlinear functions for the reward signal. Later, non-parametric methods based on Gaussian Process(GP) Rasmussen & Williams (2006) are proposed to enlarge the function space of latent reward to allow for non-linearity in Levine & Popvić (2010) . For undertaking the learning of an abstract structure with smaller data sets, Jin et al. Jin & Spanos (2015) combined deep belief networks and gaussian process to optimize the existing algorithm GP-IRL. To solve the substantial noise in the sensor measurements, some work have been applied here such as bayesian programming, probabilistic graphical model and so on. For example, in 2007, Ramachandran et al. proposed a Bayesian nonparametric approach Ramachandran & Amir (2007) to construct the reward function features in IRL, which is so-called Bayesian IRL. Later, Choi et al. Choi & Kim (2014) extend this algorithm by defining a prior on the composite features, which are defined to be the logical conjunctions of the atomic features. By assuming that each trajectory can be generated by multiple locally consistent reward functions, Nguyen et al. used an expectation-maximization (EM) algorithm to learn the different reward functions and the stochastic transitions between them in order to jointly improve the likelihood of the expert's demonstrated trajectories Nguyen et al. (2015) . As a result, the most likely partition of a trajectory into segments that are generated from different locally consistent reward functions selected by EM can be derived. Experiments show that the algorithm outperforms the SoTA EM clustering with maximum likelihood IRL.

2.3. IMITATION LEARNING

Imitation learning is a study of algorithms that can mimic the experts' demonstrations or a teachers' behavior. Unlike inverse reinforcement learning, the goal of imitation learning is to obtain a policy from teachers' behaviors rather than to recover the reward function for some certain tasks. The algorithms in imitation learning can be classified into two categories: behavioral cloning, and inverse reinforcement learning. One fatal weakness for behavioral cloning is that these methods can only learn the actions from the teacher rather than learn the motivation from teachers' behaviors. To solve this problem, inverse reinforcement learning was proposed to recover the reward function for decision making problems. By assuming that the teachers' behavior is optimal, these methods tend to recover the reward function in a Markov decision process. So when combined with direct reinforcement learning methods, inverse reinforcement learning can realize the process for mimicking the teachers' behaviors.

2.4. ENERGY BASED MODEL

The essence of the energy based model LeCun et al. (2006) is to build a function that maps each point of an input space to a single scalar, which is called "energy". The learning phase is a data driven process that shapes the energy surface in such a way that the desired configurations get assigned low energies, while the incorrect ones are given high energies. Supervised learning falls into this framework: for each x in the training set, the energy of the pair (x, y) takes low values when y is the correct label and higher values for incorrect y's. Similarly, when modeling x alone within an unsupervised learning setting, lower energy is attributed to the data manifold. The term contrastive sample is often used to refer to a data point causing an energy pull up, such as the incorrect y's in supervised learning and points from low data density regions in unsupervised learning. Denoted that the energy function is ε, the connection between probability and energy can be built through Gibbs distribution: P (y|x) = exp(-βε(y, x)) y∈Y exp(-βε(y, x)) , the denominator here is the partition function which represents the total energy in the data space and β is an arbitrary positive constant. The formulation of this connection might seem arbitrary, but other formulation can be obtained by re-defining the energy function.

3. ENERGY BASED GENERATIVE ADVERSARIAL IMITATION LEARNING (EB-GAIL)

3.1 METHODOLOGY The output of the discriminator goes through an objective functional in order to shape the energy function, assigning low energy to the regions near the expert demonstrations and higher energy to the other regions. In this work, we use an auto-encoder to represent the discriminator and the reconstruction error of the auto-encoder is assumed to be the energy. Meanwhile, we use a margin loss function to train EB-GAIL while one loss function is to train the discriminator and the other loss function is assumed to be the reward function for the reinforcement learning procedure. Given a positive margin margin, a state-action pair χ E sampled from expert demonstrations, and a state-action pair χ i rolled out by a trained policy, the discriminator loss function L D is formally defined by: L D = D(χ E ) + [margin -D(χ i )] + where [•] + = max(0, •). Meanwhile, the reward function for the reinforcement learning procedure is: r(χ i ) = -D(χ i ) Maximizing the total reward for the reinforcement learning is similar to minimizing the second term of L D . In practice, we observe that the loss function can effectively avoid gradient vanishing and mode collapse problems. In this work, we propose that the discriminator is structured as an auto-encoder, which assigns low energy to the regions near the expert demonstrations and high energy to the other regions.

3.2. USING AUTO-ENCODER

The discriminator is defined as: D(x) = Dec(Enc(x)) -x Algorithm 1 Energy based Generative Adversarial Imitation Learning (EB-GAIL) Require: Initial parameters of policy, discriminator θ 0 , w 0 . Expert trajectories τ E ∼ π E . Choose a value for margin. Ensure: 1: for i = 0 to N do 2: Sample trajectories: τ i ∼ π θi with current policy π θi . 3: Sample state-action pairs χ i ∼ τ i and χ E ∼ π E with same batch size.

4:

Update w i to w i+1 by decreasing with the gradient: Êχi [∇ wi D wi (s, a)] + Êχ E [∇ wi [margin -D wi (s, a)] + ] 5: Take a policy step from θ i to θ i+1 , using the TRPO update rule with the reward function -D wi (s, a), and the objective function for TRPO is: Êτi [(-D wi (s, a))] -λH(π θi ). 6: end for 2006), but the auto-encoder is an efficient one. Algorithm 1 depicts the procedure for training EB-GAIL. The first step is to sample the stateaction pairs from expert demonstrations χ E ∼ π E . The second step is to sample the state-action pairs from the trajectories rolled out by current policy χ i ∼ τ i with the same batch size. Then we update the discriminator by decreasing with the gradient: Êχi [∇ wi D wi (s, a)]+ Êχ E [∇ wi [margin- D wi (s, a)] + ] . The fourth step is to update the policy assuming that the reward function is -D wi (s, a). We runs these steps with N iterations until the policy converges.

3.3. THEORETICAL ANALYSIS

In this section, we introduce a theoretical analysis for EB-GAIL. We show that if EB-GAIL reaches a Nash equilibrium, then the policy perfectly match the occupancy measure with the expert policy. This section is done in a non-parametric setting. We also assume that D and G have infinite capacity. Firstly, we provide a definition for occupancy measure for inverse reinforcement learning. Definition 1. (occupancy measure) The agent rolls out the trajectories which can be divided into state-action pairs (s, a) with policy π. This leads to the definition of occupancy measure ρ s,a π (s, a) and ρ s π (s) as the density of occurrence of states or state-action pairs: ρ s,a π = ∞ t=0 γ t P (s t = s, a t = s|π) = π(a|s) ∞ t=0 γ t P (s t = s|π) = π(a|s)ρ s π (s), where we assume γ = 1 for simplicity in the following paragraph. Here we are justified in writing π ρ to denote the unique policy for an occupancy measure ρ. Now, let us define an IRL primitive procedure, which finds a reward function such that the expert performs better than all other policies, with the reward regularized by φ: IRL φ (π E ) = arg min r∈R S×A φ(r) + (max π∈Π H(π) -E π [r(s, a)]) + E π E [r(s, a)] We are interested in the policy given by running reinforcement learning procedure on the reward function which is the output of IRL. And now we are ready to acquire the policy learned by RL under the reward function recovered by IRL: Proposition 2. (Proposition 3.2 in Ho & Ermon (2016)) RL • IRL φ (π E ) = arg min π∈Π -H(π) + φ * (ρ π -ρ π E ) Remark 1. Proposition 2 tells that φ-regularized inverse reinforcement learning seeks a policy whose occupancy measure is close to the expert's, which is measured by the convex function φ * . We present that the discriminator is to minimize d φ (ρ π , ρ π E )-H(π), where d φ (ρ π , ρ π E ) = φ * (ρ π - ρ π E ) by modifying the IRL regularizer φ, so that d φ (ρ π , ρ π E ) smoothly penalizes violations in difference between the occupancy measures. And the generator is to maximize the total reward for a policy, which is E[ (s, a) ] (we assume γ = 1 for simplicity). T t=0 γ t r(s, a)] = E[ T t=0 D Proposition 3. Our choice (EB-GAIL) of φ is: φ(ρ π -ρ π E ) = min E π E [D(s, a)] + E π [(margin -D(s, a)) + ] The proof of this proposition is in Appendix A.1.1. Remark 2. Proposition 3 tells that in our paper, the φ is determined by the loss functional, which is the margin based loss function. And we can further prove that this formulation of φ effectively help the algorithm to match the occupancy measure with expert policy during the training process. Considered the loss functionals in section 3.1 and proposition 3, we define: V (G, D) = χi,χ E φ(ρ π -ρ π E )ρ π ρ π E dχ i dχ E = χi,χ E {E π E [D(χ E )] + E π [(margin -D(χ i )) + ]} L D ρ π ρ π E dχ i dχ E (9) U (G, D) = χi D(χ i ) -r(χi) ρ π dχ i (10) Obviously, we train the discriminator D to minimize the quantity V and the generator G to minimize the quantity U . A Nash equilibrium of EB-GAIL is a pair (G * , D * ) that satisfies: V (G * , D * ) ≤ V (G * , D) ∀D U (G * , D * ) ≤ U (G, D * ) ∀G Theorem 1. If the system (EB-GAIL) reaches a Nash equilibrium, then ρ π = ρ π E , and V (G * , D * ) = m, where m represents margin. The proof of this theorem is in Appendix A.1.2. Remark 3. The proof follows the idea that won't violate the equation 11 (get conclusion that m ≤ V (G * , D * ) ≤ m). And therefore, the occupancy measure can be matched (ρ π = ρ π E ) when the system reaches a Nash equilibrium. Theorem 1 tells that using EB-GAIL loss functionals can help the trained policy to match the occupancy measure with the expert policy and V (G, D) will converge to margin(m) which indicate that the discriminator will converge while the system reaches a Nash equilibrium. Theorem 2. A Nash equilibrium of this system exists and is characterized by (1)ρ π = ρ π E , and (2) there exists a constant γ ∈ [0, m] such that D * (χ) = γ almost everywhere, where m represents margin. The proof of this theorem is in Appendix A.1.3. Remark 4. Theorem 2 tells that if the Nash equilibrium of EB-GAIL exists, then it is characterized by that the trained policy could match the occupancy measure with expert policy (the trajectories rolled by the trained policy will be indistinguishable with the expert demonstrations) and D * (χ) = γ almost everywhere which indicate that the discriminator could converge.

4.1. TASKS

The tasks in the presented benchmark can be divided into two categories: basic tasks and locomotion tasks. We briefly describe them in this section. We choose to implement all tasks using physics simulators rather than symbolic equations, since the former approach is less error-prone and permits easy modification of each task. Tasks with simple dynamics are implemented using Box2D, an open-source, freely available 2D physics simulator. Tasks with more complicated dynamics, such as locomotion, are implemented using MuJoCo, a 3D physics simulator with better modeling of contacts.

4.1.1. BASIC TASKS

We implement five basic tasks that have been widely analyzed in reinforcement learning and imitation learning literature: Cart-Pole Balancing Stephenson (1908) Schulman et al. (2015b), and Full Humanoid Tassa et al. (2012) . The goal for all these tasks is to move forward as quickly as possible. These tasks are more challenging than the basic tasks due to high degrees of freedom. In addition, a great amount of exploration is needed to learn to move forward without getting stuck at local optima. Since we penalize for excessive controls as well as falling over, during the initial stage of learning, when the robot is not yet able to move forward for a sufficient distance without falling, apparent local optima exist including staying at the origin or diving forward slowly.

4.2. BASELINES

In this section, we will introduce the the baseline methods used in detail. Behavioral Cloning (BC): learning a mapping from state space to action space, with supervised learning methods. Specifically, the algorithm cannot get more information from the expert any more except the expert demonstrations. So the problem of compounding errors will occur and the performance might be poor. But in fact, the performance is quite promising. Guided Cost Learning (GCL): the algorithm of Finn et al. (2016) , which is actually a sampling based maximum entropy inverse reinforcement learning method with neural networks as the cost function. The reward function search process and policy update process are in the inner loop of the algorithm. So it will need more computation resources. Generative Adversarial Imitation Learning (GAIL): the algorithm of Ho & Ermon (2016) which use a GAN architecture for policy improvement and reward function fitting. GAIL use the discriminator to compute the reward for some state-action pair, and then using the reward function to update the policy.

4.3. TRAINING SETTING

We used all the algorithms to train policies of the same neural network architecture for all tasks: two hidden layers of 100 units each, with tanh nonlinearities in between. All networks were always initialized randomly at the start of each trial. For each task, we gave BC, GCL, GAIL and our EB-GAIL exactly the same amount of environment interaction for training. We ran all algorithms 10 times over different random seeds in all environments. More information is depicted in Appendix A.2.

4.4. RESULTS AND DISCUSSION

Figure 2 depicts the results which are the scaled rewards for different imitation learning methods. We set the expert trajectory reward is 1.0 and random policy's reward is 0.0. In basic tasks, BC achieves comparable results compared with GCL and GAIL. And our proposed method EB-GAIL achieves expert performance on these tasks. Obviously, our EB-GAIL outperforms other SoTA methods on this evaluation metric. In locomotion tasks, GCL and GAIL achieve higher rewards than BC except for HalfCheetah and Ant. In these tasks, EB-GAIL obtains better performance than BC, GCL and GAIL even with limited expert trajectories (we only use 1 ∼ 10 trajectories in basic tasks and 4 ∼ 25 trajectories in locomotion tasks except for humanoid tasks). And our proposed method EB-GAIL still achieves expert performance and outperform all the compared algorithms. These results definitely show our EB-GAIL outperforms other SoTA methods on basic tasks and MuJoCo environments. Meanwhile, the deviation for our proposed EB-GAIL is also roughly lower than other SoTA methods. It shows the stability of training our EB-GAIL. Experiment results tells that our EB-GAIL achieves expert performance on basic tasks and locomotion tasks, and it basically outperforms other SoTA methods. Meanwhile the training process for EB-GAIL is more stable than other SoTA methods.

5. CONCLUSION

In this paper, we present an energy based method for generative adversarial imitation learning, which is so-called EB-GAIL. It views the discriminator as an energy function that attributes low energies to the regions near the expert demonstrations and high energies to other regions. To learn the energy of the state-action space, we use the mean square error of an auto-encoder to represent the energy function (there are still many other choices for us to represent the energy function, but the auto-encoder is quite efficient). Theoretical analysis depicts the system could match the occupancy measure with expert policy when the system reaches a Nash equilibrium. Meanwhile it also tells that if the Nash equilibrium of EB-GAIL exists, then it is characterized by that the trained policy could match the occupancy measure with expert policy and the discriminator could converge to a number between 0 and the margin. Experiment results show that our EB-GAIL achieves SoTA performance, for learning a policy that can imitate and even outperform the human experts. Meanwhile, the training procedure for EB-GAIL is more stable than other SoTA methods. As we demonstrated, our method is also quite sample efficient by learning from limited expert demonstrations. We hope that our work can further be applied into more realistic scenarios.

A APPENDIX

A.1 OMITTED PROOFS A.1.1 PROOF OF PROPOSITION 3 Our choice (EB-GAIL) of φ is: φ(ρ π -ρ π E ) = min E π E [D(s, a)] + E π [(margin -D(s, a)) + ] Proof. This proof will take the JS divergence as an example. And we will prove that the cost regularizer φ GA (c) = E π E [g(c(s, a))] if c < 0 +∞ otherwise where g(x) = -x -log(1 -e x ) if x < 0 +∞ otherwise satisfies: φ * GA (ρ π -ρ π E ) = max E π [log(D(s, a))] + E π E [log(1 -D(s, a))]. Using the logistic loss log(1 + exp(-x)), we see that applying Proposition A.1 in Ho & Ermon (2016) , we get: φ * GA (ρ π -ρ π E ) = -R φ (ρ π , ρ π E ) = s,a max γ∈R ρ π (s, a)log( 1 1 + exp(-γ) ) + ρ π E (s, a)log( 1 1 + exp(γ) ) = s,a max γ∈R ρ π (s, a)log( 1 1 + exp(-γ) ) + ρ π E (s, a)log(1 - 1 1 + exp(-γ) ) = s,a max γ∈R ρ π (s, a)log(δ(γ)) + ρ π E (s, a)log(1 -δ(γ)), where δ(x) = 1/(1 + exp(-x)) is the sigmoid function. Because the range of δ is (0, 1), we can write: φ * GA (ρ π -ρ π E ) = s,a max ρ π (s, a)logd + ρ π E (s, a)log(1 -d) = max s,a ρ π (s, a)log(D(s, a)) + ρ π E (s, a)log(1 -D(s, a)) which is the desired expression. Obviously, in our EB-GAIL, we choose the expression for φ is: min E π E [D(s, a)] + E π [(margin -D(s, a)) + ] This completes the proof. Proof. The function φ is defined on [0, +∞), its derivative is defined on [0, +∞)\{m} and φ (x) = a -b if x ∈ [0, m) and φ (x) = a if x ∈ (m, +∞). So when a < b, the function is decreasing on [0, m) and increasing on (m, +∞). Since it is continuous, it has a minimum in m. On the other hand, if a ≥ b the function φ is increasing on [0, +∞), so 0 is a minimum. This completes the proof. Lemma 2. Zhao et al. (2016) If p and q are probability densities, and the function 1 A (x) = 1 if x ∈ A otherwise 1 A (x) = 0, then x 1 p(x)<q(x) dx = 0 if and only if x 1 p(x) =q(x) dx = 0. Proof. Let's assume that x 1 p(x)<q(x) dx = 0. Then x 1 p(x)>q(x) (p(x) -q(x))dx = x (1 -1 p(x)≤q(x) )(p(x) -q(x))dx = x p(x)dx - x q(x)dx + x 1 p(x)≤q(x) (p(x) -q(x))dx =1 -1 + x (1 p(x)<q(x) + 1 p(x)=q(x) )(p(x) -q(x))dx = x 1 p(x)<q(x) (p(x) -q(x))dx+ x 1 p(x)=q(x) (p(x) -q(x))dx =0 So x 1 p(x)>q(x) (p(x) -q(x))dx = 0 and since the term in the integral is always non-negative, 1 p(x)>q(x) (p(x) -q(x)) = 0 for almost all x. And p(x) -q(x) = 0 implies 1 p(x)>q(x) = 0, so 1 p(x)>q(x) = 0 almost everywhere. This completes the proof. Theorem 1: If the system (EB-GAIL) reaches a Nash equilibrium, then ρ π = ρ π E , and V (G * , D * ) = m, where m represents margin. Proof. First we observe that  V (G * , D * ) = χ E ρ π E (χ E )D(χ E )dχ E + χi ρ π (χ i )[m -D(χ i )] + dχ i = χ E (ρ π E (χ E )D(χ E ) + ρ π * (χ E )[m -D(χ E )] + )dχ E . V (G * , D * ) = m χ E 1 ρπ E (x)<ρ π * (χ E ) ρ π E (χ E )dχ E + m χ E 1 ρπ E (χ E )≤ρ π * (χ E ) ρ π * (χ E )dχ E = m χ E (1 ρπ E (χ E )<ρ π * (χ E ) ρ π E (χ E )+ (1 -1 ρπ E (χ E )<ρ π * (χ E ) )ρ π * (χ E ))dχ E = m χ E ρ π * (χ E )dχ E + m χ E 1 ρπ E (χ E )<ρ π * (χ E ) (ρ π E (χ E ) -ρ π * )dχ E = m + m χ E 1 ρπ E (χ E )<ρ π * (χ E ) (ρ π E (χ E ) -ρ π * (χ E ))dχ E . The second term in equation 21 is non-positive, so V (G * , D * ) ≤ m. Under review as a conference paper at ICLR 2021 By putting the ideal generator that generates p data into the right side of equation 11, we get χ E ρ π * (χ E )D * (χ E )dχ E ≤ χ E ρ π E (χ E )D * (χ E )dχ E . Thus by equation 20, This completes the proof. χ E ρ π * (χ E )D * (χ E )dχ E + χ E ρ π * (χ E )[m -D * (χ E )] + dχ E ≤ V (G * , D * ) A.1.3 PROOF OF THEOREM 2 A Nash equilibrium of this system exists and is characterized by (1)ρ π = ρ π E , and (2) there exists a constant γ ∈ [0, m] such that D * (x) = γ (almost everywhere). Proof. The sufficient conditions are obvious. The necessary condition on π * comes from theorem 1, and the necessary condition on D * (χ) ≤ m is from the proof of theorem 1. Let us now assume that D * (χ) is not constant almost everywhere and find a contradiction. If it is not, then there exists a constant C and a set S of non-zero measure such that ∀χ ∈ S, D * (χ) ≤ C and ∀χ / ∈ S, D * (χ) > C. In addition we can choose S such that there exists a subset S ⊂ S of non-zero measure such that ρ π E (χ) > 0 on S . We can build a generator policy ρ 0 such that ρ π0 (χ) ≤ ρ π E (χ) over S and ρ π0 (χ) < ρ π E (χ) over S . We compute For GCL, we use neural network to represent the reward function, and the learning rate for policy update is 0.0001, the learning rate for reward update is 0.001, with minibatches of 64 examples. For GAIL, the discriminator networks used two hidden layers of 100 units each, with tanh nonlinearities as its architecture. And the learning rate for policy update is 0.0001, and the learning rate for discriminator update is 0.001, with minibatches of 64 examples. For our EB-GAIL, the discriminator (auto-encoder) networks used four hidden layers of 100 units each, with tanh nonlinearities as its architecture. And the learning rate for policy update is 0.0001, the learning rate for discriminator (auto-encoder) update is 0.001, and margin is 5, with minibatches of 64 examples.



Figure 1: The architecture for EB-GAIL.

Figure1depicts the architecture for EB-GAIL. The reinforcement learning component is trained to roll out trajectories τ i , which are the sequences of state-action pairs. The discriminator D takes either expert or generated state-action pairs, and estimates the energy value accordingly. Here, we assume

basic result is that the set of valid occupancy measures D = {ρ π : π ∈ Π} can be written as a feasible set of affine constraints: if p 0 (s) is the distribution of starting states and P (s |s, a) is the dynamics model, then D = {ρ : ρ ≥ 0 and a ρ(s, a) = p 0 (s) + γ s ,a P (s|s , a)ρ(s , a) ∀s ∈ S}. Proposition 1. (Theorem 2 in Syed et al. (2008)) If ρ ∈ D, then ρ is the occupancy measure for π ρ (a|s) = ρ(s,a) a ρ(s,a ) , and π ρ is the only policy whose occupancy measure is ρ.

Figure 2: Comparisons for EB-GAIL and other SoTA methods on different tasks.

Zhao et al. (2016) Let a, b ≥ 0, φ(x) = ax + b[m -x] + . The minimum of φ on [0, +∞)exists and is reached in m if a < b, and it is reached in 0 otherwise.

Lemma 1 shows: (1) D * (χ) ≤ m almost everywhere. To verify it, let us assume that there exists a set of measure non-zero such that D * (χ) > m. Let D(χ) = min(D * (χ), m). Then V (G * , D) < V (G * , D * ) which violates equation 11. (2) The function φ reaches its minimum in m if a < b and in 0 otherwise. So V (G * , D) reaches its minimum when we replace D * (x) by these values. We obtain

)and since D * (χ) ≤ m, we get m ≤ V (G * , D * ). Thus, m ≤ V (G * , D * ) ≤ m, so V (G * , D * ) = m.According to equation 21, we see that can only happen if χ 1 ρπ E (χ)<ρπ(χ)dχ = 0. According to Lemma2, it is true if and only if ρ π = ρ π E .

Figure 3 depicts locomotion tasks' environments.

Figure 3: Illustrations for locomotion tasks: (a) Swimmer; (b) Hopper; (c) Walker; (d) Half-Cheetah; (e) Ant; (f) Simple Humanoid; and (g) Full Humanoid.

