UNBIASED LEARNING WITH STATE-CONDITIONED REWARDS IN ADVERSARIAL IMITATION LEARNING

Abstract

Adversarial imitation learning has emerged as a general and scalable framework for automatic reward acquisition. However, we point out that previous methods commonly exploited occupancy-dependent reward learning formulation-which hinders the reconstruction of optimal decision as an energy-based model. Despite the theoretical justification, the occupancy measures tend to cause issues in practice because of high variance and low vulnerability to domain shifts. Another reported problem is termination biases induced by provided rewarding and regularization schemes around terminal states. In order to deal with these issues, this work presents a novel algorithm called causal adversarial inverse reinforcement learning. Our formulation draws a strong connection between adversarial learning and energy-based reinforcement learning; thus, the architecture is capable of recovering a reward function that induces a multi-modal policy. In experiments, we demonstrate that our approach outperforms prior methods in challenging continuous control tasks, even under significant variation in the environments.

1. INTRODUCTION

Inverse reinforcement learning (IRL) is an algorithm of recovering the ground truth reward function from observed behavior (Ng & Russell, 2000) . IRL algorithms-followed by appropriate reinforcement learning (RL) algorithms-can optimize policy through farsighted cumulative value measures in the given system (Sutton & Barto, 2018) ; hence it can usually achieve more satisfying results than mere supervision. While a few studies have investigated recovering reward functions to continuous spaces (Babes et al., 2011; Levine & Koltun, 2012) , IRL algorithms often fail to find the ground-truth reward function in high-dimensional complex domains (Finn et al., 2016b) . The notion of the ground-truth reward requires elaboration since IRL is an ill-posed problem; there can be numerous solutions to the reward function inducing the same optimal policy (Ng et al., 1999; Ng & Russell, 2000) . Recently, adversarial imitation learning (AIL) as a reward acquisition method has shown promising results (Ho & Ermon, 2016) . One of the distinctive strengths of AIL is the scalability through parameterized non-linear functions such as neural networks. The maximum causal entropy principles are widely regarded as the solution when the optimal control problem is modeled as probabilistic inference (Ziebart et al., 2010; Haarnoja et al., 2017) . In particular, probabilistic modeling using a continuous energy function forms a representation called an energy-based model (EBM). We highlight the following advantages of the energy-based IRL: • It provides a unified framework for stochastic policies to the learning; most probabilistic models can be viewed as special types of EBMs (LeCun et al., 2006) . • It rationalizes the stochasticity of behavior; this provides robustness in the face of uncertain dynamics (Ziebart et al., 2010) and a natural way of modeling complex multi-modal distribution. AIL reward functions seem to be exceptions to these arguments-the AIL framework produces distinct types of rewards that are ever-changing and are intended for discriminating joint densities. We argue that these characteristics hinder proper information projection to the optimal decision. This work points out that there remain two kinds of biases in AIL. The established AIL algorithms are typically formalized by the cumulative densities called occupancy measure. We claim that the accumulated measures contain biases that are not related to modeling purposeful behavior, and the formulation is vulnerable to distributional shifts of an MDP. Empirically, they work as dominant noises in training because of the formulation's innate high variance. The other bias is implicit survival or early termination bias caused by reward formulation, which lacks consideration for the terminal states in finite episodes. These unnormalized rewards often provokes sub-optimal behaviors where the agent learns to maliciously make use of temporal-aware strategies. This paper proposes an adversarial IRL method called causal adversarial inverse reinforcement learning (CAIRL). We primarily associate the reward acquisition method with approaches for energy-based RL and IRL algorithms; the CAIRL reward function can induce complex probabilistic behaviors with multiple modalities. We then show that learning with a dual discriminator architecture provides stepwise, state-conditioned rewards. For handling biases induced by finite-horizon, the model postulates the reward function satisfies a Bellman equation, including "self-looping" terminal states. As a result, it learns the reward function satisfying the property of EBMs. Noteworthy contributions of this work are 1) a model-free, energy-based IRL algorithm that is effective in high-dimensional environments, 2) a dual discriminator architecture for recovering a robust state-conditioned reward function, 3) an effective approach for handling terminal states, and 4) meaningful experiments and comparison studies with state-of-the-art algorithms in various topics.

2. RELATED WORKS

Imitation learning is a fundamental approach for modeling intellectual behavior from an expert at specific tasks (Pomerleau, 1991; Zhang et al., 2018) . For the standard framework called Behavioral Cloning, learning from demonstrations is treated as supervised learning for a trajectory dataset. On the other hand, IRL aims to study the reward function of the underlying system, which characterizes the expert. In this perspective, training a policy with an IRL reward function is a branch of imitation learning, specialized in dealing with sequential decision-making problems by recovering the concise representation of a task (Ng & Russell, 2000; Abbeel & Ng, 2004) . For modeling stochastic expert policies, Boltzmann distributions appeared in early IRL research, such as Bayesian IRL, natural gradient IRL, and maximum likelihood IRL (Ramachandran & Amir, 2007; Neu & Szepesvári, 2012; Babes et al., 2011) . Notably, maximum entropy IRL (Ziebart et al., 2008) is explicitly formulated based on the principle of maximum entropy. The framework has also been derived from causal entropy-the derived algorithm can model the purposeful distribution of optimal policy into a reward function (Ziebart et al., 2010) . Our work draws significant inspirations from these prior works and aims to redeem the perspective of probabilistic causality. Recently, AIL methods (Ho & Ermon, 2016; Fu et al., 2017; Ghasemipour et al., 2020) have shown great success on continuous control benchmarks. Each of the them provides a unique divergence minimization scheme by its architecture. In particular, our work shares major components with AIRL. It has been argued that the algorithm does not recover the energy of the expert policy (Liu et al., 2020) . We stress that our work introduces essential concepts to draw an energy-based representation of the expert policy correctly. The discriminator design is based on the rich energy-based interpretation of GANs (Zhao et al., 2016; Azadi et al., 2018; Che et al., 2020) and numerous studies with multiple discriminators (Chongxuan et al., 2017; Gan et al., 2017; Choi et al., 2018) . The issues of finite-horizon tasks were initially raised in RL during the discussion of time limits in MDP benchmarks (Pardo et al., 2017; Tucker et al., 2018) . It turned out that the time limits, or even the existence of terminal states, would significantly affect the value learning procedure of RL compared to that generated in infinite horizon MDPs. IRL suffers from the identical problem that reward learning of finite episodes is not really stable for tasks outside of appropriate benchmarks. Kostrikov et al. (2018) suggested explicitly adding a self-repeating absorbing state (Sutton & Barto, 2018) after the terminal state; consequently, AIL discriminators can evaluate the termination frequencies.

3. BACKGROUND

Markov Decision Process (MDP). We define an MDP as a tuple M = (S, A, P, r, p 0 , γ) where S and A denote the state and action spaces, and γ is the discount factor. The transition distribution P, the deterministic state-action reward function r, and the initial state distribution p 0 are unknown. Let τ π and τ E be sequences of finite states and actions (s 0 , a 0 , . . . , a T -1 , s T ) obtained by a policy π and the expert policy π E , respectively. The term ρ π denotes the occupancy measures derived by Table 1 : The objectives for AIL algorithms in a form as the minimization of statistical divergences.

Method

Optimized Objective (Minimization) (Ho & Ermon, 2016) E π D JS ρ π (s, a), ρ E (s, a) -H(π(•|s)) AIRL (Fu et al., 2017) Eπ DKL ρπ(s, a) ρE(s, a) = -Eπ log ρE(s, a) + H(ρπ) FAIRL (Ghasemipour et al., 2020) Eπ DKL ρE(s, a) ρπ(s, a) = -Eπ E log ρπ(s, a) + H(ρE) CAIRL (this work) Behavioral Cloning E π E D KL π E (a|s) π(a|s) = -E π E [log π(a|s)] + const GAIL Eπ DKL π(a|s) πE(a|s) = -Eπ[r(s, a) + H(π(•|s))] + const π, and is defined as ρ π (s, a) = π(a|s) ∞ t=0 γ t Pr(s t = s|π). With a slight abuse of notation, we refer to the occupancy measures of states as ρ E (s) and ρ π (s). The expectation of π for an arbitrary function c denotes an expected return for infinite-horizon: E π [c(s, a)] E[ ∞ t=0 γ t c(s, a)|π]. Maximum Entropy IRL (MaxEnt IRL). Ziebart (2010) and Haarnoja et al. (2017) defined the optimality of stochastic policy with an entropy-regularized RL objective as follows: π = arg max π∈Π t E (st,at)∼ρπ r(s t , a t ) + αH(a t |s t ) where H denotes the causal entropy function. 1 If π E is the MaxEnt RL policy, the softmax Bellman optimality equations can be defined by the following recursive logic: Q (s t , a t ) = r(s t , a t ) + γE st+1∼P(•|st,at) V (s t+1 ) V (s t ) = E at∼π E (•|st) Q (s t , a t ) -log π E (a t |s t ) MaxEnt IRL algorithms (Ziebart et al., 2008; 2010) are energy-based interpretations of IRL which aim to find behavior abiding the MaxEnt principle. Such algorithms, however, are difficult to be computed when the given spaces are continuous or dynamics are unknown (Finn et al., 2016a) . Adversarial Imitation Learning. Ho & Ermon (2016) considered adversarial learning as a modelfree, sampling-based approximation to MaxEnt IRL. Instead of exhaustively solving the problem, GAIL performs imitation learning by minimizing the divergence between the state-action occupancy measures from expert and learner through the following logistic objective: min π∈Π max D E π E log D(s, a) + E π log 1 -D(s, a) -H(π) where D ∈ (0, 1) |S||A| indicates a binary classifier trained to distinguish between τ π and τ E . The AIRL discriminator tries to disentangle a reward function that is invariant to dynamics. It takes a particular form: D θ (s, a) = exp(f θ,ψ (s, a))/(exp(f θ,ψ (s, a)) + π φ (a|s)). Learning with the AIRL can be considered as the reverse KL divergence between occupancy measures. Ghasemipour et al. (2020) proposed the FAIRL algorithm as an adversarial method for the forward KL divergence.

4. UNBIASED REINFORCEMENT SIGNALS FOR ENERGY-BASED MODELS

Our aim in this section is to investigate unbiased probabilistic modeling of the causality of decisions using the MaxEnt framework. In Sec. 4.1, we discuss the energy-based reward function. Sec. 4.2 and Sec. 4.3 introduce the modeling method of the particular reward function.

4.1. ENERGY-BASED REWARD REPRESENTATION

To accurately manage the MaxEnt framework, the EBM of expert is set to π E (a t |s t ) ∝ exp{-E(s t , a t ) = exp{Q(s t , a t )}, such that the likelihood of distribution is proportional to the soft Q-function. The MaxEnt RL processes can be interpreted as minimizing the expected KLdivergence using the information projection (Haarnoja et al., 2017; 2018) : J(π) = E s∼ρπ D KL π(•|s) exp{Q(s,•)} Z(s) . Apparently in IRL, since the ground-truth reference will be the expert policy, the general objective of imitation learning: π = arg min π E π [D KL (π(•|s) π E (•|s))] , is indeed the special type of the MaxEnt RL objectives. We describe a state-conditioned, energy-based reward as a representation: • If γ ≈ 0, it formulates the "myopic" 1-step conditional KL divergence: D KL [π(•|s) π E (•|s)]. • If γ ≈ 1, assuming the dynamics are identical, it leads to the "far-sighted" cumulative densities: E[ ∞ t=0 D KL [π(•|s t ) π E (•|s t ) ]|π] = D KL Pr(a 0 , s 1 , . . . |s 0 = s, π) Pr(a 0 , s 1 , . . . |s 0 = s, π E ) . By the discount factor γ, we can control how much subsequent steps we want to model. Therefore, the energy-based rewards generalize learning probabilistic inference of conditional decisions. Note that AIL reward functions, curated in Table 1 , usually do not retain these properties. For example, the AIRL reward function, namely: f (s, a) = log(ρ E (s, a)/ρ π (s)), recovers the expert policy, only in the trivial case γ = 0, as π E (a|s) = f (s,a) a f (s,a ) . For γ > 0, the projection generally is not π E , and also it is difficult to be analyzed. We also highlight that the standard MaxEnt RL with AIL rewards may not precisely recover the EBM. The more discussions are addressed in Appendix B.2.

4.2. STATE-CONDITIONED REWARDS ON THE PRINCIPLE OF MAXIMUM ENTROPY

We clarify a valid candidate set of reward functions when entropy-regularization is applied. One trivial solution of such function is the log-likelihood of expert r (s, a) = log π E (a|s), in the condition: V (s) = V (s ) = 0. Then, 1-step expectation of r draws a conditional KL divergence: E a∼π(•|s) r (s, a) + H(π(•|s)) = -D KL π(•|s) π E (•|s) ≤ 0 (3) By the property of KL divergence, the above expression is less than or equal to 0 where the equality holds if and only if π(a|s) = π E (a|s), ∀a ∈ A. Since the optimal value function outputs zero, the reward shaping (Ng et al., 1999) of r is not required. As a result, the optimality of the action is instantaneously determined without consideration of future rewards. It leaves the following remark. Remark 4.1. For arbitrary discounted rates, r (s, a) = log π E (a|s) is the optimally shaped reward function for learning efficiency in the Shannon entropy-regularization. From this insight, the log-likelihood of expert policy can be considered as a desirable stateconditioned reward function for the MaxEnt RL. However, oftentimes directly modeling the log density (such as BC methods) is not practical due to the limited number of samples. We can alternatively relaxes likelihood estimation by using a state potential function Ψ (Ng et al., 1999) : R = r r(s, a) + γΨ(s ) -Ψ(s) = log π E (a|s), Ψ : S → R, ∀s, a, s ∈ S × A × S Akin to a deterministic case, we formalize a point that Eq. ( 4) does not change the learning objective. Proposition 4.1. Let r be a function satisfying Eq. ( 4). Then the expected cumulative reward of π has following property: E π r(s, a) + H(π(•|s)) = -E π D KL π(•|s) π E (•|s) + E s∼p0 Ψ(s) . Note that the term E s∼p0 [Ψ(s)] is independent of π. Compared to the AIRL, the function r ∈ R can be applied for training arbitrary policies since the overlapping state densities log ρ E (s) /ρπ(s) are detached. The subsequent learning of r provides an projection, which is the closest estimation of Pr(a 0 , . . . |s 0 = s, π E ) for the current state.

4.3. HANDLING FINITE-HORIZON BIASES VIA KL REGULARIZATION ASSUMPTION

In order to mitigate the terminiation biases depicted in Fig. 1 (a), Kostrikov et al. (2018) highlighted the concept of absorbing states and suggested adding synthetic transitions into the trajectories (see Fig. 1 (b )), which explicitly promotes learning for termination frequencies. In contrast, we focus on another intriguing point that the absorbing state makes all episodes virtually have the properties of infinite horizon. Consider that the expert represents the MaxEnt behavior even when a self-looping state is encountered. Since the action selection can change neither reward nor transition state, the "expert" would represent complete random behavior, which is identical to the uniform distribution. In the sense that they do not require the explicit manipulation of trajectories, Fig.  T -2 s T -1 s T H T -2 H T -1 s T -2 s T -1 s T s a H T -2 H T -1 Hmax Hmax s T -2 s T -1 s T H T -2 H T -1 Hmax s T -2 s T -1 s T KL T -2 KL T -1 0 (a) Entropy bonuses in a finite horizon MDP. (b) Entropy bonuses including an absorbing state. (c) Entropy bonuses with a looping terminal state. (d) KL penalties with a looping terminal state. Figure 1 : Visualization of entropy regularization methods and biases around terminal states. The red vertices and edges represent absorbing states and maximum entropy action selection. Each regularization is defined as H t = H π(•|s t ) and KL t = -D KL [π(•|s t ) p unif (•)], respectively. state value is always zero instead of ∞ k=0 γ k H max . In other words, only switching the regulation scheme from Fig. 1 (a) has eliminated the termination bias. Consequently, the KL penalties can be seamlessly integrated into finite-Horizon MDPs without any pre-processing around terminal states. With the same reasoning as in Sec. 4.2, the term log π E (a|s) /u can be understood as the optimally shaped reward under the KL regularization. The constant u denotes the likelihood of p unif (i.e. u p unif (a), ∀a ∈ A). To this end, we set the learning objective of the shaped reward function: f θ,ψ (s, a, s ) = r θ (s, a) + γh ψ (s ) -h ψ (s) = log π E (a|s) /u where r θ and h ψ denote reward and potential networks. h ψ denotes a target potential network with the identical parameter of ψ, yet the gradient computation is disconnected while training, which is analogous to constructing target value networks in the deep RL domain (Mnih et al., 2015) . For every terminal state, we fixed the output to the optimal solution: r θ (s T , a) = h ψ (s T ) = 0, ∀a ∈ A.

5. CAUSAL ADVERSARIAL INVERSE REINFORCEMENT LEARNING 5.1 A DUAL DISCRIMINATOR ARCHITECTURE

The methodology is grounded on the findings that an AIL discriminator also contains an EBM, and the property that an EBM on joint densities can be decomposed into multiple EBMs (Zhao et al., 2016; Che et al., 2020; Azadi et al., 2018) . Suppose that the GAIL discriminator is nearly optimal: D(s, a) = 1 1 + exp -d(s, a) ≈ ρ E (s, a) ρ E (s, a) + ρ π (s, a) where d(s, a) denotes the logit of D(s, a). We disentangle the logit function to the following form: d(s, a) ≈ log ρ E (s, a) ρ π (s, a) = log ρ E (s) ρ π (s) + log π E (a|s) π(a|s) and then we have two log-ratios of state occupancy measures and policy distributions. The model substitutes the log-ratio of state occupancy measures using a state-only discriminator D ϕ with a nearly optimal logit score of d ϕ (s) ≈ log( ρ E (s) /ρπ(s)). The role of D ϕ is to nullify the difference between state densities by pre-applying it. We propose an architecture of discriminator: D θ,ψ (s, a, s ) = exp f θ,ψ (s, a, s ) exp f θ,ψ (s, a, s ) + exp -d ϕ (s) + log π φ (a|s) u (7) where d ϕ (s) and π φ (a|s) are pre-computed. Using D ϕ and π φ as scaffolds, the shaped reward function f θ,ψ converges to Eq. ( 5) and becomes specialized in evaluating conditional decisions, where learning with the reward function correctly projects to the optimal policy. The discriminators D ϕ and D θ,ψ are trained for maximizing the following objectives, respectively: J (D ϕ ) = E π E log D ϕ (s) + E π φ log(1 -D ϕ (s)) , J (D θ,ψ ) = E π E log D θ,ψ (s, a, s ) + E π φ log(1 -D θ,ψ (s, a, s )) -λE π φ ,π E χ ψ (s, s ) 2 2 , where χ ψ (s, s ) = γh ψ (s ) -h ψ (s) denotes the shaping function. To make the r θ to be close to the case log π E (a|s) /u, we regularize the function by minimizing squared L2-norm with λ ∈ R + . The regularization on the shaping function eventually makes it converge to zero, but it achieves relatively stable results than regularizing h ψ (s) 2 2 . The algorithm provides an IRL reward as α • r θ (s, a); ideally, the performance is invariant to α, if all the processes share the same temperature. In some benchmarks, the reward function is constrained by using the softmax activation function. As the terminal states get the highest entropy bonuses, this constraint does not exploit awareness of terminal states. In some benchmarks it has the practical advantage of preventing overly pessimistic rewards when IRL is not sufficiently trained. We defer additional implementation details to Appendix D.

5.2. ANALYSES ON THE CAUSAL AIRL ALGORITHM

Entropy-regularized IRL. We draw a connection between entropy-regularized policy gradient algorithms and our method as an adversarial training method for the policy network. Proposition 5.1. If Eq. ( 5) is satisfied, the following equality holds. ∇ φ E π φ D KL π φ (•|s) π E (•|s) = -E π φ Q π φ (s, a)∇ φ log π φ (a|s) + ∇ φ H(•|s) , where Q π (s, a) r θ (s, a) + E ∞ t=1 γ t (r θ (s t , a t ) -log π φ (a t |s t )) s 0 = s, a 0 = a, π . The proposition again shows the strong relationship between entropy-regularized RL and AIL. Unlike deterministic RL, the policy of entropy-regularized RL is proved to be converged to a unique fixed point in a regularized condition (Geist et al., 2019; Yang et al., 2019) . Thus, we deduce that the learning scheme of CAIRL leads the policy to be converged to the fixed point of the expert. Application to Transfer Learning. Suppose that the expert resides in another MDP: M E = (S, A, P E , r, p0 , γ). Extending our formulation, we can induce that the function f θ,ψ (s, a, s ) would converge to log P E (s |s,a) P(s |s,a) • π E (a|s) u , and the expectation of the function with the KL penalty gives E a∼π(•|s),s ∼P(•|s,a) f θ,ψ (s, a, s ) -D KL (π(•|s) p unif (•)) = -D KL p π φ (a, s |s) p E (a, s |s) , where p π φ (a, s |s) and p E (a, s |s) denote the agent's and the expert's conditional probability distributions of action and transition state, and learning with r θ (s, a) also promotes the identical effect. If the distributions p0 and P E are different, the optimal behavior has to adapt to the gap between domains. The energy-based reward function r θ (s, a) is robust to dynamics misalignment as it learns the conditional joint distribution of actions and transition states, namely D KL Pr(a 0 , s 1 , . . . |s 0 = s, π) Pr(a 0 , s 1 , . . . |s 0 = s, π E ) . Objective of Potential Networks. The reward shaping can be viewed as a value iteration scheme. We relate Eq. ( 5) to an entropy-regularized operator called mellowmax defined as mm α (X) = log 1 n n i=1 exp(αx i ) /α, and show that r θ and h ψ satisfy the mellowmax Bellman optimality. Proposition 5.2. If f θ,ψ (s, a, s ) = log π E (a|s) /u for all states, actions, and transition states, then h ψ (s) = log u • A exp r θ (s, a) + γE s |s,a h ψ (s ) da Thus h ψ corresponds to the mellowmax optimal value function of r θ with α = 1. According to analyses of Asadi & Littman (2017) , the standard softmax operator may lead the value learning to multiple fixed-points when γ ≈ 1. As the mellowmax is a non-expansion operator, the use of the KL penalties achieves relatively stable potential function optimization.

6. EXPERIMENTAL RESULTS

Our experiments aim to understand CAIRL and verify the effectiveness of the algorithm in the sense of our claims. We evaluate our approach on three topics, whose settings are motivated by the previous works (Haarnoja et al., 2017; 2018; Fu et al., 2017) . For the RL algorithms, we implemented an algorithm based on OpenAI-PPO2 (Schulman et al., 2017) and use the KL regularization in order to eliminate the survival biases in MaxEnt RL as addressed in Sec. 4.3. Multi-Modal Behavior. The first experiment setting is a multi-goal environment. From the initial agent position, four goals are located at the four cardinal directions. While a deterministic policy commits to a single goal at the earliest attempt, we hypothesize that the optimal MaxEnt policy distribution represents a multi-modal behavior, which is reaching all the four goals at the same rate. We evaluated algorithms on two settings. In the 2D setting, the agent is a point mass. The groundtruth reward function is defined as the difference between Gaussian mixture model values of points: GMM(x t+1 ) -GMM(x t ) where x t is a 2D representation of state. In the 3D setting, the agent is a simulated robot where its state is defined by the position and its joint values. The detailed setting and expert of each of the environments are provided in Appendix C.1. For evaluating survival bias handling, we implemented a similar task, called an asymmetric multi-goal environment, where the right side goal is located substantially further than others (right). Table 2 displays an ablation study that shows episode time steps and ratios that an IRL agent reach the goal in the right-side. The result shows that CAIRL can induce more uniform multi-modal distribution than AIRL variants and that the algorithm is robust to the survival bias. Especially, the CAIRL algorithm the both activation functions achieved high performance; the difference between two cases was not significant. Fig. 3 shows a result that CAIRL outperforms AIRL in the quality of generated trajectories. CAIRL searched all the four goals with the robot, which prominently shows that our algorithm can reconstruct rewards from multi-modal policies in the complex control tasks. Imitation Learning. The second experiment is the imitation learning tasks of continuous domains from the Gym benchmark suite (Brockman et al., 2016) . We evaluated our algorithm on five challenging tasks, including Humanoid benchmark with 21 action dimensions. Transfer Learning. The third experiment is the transfer learning tasks. The setup is inspired by Fu et al. (2017) , but the settings are quite different. We trained each IRL network in the target (test) environment. To simply put, our experiment aims to measure the flexibility of the reward learning process. The formulation was designed to show that knowledge transfer requires adaptation. We wanted to find out whether each algorithm provides helpful rewards without pretraining. Also, it is natural to think that only a chunk of trajectories is available in a realistic problem. We implemented three transfer learning tasks called SlopeHopper, SlopeWalker2d, and CrippledAnt. In the Slope-Hopper and SlopeWalker tasks, the agent has the same configuration with the original 3D models but the ground is tilted by certain degree ranges to [1, 15] . In the CrippledAnt task, the robot has noticeably shorter forelegs as colored red in Fig. 5 . We additionally restricted the joint angles of forelegs in the range of {.01,.25,.50,.75,1.0} compared to the original model. Given the expert trajectories from the source environments, the results of transfer learning tasks are shown in Fig. 5 . For each task, we repeated all runs 5 times and report the results which are averaged over scores from the last 0.5 million training steps. In transfer learning setting, CAIRL outperformed other algorithms considerably that it achieved the highest performance for every experiment. The results imply that the algorithm extracted informative rewards for new environments with different dynamics, such that our reward acquisition methods are robust with the variation between tasks. Other algorithms failed considerably; it empirically validate one of our hypothesis that the cumulative sum of state occupancy measures hiders robust learning when domain shift happens. The experiments have verified that with the proper consideration of temporal dependencies, the AIL algorithms could be extended towards transfer learning problems.

7. CONCLUSION

In this paper, we have proposed a causal AIRL algorithm that recovers a robust reward function. We have provided theoretical analyses, including reward shaping in entropy-regularized MDPs, and the connection between adversarial learning and energy-based RL. We have proposed a novel dual discriminator architecture, which learns a reward and a value function of regularized Bellman optimality equations. Our model can efficiently disentangle biases originated from state occupancy and terminal states. We have verified that the proposed IRL method has clear advantages over AIRL for learning multi-modal behaviors and handling termination biases. The proposed method recovers state-conditioned rewards, which has advantages over AIL algorithms in terms of the robustness of imitation performance in challenging continuous domains. Furthermore, the proposed method outperformed other methods in domain adaptation in the transfer learning experiments. A PROOFS Proposition 4.1. Let r be a function satisfying Eq. ( 4). Then the expected cumulative reward of π has following property: E π r(s, a) + H(π(•|s)) = -E π D KL π(•|s) π E (•|s) + E s∼p0 Ψ(s) . Proof. ρ π (s) denotes the state-only occuany measure of state s. E π [r(s, a) + H(π(•|s))] = E ∞ t=0 γ t (r(s t , a t ) -log π(a t |s t )) = E ∞ t=0 γ t (log π E (a t |s t ) -log π(a t |s t ) -γΨ(s t+1 ) + Ψ(s t )) = E ∞ t=0 γ t (log π E (a t |s t ) π(a t |s t ) ) - ∞ t=0 γ t+1 Ψ(s t+1 ) + ∞ t=0 γ t Ψ(s t ) = E ∞ t=0 γ t (log π E (a t |s t ) π(a t |s t ) ) - ∞ t=1 γ t Ψ(s t ) + ∞ t=0 γ t Ψ(s t ) = E ∞ t=0 γ t (log π E (a t |s t ) π(a t |s t ) ) + Ψ(s 0 ) = S A ρ π (s, a) log π E (a|s) π(a|s) da ds + S p 0 (s)Ψ(s) ds = S ρ π (s) A π(a|s) log π E (a|s) π(a|s) da ds + E s∼p0 [Ψ(s)] = - S ρ π (s)(D KL (π(•|s) π E (•|s))) ds + E s∼p0 [Ψ(s)] = - S ρ π (s) A π(a|s)(D KL (π(•|s) π E (•|s))) da ds + E s∼p0 [Ψ(s)] = -E π D KL π(•|s) π E (•|s) + E s∼p0 Ψ(s) Therefore, the reward function r(s, a) also provides the same global objective as the likelihood log π E (a|s). Proposition 5.1. If all critic functions are optimal, the following equality holds. ∇ φ E π φ D KL π φ (•|s) π E (•|s) = -E π φ Q π φ (s, a)∇ φ log π φ (a|s) + ∇ φ H(•|s) , where Q π (s, a) r θ (s, a) + E ∞ t=1 γ t (r θ (s t , a t ) -log π φ (a t |s t )) s 0 = s, a 0 = a, π . Proof. By the policy gradient theorem Sutton & Barto (2018) , we can derive the gradient of φ when initial state is s. -∇ φ E π φ D KL π φ (•|s) π E (•|s) s 0 = s = ∇ φ (E π φ r θ (s, a) -log π φ (a|s) s 0 = s + const.) = ∇ φ E ∞ t=0 γ t r θ (s t , a t ) -log π φ (a t |s t ) s 0 = s = ∇ φ V π φ (s) where V π (s) E ∞ t=0 γ t r θ (s t , a t ) -log π φ (a t |s t ) s 0 = s, π . By the product rule, we get ∇ φ V π φ (s) = A ∇ φ π φ (a|s) Q π (s, a) -log π φ (a|s) + π φ (a|s)∇ φ Q π (s, a) -log π φ (a|s) da = A ∇ φ π φ (a|s)Q π (s, a) + π φ (a|s)∇ φ H(•|s) + π φ (a|s) S γP(s |s, a)∇ φ V π φ s ds da By repeatedly unrolling ∇ φ V π (s t ), we can derive the following form: ∇ φ V π (s) = S ∞ k=0 γ k Pr(s → x, k, π φ ) A ∇ φ π φ (a|x)Q π (x, a) + π φ (a|x)∇ φ H(•|x) da dx where Pr(s → x, k, π) is the probability of transitioning from state s to state x in k steps under policy π. Thus, using the log derivative trick we can derive the rest of equations as follows a|s) /u for all states, actions, and transition states, then ∇ φ E π φ D KL π φ (•|s) π E (•|s) = S p 0 (s)∇ φ E π φ D KL π φ (•|s) π E (•|s) s 0 = s ds = - S ∞ t=0 γ t P r(s t = s|π) A ∇ φ π φ (a|s)Q π (s, a) + π φ (a|s)∇ φ H(•|s) da ds = - S ρ π φ (s) A ∇ φ π φ (a|s)Q π (s, a) + π φ (a|s)∇ φ H(•|s) da ds = - S ρ π φ (s) A π φ (a|s) (Q π (s, a)∇ φ log π φ (a|s) + ∇ φ H(•|s) da ds = -E π φ Q π (s, a)∇ φ log π φ (a|s) + ∇ φ H(•|s) Proposition 5.2. If f θ,ψ (s, a, s ) = log π E ( h ψ (s) = log u • A exp r θ (s, a) + γE s |s,a h ψ (s ) da Thus h ψ corresponds to the optimal value function of the mellowmax regularizer with α = 1. Proof. r θ (s, a) + γh ψ (s ) -h ψ (s) = log π E (a|s) -log u ∀s ∈ S =⇒ r θ (s, a) + γE s |s,a [h ψ (s )] -h ψ (s) = log π E (a|s) -log u =⇒ u exp(r θ (s, a) + γE s |s,a [h ψ (s )] -h ψ (s)) = π E (a|s) =⇒ u exp(r θ (s, a) + γE s |s,a [h ψ (s )]) exp(h ψ (s)) = π E (a|s) =⇒ A u • exp(r θ (s, a) + γE s |s,a [h ψ (s )]) exp(h ψ (s)) da = A π E (a|s) da =⇒ h ψ (s) = log u • A exp r θ (s, a) + γE s ∼P(•|s,a) h ψ (s ) da

B DISCUSSIONS WITH MAXENT IRL METHODS

In this section, we address similarities between frameworks of the maximum causal entropy IRL and the causal AIRL.

B.1 MAXIMUM CAUSAL ENTROPY FRAMEWORK

Following the notation of Kramer (1998) , the causal entropy can be defined as H(A T ||S T ) E A,S [-log P (A T ||S T )] = T t=1 H(A t |S 1:t , A 1:t-1 ) The objective of maximum causal entropy IRL can be formulated by the following optimization Ziebart et al. (2010) : arg max P (At|St) H(A T ||S T ) such that: E p(S,A) [F(S, A)] = E p(S,A) [F(S, A)] ∀s t at P (A t |S 1:t , A 1:t-1 ) = 1 where the overall dynamics P (S With the perspective of energy-based RL, we draw the following implications from the solution: • p θ (a t |s t ) can be understood as the unique fixed point of MaxEnt framework, which can be achieved by the optimization via soft RL algorithms. • r θ (s t , a t ) = θ T F st,at can be understood as a reward function which is subject to the linear constraints on feature function. • log β(s t , a t ) and log β(s t ) can be understood as the (optimal) soft values satisfying the Bellman equation.

B.2 VANILLA AIRL REWARD FUNCTION DOES NOT FORMULATE AN EBM

Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2017 ) is a well-known IRL method that applies an adversarial architecture to solve the IRL problem. Formally, AIRL constructs the discriminator as D(s, a) = exp{f (s, a)} exp{f (s, a)} + π(a|s) This is highly motivated by the former GAN-GCL work (Finn et al., 2016b) , which proposes that one can apply GAN to train the discriminator as D(τ ) = 1 Z exp(c(τ )) 1 Z exp(c(τ )) + π(τ ) . ( ) where c(τ ) denotes the cost function for trajectories. From the AIL formulation, AIRL uses a surrogate reward as r(s, a) = log D(s, a) -log(1 -D(s, a)) = f (s, a) -log π(a|s), where f (s, a) = log(ρ E (s, a)/ρ π (s)) and the overall reward can be seen as an entropy-regularized reward function. If γ = 0, it is evident that AIRL is identical to the standard adversarial learning without temporal sequences. As a result, it makes sense to directly optimize the policy by taking the energy model as the target policy instead of the reward function, which leads to the optimal solution as: π (a|s) = exp(Q π (s, a)) a exp(Q π (s, a )) = exp(f (s, a)) a exp(f (s, a )) = ρ π E (s, a) a ρ π E (s, a ) , γ = 0. By the fundamental property of occupancy measure (Theorem 2 of Syed et al. 2008) . However, for the general case, the projection of Q-value function cannot be close to π E like the standard probabilitic inference using energy-based models.

B.3 CAUSAL ADVERSARIAL INVERSE REINFORCEMENT LEARNING FORMULATION

As a direct interpretation of the maximum causal entropy framework, the goal of CAIRL can be seen as training a parameterized distribution over the expert trajectories as the following maximumlikelihood objective: max θ J (θ) = max θ E π E [log p θ (a|s)], where the conditional distribution p θ (a t |s t ) is parameterized as p θ (a t |s t ) ∝ exp Q θ (s t , a t ) . We can compute the gradient with respect to θ as follows: ∂ ∂θ J (θ) = E π E ∂ ∂θ log p θ (a|s) = E π E ∂ ∂θ Q θ (s, a) - ∂ ∂θ log Z θ (s) = E π E ∂ ∂θ Q θ (s, a) -E π E E a ∼p θ (•|s) ∂ ∂θ Q θ (s, a ) , where Z θ (s) is the partition function normalizes the distribution p θ . Since sampling with π E and p θ is difficult, we can think of substituting the formulation with the following adversarial learning form: E π E ∂ ∂θ r θ (s, a) -E π ∂ ∂θ r θ (s, a) . ( ) Nevertheless the formulation in Eq. ( 14) is similar with a standard adversarial framework with i.i.d. data, the approximation is not a safe choice because of unbounded divergences among π, p θ , and π E . Therefore, appropriate adversarial learning in sequential decision problems is essentially different to GANs for joint distribution matching. In CAIRL, we replace the reward learning objective with training a logistic discriminator: D θ (s, a) = exp[f θ (s, a)] exp[f θ (s, a)] + κ(s, a) , where κ(s, a) = ρπ(s,a) ρ E (s)•u . The objective of the discriminator is to maximize the generalized Jensen-Shannon divergence between of the generated samples: J (D θ ) = E π E [log D θ (s, a)] + E π [log(1 -D θ (s, a))] For the discriminators, the objective can be written as follows: J (D θ ) = E π E [log D θ (s, a)] + E π [log(1 -D θ (s, a))] = E π E log exp[f θ (s, a)] exp[f θ (s, a)] + κ(s, a) + E π log κ(s, a) exp[f θ (s, a)] + κ(s, a) = E π E f θ (s, a) + E π κ(s, a) -2 • E π log exp[f θ (s, a)] + κ(s, a) , where the operator E π denote the expectation using an occupancy measure ρ π (s, a) = ρ E (s,a)+ρπ(s,a) 2 . Taking the derivative with respect to θ, The optimal θ can be easily found by considering ∂ ∂θ J (D θ ) = 0, in condition of pθ (a|s) = π E (a|s) with δ(s, a) = 1 for all states and actions. Moreover, for the right hand side of the expression, we can draw the following properties of δ: ∂ ∂θ J (D θ ) = E π E ∂ ∂θ f θ (s, a) -2 • E π exp[f θ (s, a)] exp[f θ (s, a)] + κ(s, a) ∂ ∂θ f θ (s, a) = E π E ∂ ∂θ f θ (s, a) -E π ρ E (s)p θ (a|s) 1 2 ρ E (s)p θ (a|s) + 1 2 ρ π (s, a) ∂ ∂θ f θ (s, a) = E π E ∂ ∂θ f θ (s, a) -E π E a ∈A pθ (a |s) • δ(s, a ) • ∂ ∂θ f θ (s, • If ρ π (s, a) ρ E (s, a), meaning the supports are disjoint, δ(s, a) ≈ 1. • If ρ π (s, a) ≈ ρ E (s, a), where pθ (a|s) is normally closer to π E than π, δ(s, a) ≈ 1. Therefore, the expression matches the Eq. ( 13). CAIRL approximates the maximum causal entropy framework without exhaustively computing the recursive equation, and also recovers a plausible reward function on both supports of ρ π and ρ E .

C EXPERIMENTAL DETAILS

C.1 MUTL-GOAL ENVIRONMENTS • 2D Point Environment: Let the 2D coordinate denote the position of a point mass on the environment. The agent is generated according to the normal distribution N (0, (0.1) 2 I). The four goals are located at (6, 0), (-6, 0), (0, 6), and (0, -6), where the agent can move maximum 1 unit per timestep for each coordinate. The ground-truth reward is given by the difference between successive values of a Gaussian mixture depicted as Fig. 6 . The assymetric multi-goal environment has similar settings, except the scale is five times bigger goal of the east side is further located at (60, 0). • 3D Ant Environment: Let the 2D coordinate representation denotes the orthogonal projection of the position of Ant robot torso. The simulated robot is spawned near the origin. The four goals are located at (30, 0), (-30, 0), (0, 30), and (0, -30), where the agent has to control four legs to reach one of the goals. It requires approximately 150 timesteps for an expert to reach one of the goals from the initial position. A vector of the current position and the robot's joint values represent the state. Since it is hard to train a single expert model to represent the desired multi-modal behavior precisely, we evenly merged 2,000 trajectory samples from 8 uni-modal policies specialized in moving to one of the fixed positions. 

C.2 TRANSFER LEARNING ENVIRONMENTS

The experiment setting was designed to show that imitation learning for realistic tasks regardless of domain shifts. We trained each IRL network in the target (test) environment. To simply put, our experiment aims to measure the flexibility of reward learning process. We used 1,000 trajectories from the source task as transfer learning data. • SlopeHopper & SlopeWalker2d: We used the same 3D model from the Hopper-v3 and Walker2d-v3 Mujoco benchmarks. For each task, the ground is tilted with a certain degree in [1, 15] . For computing a state vector, the models' height is adjusted to the vertical distance from the slope. • CrippledAnt: Compared to the original Ant model, we shortened the length of two forelegs to half. We additionally restricted the joint angles of forward legs in the range of {.01,.25,.50,.75,1.0}. While the objective of the agent is for running to the right side, the reward can be calculated by the velocity of robot moving along the x-axis.

D IMPLEMENTATION DETAILS D.1 ALGORITHMS

Algorithm 1 summarizes the overall IRL procedures. The term "REG" in Line 7 of the algorithm refers to the shaping regularization minimizing squared L2-norm γh ψ (s ) -h(s) 2 2 with the regularization parameter λ ∈ R + in order to make the overall algorithm well-defined. For stochastic action distributions, the temperature parameter α has to be multiplied for exact matching between conditional distributions. For ensuring Lipschitz continuity, every critic network is regularized by a gradient penalty (Zhou et al., 2019) . Algorithm 1 Causal adversarial inverse reinforcement learning. Update π φ with α • r θ (s, a) using a maximum entropy policy optimization method. 9: return π φ , r θ  where ε is the clipping range of PPO algorithm, ensuring that the updated policy does not diverge too far from the previous distribution. One thing to be careful in the implementation is that the KL penalty terms are dependent on the policy distribution. Therefore for a PPO algorithm with the KL regularization, the term A π old α (s, a) can be computed as follows: V π old (s) = E ∞ t=0 γ t (r θ (s t , a t ) -α log π φ (at|st) /u) s 0 = s, π old , Q π old (s, a) = r θ (s, a) + E ∞ t=1 γ t (r θ (s t , a t ) -α log π φ (at|st) /u) s 0 = s, a 0 = a, π old , (17) A π old (s, a) = Q π old (s, a) -V π old (s), (Advantage estimate) (18) A π old α (s, a) = A π old (s, a) -α. We refer the work of Shi et al. (2019) for the detailed derivation of entropy-regularized policy gradient algorithms.

D.2 NETWORK ARCHITECTURES AND HYPERPARAMETERS

For all policy and discriminator networks, we use networks with 2-layer MLP with 256-dim layers, and the activation function is ReLU. For input layer, we normalize inputs (state and action vectors) by calculating exponential moving average and variance. Details for Policy Networks For multi-goal environments, the policy function is represented by a Gaussian mixture with four modes. In contrast, a single diagonal Gaussian function is used for other unimodal tasks. For enforcing action bounds, we apply an invertible squashing function (tanh) to the raw action samples and compute the likelihoods of the bounded actions such as tanh(Normal(µ, σ)) (Haarnoja et al., 2018) . In our implementation, we separated the distribution network into mean network and standard deviation network. The gradient penalty regularization is applied to the value network.



For the rest of the paper, we occasionally omit the parameter with α = 1 for simplicity of derivation.



1 (c) and (d) demonstrate two possible simpler examples to turn finite-length episodes into infinite-horizon episodes by assuming self-looping terminal states. The two instances are alike for handling the biases as H t = KL t + const, but the KL regularization has a slight advantage because a terminal Survive .

s

Figure 2: IRL trajectories of 2D Multi-goal environments.

Figure 4: Training curves of stochastic policies on imitation learning benchmarks.

Figure 5: Illustrations of transfer learning and total distance traveled by transfer learning agents.

a ) , where pθ (a|s) = u • exp(f θ (s, a)) and δ(s, a) = ρ E (s, a) + ρ π (s, a) ρ E (s)p θ (a|s) + ρ π (s, a) .

Figure 6: Illustrations of multi-goal environments. From the initial agent position, four goals are symmetrically located at the four cardinal directions. The ground-truth energy function of the 2D environment and the expert trajectory samples for each environment are displayed.

For a policy optimization in the experiments we used PPO implementation train policy by a clipped surrogate objective function as follows:L CLIP π old (φ) = E π old min π φ (a|s) π old (a|s) A π old α (s, a), clip π φ (a|s) π old (a|s) , 1 -ε, 1 + ε A π old α (s, a)

Statistics of CAIRL and AIRL in the asymmetric multi-goal environment.

T ||A T -1 )

Input: Expert trajectory dataset τ : Initialize policy π φ and critic functions D ϕ , and D θ,ψ . 3: for step i in {1, . . . , N } do 4:Collect τ π by executing π φ .{(s t , a t , •, s t )} T t=1 = τ π , {(s t , āt , •, s t )} T t=1 = τ -D ϕ (s t )) + log D ϕ (s t ) with GP. -D θ,ψ (s t , a t , s t ))+log D θ,ψ (s t , āt , s t ) with REG+GP.

annex

Details for CAIRL Networks For CAIRL networks, the reward network consists softplus activation at the final output layer while the potential network has linear activation. Gradient penalty is also used in the both networks. We do not explicitly construct the target potential networks, instead we use the same potential network, but the target potential network does not involved any gradient computation. Also, to discard the learning of terminal state values, h ψ is outputs as follows:if s is a terminal state not caused by time limits, no grad h ψ (s) otherwise Table 3 shows the hyperparameters of conducted experiments. The results suggest that sufficiently low gradient penalty is preferred for achieving desired performance. (c) Compared to the other hyperparameters, CAIRL is robust with the temperature α.

E.4 VISUALIZATION OF CRIPPLEDANT LOCOMOTION

We provide the visualization of generated samples from a trained agent trained by CAIRL algorithm in Fig. 9 . Even though dynamics of the environment is considerably misaligned, the algorithm still can teach agent appropriate algorithm to teach an agent to go desired direction. We believe that this work also provided effective transfer learning algorithm for sequential decision problems.Figure 9 : Visualization of trained agent on CrippledAnt environment. The model maximizes the movement speed toward the right side, without showing biased locomotion such as moving to the forward or to the backward. Also, the orientation of the torso is preserved throughout the episode, implying that the algorithm is capable of imitating experts even under variation in the dynamics.

E.5 VISUALIZATION OF AIL REWARD FUNCTIONS

We provide the visualization of CAIRL and AIRL rewards in the 2D multi-goal environment. To effectively represent the two functions that takes the argument (s, a) ∈ S × A into the space S, we calculated a two-dimensional vector representation of reward for each state by averaging all possible action in A, i.e., v i = 1|A| a∈A [r(s, a)•a i ], i ∈ {0, 1}. As a results, we plotted contour maps by the relative value of rewards and corresponding vector fields of reward function for a state grid, which is shown in Figs. 10 and 11. Apparently, it can be observed that the CAIRL reward function is much more analogous to the MaxEnt likelihood depicted by Haarnoja et al. (2017) . More importantly, CAIRL is prominently showing that our reward modeling provides more informative reward that approximately advise the best direction to reach one of the goals for each state. 

