REGULARIZED INVERSE REINFORCEMENT LEARNING

Abstract

Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies strongly convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing solutions that are intractable in practice. We present theoretical backing for our proposed IRL method's applicability to both discrete and continuous controls, empirically validating our performance on a variety of tasks.

1. INTRODUCTION

Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al., 2015; 2016) and robot control (Schulman et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018) . Advanced RL methods often employ policy regularization motivated by, e.g., boosting exploration (Haarnoja et al., 2018) or safe policy improvement (Schulman et al., 2015) . While Shannon entropy is often used as a policy regularizer (Ziebart et al., 2008) , Geist et al. (2019) recently proposed a theoretical foundation of regularized Markov decision processes (MDPs)-a framework that uses strongly convex functions as policy regularizers. Here, one crucial advantage is that an optimal policy is shown to uniquely exist, whereas multiple optimal policies may exist in the absence of policy regularization. Meanwhile, since RL requires a given or known reward function (which can often involve non-trivial reward engineering), Inverse Reinforcement Learning (IRL) (Russell, 1998; Ng et al., 2000) -the problem of acquiring a reward function that promotes expert-like behavior-is more generally adopted in practical scenarios like robotic manipulation (Finn et al., 2016b) , autonomous driving (Sharifzadeh et al., 2016; Wu et al., 2020) and clinical motion analysis (Li et al., 2018) . In these scenarios, defining a reward function beforehand is particularly challenging and IRL is simply more pragmatic. However, complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert's behavior (Ng et al., 2000) . Fortunately, Geist et al. (2019) show that IRL in regularized MDPs-regularized IRL-does not contain such degenerate solutions due to the uniqueness of the optimal policy for regularized MDPs. Despite this, no tractable solutions of regularized IRL-other than maximum-Shannon-entropy IRL (MaxEntIRL) (Ziebart et al., 2008; Ziebart, 2010; Ho & Ermon, 2016; Finn et al., 2016a; Fu et al., 2018) -have been proposed. In Geist et al. (2019) , solutions for regularized IRL were introduced. However, they are generally intractable since they require a closed-form relation between the policy and optimal value function and the knowledge on model dynamics. Furthermore, practical algorithms for solving regularized IRL problems have not yet been proposed. We summarize our contributions as follows. Unlike the solutions in Geist et al. (2019) , we propose tractable solutions for regularized IRL problems that can be derived from policy regularization and its gradient in discrete control problems (Section 3.1). We additionally show that our solutions are tractable for Tsallis entropy regularization with multi-variate Gaussian policies in continuous control problems (Section 3.2). We devise Regularized Adversarial Inverse Reinforcement Learning (RAIRL), a practical sample-based method for policy imitation and reward learning in regularized MDPs, which generalizes adversarial IRL (AIRL, Fu et al. (2018) ) (Section 4). Finally, we empirically validate our RAIRL method on both discrete and continuous control tasks, evaluating RAIRL via episodic scores and from divergence minimization perspective (Ke et al., 2019; Ghasemipour et al., 2019; Dadashi et al., 2020) (Section 5).

2. PRELIMINARIES

Notation For finite sets X and Y , Y X is a set of functions from X to Y . ∆ X (∆ X Y ) is a set of (conditional) probabilities over X (conditioned on Y ). Especially for the conditional probabilities p X|Y ∈ ∆ X Y , we say p X|Y (•|y) ∈ ∆ X for y ∈ Y . R is the set of real numbers. For functions f 1 , f 2 ∈ R X , we define f 1 , f 2 X := x∈X f 1 (x) f 2 (x). Regularized Markov Decision Processes and Reinforcement Learning We consider sequential decision making problems where an agent sequentially chooses its action after observing the state of the environment, and the environment in turn emits a reward with state transition. Such an interaction between the agent and the environment is modeled as an infinite-horizon Markov Decision Process (MDP), M r := S, A, P 0 , P, r, γ and the agent's policy π ∈ ∆ A S . The terms within the MDP are defined as follows: S is a finite state space, A is a finite action space, P 0 ∈ ∆ S is an initial state distribution, P ∈ ∆ S S×A is a state transition probability, r ∈ R S×A is a reward function, and γ ∈ [0, 1) is the discount factor. We also define an MDP without reward as M -:= S, A, P 0 , P, γ . The normalized state-action visitation distribution, d π ∈ ∆ S×A , associated with π is defined as the expected discounted state-action visitation of π, i.e., d π (s, a) := (1 -γ) • E π [ ∞ i=0 γ i I{s i = s, a i = a}], where the subscript π on E means that a trajectory (s 0 , a 0 , s 1 , a 1 , ...) is randomly generated from M -and π, and I{•} is an indicator function. Note that d π satisfies the transposed Bellman recurrence (Boularias & Chaib-Draa, 2010; Zhang et al., 2019) : d π (s, a) = (1 -γ)P 0 (s)π(a|s) + γπ(a|s) s,ā P (s|s, ā)d π (s, a). We consider RL in regularized MDPs (Geist et al., 2019) , where the policy is optimized with a causal policy regularizer. Mathematically for an MDP M r and a strongly convex function Ω : ∆ A → R, the objective in regularized MDPs is to seek π that maximizes the expected discounted sum of rewards, or return in short, with policy regularizer Ω: arg max π∈∆ A S J Ω (r, π) := E π ∞ i=0 γ i {r(s i , a i ) -Ω(π(•|s i ))} = 1 1 -γ E (s,a)∼dπ [r(s, a) -Ω(π(•|s)] . It turns out that the optimal solution of Eq.( 1) is unique (Geist et al., 2019) , whereas multiple optimal policies may exist in unregularized MDPs (See Appendix A for a detailed explanation). In later work (Yang et al., 2019) , Ω(p) = -λE a∼p φ(p(a)), p ∈ ∆ A was considered for λ > 0 and φ : (0, 1] → R satisfying some mild conditions. For example, RL with Shannon entropy regularization (Haarnoja et al., 2018) can be recovered by φ(x) = -log x, while RL with Tsallis entropy regularization (Lee et al., 2020) can be recovered from φ(x) = k q-1 (1 -x q-1 ) for k > 0, q > 1. The optimal policy π * for Eq.(1) with Ω from Yang et al. (2019) is shown to be π * (a|s) = max g φ µ * (s) -Q * (s, a) λ , 0 , Q * (s, a) = r(s, a) + γE s ∼P (•|s,a) V * (s ), V * (s) = µ * (s) -λ a∈A π * (a|s) 2 φ (π * (a|s)), where φ (x) = ∂ ∂x φ(x), g φ is an inverse function of f φ for f φ (x) := xφ(x), x ∈ (0, 1], and µ * is a normalization term such that a∈A π * (a|s) = 1. Note that we still need to find out µ * to acquire a closed-form relation between optimal policy π * and value function Q * . However, such relations have not been discovered except for Shannon-entropy regularization (Haarnoja et al., 2018) and specific instances (q = 1, 2, ∞) of Tsallis-entropy regularization (Lee et al., 2019) to the best of our knowledge. Inverse Reinforcement Learning Given a set of demonstrations from an expert policy π E , IRL (Russell, 1998; Ng et al., 2000) is the problem of seeking a reward function from which we can recover π E through RL. However, IRL in unregularized MDPs has been shown to be an ill-defined problem since (1) any constant reward function can rationalize every expert and (2) multiple rewards meet the criteria of being a solution (Ng et al., 2000) . Maximum entropy IRL (MaxEntIRL) (Ziebart et al., 2008; Ziebart, 2010) is capable of solving the first issue by seeking a reward function that maximizes the expert's return along with Shannon entropy of expert policy. Mathematically, for the RL objective J Ω in Eq.( 1) and Ω = -H for negative Shannon entropy H(p) = E a∼p [-log p(a)] (Ho & Ermon, 2016) , the objective of MaxEntIRL is MaxEntIRL(π E ) := arg max r∈R S×A J -H (r, π E ) -max π∈∆ A S J -H (r, π) . ( ) Another commonly used IRL method is Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2018) which involves generative adversarial training (Goodfellow et al., 2014; Ho & Ermon, 2016) to acquire a solution of MaxEntIRL. AIRL considers the structured discriminator (Finn et al., 2016a) ,a) e r(s,a) +π(a|s) for σ(x) := 1/(1 + e -x ) and iteratively optimizes the following objective:  D(s, a) = σ(r(s, a) -log π(a|s)) = e r(s max r∈R S×A E (s,a)∼dπ E [log D r,π (s, a)] + E (s,a)∼dπ [log(1 -D r,π (s, a))] , max π∈∆ A S E (s,a)∼dπ [log D r,π (s, a) -log(1 -D r,π (s, a))] = max π∈∆ A S E (s,a)∼dπ [r(s, a) -log π(a|s)] .

3. INVERSE REINFORCEMENT LEARNING IN REGULARIZED MDPS

In this section, we propose the solution of IRL in regularized MDPs-regularized IRL-and relevant properties in Section 3.1. We then discuss a specific instance of our proposed solution in Section 3.2 where Tsallis entropy regularizers and multi-variate Gaussian policies are used in continuous action spaces.

3.1. SOLUTIONS OF REGULARIZED IRL

We consider regularized IRL that generalizes MaxEntIRL in Eq.( 4) to IRL with a class of strongly convex policy regularizers: IRL Ω (π E ) := arg max r∈R S×A J Ω (r, π E ) -max π∈∆ A S J Ω (r, π) . For any strongly convex policy regularizer Ω, regularized IRL does not suffer from degenerate solutions since there is a unique optimal policy in any regularized MDP (Geist et al., 2019) . While Geist et al. (2019) proposed solutions of regularized IRL, those are intractable solutions (See Appendix F.1 for a detailed explanation). In the following lemma, we propose tractable solutions that only requires the evaluation of the policy regularizer (Ω) and its gradient (∇Ω) which are more manageable in practice.  π∈∆ A S E π ∞ i=0 γ i D A Ω (π(•|s i )||π E (•|s i )) , where D A Ω is the Bregman divergence (Bregman, 1967) defined by D A Ω (p 1 ||p 2 ) = Ω(p 1 ) -Ω(p 2 ) -∇Ω(p 2 ), p 1 -p 2 A for p 1 , p 2 ∈ ∆ A . Due to the non-negativity of the Bregman divergence, π = π E is a solution of Eq.( 8) and is unique since Eq.(1) has the unique solution for arbitrary reward functions (Geist et al., 2019) . In particular, for any policy regularizer Ω represented by an expectation over the policy (Yang et al., 2019) , Lemma 1 can be reduced to the following solution in Corollary 1: Corollary 1. For Ω(p) = -λE a∼p φ(p(a)) with p ∈ ∆ A (Yang et al., 2019) , Eq.( 7) becomes t(s, a; π) = -λ • f φ (π(a|s)) -E a ∼π(•|s) [f φ (π(a |s)) -φ(π(a |s))] (9) for f φ (x) = ∂ ∂x (xφ(x) ). The proof is in Appendix B. Throughout the paper, we denote reward baseline by the expectation E a∼π(•|s) [f φ (π(a|s))φ(π(a|s))]. Note that for continuous control tasks with Ω(p) = -λE a∼p φ(p(a)), we can obtain the same form of the reward in Eq.( 9) (The proof is in Appendix D). Although the reward baseline is generally not intractable in continuous control tasks, we derive a tractable reward baseline for a special case (See Section 3.2). Additionally, when λ = 1 and φ(x) = -log x, it can be shown that t(s, a; π) = log π(a|s)-for both discrete and continuous control problems-which was used as a reward objective in previous work (Fu et al., 2018) , and that the Bregman divergence in Eq.( 8) becomes the KL divergence KL(π(•|s)||π E (•|s)). Additional solutions of regularized IRL can be found by shaping t(s, a; π E ) as stated in the following lemma: Lemma 2 (Potential-based reward shaping). Let π * be the solution of Eq.(1) in a regularized MDP M r with a regularizer Ω : ∆ A → R and a reward function r ∈ R S×A . Then for Φ ∈ R S , using either r(s, a) + γΦ(s ) -Φ(s) or r(s, a) + E s ∼P (•|s,a) Φ(s ) -Φ(s) as a reward does not change the solution of Eq.(1). The proof is in Appendix E. From Lemma 1 and Lemma 2, we prove the sufficient condition of rewards being solutions of the IRL problem. However, the necessary condition-a set of those solutions are the only possible solutions for an arbitrary MDP-is not proved (Ng et al., 1999) . In the following lemma, we check how the proposed solution is related to the normalized statevisitation distribution which can be discussed in the line of distribution matching perspective on imitation learning problems (Ho & Ermon, 2016; Fu et al., 2018; Ke et al., 2019; Ghasemipour et al., 2019) :  When Ω(d) is strictly convex and a solution t(s, a; π E ) = ∇ Ω(d π E ) of IRL in Eq.( 10) is used, the RL objective in Eq.( 1) is equal to arg min π∈∆ A S D S×A Ω (d π ||d π E ), where D S×A Ω is the Bregman divergence among visitation distributions defined by D S×A Ω (d 1 ||d 2 ) = Ω(d 1 )-Ω(d 2 )-∇ Ω(d 2 ), d 1 -d 2 for visitation distributions d 1 and d 2 . The proof is in Appendix G. Note that the strict convexity of Ω is required for D S×A Ω to become a valid Bregman divergence. Although the strict convexity of a policy regularizer Ω does not guarantee the assumption on the strict convexity of Ω, it has been shown to be true for Shannon entropy regularizer ( Ω = -H of Lemma 3.1 in Ho & Ermon (2016) ) and Tsallis entropy regularizer with its constants k = 1 2 , q = 2 ( Ω = -W of Theorem 3 in Lee et al. (2018) ).

3.2. IRL WITH TSALLIS ENTROPY REGULARIZATION AND GAUSSIAN POLICIES

For continuous controls, multi-variate Gaussian policies are often used in practice (Schulman et al., 2015; 2017) and we consider IRL problems with those policies in this subsection. In particular, we consider IRL with Tsallis entropy regularizer et al., 2018; Yang et al., 2019; Lee et al., 2020) for a multi-variate Gaussian policy π( Ω(p) = -T k q (p) = -E a∼p [ k q-1 (1 -p(a) q-1 )] (Lee •|s) = N (µ µ µ(s), Σ Σ Σ(s)) with µ µ µ(s) = [µ 1 (s), ..., µ d (s)] T , Σ Σ Σ(s) = diag{(σ 1 (s)) 2 , ..., (σ d (s)) 2 }. In such a case, we can obtain tractable forms of the following quantities: Tsallis entropy. The tractable form of Tsallis entropy for a multi-variate Gaussian policy is T k q (π(•|s)) = k(1 -e (1-q)Rq(π(•|s)) ) q -1 , R q (π(•|s)) = d i=1 log( √ 2πσ i (s)) - log q 2(1 -q) for Renyi entropy R q . Its derivation is given in Appendix I. Reward baseline. The reward baseline term E a∼π(•|s) [f φ (π(a|s)) -φ(π(a|s))] in Corollary 1 is generally intractable except for either discrete control problems or Shannon entropy regularization where the reward baseline is equal to -1. Interestingly, as long as the tractable form of Tsallis entropy can be derived, that of the corresponding reward baseline can also be derived since the reward baseline satisfies E a∼π(•|s) [f φ (π(a|s)) -φ(π(a|s))] = (q -1)E a∼π(•|s) [φ(π(a|s)] -k = (q -1)T k q (π(•|s)) -k. Here, the first equality holds with f φ (x) = k q-1 (1 -qx q-1 ) = qφ(x) -k for Tsallis entropy regularization. Bregman divergence associated with Tsallis entropy. For two different multivariate Gaussian policies, we derive the tractable form of the Bregman divergence (associated with Tsallis entropy) between two policies. The resultant divergence has a complicated form, so we leave its derivation in Appendix I.3. q (p)) between two uni-variate Gaussian distributions π = N (µ, σ 2 ) and π E = N (0, (e -3 ) 2 ) (green point in each subplot). In each subplot, we normalized the Bregman divergence so that the maximum value becomes 1. Note that for q = 1, D Ω (π||π E ) becomes the KL divergence KL(π||π E ). For deeper understanding of Tsallis entropy and its associated Bregman divergence in regularized MDPs, we consider an example in Figure 1 . We first assume that both learning agents' and experts' policies follow uni-variate Gaussian distributions π = N (µ, σ 2 ) and π E = N (0, (e -3 ) 2 ), respectively. We then evaluate the Bregman divergence in Figure 1 by using its tractable form and varying q from 1.0-which corresponds to the KL divergence-to 2.0. We observe that the constant q from the Tsallis entropy affects the sensitivity of the associated Bregman divergence w.r.t. the mean and standard deviation of the learning agent's policy π. Specifically, as q increases, the size of the valley-relatively red region in Figure 1 -across the µ-axis and log σ-axis decreases. This suggests that for larger q, minimizing the Bregman divergence requires more tightly matching means and variances of π and π E . Published as a conference paper at ICLR 2021 Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL) 1: Input: A set D E of expert demonstration generated by expert policy π E , a reward approximator r θ , a policy π ψ for neural network parameters θ and ψ 2: for each iteration do 3: Sample rollout trajectories by using the learners policy π ψ .

4:

Optimize θ with the discriminator D r θ ,π ψ and the learning objective in Eq.( 11).

5:

Optimize ψ with Regularized Actor Critic by using r θ as a reward function. 6: end for 7: Output: π ψ ≈ π E , r θ (s, a) ≈ t(s, a; π E ) (a solution of the IRL problem in Lemma 1).

4. ALGORITHMIC CONSIDERATION

Based on a solution for regularized IRL in the previous section, we focus on developing an IRL algorithm in this section. Particularly to recover the reward function t(s, a; π E ) in Lemma 1, we design an adversarial training objective as follows. Motivated by AIRL (Fu et al., 2018) , we consider the following structured discriminator associated with π, r and t in Lemma 1: D r,π (s, a) = σ(r(s, a) -t(s, a; π)), σ(z) = 1 1 + e -z , z ∈ R. Note that we can recover the discriminator of AIRL in Eq.( 5) when t(s, a) = log π(a|s) (φ(x) = log x and λ = 1). Then, we consider the following optimization objective of the discriminator which is the same as that of AIRL: t(s, a; π) := arg max r∈R S×A E (s,a)∼dπ E [log D r,π (s, a)] + E (s,a)∼dπ [log(1 -D r,π (s, a))] . Since the function x → a log σ(x) + b log(1 -σ(x)) attains its maximum at σ(x) = a a+b , or equivalently at x = log a b (Goodfellow et al., 2014; Mescheder et al., 2017) , it can be shown that t(s, a; π) = t(s, a; π) + log d π E (s, a) d π (s, a) . ( ) When π = π E in Eq.( 12), we have t(s, a; π E ) = t(s, a; π E ) since d π = d π E , which means the maximizer t becomes the solution of IRL after the agent successfully imitates the expert policy π E . To do so, we consider the following iterative algorithm. Assuming that we find out the optimal reward approximator t(s, a; π (i) ) in Eq.( 12) for the policy π (i) of the i-th iteration, we get the policy π (i+1) by optimizing the following objective with gradient ascent: maximize π∈∆ A S E (s,a)∼dπ t(s, a; π (i) ) -Ω(π(•|s)) . The above expectation in Eq.( 13) can be decomposed into the following two terms E (s,a)∼dπ t(s, a; π (i) ) -Ω(π(•|s)) = E (s,a)∼dπ t(s, a; π (i) ) -Ω(π(•|s)) -KL(d π ||d π E ) = -E (s,a)∼dπ D A Ω (π(•|s)||π (i) (•|s)) (I) -KL(d π ||d π E ) (II) , where the second equality follows since Lemma 1 tells us that t(s, a; π (i) ) is a reward function that makes π (i) an optimal policy in the Ω-regularized MDP. Minimizing term (II) in Eq.( 14) makes π (i+1) close to π E while minimizing term (I) can be regarded as a conservative policy optimization around the policy π (i) (Schulman et al., 2015) . In practice, we parameterize our reward and policy approximations with neural networks and train them using an off-policy Regularized Actor-Critic (RAC) (Yang et al., 2019) as described in Algorithm 1.Below, we evaluate our Regularized Adversarial Inverse Reinforcement Learning (RAIRL) approach across various scenarios.

5. EXPERIMENTS

We summarize the experimental setup as follows. In our experiments, we consider Ω(p) = -λE a∼p [φ(p(a)] with the following regularizers from Yang et al. ( 2019): (1) Shannon entropy (φ(x) = -log x), (2) Tsallis entropy regularizer (φ(x) = k q-1 (1 -x q-1 )), (3) exp regularizer (φ(x) = e -e x ), (4) cos regularizer (φ(x) = cos( π 2 x)), (5) sin regularizer (φ(x) = 1 -sin π 2 x). The above regularizers were chosen since other regularizers have not been empirically validated to the best of our knowledge. We chose these regularizers to make our empirical analysis more tractable. In addition, we model the reward approximator of RAIRL as a neural network with either one of the following models: (1) Non-structured model (NSM)-a simple feed-forward neural network that outputs real values used in AIRL (Fu et al., 2018) -and (2) Density-based model (DBM)-a model using a neural network for π (softmax for discrete controls and multi-variate Gaussian model for continuous controls) of the solution in Eq.( 1) (See Appendix J.2 for a detailed explanation). For the RL algorithm of RAIRL, we implement Regularized Actor Critic (RAC) (Yang et al., 2019) on top of the SAC implementation from Rlpyt (Stooke & Abbeel, 2019) . Other settings are summarized in Appendix J. For all experiments, we use 5 runs and report 95% confidence intervals.

5.1. EXPERIMENT 1: MULTI-ARMED BANDIT (DISCRETE ACTION)

We consider a 4-armed bandit environment as shown in Figure 2 (left). An expert policy π E is assumed to be either dense (with probability 0.1, 0.2, 0.3, 0.4 for a = 0, 1, 2, 3) or sparse (with probability 0, 0, 1/3, 2/3 for a = 0, 1, 2, 3). For those experts, we use RAIRL with actions sampled from π E and compare learned rewards with the ground truth reward t(s, a; π E ) in Lemma 1. When π E is dense, RAIRL successfully acquires the ground truth rewards irrespective of the reward model choices. When sparse π E is used, however, RAIRL with a non-structured model (RAIRL-NSM) fails to recover the rewards for a = 0, 1-where π E (a) = 0-due to the lack of samples at the end of the imitation. On the other hand, RAIRL with a density-based model (RAIRL-DBM) can recover the correct rewards due to the softmax layer which maintains the sum over the outputs equal to 1. Therefore, we argue that using DBM is necessary for correct reward acquisition since a set of demonstrations is generally sparse. In the following experiment, we show that the choice of reward models indeed affects the performance of rewards. We consider an environment with a 2-dimensional continuous state space as described in Figure 3 . At each episode, the learning agent is initialized uniformly on the x-axis between -5 and 5, and there are 8 possible actions-an angle in {-π, -3 4 π, ..., 1 2 π, 3 4 π} that determines the direction of movement. An expert in Bermuda World considers 3 target positions (-5, 10), (0, 10), (5, 10) and behaves stochastically. We state how we mathematically define the expert policy π E in Appendix J.3. During RAIRL's training (Figure 3 , Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for D A Ω (p 1 ||p 2 ) = E a∼p1 [f φ (p 2 (a)) - φ(p 1 (a))] -E a∼p2 [f φ (p 2 (a)) -φ(p 2 (a))], 1 N N i=1 D A Ω (π(•|s i )||π E (•|s i )). Here, the states s 1 , ..., s N come from 30 evaluation trajectories that are stochastically sampled from the agent's policy π-which is fixed during evaluation-in a separate evaluation environment. During the evaluation of learned reward (Figure 3 , Bottom row), we train randomly initialized agents with RAC and rewards acquired from RAIRL's training and check if the mean Bregman divergence is properly minimized. We measure the mean Bregman divergence as was done in RAIRL's training. RAIRL-DBM is shown to minimize the target divergence more effectively compared to RAIRL-NSM during reward evaluation, although both achieve comparable performances during RAIRL's training. Moreover, we substitute λ with 1, 5, 10 and observe that learning with λ larger than 1 returns better rewards-only λ = 1 was considered in AIRL (Fu et al., 2018) . Note that in all cases, the minimum divergence achieved by RAIRL is comparable with that of behavioral cloning (BC). This is because BC performs sufficiently well when many demonstrations are given. We think the divergence of BC may be the near-optimal divergence that can be achieved with our policy neural network model. 

5.3. EXPERIMENT 3: MUJOCO (CONTINUOUS OBSERVATION AND ACTION)

We validate RAIRL on MuJoCo continuous control tasks (Hopper-v2, Walker-v2, HalfCheetah-v2, Ant-v2) as follows. We assume multivariate-Gaussian policies (with diagonal covariance matrices) for both learner's policy π and expert policy π E . Instead of tanh-squashed policy in Soft-Actor Critic (Haarnoja et al., 2018) , we use hyperbolized environments-where tanh is regarded as a part of the environment-with additional engineering on the policy networks (See Appendix J.4 for details). We use 100 demonstrations stochastically sampled from π E to validate RAIRL. In MuJoCo experiments, we focus on a set of Tsallis entropy regularizer (Ω = -T 1 q ) with q = 1, 1.5, 2-where Tsallis entropy becomes Shannon entropy for q = 1. We then exploit the tractable quantities for multi-variate Gaussian distributions in Section 3.2 to stabilize RAIRL and check its performance in terms of its mean Bregman divergence similar to the previous experiment. Note that since both π and π E are multi-variate Gaussian and can be evaluated, we can evaluate the individual Bregman divergence D A Ω (π(•|s)||π E (•|s)) on s by using the derivation in Appendix I.3. The performances during RAIRL's training are described as follows. We report π with both an episodic score (Figure 4 ) and mean Bregman divergences with respect to three types of Tsallis entropies (Figure 5 ) Ω = -T 1 q with q = 1, 1.5, 2. Note that the objective of RAIRL with Ω = -T 1 q is to minimize the corresponding mean Bregman divergence with q = q. In Figure 4 , both RAIRL-DBM and RAIRL-NSM are shown to achieve the expert performance, irrespective of q, in Hopper-v2, Walker-v2, and HalfCheetah-v2. In contrast, RAIRL in Ant-v2 fails to achieve the expert's performance within 2,000,000 steps and RAIRL-NSM highly outperforms RAIRL-DBM in our setting. Although the episodic scores are comparable for all methods in Hopper-v2, Walker-v2, and HalfCheetah-v2, respective divergences are shown to be highly different from one another as shown in Figure 5 . RAIRL with q = 2 in most cases achieves the minimum mean Bregman divergence (for all three divergences with q = 1, 1.5, 2), whereas RAIRL with q = 1-which corresponds to AIRL (Fu et al., 2018) -achieves the maximum divergence in most cases. This result is in alignment with our intuition from Section 3.2; as q increases, minimizing the Bregman divergence requires much tighter matching between π and π E . Unfortunately, while evaluating the acquired reward-RAC with a randomly initialized agent and acquired reward-the target divergence is not properly decreased in continuous controls. We believe this is because π is a probability density function in continuous controls and causes large variance during training, while π is a mass function and is well-bounded in discrete control problems. regularizer with q = 1, 1.5, 2 is considered. Figure 5 : Bregman divergences with Tsallis entropy T 1 q with q = 1, 1.5, 2 during RAIRL's training in MuJoCo environments. We consider RAIRL with Tsallis entropy regularizer T 1 q with q = 1, 1.5, 2.

6. DISCUSSION AND FUTURE WORKS

We consider the problem of IRL in regularized MDPs (Geist et al., 2019) , assuming a class of strongly convex policy regularizers. We theoretically derive its solution (a set of reward functions) and show that learning with these rewards is equivalent to a specific instance of imitation learning-i.e., one that minimizes the Bregman divergence associated with policy regularizers. We propose RAIRL-a practical sampled-based IRL algorithm in regularized MDPs-and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control). Finally, recent advances in imitation learning and IRL are built from the perspective of regarding imitation learning as statistical divergence minimization problems (Ke et al., 2019; Ghasemipour et al., 2019) . Although Bregman divergence is known to cover various divergences, it does not include some divergence families such as f -divergence (Csiszár, 1963; Amari, 2009) . Therefore, we believe that considering RL with policy regularization different from Geist et al. (2019) and its inverse problem is a possible way of finding the links between imitation learning and various statistical distances.

A BELLMAN OPERATORS, VALUE FUNCTIONS IN REGULARIZED MDPS

Let a policy regularizer Ω : ∆ A → R be strongly convex, and define the convex conjugate of Ω is Ω * : R A → R as Ω * (Q(s, •)) = max π(•|s)∈∆ A π(•|s), Q(s, •) A -Ω(π(•|s)), Q ∈ R S×A , s ∈ S. Then, Bellman operators, equations and value functions in regularied MDPs are defined as follows. Definition 1 (Regularized Bellman Operators). For V ∈ R S , let us define Q(s, a) = r(s, a) + γE s ∼P (•|s,a) [V (s )]. The regularized Bellman evaluation operator is defined as [T π V ](s) := π(•|s), Q(s, •) A -Ω(π(•|s)), s ∈ S, and T π V = V is called the regularized Bellman equation. Also, the regularized Bellman optimality operator is defined as [T * V ](s) := max π(•|s)∈∆ A [T π V ](s) = Ω * (Q(s, •)), s ∈ S, and T * V = V is called the regularized Bellman optimality equation. Definition 2 (Regularized value functions). The unique fixed point V π of the operator T π is called the state value function, and Q π (s, a) = r(s, a) + γE s ∼P (•|s,a) [V π (s ) ] is called the state-action value function. Also, the unique fixed point V * of the operator T * is called the optimal state value function, and Q * (s, a) = r(s, a) + γE s ∼P (•|s,a) [V * (s ) ] is called the optimal state-action value function. It should be noted that Proposition 1 in Geist et al. ( 2019) tells us ∇Ω * (Q(s, •)) is a policy that uniquely maximizes Eq.( 15). For example, when ,a )) . Due to this property, the optimal policy π * of a regularized MDP is uniquely found for the optimal state-action value function Q * which is also uniquely defined as the fixed point of T * . Ω(π(•|s)) = a∼π(•|s) log π(a|s) (negative Shan- non entropy), ∇Ω * (Q(s, •)) is a softmax policy, i.e., ∇Ω * (Q(s, •)) = exp(Q(s,•)) a ∈A exp(Q(s

B PROOF OF LEMMA 1

Let us define π s = π(•|s). For r(s, a) = t(s, a; π E ), the RL objective Eq.( 1) satisfies E π ∞ i=0 γ i {r(s i , a i ) -Ω(π si )} (i) = E π ∞ i=0 γ i {t(s, a; π E ) -Ω(π si )} (ii) = E π ∞ i=0 γ i Ω (s i , a i ; π E ) -E a∼π s i E Ω (s i , a; π E ) + Ω(π si E ) -Ω(π si ) (iii) = E π ∞ i=0 γ i E a∼π s i Ω (s i , a; π E ) -E a∼π s i E Ω (s i , a; π E ) + Ω(π si E ) -Ω(π si ) (iv) = E π ∞ i=0 γ i ∇Ω(π si E ), π si A -∇Ω(π si E ), π si E A + Ω(π si E ) -Ω(π si ) = E π ∞ i=0 γ i Ω(π si E ) -Ω(π si ) + ∇Ω(π si E ), π si A -∇Ω(π si E ), π si E A = -E π ∞ i=0 γ i Ω(π si ) -Ω(π si E ) -∇Ω(π si E ), π si A + ∇Ω(π si E ), π si E A = -E π ∞ i=0 γ i Ω(π si ) -Ω(π si E ) -∇Ω(π si E ), π si -π si E A (v) = -E π ∞ i=0 γ i D A Ω (π si ||π si E )) , where (i) follows from the assumption r(s, a) = t(s, a; π E ) in Lemma 1, (ii) follows from the definition of t(s, a; π) in Eq.( 7), (iii) follows since taking the inner expectation first does not change the overall expectation, (iv) follows from the definition of Ω in Lemma 1 and a∈A p(a)[∇Ω(p)](a) = ∇Ω(p), p A , and (v) follows from the definition of Bregman divergence, i.e., D A Ω (p 1 ||p 2 ) = Ω(p 1 ) -Ω(p 2 ) -∇Ω(p 2 ), p 1 -p 2 A . Due to the non-negativity of D A Ω , Eq.( 16) is less than or equal to zero which can be achieved when π = π E .

C PROOF OF COROLLARY 1

Let a ∈ {1, ..., |A|} and π a = π(a) for simplicity. For Ω(π) = -λE a∼π φ(π a ) = -λ a∈A π a φ(π a ) = -λ a∈A f φ (π a ) with f φ (x) = xφ(x), we have ∇Ω(π) = -λ∇ π1,...,π |A| a∈A f φ (π a ) = -λ[f φ (π 1 ), ..., f φ (π |A| )] T for f φ (x) = ∂ ∂x (xφ(x)). Therefore, for π s = π(•|s) we have t(s, a; π) = [∇Ω(π s )](a) -∇Ω(π s ), π s A + Ω(π s ) = -λf φ s a ) - a ∈A π s a (-λf φ (π s a )) + -λ a ∈A π s a φ(π s a ) = -λ f φ (π s a ) - a ∈A π s a f φ (π s a ) + a ∈A π s a φ(π s a ) = -λ f φ (π s a ) - a ∈A π s a f φ (π s a ) -φ(π s a ) = -λ f φ (π s a ) -E a ∼π s f φ (π s a ) -φ(π s a ) .

D PROOF OF OPTIMAL REWARDS ON CONTINUOUS CONTROLS

Note that for two continuous distributions P 1 and P 2 having probability density functions p 1 (x) and p 2 (x), respectively, the Bregman divergence can be defined as (Guo et al., 2017; Jones & Byrne, 1990 ) D X ω (P 1 ||P 2 ) := X {ω(p 1 (x)) -ω(p 2 (x)) -ω (p 2 (x))(p 1 (x) -p 2 (x))} dx, ( ) where ω (x) := ∂ ∂x ω(x) and the divergence is measured point-wisely on x ∈ X . Let us assume Ω(π) = A ω(π(a))da (18) for a probability density function π on A and define t(s, a; π) := ω (π s (a)) - A [π s (a )ω (π s (a )) -ω(π s (a ))] da . ( ) for π s = π(•|s). For r(s, a) = t(s, a; π E ), the RL objective in Eq.( 1) satisfies E π ∞ i=0 γ i {r(s i , a i ) -Ω(π si )} (i) = E π ∞ i=0 γ i {t(s i , a i ; π E ) -Ω(π si )} (ii) = E π ∞ i=0 γ i ω (π si E (a i )) - A π si E (a)ω (π si E (a)) -ω(π si E (a)) da - A ω(π si (a))da = E π ∞ i=0 γ i A π si (a)ω (π si E (a))da - A π si E (a)ω (π si E (a)) -ω(π si E (a)) + ω(π si (a)) da = E π ∞ i=0 γ i A π si (a)ω (π si E (a)) -π si E (a)ω (π si E (a)) -ω(π si E (a)) + ω(π si (a)) da = E π ∞ i=0 γ i A ω(π si E (a)) -ω(π si (a)) + π si (a)ω (π si E (a)) -π si E (a)ω (π si E (a)) da = -E π ∞ i=0 γ i A ω(π si (a)) -ω(π si E (a)) -π si (a)ω (π si E (a)) + π si E (a)ω (π si E (a)) da = -E π ∞ i=0 γ i A ω(π si (a)) -ω(π si E (a)) -ω (π si E (a)) π si (a) -π si E (a) da (iii) = -E π ∞ i=0 γ i D A ω (π si ||π si E )) , where (i) follows from r(s, a) = t(s, a; π E ), (ii) follows from Eq.( 18) and Eq.( 19), and (iii) follows from the definition of Bregman divergence in Eq.( 17). Due to the non-negativity of D ω , Eq.( 24) is less than or equal to zero which can be achieved when π = π E . Also, π = π E is a unique solution since Eq.( 1) has a unique solution for arbitrary reward functions.

E PROOF OF LEMMA 2

Since Lemma 2 was mentioned but not proved in Geist et al. (2019) , we derive the rigorous proof for Lemma 2 in this subsection. Note that we follow the proof idea in Ng et al. (1999) . First, let us assume an MDP M r with a reward r and corresponding optimal policy π * . From Definition 1, the optimal state-value function V * ,r and its corresponding state-action value function Q * ,r (s, a) := r(s, a) + γE s ∼P (•|s,a) [V * ,r ] should satisfy the regularized Bellman optimality equation V * ,r (s) = T * ,r V * ,r (s) = Ω * (Q * ,r (s, a)) = max π(•|s)∈∆ A π(•|s), Q * ,r (s, a) A -Ω(π(•|s)) = max π(•|s)∈∆ A π(•|s), r(s, a) + γE s ∼P (•|s,a) [V * ,r (s )] A -Ω(π(•|s)), where we explicitize the dependencies on r. Also, the optimal policy π * is a unique maximizer for the above maximization. Now, let us consider the shaped rewards r(s, a)+Φ(s )-Φ(s) and r(s, a)+E s ∼P (•|s,a) Φ(s )-Φ(s). Please note that for both rewards, the expectation over s for given s, a is equal to r(s, a) = r(s, a) + E s ∼P (•|s,a) Φ(s ) -Φ(s), and thus, it is sufficient to only consider the optimality for r. By subtracting Φ(s) from the regularized optimality Bellman equation for r in Eq.( 21), we have V * ,r (s) -Φ(s) = max π(•|s)∈∆ A π(•|s), r(s, a) + γE s ∼P (•|s,a) Φ(s ) -Φ(s) + γE s ∼P (•|s,a) [V * ,r (s ) -Φ(s )] A -Ω(π(•|s)) = max π(•|s)∈∆ A π(•|s), r(s, a) + γE s ∼P (•|s,a) [V * ,r (s ) -Φ(s )] A -Ω(π(•|s)) = [T * ,r (V * ,r -Φ)](s). That is, V * ,r -Φ is the fixed point of the regularized Bellman optimality operator T * ,r associated with the shaped reward r. Also, a maximizer for the above maximization is π * since subtracting Φ(s) from Eq.( 21 2019), a solution of regularized IRL is given, and we rewrite the relevant theorem in this subsection. Let us consider a regularized IRL problem, where π E ∈ ∆ A S is an expert policy. Assuming that both the model (dynamics, discount factor and regularizer) and the expert policy are known, Geist et al. (2019) proposed a solution of regularized IRL as follows: Lemma 4 (A solution of regularized IRL from Geist et al. ( 2019 )). Let Q E ∈ R S×A be a function such that π E (•|s) = ∇Ω * (Q E (s, •)), s ∈ S. Also, define r E (s, a) := Q E (s, a) -γE s ∼P (•|s,a) [Ω * (Q E (s , •))] = Q E (s, a) -γE s ∼P (•|s,a) [ π E (•|s ), Q E (s , •) A -Ω(π E (•|s ))]. Then, in the Ω-regularized MDP with the reward r E , π E is an optimal policy. Proof. Although the brief idea of the proof is given in Geist et al. (2019) , we rewrite the proof in a more rigorous way as follows. Let us define V E (s) = π E (•|s), Q E (s, •) A -Ω(π E (•|s)). Then, r E (s, a) = Q E (s, a) -γE s ∼P (•|s,a) [V E (s )], and thus, Q E (s, a) = r E (s, a) + γE s ∼P (•|s,a) [V E (s )]. By using this and regularized Bellman optimality operator, we have [T * V E ](s) (i) = Ω * (Q E (s, •)) (ii) = max π(•|s)∈∆ A π(•|s), Q E (s, •) A -Ω(π(•|s)) (iii) = π E (•|s), Q E (s, •) A -Ω(π E (•|s)) = V E (s), where (i) and (ii) follow from Definition 1, and (iii) follows since π E is a unique maximizer. Thus, V E is the fixed point of T * , and π E (•|s) = ∇Ω * (Q E (s, •)), s ∈ S, becomes an optimal policy. For example, when negative Shannon entropy is used as a regularizer, we can get r E (s, a) = log π E (a|s) by setting Q E (s, a) = log π E (a|s). However, a solution proposed in Lemma 4 has two crucial issues: Issue 1. It requires the knowledge on the model dynamics, which is generally intractable. Issue 2. We need to figure out Q E that satisfies π E (•|s) = ∇Ω * (Q E (s, •)), which comes from the relationship between the optimal value function and optimal policy (Geist et al., 2019) . In the following subsection, we show how our solution in Lemma 1 is related to the solution from Geist et al. (2019) in Lemma 4. for Φ ∈ R S has its optimal policy as π E . Since Φ can be arbitrarily chosen, let us assume Φ(s) = Ω * (Q E (s, •)). Then, we have rE (s, a) = r E (s, a) + γE s ∼P (•|s,a) Φ(s ) -Φ(s) = Q E (s, a) -γE s ∼P (•|s,a) [Ω * (Q E (s , •))] + γE s ∼P (•|s,a) Ω * (Q E (s , •)) -Ω * (Q E (s, •)) = Q E (s, a) -Ω * (Q E (s, •)). ( ) Note that the reward in Eq.( 22) does not require the knowledge on the model dynamics, which addresses Issue 1 in Appendix F.1. Also, by using V E (s) = Ω * (Q E (s, •)) in the proof of Lemma 4, the reward in Eq.( 22) can be written as rE (s, a) = Q E (s, a) -V E (s), which means rE (s, a) is an advantage function for the optimal policy π E in the Ω-regularized MDP. However, we still have Issue 2 in Appendix F.1 since Ω * in Eq.( 22) is generally intractable (Geist et al., 2019) , which prevents us from finding the appropriate Q E (s, a). Interestingly, we show that for all s ∈ S and Q E (s, •) = ∇Ω(π E (•|s)), ∇Ω * (Q E (s, •)) = arg max π(•|s)∈∆ A π(•|s), Q E (s, •) A -Ω(π(•|s)) = arg max π(•|s)∈∆ A π(•|s), ∇Ω(π E (•|s)) A -Ω(π(•|s)) = arg min π(•|s)∈∆ A Ω(π(•|s)) -∇Ω(π E (•|s)), π(•|s) A = arg min π(•|s)∈∆ A Ω(π(•|s)) -Ω(π E (•|s)) -∇Ω(π E (•|s)), π(•|s) -π E (•|s) A = arg min π(•|s)∈∆ A D A Ω (π(•|s)||π E (•|s)) = π E (•|s), where the last equality holds since the Bregman divergence D A Ω (π(•|s)||π E (•|s)) is greater than or equal to zero and its lowest value is achieved when π(•|s) = π E (•|s). This means that when Q E (s, •) = ∇Ω(π E (•|s)) is used, the condition π E (•|s) = ∇Ω * (Q E (s, •)) in Lemma 4 is satisfied without knowing the tractable form of Ω * or ∇Ω * . Thus, Issue 2 in Appendix F.1 is addressed; we instead require the knowledge on the gradient ∇Ω of the policy regularizer Ω, which is practically more tractable. Finally, by substituting Q E (s, •) = Ω (s, •; π E ) for Ω (s, •; π) := ∇Ω(π(•|s)), s ∈ S, to Eq.( 22), we have rE (s, a) = Ω (s, a; π E ) -Ω * (Ω (s, •; π E )) = Ω (s, a; π E ) -{ π E (•|s), Ω (s, •; π E ) A -Ω(π E (•|s))} = Ω (s, a; π E ) -E a ∼π E (•|s) [Ω (s, a ; π E )] + Ω(π E (•|s)) = t(s, a; π E ), where t(s, a; π E ) is our proposed solution in Lemma 1. G PROOF OF LEMMA 3 RL objective in Regularized MDPs w.r.t. normalized visitation distributions. For a reward function r ∈ R S×A and a strongly convex function Ω : ∆ A → R, the RL objetive J Ω (r, π) in Eq.( 1) is equivalent to arg max π JΩ(r, d π ) := r, d π S×A -Ω(d π ), where for a set D of normalized visitation distributions (Syed et al., 2008 )  D := d ∈ R S×A : (i) = max π∈∆ A S r, d π S×A -Ω(d π ) = max π∈∆ A S s,a d π (s, a) [r(s, a) -Ω(π(a|s))] = (1 -γ) • max π∈∆ A S J Ω (r, π), where (i) follows from using the one-to-one correspondence between policies and visitation distributions (Syed et al., 2008; Ho & Ermon, 2016) . Note that Eq.( 24) is equal to the optimal discounted average return in regularized MDPs. IRL objective in regularized MDPs w.r.t. normalized visitation distributions. By using the RL objective in Eq.( 23), we can rewrite the IRL objective in Eq.( 6) w.r.t. the normalized visitation distributions as the maximization of the following objective over r ∈ R S×A : ( 1 -γ) • J Ω (r, π E ) -max π∈∆ A S J Ω (r, π) = JΩ(r, d π E ) -max d∈D JΩ(r, d) = min d∈D JΩ(r, d π E ) -JΩ(r, d) = min d∈D r, d π E S×A -Ω(d π E ) -r, d S×A -Ω(d) = min d∈D Ω(d) -Ω(d π E ) -r, d -d π E S×A . Note that if ∇ Ω(d) is well-defined and r = ∇ Ω(d π E ) for any strictly convex Ω, Eq.( 25) is equal to min d∈D Ω(d) -Ω(d π E ) -∇ Ω(d π E ), d -d π E S×A = min d∈D D S×A Ω (d||d π E ), where the equality comes from the definition of Bregman divergence.  ] T =     d 1 1 • • • d 1 |A| . . . . . . . . . d |S| 1 • • • d |S| |A|     ∈ R S×A , π π π(x x x) := x x x 1 1 1 T A x x x = 1 a∈A x a x 1 , ..., x |A| T ∈ R A , x x x := x 1 , ..., x |A| T ∈ R A , where 1 1 1 A = [1, ..., 1] T ∈ R A is an |A|-dimensional all-one vector. By using these notations, the original Ω can be rewritten as Ω(D D D) = s,a d s a Ω(π π π(d d d s )) = s∈S 1 1 1 T A d d d s Ω(π π π(d d d s )). The gradient of Ω w.r.t. D D D (using denominator-layout notation) is ∇ D D D Ω(D D D) = ∂ Ω(D D D) ∂d d d 1 , ..., ∂ Ω(D D D) ∂d d d |S| T ∈ R S×A , where each element of ∇ D D D Ω(D D D) satisfies ∂ Ω(D D D) ∂d d d s = ∂ Ω(D D D) ∂d s 1 , ..., ∂ Ω(D D D) ∂d s |A| T = ∂ ∂d d d s s∈S 1 1 1 T A d d d s Ω(π π π(d d d s )) = Ω(π π π(d d d s ))1 1 1 A + 1 1 1 T A d d d s ∂Ω(π π π(d d d s )) ∂d d d s = Ω(π π π(d d d s ))1 1 1 A + 1 1 1 T A d d d s ∂ π π π(d d d s ) ∂d d d s ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) . ( ) for ∂ π π π(d d d s ) ∂d d d s = ∂ π1 (d d d s ) ∂d d d s , ..., ∂ π|A| (d d d s ) ∂d d d s , ∂ πa (d d d s ) ∂d d d s = ∂ ∂d d d s d s a 1 1 1 T A d d d s = ∂d s a ∂d d d s (1 1 1 T A d d d s ) -1 + d s a ∂(1 1 1 T A d d d s ) -1 ∂d d d s . Note that each element of ∂ πa(d d d s ) ∂d d d s satisfies ∂ πa (d d d s ) ∂d s a = ∂d s a ∂d s a (1 1 1 T A d d d s ) -1 + d s a ∂(1 1 1 T A d d d s ) -1 ∂d s a = I{a = a }(1 1 1 T A d d d s ) -1 -d s a (1 1 1 T A d d d s ) -2 = I{a = a }(1 1 1 T A d d d s ) -1 -πa (d d d s )(1 1 1 T A d d d s ) -1 = (1 1 1 T A d d d s ) -1 [I{a = a } -πa (d d d s )] , and thus, ∂ π π π(d d d s ) ∂d d d s = (1 1 1 T A d d d s ) -1 I I I A×A -1 1 1 A [π π π(d d d s )] T . By substituting Eq.( 27) into Eq.( 26), we have ∂ Ω(D D D) ∂d d d s = Ω(π π π(d d d s ))1 1 1 A + 1 1 1 T A d d d s ∂ π π π(d d d s ) ∂d d d s ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) = Ω(π π π(d d d s ))1 1 1 A + 1 1 1 T A d d d s (1 1 1 T A d d d s ) -1 I I I A×A -1 1 1 A [π π π(d d d s )] T ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) = Ω(π π π(d d d s ))1 1 1 A + I I I A×A -1 1 1 A [π π π(d d d s )] T ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) = Ω(π π π(d d d s ))1 1 1 A + ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) -[π π π(d d d s )] T ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) 1 1 1 A = ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) -[π π π(d d d s )] T ∂Ω(π π π(d d d s )) ∂ π π π(d d d s ) 1 1 1 A + Ω(π π π(d d d s ))1 1 1 A . If we use the function notation, Eq.( 28) can be written as ∇[ Ω(d)](s, a) = ∇Ω(π d (•|s))(a) -E a ∼π d (•|s) [∇Ω(π d (•|s))(a )] + Ω(π d (•|s)) = t(s, a; πd ) for t of Eq.( 7) in Lemma 1.

H DERIVATION OF BREGMAN-DIVERGENCE-BASED MEASURE IN CONTINUOUS CONTROLS

In Eq.( 17), the Bregman divergence in the control task is defined as D A ω (P 1 ||P 2 ) := X {ω(p 1 (x)) -ω(p 2 (x)) -ω (p 2 (x))(p 1 (x) -p 2 (x))} dx. Note that we consider Ω(p) = X ω(p(x))dx = X [-f φ (p(x))] dx for f φ (x) = xφ(x), which makes Eq.( 29) equal to X -p 1 (x)φ(p 1 (x)) + p 2 (x)φ(p 2 (x)) + f φ (p 2 (x))(p 1 (x) -p 2 (x)) dx = X p 1 (x) f φ (p 2 (x)) -φ(p 1 (x)) dx - X p 2 (x) f φ (p 2 (x)) -φ(p 2 (x)) dx = E x∼p1 f φ (p 2 (x)) -φ(p 1 (x)) -E x∼p2 f φ (p 2 (x)) -φ(p 2 (x)) . Thus, by considering a learning agent's policy π s = π(•|s), expert policy π s E = π E (•|s), and the objective in Eq.( 8) characterized by the Bregman divergence, we can think of the following measure between expert and agent policies:  E s∼dπ D A Ω (π s ||π s E ) = E s∼dπ E a∼π s f φ (π s E (a)) -φ(π s (a)) -E a∼π s E f φ (π s E (a)) -φ(π s E (a)) . Note that for that share t, F , and k, it can be shown that I(π, π; α, β) = π(x) α π(x) β dx = exp F (αθ + β θ) -αF (θ) -βF ( θ) since π(x) α π(x) β dx = exp α θ, t(x) -αF (θ) + β θ, t(x) -βF ( θ) dx = exp αθ + β θ, t(x) -F (αθ + β θ) exp F (αθ + β θ) -αF (θ) -βF ( θ) dx = exp F (αθ + β θ) -αF (θ) -βF ( θ) exp αθ + β θ, t(x) -F (αθ + β θ) dx = exp F (αθ + β θ) -αF (θ) -βF ( θ) . I.1 TSALLIS ENTROPY For φ(x) = k q-1 (1 -x q-1 ) and k = 1, the Tsallis entropy of π can be written as T q (π) := E x∼π φ(x) = π(x) 1 -π(x) q-1 q -1 dx = 1 -π(x) q dx q -1 = 1 q -1 (1 -I(π, π; q, 0)) = 1 -exp (F (qθ) -qF (θ)) q -1 . If π is a multivariate Gaussian distribution, we have F (qθ) = q 2 µ T Σ -1 µ + 1 2 log(2π) d |Σ| - 1 2 log q d , qF (θ) = q 2 µ T Σ -1 µ + q 2 log(2π) d |Σ|, F (qθ) -qF (θ) = 1 -q 2 log(2π) d |Σ| - 1 2 log q d = (1 -q) d 2 log 2π + 1 2 log |Σ| - d log q 2(1 -q) . For Σ = diag{σ 2 1 , ..., σ 2 d }, we have F (qθ) -qF (θ) = (1 -q) d 2 log 2π + 1 2 log |Σ| - d log q 2(1 -q) = (1 -q) d 2 log 2π + 1 2 log d i=1 σ 2 i - d log q 2(1 -q) = (1 -q) d i=1 log 2π 2 + log σ i -log q 2(1 -q) .

I.2 TRACTABLE FORM OF BASELINE

For φ(x) = k q-1 (1 -x q-1 ), we have f φ (x) = k q -1 (1 -qx q-1 ) = k q -1 (q -qx q-1 -(q -1)) = qk q -1 (1 -x q-1 ) -k = qφ(x) -k. Therefore, the baseline can be rewritten as E x∼π [-f φ (x) + φ(x)] = E x∼π [k -qφ(x) + φ(x)] = (1 -q)T q (π) + k. For a multivariate Gaussian distribution π, the tractable form of E x∼π [-f φ (x) + φ(x)] can be derived by using that of Tsallis entropy T q (π) of π.

I.3 BREGMAN DIVERGENCE WITH TSALLIS ENTROPY REGULARIZATION

In Eq.( 30), we consider the following form the Bregman divergence: π(x){f φ (π(x)) -φ(π(x))}dx -π(x){f φ (π(x)) -φ(π(x))}dx. For φ(x) = k q-1 (1 -x q-1 ), f φ (x) = k q-1 (1 -qx q-1 ) = qφ(x) -k, and k = 1, the above form is equal to π(x) 1 -qπ(x) q-1 q -1 dx -T q (π) -(q -1)T q (π) + 1 = 1 q -1 -q q -1 π(x)π(x) q-1 dx -T q (π) -(q -1)T q (π) + 1 = q q -1 -q q -1 π(x)π(x) q-1 dx -T q (π) -(q -1)T q (π). For multivariate Gaussians π(x) = N (x; µ, Σ), µ = [ν 1 , ..., ν d ] T , Σ = diag(σ 2 1 , ..., σ 2 d ), π(x) = N (x; μ, Σ), μ = [ν 1 , ..., νd ] T , Σ = diag(σ 2 1 , ..., σ2 d ), we have π(x)π(x) q-1 dx = I(π, π; 1, q -1) = exp F (θ ) -F (θ) -(q -1)F ( θ) , where  θ = Σ -1 µ -1 2 Σ -1 , θ = Σ-1 μ -1 2 Σ-1 , θ = θ + (q -1) θ = Σ -1 µ + (q -1) Σ-1 μ -1 2 (Σ -1 + (q -1) Σ-1 ) = θ 1 θ 2 , θ 1 = ν 1 σ 2 1 + (q -1) ν1 σ2 = d i=1      1 2 νi σ 2 i + (q -1) νi σ2 i 2 1 σ 2 i + (q -1) 1 σ2 i + log 2π 2 + log 1 1 σ 2 i + (q -1) 1 σ2 i      . Published as a conference paper at ICLR 2021 J.5 HYPERPARAMETERS 



) It turns out that AIRL minimizes the divergence between visitation distributions d π and d π E by solving min π∈∆ A S KL(d π ||d π E ) for Kullback-Leibler (KL) divergence KL(•||•) (Ghasemipour et al., 2019) .

Given the policy regularizer Ω, let us define Ω(d) := E (s,a)∼d [Ω(π d (•|s))] for an arbitrary normalized state-action visitation distribution d ∈ ∆ S×A and the policy πd (a|s) := d(s,a) a ∈A d(s,a ) induced by d. Then, Eq.(7) is equal to t(s, a; πd ) = [∇ Ω(d)](s, a).

Figure 1: Bregman divergence D Ω (π||π E ) associated with Tsallis entropy (Ω(p) = -T 1q (p)) between two uni-variate Gaussian distributions π = N (µ, σ 2 ) and π E = N (0, (e -3 ) 2 ) (green point in each subplot). In each subplot, we normalized the Bregman divergence so that the maximum value becomes 1. Note that for q = 1, D Ω (π||π E ) becomes the KL divergence KL(π||π E ).

Figure 2: Expert policy (Left) and reward learned by RAIRL with different types of policy regularizers (Right) in Multi-armed Bandit. Either one of dense (Top row) or sparse (Bottom row) expert policies π E is considered.

Figure 3: Mean Bregman divergence during training (Top row) and the divergence during reward evaluation (Bottom row) in Bermuda World. In each column, different policy regularizers and their respective target divergences are considered. The results are reported after normalization with the divergence of uniform random policy, and that of behavioral cloning (BC) is reported for comparison.

Figure 4: Averaged episodic score of RAIRL's training in MuJoCo environments. RAIRL with T 1 q

) does not change the unique maximizer π * . F COMPARISON BETWEEN OUR SOLUTION AND EXISTING SOLUTION IN GEIST ET AL. (2019) F.1 ISSUES WITH THE SOLUTIONS IN GEIST ET AL. (2019) In Proposition 5 of Geist et al. (

RELATION BETWEEN THE SOLUTION OF GEIST ET AL. (2019) AND OUR TRACTABLE SOLUTION Let us consider the expert policy π E and functions Q E and r E satisfying the conditions in Lemma 4. From Lemma 2, a regularized MDP with the shaped reward rE (s, a) := r E (s, a) + γE s ∼P (•|s,a) Φ(s ) -Φ(s)

a d(s , a ) = (1 -γ)P 0 (s ) + γ s,a P (s |s, a)d(s, a), ∀s ∈ S , we define Ω(d) := E (s,a)∼d [Ω(π d (•|s))] and πd (•|s) := d(s,•) a d(s,a ) ∈ ∆ A S for d ∈ D and use πdπ (•|s) = π(•|s) for all s ∈ S. For Ω : D → R, its convex conjugate Ω * is Ω * (r) : = max d∈D JΩ(r, d) = max d∈D r, d S×A -Ω(d)

Proof of t(s, a; π d ) = ∇[ Ω(d)](s, a). For simpler notation, we use matrix-vector notation for the proof when discrete state and action spaces S = {1, ..., |S|} and A = {1, ..., |A|} are considered. For a normalized visitation distribution d ∈ D, let us define d s a := d(s, a), s ∈ S, a ∈ A, d d d s := [d s 1 , ..., d s |A| ] T ∈ R A , s ∈ S, D D D := [d d d 1 , ..., d d d |S|

TSALLIS ENTROPY AND ASSOCIATED BREGMAN DIVERGENCE AMONG MULTI-VARIATE GAUSSIAN DISTRIBUTIONSBased on the derivation inNielsen & Nock (2011), we derive the Tsallis entropy and associated Bremgan divergence as follows. We first consider the distributions in the exponential family exp ( θ, t(x) -F (θ) + k(x)) .

Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and MuJoCo experiments, respectively.

Hyperparameters for Bandit environments.

Hyperparameters for Bermuda World environment.

Hyperparameters for MuJoCo environments.

ACKNOWLEDGMENTS

We would like to thank Daewoo Kim at Waymo for his valuable comments on this work.

annex

we can recover the multi-variate Gaussian distribution (Nielsen & Nock, 2011) :For two distributions with k(x) = 0,Table 1 : Policy regularizers φ and their corresponding f φ (Yang et al., 2019) .By exploiting the knowledge on the reward in Corollary 1we consider the density-based model (DBM) which is defined byHere, f φ is a function that can be known priorly, and π θ1 is a neural network which is defined separately from the policy neural network π ψ . The DBM for discrete control problems are depicted in Figure 6 (Left). The model outputs rewards over all actions in parallel, where softmax is used for π θ1 (•|s) and -f φ is elementwisely applied to those softmax outputs followed by elementwisely adding B θ2 (s). For continuous control (Figure 6 , Right), we use the network architecture similar to that in discrete control, where a multivariate Gaussian distribution is used instead of a softmax layer. We assume a stochastic expert defined byfor s = (x, y), s(1) = (x (1) , ȳ(1) ) = (-5, 10), s(2) = (x (2) , ȳ(2) ) = (0, 10), s(3) = (x (3) , ȳ(3) ) = (5, 10), = 10 -4 and an operator Proj(θ) : R → A that maps θ to the closest angle in A. In Figure 7 , we depicted the expert policy. Instead of directly using MuJoCo environments with tanh-squashed policies proposed in Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , we move tanh to a part of the environment-named hyperbolized environments in short-and assume Gaussian policies. Specifically, after an action a is sampled from the policies, we pass tanh(a) to the environment. We then consider multi-variate Gaussian policyfor all i = 1, ..., d. Instead of using clipping, we use tanh-activated outputs and scale them to fit in the above ranges, which empirically improves the performance. Also, instead of using potential-based reward shaping used in AIRL (Fu et al., 2018) , we update the moving mean of intermediate reward values and update the value network with mean-subtracted rewards-so that the value network gets approximately mean-zero reward-to stabilize the RL part of RAIRL. Note that this is motivated by Lemma 2 from which we can guarantee that any constant shift of reward functions does not change optimality.

