REGULARIZED INVERSE REINFORCEMENT LEARNING

Abstract

Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies strongly convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing solutions that are intractable in practice. We present theoretical backing for our proposed IRL method's applicability to both discrete and continuous controls, empirically validating our performance on a variety of tasks.

1. INTRODUCTION

Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al., 2015; 2016) and robot control (Schulman et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018) . Advanced RL methods often employ policy regularization motivated by, e.g., boosting exploration (Haarnoja et al., 2018) or safe policy improvement (Schulman et al., 2015) . While Shannon entropy is often used as a policy regularizer (Ziebart et al., 2008) , Geist et al. (2019) recently proposed a theoretical foundation of regularized Markov decision processes (MDPs)-a framework that uses strongly convex functions as policy regularizers. Here, one crucial advantage is that an optimal policy is shown to uniquely exist, whereas multiple optimal policies may exist in the absence of policy regularization. Meanwhile, since RL requires a given or known reward function (which can often involve non-trivial reward engineering), Inverse Reinforcement Learning (IRL) (Russell, 1998; Ng et al., 2000) -the problem of acquiring a reward function that promotes expert-like behavior-is more generally adopted in practical scenarios like robotic manipulation (Finn et al., 2016b) , autonomous driving (Sharifzadeh et al., 2016; Wu et al., 2020) and clinical motion analysis (Li et al., 2018) . In these scenarios, defining a reward function beforehand is particularly challenging and IRL is simply more pragmatic. However, complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert's behavior (Ng et al., 2000) . Fortunately, Geist et al. (2019) show that IRL in regularized MDPs-regularized IRL-does not contain such degenerate solutions due to the uniqueness of the optimal policy for regularized MDPs. Despite this, no tractable solutions of regularized IRL-other than maximum-Shannon-entropy IRL (MaxEntIRL) (Ziebart et al., 2008; Ziebart, 2010; Ho & Ermon, 2016; Finn et al., 2016a; Fu et al., 2018) -have been proposed. In Geist et al. (2019) , solutions for regularized IRL were introduced. However, they are generally intractable since they require a closed-form relation between the policy and optimal value function and the knowledge on model dynamics. Furthermore, practical algorithms for solving regularized IRL problems have not yet been proposed. We summarize our contributions as follows. Unlike the solutions in Geist et al. ( 2019), we propose tractable solutions for regularized IRL problems that can be derived from policy regularization and * Correspondence to: Wonseok Jeon <jeonwons@mila.quebec> its gradient in discrete control problems (Section 3.1). We additionally show that our solutions are tractable for Tsallis entropy regularization with multi-variate Gaussian policies in continuous control problems (Section 3.2). We devise Regularized Adversarial Inverse Reinforcement Learning (RAIRL), a practical sample-based method for policy imitation and reward learning in regularized MDPs, which generalizes adversarial IRL (AIRL, Fu et al. ( 2018)) (Section 4). Finally, we empirically validate our RAIRL method on both discrete and continuous control tasks, evaluating RAIRL via episodic scores and from divergence minimization perspective (Ke et al., 2019; Ghasemipour et al., 2019; Dadashi et al., 2020 ) (Section 5).

2. PRELIMINARIES

Notation For finite sets X and Y , Y X is a set of functions from X to Y . ∆ X (∆ X Y ) is a set of (conditional) probabilities over X (conditioned on Y ). Especially for the conditional probabilities p X|Y ∈ ∆ X Y , we say p X|Y (•|y) ∈ ∆ X for y ∈ Y . R is the set of real numbers. For functions f 1 , f 2 ∈ R X , we define f 1 , f 2 X := x∈X f 1 (x) f 2 (x). Regularized Markov Decision Processes and Reinforcement Learning We consider sequential decision making problems where an agent sequentially chooses its action after observing the state of the environment, and the environment in turn emits a reward with state transition. Such an interaction between the agent and the environment is modeled as an infinite-horizon Markov Decision Process (MDP), M r := S, A, P 0 , P, r, γ and the agent's policy π ∈ ∆ A S . The terms within the MDP are defined as follows: S is a finite state space, A is a finite action space, P 0 ∈ ∆ S is an initial state distribution, P ∈ ∆ S S×A is a state transition probability, r ∈ R S×A is a reward function, and γ ∈ [0, 1) is the discount factor. We also define an MDP without reward as M -:= S, A, P 0 , P, γ . The normalized state-action visitation distribution, d π ∈ ∆ S×A , associated with π is defined as the expected discounted state-action visitation of π, i.e., d π (s, a) := (1 -γ) • E π [ ∞ i=0 γ i I{s i = s, a i = a}], where the subscript π on E means that a trajectory (s 0 , a 0 , s 1 , a 1 , ...) is randomly generated from M -and π, and We consider RL in regularized MDPs (Geist et al., 2019) , where the policy is optimized with a causal policy regularizer. Mathematically for an MDP M r and a strongly convex function Ω : ∆ A → R, the objective in regularized MDPs is to seek π that maximizes the expected discounted sum of rewards, or return in short, with policy regularizer Ω: I{ arg max π∈∆ A S J Ω (r, π) := E π ∞ i=0 γ i {r(s i , a i ) -Ω(π(•|s i ))} = 1 1 -γ E (s,a)∼dπ [r(s, a) -Ω(π(•|s)] . It turns out that the optimal solution of Eq.( 1) is unique (Geist et al., 2019) , whereas multiple optimal policies may exist in unregularized MDPs (See Appendix A for a detailed explanation). In later work (Yang et al., 2019) , Ω(p) = -λE a∼p φ(p(a)), p ∈ ∆ A was considered for λ > 0 and φ : (0, 1] → R satisfying some mild conditions. For example, RL with Shannon entropy regularization (Haarnoja et al., 2018) can be recovered by φ(x) = -log x, while RL with Tsallis entropy regularization (Lee et al., 2020) can be recovered from φ(x) = k q-1 (1 -x q-1 ) for k > 0, q > 1. The optimal policy π * for Eq.( 1 where φ (x) = ∂ ∂x φ(x), g φ is an inverse function of f φ for f φ (x) := xφ(x), x ∈ (0, 1], and µ * is a normalization term such that a∈A π * (a|s) = 1. Note that we still need to find out µ * to acquire a closed-form relation between optimal policy π * and value function Q * . However, such relations



•} is an indicator function. Note that d π satisfies the transposed Bellman recurrence (Boularias & Chaib-Draa, 2010; Zhang et al., 2019): d π (s, a) = (1 -γ)P 0 (s)π(a|s) + γπ(a|s) s,ā P (s|s, ā)d π (s, a).

) with Ω from Yang et al. (2019) is shown to be π * (a|s) = max g φ µ * (s) -Q * (s, * (s, a) = r(s, a) + γE s ∼P (•|s,a) V * (s ), V * (s) = µ * (s) -λ a∈A π * (a|s) 2 φ (π * (a|s)), (3)

