REGULARIZED INVERSE REINFORCEMENT LEARNING

Abstract

Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies strongly convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing solutions that are intractable in practice. We present theoretical backing for our proposed IRL method's applicability to both discrete and continuous controls, empirically validating our performance on a variety of tasks.

1. INTRODUCTION

Reinforcement learning (RL) has been successfully applied to many challenging domains including games (Mnih et al., 2015; 2016) and robot control (Schulman et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018) . Advanced RL methods often employ policy regularization motivated by, e.g., boosting exploration (Haarnoja et al., 2018) or safe policy improvement (Schulman et al., 2015) . While Shannon entropy is often used as a policy regularizer (Ziebart et al., 2008 ), Geist et al. (2019) recently proposed a theoretical foundation of regularized Markov decision processes (MDPs)-a framework that uses strongly convex functions as policy regularizers. Here, one crucial advantage is that an optimal policy is shown to uniquely exist, whereas multiple optimal policies may exist in the absence of policy regularization. Meanwhile, since RL requires a given or known reward function (which can often involve non-trivial reward engineering), Inverse Reinforcement Learning (IRL) (Russell, 1998; Ng et al., 2000) -the problem of acquiring a reward function that promotes expert-like behavior-is more generally adopted in practical scenarios like robotic manipulation (Finn et al., 2016b) , autonomous driving (Sharifzadeh et al., 2016; Wu et al., 2020) and clinical motion analysis (Li et al., 2018) . In these scenarios, defining a reward function beforehand is particularly challenging and IRL is simply more pragmatic. However, complications with IRL in unregularized MDPs relate to the issue of degeneracy, where any constant function can rationalize the expert's behavior (Ng et al., 2000) . Fortunately, Geist et al. (2019) show that IRL in regularized MDPs-regularized IRL-does not contain such degenerate solutions due to the uniqueness of the optimal policy for regularized MDPs. Despite this, no tractable solutions of regularized IRL-other than maximum-Shannon-entropy IRL (MaxEntIRL) (Ziebart et al., 2008; Ziebart, 2010; Ho & Ermon, 2016; Finn et al., 2016a; Fu et al., 2018) -have been proposed. In Geist et al. (2019) , solutions for regularized IRL were introduced. However, they are generally intractable since they require a closed-form relation between the policy and optimal value function and the knowledge on model dynamics. Furthermore, practical algorithms for solving regularized IRL problems have not yet been proposed. We summarize our contributions as follows. Unlike the solutions in Geist et al. ( 2019), we propose tractable solutions for regularized IRL problems that can be derived from policy regularization and * Correspondence to: Wonseok Jeon <jeonwons@mila.quebec> 1

