STOCHASTIC INVERSE REINFORCEMENT LEARNING

Abstract

The goal of the inverse reinforcement learning (IRL) problem is to recover the reward functions from expert demonstrations. However, the IRL problem like any ill-posed inverse problem suffers the congenital defect that the policy may be optimal for many reward functions, and expert demonstrations may be optimal for many policies. In this work, we generalize the IRL problem to a well-posed expectation optimization problem stochastic inverse reinforcement learning (SIRL) to recover the probability distribution over reward functions. We adopt the Monte Carlo expectation-maximization (MCEM) method to estimate the parameter of the probability distribution as the first solution to the SIRL problem. The solution is succinct, robust, and transferable for a learning task and can generate alternative solutions to the IRL problem. Through our formulation, it is possible to observe the intrinsic property for the IRL problem from a global viewpoint, and our approach achieves a considerable performance on the objectworld.

1. INTRODUCTION

The IRL problem addresses an inverse problem that a set of expert demonstrations determines a reward function over a Markov decision process (MDP) if the model dynamics are known Russell (1998); Ng et al. (2000) . The recovered reward function provides a succinct, robust, and transferable definition of the learning task and completely determines the optimal policy. However, the IRL problem is ill-posed that the policy may be optimal for many reward functions and expert demonstrations may be optimal for many policies. For example, all policies are optimal for a constant reward function. In a real-world scenario, experts always act sub-optimally or inconsistently, which is another challenge. To overcome these limitations, two classes of probabilistic approaches for the IRL problem are proposed, i.e., Bayesian inverse reinforcement learning (BIRL) Ramachandran & Amir (2007) based on Bayesians' maximum a posteriori (MAP) estimation and maximum entropy IRL (MaxEnt) Ziebart et al. (2008) ; Ziebart (2010) based on frequentists' maximum likelihood (MLE) estimation. BIRL solves for the distribution of reward functions without an assumption that experts behave optimally, and encode the external a priori information in a choice of a prior distribution. However, BIRL also suffers from the practical limitation that a large number of algorithmic iterations is required for the procedure of Markov chain Monte Carlo (MCMC) in a sampling of posterior over reward functions. Advanced techniques, for example Kernel technique Michini & How (2012) and gradient method Choi & Kim (2011) , are proposed to improve the efficiency and tractability of this situation. MaxEnt employs the principle of maximum entropy to resolve the ambiguity in choosing demonstrations over a policy. This class of methods, inheriting the merits from previous non-probabilistic IRL approaches including Ng et al. ( 2000 2016), imposes regular structures of reward functions in a combination of hand-selected features. Formally, the reward function is a linear or nonlinear combination of the feature basis functions which consists of a set of real-valued functions {φ i (s, a)} i=1 hand-selected by experts. The goal of this approach is to find the best-fitting weights of feature basis functions through the MLE approach. Wulfmeier et al. ( 2015 2017). However, GAIL is in a lack of an explanation of expert's behavior and a portable representation for the knowledge transfer which are the merits of the class of the MaxEnt approach, because the MaxEnt approach is equipped with the "transferable" regular structures over reward functions. In this paper, under the framework of the MaxEnt approach, we propose a generalized perspective of studying the IRL problem called stochastic inverse reinforcement learning (SIRL). It is formulated as an expectation optimization problem aiming to recover a probability distribution over the reward function from expert demonstrations. The solution of SIRL is succinct and robust for the learning task in the meaning that it can generate more than one weight over feature basis functions which compose alternative solutions to the IRL problem. Benefits of the class of the MaxEnt method, the solution to our generalized problem SIRL is also transferable. Since of the intractable integration in our formulation, we employ the Monte Carlo expectation-maximization (MCEM) approach Wei & Tanner (1990) to give the first solution to the SIRL problem in a model-based environment. In general, the solutions to the IRL problem are not always best-fitting in the previous approaches because a highly nonlinear inverse problem with the limited information is very likely to get trapped in a secondary maximum in the recovery. Taking advantage of the Monte Carlo mechanism of a global exhaustive search, our MCEM approach avoids the secondary maximum and theoretically convergent demonstrated by pieces of literature Caffo et al. (2005) ; Chan & Ledolter (1995). Our approach is also quickly convergent because of the preset simple geometric configuration over weight space in which we approximate it with a Gaussian Mixture Model (GMM). Hence, our approach works well in a real-world scenario with a small and variability set of expert demonstrations. In particular, the contributions of this paper are threefold: 1. We generalize the IRL problem to a well-posed expectation optimization problem SIRL. 2. We provide the first theoretically existing solution to SIRL by the MCEM approach. 3. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms on the objectworld.

2. PRELIMINARY

An MDP is a tuple M := S, A, T , R, γ , where S is the set of states, A is the set of actions, and the transition function (a.k.a. model dynamics) T := P(s t+1 = s s t = s, a t = a) for s, s ∈ S and a ∈ A is the probability of being current state s, taking action a and yielding next state s . Reward function R(s, a) is a real-valued function and γ ∈ [0, 1) is the discount factor. A policy π : S → A is deterministic or stochastic, where the deterministic one is written as a = π(s), and the stochastic one is as a conditional distribution π(a|s). Sequential decisions are recorded in a series of episodes which consist of states s, actions a, and rewards r. The goal of reinforcement learning aims to get optimal policy π * for maximizing the expected total reward, i.e. π * := arg max π E ∞ t=0 γ t • R(s t , a t ) π . Given an MDP without a reward function R, i.e., MDP\R= S, A, T , and m expert demonstrations ζ E := {ζ 1 , • • • , ζ m }. Each expert demonstration ζ i is a sequential of state-action pairs. The goal of the IRL problem is to estimate the unknown reward function R(s, a) from expert demonstrations ζ E and the reward functions compose a complete MDP. The estimated complete MDP yields an optimal policy that acts as closely as the expert demonstrations. 



); Abbeel & Ng (2004); Ratliff et al. (2006); Abbeel et al. (2008); Syed & Schapire (2008); Ho et al. (

) and Levine et al. (2011) use deep neural networks and Gaussian processes to fit the parameters based on demonstrations respectively but still suffer from the problem that the true reward shaped by the changing environment dynamics. Influenced by the work of Finn et al. (2016a;b), Fu et al. (2017) propose a framework called adversarial IRL (AIRL) to recover robust reward functions in a changing dynamics based on adversarial learning and achieves superior results. Compared with AIRL, another adversarial method called generative adversarial imitation learning (GAIL) Ho & Ermon (2016) seeks to directly recover the expert's policy rather than reward functions. Many follow-up methods enhance and extend GAIL for multipurpose in various application scenarios Li et al. (2017); Hausman et al. (2017); Wang et al. (

REGULAR STRUCTURE OF REWARD FUNCTIONS In this section, we provide a formal definition of the regular (linear/nonlinear) structure of reward functions. The linear structure Ng et al. (2000); Ziebart et al. (2008); Syed & Schapire (2008); Ho

