STOCHASTIC INVERSE REINFORCEMENT LEARNING

Abstract

The goal of the inverse reinforcement learning (IRL) problem is to recover the reward functions from expert demonstrations. However, the IRL problem like any ill-posed inverse problem suffers the congenital defect that the policy may be optimal for many reward functions, and expert demonstrations may be optimal for many policies. In this work, we generalize the IRL problem to a well-posed expectation optimization problem stochastic inverse reinforcement learning (SIRL) to recover the probability distribution over reward functions. We adopt the Monte Carlo expectation-maximization (MCEM) method to estimate the parameter of the probability distribution as the first solution to the SIRL problem. The solution is succinct, robust, and transferable for a learning task and can generate alternative solutions to the IRL problem. Through our formulation, it is possible to observe the intrinsic property for the IRL problem from a global viewpoint, and our approach achieves a considerable performance on the objectworld.

1. INTRODUCTION

The IRL problem addresses an inverse problem that a set of expert demonstrations determines a reward function over a Markov decision process (MDP) if the model dynamics are known Russell (1998); Ng et al. (2000) . The recovered reward function provides a succinct, robust, and transferable definition of the learning task and completely determines the optimal policy. However, the IRL problem is ill-posed that the policy may be optimal for many reward functions and expert demonstrations may be optimal for many policies. For example, all policies are optimal for a constant reward function. In a real-world scenario, experts always act sub-optimally or inconsistently, which is another challenge. To overcome these limitations, two classes of probabilistic approaches for the IRL problem are proposed, i.e., Bayesian inverse reinforcement learning (BIRL) Ramachandran & Amir (2007) based on Bayesians' maximum a posteriori (MAP) estimation and maximum entropy IRL (MaxEnt) Ziebart et al. (2008) ; Ziebart (2010) based on frequentists' maximum likelihood (MLE) estimation. BIRL solves for the distribution of reward functions without an assumption that experts behave optimally, and encode the external a priori information in a choice of a prior distribution. However, BIRL also suffers from the practical limitation that a large number of algorithmic iterations is required for the procedure of Markov chain Monte Carlo (MCMC) in a sampling of posterior over reward functions. Advanced techniques, for example Kernel technique Michini & How (2012) and gradient method Choi & Kim (2011), are proposed to improve the efficiency and tractability of this situation. MaxEnt employs the principle of maximum entropy to resolve the ambiguity in choosing demonstrations over a policy. This class of methods, inheriting the merits from previous non-probabilistic IRL approaches including Ng et al. (2000) ; Abbeel & Ng (2004); Ratliff et al. (2006); Abbeel et al. (2008); Syed & Schapire (2008); Ho et al. (2016) , imposes regular structures of reward functions in a combination of hand-selected features. Formally, the reward function is a linear or nonlinear combination of the feature basis functions which consists of a set of real-valued functions {φ i (s, a)} i=1 hand-selected by experts. The goal of this approach is to find the best-fitting weights of feature basis functions through the MLE approach. Wulfmeier et al. (2015) and Levine et al. (2011) 



use deep neural networks and Gaussian processes to fit the parameters based on demonstrations respectively but still suffer from the problem that the true reward shaped by the changing environment dynamics. Influenced by the work of Finn et al. (2016a;b), Fu et al. (2017) propose a framework called adversarial IRL (AIRL) to recover robust reward functions in a changing dynamics based on adversarial learning and achieves superior results. Compared with AIRL, another adversarial method called generative

