SCALABLE BAYESIAN INVERSE REINFORCEMENT LEARNING

Abstract

Bayesian inference over the reward presents an ideal solution to the ill-posed nature of the inverse reinforcement learning problem. Unfortunately current methods generally do not scale well beyond the small tabular setting due to the need for an inner-loop MDP solver, and even non-Bayesian methods that do themselves scale often require extensive interaction with the environment to perform well, being inappropriate for high stakes or costly applications such as healthcare. In this paper we introduce our method, Approximate Variational Reward Imitation Learning (AVRIL), that addresses both of these issues by jointly learning an approximate posterior distribution over the reward that scales to arbitrarily complicated state spaces alongside an appropriate policy in a completely offline manner through a variational approach to said latent reward. Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods, as well as task performance competitive with focused offline imitation learning algorithms.

1. INTRODUCTION

For applications in complicated and high-stakes environments it can often mean operating in the minimal possible setting -that is with no access to knowledge of the environment dynamics nor intrinsic reward, nor even the ability to interact and test policies. In this case learning and inference must be done solely on the basis of logged trajectories from a competent demonstrator showing only the states visited and the the action taken in each case. Clinical decision making is an important example of this, where there is great interest in learning policies from medical professionals but is completely impractical and unethical to deploy policies on patients mid-training. Moreover this is an area where it is not only the policies, but also knowledge of the demonstrator's preferences and goals, that we are interested in. While imitation learning (IL) generally deals with the problem of producing appropriate policies to match a demonstrator, with the added layer of understanding motivations this would then usually be approached through inverse reinforcement learning (IRL). Here attempting to learn the assumed underlying reward driving the demonstrator, before secondarily learning a policy that is optimal with respect to the reward using some forward reinforcement learning (RL) technique. By composing the RL and IRL procedures in order to perform IL we arrive at apprenticeship learning (AL), which introduces its own challenges, particularly in the offline setting. Notably for any given set of demonstrations there are (infinitely) many rewards for which the actions would be optimal (Ng et al., 2000) . Max-margin (Abbeel & Ng, 2004 ) and max-entropy (Ziebart et al., 2008) methods for heuristically differentiating plausible rewards do so at the cost of potentially dismissing the true reward for not possessing desirable qualities. On the other hand a Bayesian approach to IRL (BIRL) is more conceptually satisfying, taking a probabilistic view of the reward, we are interested in the posterior distribution having seen the demonstrations (Ramachandran & Amir, 2007) , accounting for all possibilities. BIRL is not without its own drawbacks though, as noted in Brown & Niekum (2019), making it inappropriate for modern complicated environments: assuming linear rewards; small, solvable environments; and repeated, inner-loop, calls to forward RL. A distribution over the reward, which is amortised over the demonstration space, is learnt that then informs an imitator Q-function policy. The dotted line represents a departure from a traditional auto-encoder as the input, alongside the latent reward, informs the decoder. The main contribution then of this paper is a method for advancing BIRL beyond these obstacles, allowing for approximate reward inference using an arbitrarily flexible class of functions, in any environment, without costly inner-loop operations, and importantly entirely offline. This leads to our algorithm AVRIL, depicted in figure 1 , which represents a framework for jointly learning a variational posterior distribution over the reward alongside an imitator policy in an auto-encoderesque manner. In what follows we review the modern methods for offline IRL/IL (Section 2) with a focus on the approach of Bayesian IRL and the issues it faces when confronted with challenging environments. We then address the above issues by introducing our contributions (Section 3), and demonstrate the gains of our algorithm in real medical data and simulated control environments, notably that it is now possible to achieve Bayesian reward inference in such settings (Section 4). Finally we wrap up with some concluding thoughts and directions (Section 5). Code for AVRIL and our experiments is made available at https://github.com/XanderJC/scalable-birl and https://github.com/vanderschaarlab/mlforhealthlabpub.

2. APPROACHING APPRENTICESHIP AND IMITATION OFFLINE

Preliminaries. We consider the standard Markov decision process (MDP) environment, with states s ∈ S, actions a ∈ A, transitions T ∈ ∆(S) S×A , rewards R ∈ R S×A1 , and discount γ ∈ [0, 1]. For a policy π ∈ ∆(A) S let ρ π (s, a) = E π,T [ ∞ t=0 γ t 1 {st=s,at=a} ] be the induced unique occupancy measure alongside the state-only occupancy measure ρ π (s) = a∈A ρ π (s, a). Despite this full environment model, the only information available to us is the MDP\RT , in that we have no access to either the underlying reward or the transitions, with our lacking knowledge of the transitions being also strong in the sense that further we are unable to simulate the environment to sample them. The learning signal is then given by access to m-many trajectories of some demonstrator assumed to be acting optimally w.r.t. the MDP, following a policy π D , making up a data set D raw = {(s (i) 1 , a (i) 1 , . . . , s τ (i) , a τ (i) )} m i=1 where s (i) t is the state and a (i) t is the action taken at step t during the ith demonstration, and τ (i) is the (max) time horizon of the ith demonstration. Given the Markov assumption though it is sufficient and convenient to consider the demonstrations simply as a collection of n-many state, action, next state, next action tuples such that D = {(s i , a i , s i , a i )} n i=1 with n = m i=1 (τ (i) -1). Apprenticeship through rewards. Typically AL proceeds by first inferring an appropriate reward function with an IRL procedure (Ng et al., 2000; Ramachandran & Amir, 2007; Rothkopf & Dimitrakakis, 2011; Ziebart et al., 2008) before running forward RL to obtain an appropriate policy. This allows for easy mix-and-match procedures, swapping in different standard RL and IRL methods depending on the situation. These algorithms though depend on either knowledge of T in order to solve exactly or the ability to perform roll-outs in the environment, with little previous work focusing on the entirely offline setting. One simple solution is through attempting to learn the dynamics



We define a state-action reward here, as is usual in the literature. Extensions to a state-only reward are simple, and indeed can be preferable, as we will see later.



Figure1: Overview. AVRIL is a framework for BIRL that works through an approximation in the variational Bayesian framework, considering the reward to be a latent representation of behaviour. A distribution over the reward, which is amortised over the demonstration space, is learnt that then informs an imitator Q-function policy. The dotted line represents a departure from a traditional auto-encoder as the input, alongside the latent reward, informs the decoder.

