UNDERSTANDING AND ADOPTING RATIONAL BEHAV-IOR BY BELLMAN SCORE ESTIMATION

Abstract

We are interested in solving a class of problems that seek to understand and adopt rational behavior from demonstrations. We may broadly classify these problems into four categories of reward identification, counterfactual analysis, behavior imitation, and behavior transfer. In this work, we make a key observation that knowing how changes in the underlying rewards affect the optimal behavior allows one to solve a variety of aforementioned problems. To a local approximation, this quantity is precisely captured by what we term the Bellman score, i.e the gradient of the log probabilities of the optimal policy with respect to the reward. We introduce the Bellman score operator which provably converges to the gradient of the infinite-horizon optimal Q-values with respect to the reward which can then be used to directly estimate the score. Guided by our theory, we derive a practical score-learning algorithm which can be used for score estimation in high-dimensional state-actions spaces. We show that score-learning can be used to reliably identify rewards, perform counterfactual predictions, achieve state-of-theart behavior imitation, and transfer policies across environments.

1. INTRODUCTION

A hallmark of intelligence is the ability to achieve goals with rational behavior. For sequential decision making problems, rational behavior is often formalized as a policy that is optimal with respect to a Markov Decision Process (MDP). In other words, intelligent agents are postulated to learn rational behavior for reaching goals by maximizing the reward of an underlying MDP (Doya, 2007; Neftci & Averbeck, 2019; Niv, 2009) . The reward covers information about the goal while the remainder of the MDP characterizes the interplay between the agent's decisions and the environment. In this work, we are interested in solving a class of problems that seek to understand and adopt rational (i.e optimal) behavior from demonstrations of sequential decisions. We may broadly classify them into categories of reward identification, counterfactual analysis, behavior imitation, and behavior transfer. (see Appendix D for detailed definitions) These four problems of understanding and adopting rationality arise across a wide spectrum of science fields. For example, in econometrics, the field of Dynamic Discrete Choice (DDC) seeks algorithms for fitting reward functions to human decision making behavior observed in labor (Keane & Wolpin, 1994) or financial markets (Arcidiacono & Miller, 2011) . The identified utilities are leveraged to garner deeper insight into people's decision strategies (Arcidiacono & Miller, 2020) , make counterfactual predictions about how their choices will change in response to market interventions that alter the reward function (Keane & Wolpin, 1997) , and train machine learning models that can make rational decisions like humans (Kalouptsidi et al., 2015) . In animal psychology and neuroscience, practitioners model the decision strategies of animals by fitting reward models to their observed behavior. The fitted reward is analyzed and then used to train AI models that simulate animal movements (Yamaguchi et al., 2018; Schafer et al., 2022) in order to gain a better understanding of ecologically and evolutionarily significant phenomena such as habitat selection and migration (Hirakawa et al., 2018) . In robot learning, Inverse Reinforcement Learning (IRL) is used to infer reward functions from teleoperated robots and the learned rewards are fed through an RL algorithm to teach a robot how to perform the same task without human controls (Fu et al., 2018; Finn et al., 2016b; Chan & van der Schaar, 2021) . As rewards for a task are often invariant to perturbations in the environment, (e.g walking can be described by the same reward function which encourages forward velocity and stability regardless of whether the agent is walking on ice or grass) the same task can be learned in various environmental conditions by optimizing the same inferred reward (Fu et al., 2018; Zhang et al., 2018) . We make a key observation that knowing how changes in the underlying rewards affect the optimal behavior (policy) allows one to solve a variety of aforementioned problems. To a local approximation, this quantity is precisely captured by what we term the Bellman score, i.e gradient of log probabilities of the optimal policy with respect to the reward. Prior works which study related quantities to the Bellman score are scarce and relied on having full-knowledge of the environment dynamics or small discrete state-action spaces (Neu & Szepesvári, 2012; Vroman, 2014; Li et al., 2017) . We introduce the Bellman score operator that provably converges to the gradient of the infinite-horizon optimal Qvalues with respect to the reward. This Q-gradient can then be used to estimate the score. We further show that the Q-gradient is equivalent to the conditional state-action visitation counts under the optimal policy. With these results, we derive the gradient of the Maximum Entropy IRL (Ziebart et al., 2008; Finn et al., 2016b; Fu et al., 2018) objective in the general setting with stochastic dynamics case and non-linear reward models. Guided by theory, we propose a powerful score-learning algorithm that can be used for model-free score estimation in continuous high-dimensional state-action spaces and an effective IRL algorithm, named Gradient Actor-Critic (GAC). Our experiments demonstrate that score-learning can be used to reliably identify rewards, make counterfactual predictions, imitate behaviors, and transfer policies across environments.

2. PRELIMINARIES

A Markov Decision Process (MDP) M 2 ⌦, where ⌦ is the set of all MDPs, is a tuple M = (X , A, P, P 0 , r ✓ , ) where X is the discretefoot_0 state space, A is the discrete action (decision) space, P 2 R |X ⇥A|⇥|X | is the transition probability matrix, P 0 2 R |X | is the initial state distribution, r ✓ 2 R |X ⇥A| is the (stationary) parametric reward with parameters ✓ 2 ⇥, and 2 [0, 1] is the discount factor. ⇥ is the parameter space, typically set to a finite-dimensional real vector space R dim(✓) . We will use T 0 to denote the time horizon for finite-horizon problems. A domain d is an MDP without the reward M \ r. Moving forward we will alternate between vector and function notation, e.g P (x 0 |x, a), r ✓ (x, a) denote the value of vector r ✓ at the dimension for the state-action (x, a) and the value of matrix P at the location for ((x, a), x 0 ). Furthermore, one-dimensional vectors are treated as row vectors, e.g R |X ⇥A| = R 1⇥|X ⇥A| . A (stationary) policy is a vector ⇡ 2 R |X ⇥A| that represents distributions over actions, i.e P a2A ⇡(a|x) = 1 for all x. A non-stationary policy for time horizon T is a sequence of policies ⇡ = (⇡ t ) T t=0 2 (R |X ⇥A| ) T +1 where ⇡ t is the policy used when there are t environment steps remainingfoot_1 and ⇡:k = (⇡ t ) k t=0 for k  T denotes a subsequence. When there is no confusion we will use ⇡ to denote to both stationary and non-stationary policies. Next, P ⇡ 2 R |X ⇥A|⇥|X ⇥A| denotes the transition matrix of the Markov chain on state-action pairs induced by policy ⇡, i.e P ⇡ (x 0 , a 0 |x, a) = P (x 0 |x, a)⇡(a 0 |x 0 ). Furthermore, P n ⇡ denotes powers of the transition matrix with P 0 ⇡ = I where I 2 R |X ⇥A|⇥|X ⇥A| is the identity. For non-stationary policies, we define P n ⇡ = P ⇡ T 1 P ⇡ T 2 ...P ⇡ T n for n 1 and P 0 ⇡ = I. Let x,a 2 R |X ⇥A| denote the indicator vector which has value 1 at the dimension corresponding to (x, a) and 0 elsewhere. The conditional marginal distribution for both stationary and non-stationary policy ⇡ after n environment steps is p ⇡,n (•|x, a) = x,a P n ⇡ 2 R |X ⇥A| . The (unnormalized) t-stepfoot_2 conditional occupancy measure of ⇡ is the discounted sum of conditional marginals: ⇢ ⇡,t (•|x, a) = t X n=0 n p ⇡,n (•|x, a) = t X n=0 n x,a P n ⇡ . Intuitively, ⇢ ⇡,t (x 0 , a 0 |x, a) quantifies the visitation frequency for (x 0 , a 0 ) when an agent starts from (x, a) and runs ⇡ for t environment steps, with more weight on earlier visits. The infinite horizon conditional occupancy exists for < 1 and we will denote it by simply omitting the subscript t, i.e ⇢ ⇡ (•|x, a) = lim t!1 ⇢ ⇡,t (•|x, a). The (unconditional) occupancy measure can be recovered by ⇢ ⇡,t (x 0 , a 0 ) = P x,a P 0 (x)⇡ T (a|x)⇢ ⇡,t (x 0 , a 0 |x, a). The t-step Q-values Q ⇡,t 2 R |X ⇥A| for both stationary and non-stationary policy ⇡ are defined as Q ⇡,t (x, a) = E x 0 ,a 0 ⇠⇢⇡,t(•|x,a) [r ✓ (x 0 , a 0 )] = ⇢ ⇡,t (•|x, a)•r, i.e the conditional expectation of discounted reward sums when there are t environment



We use discrete spaces for notational brevity later on, but our results can be extended to continuous spaces Backwards indexing is standard in dynamic programming RL algorithms(Bertsekas & Tsitsiklis, 1995) Step refers to the number of environment transitions and not the number of decisions made

