UNDERSTANDING AND ADOPTING RATIONAL BEHAV-IOR BY BELLMAN SCORE ESTIMATION

Abstract

We are interested in solving a class of problems that seek to understand and adopt rational behavior from demonstrations. We may broadly classify these problems into four categories of reward identification, counterfactual analysis, behavior imitation, and behavior transfer. In this work, we make a key observation that knowing how changes in the underlying rewards affect the optimal behavior allows one to solve a variety of aforementioned problems. To a local approximation, this quantity is precisely captured by what we term the Bellman score, i.e the gradient of the log probabilities of the optimal policy with respect to the reward. We introduce the Bellman score operator which provably converges to the gradient of the infinite-horizon optimal Q-values with respect to the reward which can then be used to directly estimate the score. Guided by our theory, we derive a practical score-learning algorithm which can be used for score estimation in high-dimensional state-actions spaces. We show that score-learning can be used to reliably identify rewards, perform counterfactual predictions, achieve state-of-theart behavior imitation, and transfer policies across environments.

1. INTRODUCTION

A hallmark of intelligence is the ability to achieve goals with rational behavior. For sequential decision making problems, rational behavior is often formalized as a policy that is optimal with respect to a Markov Decision Process (MDP). In other words, intelligent agents are postulated to learn rational behavior for reaching goals by maximizing the reward of an underlying MDP (Doya, 2007; Neftci & Averbeck, 2019; Niv, 2009) . The reward covers information about the goal while the remainder of the MDP characterizes the interplay between the agent's decisions and the environment. In this work, we are interested in solving a class of problems that seek to understand and adopt rational (i.e optimal) behavior from demonstrations of sequential decisions. We may broadly classify them into categories of reward identification, counterfactual analysis, behavior imitation, and behavior transfer. (see Appendix D for detailed definitions) These four problems of understanding and adopting rationality arise across a wide spectrum of science fields. For example, in econometrics, the field of Dynamic Discrete Choice (DDC) seeks algorithms for fitting reward functions to human decision making behavior observed in labor (Keane & Wolpin, 1994) or financial markets (Arcidiacono & Miller, 2011) . The identified utilities are leveraged to garner deeper insight into people's decision strategies (Arcidiacono & Miller, 2020) , make counterfactual predictions about how their choices will change in response to market interventions that alter the reward function (Keane & Wolpin, 1997) , and train machine learning models that can make rational decisions like humans (Kalouptsidi et al., 2015) . In animal psychology and neuroscience, practitioners model the decision strategies of animals by fitting reward models to their observed behavior. The fitted reward is analyzed and then used to train AI models that simulate animal movements (Yamaguchi et al., 2018; Schafer et al., 2022) in order to gain a better understanding of ecologically and evolutionarily significant phenomena such as habitat selection and migration (Hirakawa et al., 2018) . In robot learning, Inverse Reinforcement Learning (IRL) is used to infer reward functions from teleoperated robots and the learned rewards are fed through an RL algorithm to teach a robot how to perform the same task without human controls (Fu et al., 2018; Finn et al., 2016b; Chan & van der Schaar, 2021) . As rewards for a task are often invariant to perturbations in the environment, (e.g walking can be described by the same reward function which encourages forward velocity and stability regardless of whether the agent is walking on ice or grass) the same task can be learned in various environmental conditions by optimizing the same inferred reward (Fu et al., 2018; Zhang et al., 2018) . We make a key observation that knowing how changes in the underlying rewards affect the optimal behavior (policy) allows one to solve a variety of aforementioned problems. To a local approximation, this quantity is precisely captured by what we term the Bellman score, i.e gradient of log probabilities of the optimal policy with respect to the reward. Prior works which study related quantities to the Bellman score are scarce and relied on having full-knowledge of the environment dynamics or small discrete state-action spaces (Neu & Szepesvári, 2012; Vroman, 2014; Li et al., 2017) . We introduce the Bellman score operator that provably converges to the gradient of the infinite-horizon optimal Qvalues with respect to the reward. This Q-gradient can then be used to estimate the score. We further show that the Q-gradient is equivalent to the conditional state-action visitation counts under the optimal policy. With these results, we derive the gradient of the Maximum Entropy IRL (Ziebart et al., 2008; Finn et al., 2016b; Fu et al., 2018) objective in the general setting with stochastic dynamics case and non-linear reward models. Guided by theory, we propose a powerful score-learning algorithm that can be used for model-free score estimation in continuous high-dimensional state-action spaces and an effective IRL algorithm, named Gradient Actor-Critic (GAC). Our experiments demonstrate that score-learning can be used to reliably identify rewards, make counterfactual predictions, imitate behaviors, and transfer policies across environments.

2. PRELIMINARIES

A Markov Decision Process (MDP) M 2 ⌦, where ⌦ is the set of all MDPs, is a tuple M = (X , A, P, P 0 , r ✓ , ) where X is the discretefoot_0 state space, A is the discrete action (decision) space, P 2 R |X ⇥A|⇥|X | is the transition probability matrix, P 0 2 R |X | is the initial state distribution, r ✓ 2 R |X ⇥A| is the (stationary) parametric reward with parameters ✓ 2 ⇥, and 2 [0, 1] is the discount factor. ⇥ is the parameter space, typically set to a finite-dimensional real vector space R dim(✓) . We will use T 0 to denote the time horizon for finite-horizon problems. A domain d is an MDP without the reward M \ r. Moving forward we will alternate between vector and function notation, e.g P (x 0 |x, a), r ✓ (x, a) denote the value of vector r ✓ at the dimension for the state-action (x, a) and the value of matrix P at the location for ((x, a), x 0 ). Furthermore, one-dimensional vectors are treated as row vectors, e.g R |X ⇥A| = R 1⇥|X ⇥A| . A (stationary) policy is a vector ⇡ 2 R |X ⇥A| that represents distributions over actions, i.e P a2A ⇡(a|x) = 1 for all x. A non-stationary policy for time horizon T is a sequence of policies ⇡ = (⇡ t ) T t=0 2 (R |X ⇥A| ) T +1 where ⇡ t is the policy used when there are t environment steps remainingfoot_1 and ⇡:k = (⇡ t ) k t=0 for k  T denotes a subsequence. When there is no confusion we will use ⇡ to denote to both stationary and non-stationary policies. Next, P ⇡ 2 R |X ⇥A|⇥|X ⇥A| denotes the transition matrix of the Markov chain on state-action pairs induced by policy ⇡, i.e P ⇡ (x 0 , a 0 |x, a) = P (x 0 |x, a)⇡(a 0 |x 0 ). Furthermore, P n ⇡ denotes powers of the transition matrix with P 0 ⇡ = I where I 2 R |X ⇥A|⇥|X ⇥A| is the identity. For non-stationary policies, we define P n ⇡ = P ⇡ T 1 P ⇡ T 2 ...P ⇡ T n for n 1 and P 0 ⇡ = I. Let x,a 2 R |X ⇥A| denote the indicator vector which has value 1 at the dimension corresponding to (x, a) and 0 elsewhere. The conditional marginal distribution for both stationary and non-stationary policy ⇡ after n environment steps is p ⇡,n (•|x, a) = x,a P n ⇡ 2 R |X ⇥A| . The (unnormalized) t-stepfoot_2 conditional occupancy measure of ⇡ is the discounted sum of conditional marginals: ⇢ ⇡,t (•|x, a) = t X n=0 n p ⇡,n (•|x, a) = t X n=0 n x,a P n ⇡ . Intuitively, ⇢ ⇡,t (x 0 , a 0 |x, a) quantifies the visitation frequency for (x 0 , a 0 ) when an agent starts from (x, a) and runs ⇡ for t environment steps, with more weight on earlier visits. The infinite horizon conditional occupancy exists for < 1 and we will denote it by simply omitting the subscript t, i.e ⇢ ⇡ (•|x, a) = lim t!1 ⇢ ⇡,t (•|x, a). The (unconditional) occupancy measure can be recovered by ⇢ ⇡,t (x 0 , a 0 ) = P x,a P 0 (x)⇡ T (a|x)⇢ ⇡,t (x 0 , a 0 |x, a). The t-step Q-values Q ⇡,t 2 R |X ⇥A| for both stationary and non-stationary policy ⇡ are defined as Q ⇡,t (x, a) = E x 0 ,a 0 ⇠⇢⇡,t(•|x,a) [r ✓ (x 0 , a 0 )] = ⇢ ⇡,t (•|x, a)•r, i.e the conditional expectation of discounted reward sums when there are t environment steps remaining. The theory in this section will be derived in the Maximum Entropy RL (MaxEntRL) setting (Haarnoja et al., 2017; Yarats et al., 2021)  ! R |X ⇥A| as Q soft ✓,t (x, a) = (T H Q soft ✓,t 1 )(x, a) = r ✓ (x, a) + E x 0 ⇠P (•|x,a) [log X a 0 2A exp Q soft ✓,t 1 (x 0 , a 0 )] . (2) for all t > 0 and Q soft r,0 = r. Hence the optimal Q-values for a decision problem with horizon t can be obtained by sequentially computing the optimal Q-values for 1, ..., t 1 step decision problems using dynamic programming (Bellman, 1957) , i.e value iteration. The operator T H is a contraction in the max-norm (Bertsekas & Tsitsiklis, 1995) whose unique fixed point is the infinite- horizon optimal Q-values Q soft ✓,1 satisfying Q soft ✓,1 = T H Q soft ✓,1 . For finite horizon problems, the (nonstationary) optimal policy ⇡ soft a) and, similarly, the infinite-horizon (stationary) optimal policy is ⇡ soft ✓,1 = softmax(Q soft ✓,1 ). We may now define the key quantity of interest, the Bellman score. ✓ = (⇡ soft ✓,t ) T t=0 is derived by ⇡ soft ✓,t (a|x) = softmax(Q soft ✓,t )(x, a) = e Q soft ✓,t (x,a) / P a e Q soft ✓,t (x, Definition 1 The finite-horizon Bellman score s soft ✓ 2 (R |X ⇥A|⇥dim(✓) ) T +1 is the gradient of the log probabilities of the (non-stationary) optimal policy with respect to the reward, s soft ✓ = (s soft ✓,t ) T t=0 = (r ✓ log ⇡ soft ✓,t ) T t=0 . Similarly the infinite-horizon Bellman score s soft ✓,1 2 R |X ⇥A|⇥dim(✓) is s soft ✓,1 = r ✓ log ⇡ soft ✓,1 . In words, the finite-horizon score s soft ✓,t is the Jacobian of the log optimal policy vector with respect to the reward parameters, i.e s soft ✓,t (x, a) = r ✓ log ⇡ soft ✓,t (a|x) is the direction in which the reward parameters ✓ should perturbed in order to increase the log probability of action a at state x under the optimal policy for time t. To a local linear approximation around ✓, the score conveys how the optimal behavior changes as a function of the underlying reward parameters. Next, we will prove useful properties of the score as well as derive a dynamic programming algorithm to compute it.

3. SCORE ITERATION

Since ⇡ soft ✓,t (a|x) = softmax(Q soft ✓,t )(x, a ), the Bellman score can be written as a difference of the gradients of optimal Q-values with respect to the reward parameters: s soft ✓,t (x, a) = r ✓ log ⇡ soft ✓,t (a|x) = r ✓ Q soft ✓,t (x, a) E a 0 ⇠⇡ soft ✓,t (•|x) [r ✓ Q soft ✓,t (x, a 0 )] . We will refer to r ✓ Q soft ✓,t as the Q-gradient or value gradient. Eq. 3 shows that an unbiased sample estimate of the score can be obtained from the Q-gradient and the optimal policy. Thus, we will derive a dynamic programming algorithm termed score iteration which efficiently computes the Q-gradients. Similar to how value iteration proceeds by repeated application of the Bellman optimality operator (Eq. 2), score iteration will rely on the Bellman score operator which we define now. Definition 2 The Bellman score operator G ⇡,✓ : R |X ⇥A|⇥dim(✓) ! R |X ⇥A|⇥dim(✓) for a policy ⇡ and reward r ✓ is defined on an input J 2 R |X ⇥A|⇥dim(✓) as G ⇡,✓ J = r ✓ r ✓ + P ⇡ J . To gain intuition about the score operator, consider the tabular reward r ✓ (x, a) = ✓(x, a) where ✓ 2 R |X ⇥A| . In this setting, the score operator simply computes the conditional occupancies for ⇡ (Eq. 1) when applied to the identity matrix I. For example, the values of G ⇡,✓ I = r ✓ r ✓ + P ⇡ I = I + P ⇡ I at the row for state-action (x, a) is x,a + P ⇡ (•|x, a) which corresponds to the one-step conditional occupancy ⇢ ⇡,1 (•|x, a). While the score operator has a pleasantly simple form, Theorem 1 will show that the Q-gradients can be computed by its repeated application and, as a result, that the Q-gradient is the conditional occupancy measure of the optimal policy for the underlying reward. Algorithm 1 Score Iteration: Computes the infinite-horizon Bellman score via dynamic programming M = (X , A, P, P0, r ✓ , ): Markov Decision Process g: randomly initialized Q-gradient vector procedure SCOREITERATION(M) For M, learn the optimal policy ⇡ soft ✓,1 while g is not converged do for x 2 X , a 2 A do Update Q-gradient: g(x, a) r ✓ r ✓ (x, a) + P x 0 ,a 0 2X ⇥A P ⇡ soft ✓,1 (x 0 , a 0 |x, a)g(x 0 , a 0 ) for x 2 X , a 2 A do Compute Bellman score: s(x, a) g(x, a) P a 0 2A ⇡ soft ✓,1 (a 0 |x)g(x, a 0 ) return ⇡ soft ✓,1 , g, s Theorem 1 For all t = 1, ..., T and any matrix J 2 R |X ⇥A|⇥dim(✓) r ✓ Q soft ✓,t = G ⇡ soft ✓,t 1 ,✓ ...G ⇡ soft ✓,0 ,✓ (r ✓ Q soft ✓,0 ) r ✓ Q soft ✓,1 = G ⇡ soft ✓,1 ,✓ (r ✓ Q soft ✓,1 ) = lim k!1 G k ⇡ soft ✓,1 ,✓ J where, r ✓ Q soft ✓,0 = r ✓ r ✓ . Furthermore, the Q-gradient satisfies: r ✓ Q soft ✓,t (x, a) = E x 0 ,a 0 ⇠⇢ ⇡ soft ✓,:t ,t (•|x,a) [r ✓ r ✓ (x 0 , a 0 )] r ✓ Q soft ✓,1 (x, a) = E x 0 ,a 0 ⇠⇢ ⇡ soft ✓,1 (•|x,a) [r ✓ r ✓ (x 0 , a 0 )] Theorem 1 shows that both finite and infinite horizon Q-gradients can be computed by repeatedly applying the score operator of the current reward r ✓ and its optimal policy ⇡ soft ✓ , ⇡ soft ✓,1 . Similar to how the Bellman optimality operator (Eq. 2) converges any starting vector in R |X ⇥A| to the infinitehorizon optimal Q-values (Bertsekas & Tsitsiklis, 1995), the Bellman score operator converges any starting matrix in R |X ⇥A|⇥dim(✓) to the infinite-horizon Q-gradient. Furthermore, we see that the Q-gradient is in fact the expected reward gradient under the conditional occupancy measure. To gain more intuition, once again consider the tabular reward r ✓ (x, a) = ✓(x, a). Then, r ✓ Q soft ✓,t (x, a) = E x 0 ,a 0 ⇠⇢ ⇡ soft ✓,:t ,t (•|x,a) [ x 0 ,a 0 ] = ⇢ ⇡ soft ✓,:t ,t . We see that the Q-gradient is in fact equivalent to the conditional occupancy of the optimal policy, which is consistent with our previous analysis of the score operator. Thus, in order to increase the optimal Q-values at (x, a), the rewards of state-actions (x 0 , a 0 ) should be increased proportional to their discounted visitation frequencies when starting from (x, a) and following the optimal policy. As an alternative to the operator perspective, one may consider a Linear Programming (LP) interpretation. The optimal Q-values are the solution to an LP optimization problem over the set of conditional occupancies D that are feasible under the Markovian environment dynamics, i.e Q soft ✓,t (x, a) = max ⇢2D r ✓ • ⇢. Loosely speaking, invoking the supremum rule (Boyd et al., 2003) allows one to show that the gradient of the LP solution, i.e r r ✓ max ⇢2D r ✓ • ⇢, is in fact the conditional occupancy that solves the LP. The main practical contribution of Theorem 1 is that the operator convergence results enable designing a powerful dynamic programming method for score computation as shown in Algorithm 1. (Finite-horizon version in Appendix D) The algorithm computes the optimal policy ⇡ soft ✓,1 , then proceeds by repeatedly applying the score operator G ⇡ soft ✓,1 ,✓ . While highly effective in countable state-action spaces, score iteration has limited applicability to continuous state-action spaces and requires full knowledge of the environment dynamics. Similar to how Q-learning is a model-free instantiation of the value iteration algorithm, we will proceed to propose score-learning which is a model-free instantiation of the score iteration algorithm.

4. SCALING UP SCORE ESTIMATION

In order to scale score estimation to the model-free settings with (potentially) high-dimensional continuous state-action spaces, we start by parameterizing the policy, Q-gradient network, and score network as neural networks ⇡ : ) , and s ! : X ⇥ A ! R dim (✓) . Just as Q-learning (Silver et al., 2016) approximates an application of the Bellman optimality operator by learning a Q-network that minimizes a boot-strapped regression loss, score-learning approximates Algorithm 2 Score-learning: model-free instantiation of SCOREITERATION (✓, , , !): Weights for reward, policy, Q-gradient, and score network B: Buffer of transitions (can be either online or offline data) N, N : Total number of algorithm iterations, number of Q-gradient update steps per policy update ⌘ , ⌘!: Q-gradient learning rate, score-learning rate ↵g: Target Q-gradient mixing rate RL: One policy update step in an RL algorithm, e.g one policy gradient step in SAC (Haarnoja et al., 2018) . procedure SCORELEARNING(✓, , , !, B, N ) X ⇥ A ! R, g : X ⇥ A ! R dim(✓ for i 2 { 1, . . . , N } do # Update policy ⇡ if is not None then Update policy to maximize reward r ✓ : RL(✓, , B) # Update Q-gradient g for j 2 { 1, . . . , N } do Sample batch: (x, a, x 0 ) ⇠ B if is not None then Sample next action from current policy: a 0 ⇠ ⇡ (•|x 0 ) else Sample next action from buffer a 0 ⇠ B Update Q-gradient (Eq. 4): ⌘ r (r ✓ r ✓ (x, a) + ḡ (x 0 , a 0 ) g (x, a)) 2 Update Target Q-gradient ḡ with soft mixing rate ↵ (Haarnoja et al., 2017) # Update Bellman score s! Sample batch and contrastive action: (x, a) ⇠ B, a 0 ⇠ ⇡ (•|x) Update score network (Eq. 5): ! ! ⌘sr!(g (x, a) g (x, a 0 ) s!(x, a)) 2 return , , ! an application of the score operator by learning a Q-gradient network that minimizes the boot-strapped regression loss in Eq. 4. The score network then minimizes the regression loss in Eq. 5 which directly follows by replacing the Q-gradient with our estimate g in Eq. 3. L = (r ✓ r ✓ (x, a) + E x 0 ,a 0 ⇠P⇡ (•|x,a) [ḡ (x 0 , a 0 )] g (x, a)) 2 L ! = (g (x, a) E a 0 ⇠⇡ (•|x) [g (x, a 0 )] s ! (x, a)) 2 (5) Note that ḡ is a target Q-gradient network used for stabilizing optimization analogous to target Q-networks that are used in Q-learning (Haarnoja et al., 2017) . Algorithm 2 shows the full execution flow of score-learning. Instead of sequentially learning each model until convergence, score-learning alternates between making small improvements to the policy (red), regressing to the Q-gradient with the current estimate of the optimal policy (blue), and updating the score (yellow) with the current estimate of the Q-gradient. This enables the algorithm to output approximate estimates of all three components when run for a few-steps, which will be useful for downstream algorithms that use approximate score-learning in the inner-loop. If the buffer B already contains samples from the optimal policy, simply not passing disables policy updates. Note that score-learning can operate with both online and offline data B depending on the choice of the RL algorithm used to update the policy. We now show how the score-learning algorithm can be used for a variety of downstream tasks.

4.1. MAXIMUM ENTROPY INVERSE REINFORCEMENT LEARNING WITH SCORE-LEARNING

In this section we will show how score-learning can be used for maximum entropy inverse reinforcement learning (MaxEntIRL). Let ⌧ = (x 0 , a 0 , ..., x T , a T ) denote a trajectory of state-actions, and p ✓ (⌧ ) = log P 0 (x 0 ) Q T t=0 ⇡ soft ✓,t (a t |x t ) Q T 1 t=0 P (x t+1 |x t , a t ) denote the trajectory distribution of the MaxEnt optimal policy ⇡ soft ✓ for reward r ✓ and similarly let p ⇤ (⌧ ) be the trajectory distribution of the expert policy ⇡ ⇤ = (⇡ ⇤ t ) T t=0 from which the demonstrations D are sampled. The goal in MaxEntIRL is to find reward parameters ✓ that maximize the log-likelihood of the expert trajectories: max ✓ E ⌧ ⇠p ⇤ [log p ✓ (⌧ )] . The gradient of the objective in Eq. 6 is highly useful as it enables direct gradient ascent to solve the maximization. Previously, the MaxEntIRL gradient has been derived in either the limited setting of deterministic dynamics (Ziebart et al., 2008) or in the general stochastic dynamics setting for the special case of linear reward models (Ziebart et al., 2010) . Here, we leverage Theorem 1 to derive the general MaxEntIRL gradient in the stochastic dynamics setting with non-linear rewards.  E ⌧ ⇠p ⇤ [log p ✓ (⌧ )] = E x⇠P0,a⇠⇡ ⇤ T (•|x),a 0 ⇠⇡ soft ✓,T (•|x) [r ✓ Q ⇡ ⇤ ,T (x, a) r ✓ Q ✓,T (x, a 0 )] (7) = E x,a⇠⇢ ⇡ ⇤ ,T [r ✓ r ✓ (x, a)] E x 0 ,a 0 ⇠⇢ ⇡ soft ✓ ,T [r ✓ r ✓ (x 0 , a 0 )] . Theorem 2 shows that the reward should be updated to increase the Q-values of the expert and decrease the optimal Q-values for the learner (Eq. 7) which is equivalent to increasing the reward on state-actions sampled from the expert and decreasing the reward of state-actions sampled from the learner (Eq. 8). With deterministic dynamics, Eq. 8 simply reduces to the known contrastive divergence gradient (Finn et al., 2016b) for energy-based models. When r ✓ is linear, Eq. 8 reduces to the expected feature gap derived in Ziebart et al. (2010) . (see Appendix A for extended discussion) Many prior works in MaxEntIRL (Finn et al., 2016b; Fu et al., 2018) take Monte Carlo samples to estimate Eq. 8 and perform stochastic gradient updates on ✓. The key challenges with this approach are, first, the difficulty of reusing old samples (x, a) ⇠ ⇡ soft ✓ old from policies for older rewards ✓ old and, second, the high variance of the Monte Carlo estimator. The main practical significance of Theorem 2 comes from Eq. 7 which reveals that we may instead learn Q-gradient networks with score-learning to approximate the expectations in Eq. 8. At the cost of introducing bias to the gradient estimates, this approach resolves the two aforementioned challenges; score-learning works with offline data, which enables old sample reuse, and the Q-gradient network outputs the mean reward gradient, which reduces variance. We name our algorithm Gradient Actor-Critic (GAC ) (see Algorithm 3) which uses score-learning to estimate the expert Q-gradient r ✓ Q ⇡ ⇤ ,T (x, a) and learner Q-gradient r ✓ Q ✓,T (x, a) with parametric functions g E and g L . To reduce computational cost, GAC alternates between approximate Q-gradient estimation, i.e running score-learning for N steps, and using the rough gradients for reward updates. Our method is analogous to actor-critic methods (Haarnoja et al., 2017) that learn a Q-network to estimate mean policy returns which are then used for policy gradient updates. These methods similarly introduce bias into the policy gradients in exchange for the ability to reuse old policy data and perform lower variance gradient updates. Just as how actor-critic typically outperforms Monte-Carlo policy gradient methods, our experiments will demonstrate a similar usefulness in trading off bias for lower variance.

4.2. COUNTERFACTUAL PREDICTIONS WITH SCORE-LEARNING

Here we describe how score-learning can be used for counterfactual predictions. The goal is to predict how the optimal policy changes when the original reward r ✓ is perturbed to a counterfactual reward r ✓ 0 . With access to an efficient environment simulator, one natural solution is to simply resolve the RL problem r ✓ 0 . However, this is difficult in many domains where solving the RL problem is expensive, particularly when we wish to make many counterfacutal predictions for multiple reward alterations. Instead, we may use the score s soft ✓,1 for the original reward r ✓ to estimate of the optimal policy at the new reward r ✓ 0 by simply using the first order taylor approximation of the log optimal policy: log ⇡ soft ✓ 0 ,1 (x, a) ⇡ log ⇡ soft ✓,1 (x, a) + s soft ✓,1 (x, a) • (✓ 0 ✓) . Eq. 9 shows that log ⇡ soft ✓ 0 ,1 (a|x) increases proportional to how well the direction of change in reward (✓ 0 ✓) aligns with the score s ✓ (x, a). We use Algorithm 2 to estimate s ✓ with g and apply Eq. 9. Figure 1 : Counterfactual Prediction Performance for two scenarios: education subsidy (top row) and military enrollment incentives (bottom row). "no sub" corresponds to the initial reward from which the scores are estimated, "true sub" shows the probability of actions under the MaxEnt optimal policy obtained by resolving the RL model for perturbed rewards, and "pred sub" shows the scoreestimated optimal policy probabilities. 2000, 8000 denote the magnitude of the perturbation to the reward parameters which corresponds to the amount of subsidy or incentives.

5.1. REWARD IDENTIFICATION AND COUNTERFACTUAL PREDICTIONS

Environment: We experiment with the Keane and Wolpin (KW) environment (Keane & Wolpin, 1994; 1997) which is a widely used benchmark in the DDC literature. The KW environment simulates a life-time of occupational decisions where at each time-step an agent selects among five actions: white collar work, blue collar work, military, school, and staying home. The state is a 25 dimensional vector consisting of various features about the agent such as the time spent in each occupation, and their current job. The dynamics captures how their current job decision alters the state. Agents are required to be forward looking and maximize the rewards summed over a life-time of occupational choices. The reward parameter ✓ is a 63 dimensional vector where each parameter dimension corresponds to either a cost or benefit, e.g the cost of attending college or the benefit of transitioning from a white collar job to the military. The specific structure of the reward is omitted for brevity, but full details about the environment can be found in (Keane & Wolpin, 1994; 1997) . Table 1 : Reward Identification on KW dataset. Mean squared error between true reward and IRL estimated reward evaluated on the demonstrations.

GCL MSM GAC

✓ edu 12.1 ± 5.3 0.14 ± 0.4 0.16 ± 0.3 ✓ mil 20.5 ± 6.8 0.14 ± 0.3 0.08 ± 0.2 Results: We solve the forward MaxEntRL problem in the KW environment and use the optimal policy to generate 100 trajectories of occupational choices for two different reward parameters: ✓ edu and ✓ mil . The first reward simulates a scenario with lower education costs and the latter captures a scenario with high benefits for military enrollment (Keane & Wolpin, 1994; 1997) . We test the reward identification performance by computing the mean squared error between the ground truth reward r ✓ edu , r ✓ mil and the estimated rewards on r ✓edu , r ✓mil evaluated on the demonstrations. As rewards are only identifiable up to additive constant shifts (Kim et al., 2021) , we search this equivalence class for the parameters that minimize the mean squared error. Table 1 shows that GAC is able to identify the rewards to similar precision as the Method of Simulated Moments (MSM) baseline which is the state-of-the-art method for identification in the DDC literature. While GAC and GCL are optimizing the same objective, i.e Eq. 8, we posit that trading off bias for lower variance (as explained in section 4.1) accounts for GAC's superior performance. Next we test the counterfactual prediction performance using the method in section 4.2. We simulate two counterfactual scenarios where the government provides a subsidy for college tuition and military participation. The scenarios are simulated by reducing the cost parameters in the reward for attending college and joining the military. Figure 1 shows how our score-estimated counterfactual predictions of policy changes compare to fully re-solving the RL problem for the perturbed rewards. We are able to accurately estimate the increase in time spent in school and the military. Particularly, when the reward perturbations are smaller, the score-based counterfactual predictions are more accurate. For significantly larger perturbations, we would expect our method's accuracy to drop as the first order approximation is no longer valid.

5.2. BEHAVIOR IMITATION

Environment and Baselines: We experiment with high-dimensional robotics control environments from the DeepMind Control Suite (DMC) (Tassa et al., 2018) , which is a widely used benchmark for RL and Imitation Learning (IL). For state control problems the agent has access to its proprioceptive states. For visual control the agent sees three consecutive observation images of size 84 ⇥ 84 ⇥ 3 and uses action-repeat following the standard DMC protocol for visual RL (Yarats et al., 2022) . We compare GAC against Behavioral Cloning (BC) (Pomerleau, 1991) , Adversarial IRL (AIRL) (Fu et al., 2018) , Guided Cost Learning (GCL) (Finn et al., 2016b) , and Discriminator Actor-Critic (DAC) (Kostrikov et al., 2019; Cohen et al., 2021) . We report the mean performance and standard deviation across ten random seeds where performance of each seed is measured as the mean reward of twenty trajectories sampled from the final policy. (see Appendix B for experiment details) Results: For state control, we experiment in the extreme low-data regime where only one demonstration trajectory (consisting of 1000 state-actions) is provided. The experts for all tasks obtain rewards near 1000. Table 2 shows that GAC outperforms all baselines while achieving expert level performance. For visual control, Figure 2 shows that GAC outperforms baselines while achieving expert performance on many hard visual control tasks with just one demonstration. In particular, we found that the IRL baselines GCL and AIRL are incapable of performing well, which suggests that trading off bias for lower variance is beneficial in practice. While DAC achieves competitive performance on some tasks such as HOPPER STAND, it does not have the benefit of learning a reward. Some limitations of GAC include its longer compute time, potential optimization instabilities from the deadly triad (Van Hasselt et al., 2018) , and double-sampling (Zhu & Ying, 2020) . A more complete discussion of the limitations as well as hyper-parameter ablation studies are in Appendix C. 

5.3. BEHAVIOR TRANSFER

Next we show that that learned rewards from GAC robustly transfer behaviors to perturbed environments (Fu et al., 2018) . We create variations of the state-based DMC environments by altering the surface and joint frictions to simulate realistic scenarios for robot learning where we may want to learn a single reward that can teach robots of different joint mechanics to mobilize in different walking surfaces. For example, the SLIPPERYWALKER environment has about 80% less friction on the walking surface while STIFFCHEETAH has roughly 150% higher friction on the robot joints. Rewards are first learned on the non-perturbed environments using 10 demonstrations, then re-optimized via RL on the perturbed environments. As a sanity check, we also include the DIRECT baseline which simply uses policy learned in the non-perturbed environment. Table 3 shows that GAC is able to outperform the baselines. However, GAC is not able to attain perfect expert level performance, which we believe is due to the perturbed environments requiring slightly different state visitation strategies.

6. RELATED WORKS

Inverse Reinforcement Learning and Dynamic Discrete Choice: The field of IRL was pioneered by (Ng et al., 2000) and has since accumulated a rich history starting with early works that rely on linear reward models and hand-crafted features (Abbeel & Ng, 2004; Ramachandran & Amir, 2007; Choi & Kim, 2011; Neu & Szepesvári, 2012) . Since the introduction of the Maximum Entropy IRL (Ziebart et al., 2008; 2010) framework, the field has gravitated towards learning more flexible rewards parameterized by deep neural networks (Wulfmeier et al., 2015) paired with adversarial training (Fu et al., 2018; Ho & Ermon, 2016; Finn et al., 2016b; a) while some works attempt scale more classical methods such as Bayesian IRL (Chan & van der Schaar, 2021; Mandyam et al., 2021) . Most related to our work is gradient-based apprenticeship learning (Neu & Szepesvári, 2012) which showed that the gradient of the infinite horizon Q-function in the standard RL setting satisfies a linear fixed point equation. Bellman gradient iteration (Li et al., 2017; 2019) and maximum-likelihood IRL (Vroman, 2014 ) take a computational approach to estimating approximate Q-gradients. Unlike these prior works, we generalize results to the MaxEntRL setting where we prove new properties of the Bellman score, and derive a dynamic programming algorithm that provably converges to the infinite-horizon Q-gradient. Most significantly, we propose a model-free score estimation algorithm which scales to high-dimensional environments and apply it to various applications beyond IRL. The field of econometrics has a rich body of work on identifying Dynamic Discrete Choice (DDC) (Rust, 1994; Arcidiacono & Miller, 2011; 2020; Abbring & Daljord, 2020) models which is equivalent to the problem of MaxEntIRL (Ziebart et al., 2008) . DDC literature more so focuses on counter factual predictions (Christensen & Connault, 2019) which has various application areas such as in labor markets (Keane & Wolpin, 1994) , health care (Heckman & Navarro, 2007) , and retail competition (Arcidiacono & Miller, 2011) . Imitation Learning A related field of Imitation Learning (IL) seeks to learn policies from demonstrations (Pomerleau, 1991; Ho & Ermon, 2016; Zhang et al., 2020; Rajaraman et al., 2020; Xu et al., 2020) which is a viable alternative when the sole goal is behavior adoption. In recent years, IL has shown superior results than IRL for policy performance, particularly in the visual IL space. (Samak et al., 2021; Young et al., 2021) Techniques such as data augmentation and encoder sharing that have boosted performance in visual RL (Yarats et al., 2022) have been combined with adversarial IL methods to solve challenging control environments (Cohen et al., 2021) .

7. CONCLUSION

We have studied the theoretical properties of the Bellman score which allowed us to derive a modelfree algorithm, score-learning, for score-estimation. We showed that score-learning has various applications in IRL, behavior transfer, reward identification, and counterfactual analysis. We are looking forward to seeing future works that apply score-learning to other problems such as reward design (Hadfield-Menell et al., 2017) and explainable RL (Puiutta & Veith, 2020) .



We use discrete spaces for notational brevity later on, but our results can be extended to continuous spaces Backwards indexing is standard in dynamic programming RL algorithms(Bertsekas & Tsitsiklis, 1995) Step refers to the number of environment transitions and not the number of decisions made



Figure 2: Visual Control Performance when a varying number of demonstrations.

. Let us define the t-step discounted conditional entropy of a policy as H ⇡,t (x, a) = |X ⇥A| denote the t-step optimal Q-values defined iteratively by the soft Bellman optimality operator T H : R |X ⇥A|

Algorithm 3 Gradient Actor-Critic (GAC ): Inverse RL via Q-gradient Estimation (✓, , L, E ): Weights for reward, policy, learner Q-gradient, and expert Q-gradient network D: Demonstrations of expert behavior M : Total number GAC iterations N, N ✓ : Number of score iteration steps per GAC iteration, Reward update interval The gradient of the MaxEntIRL objective of Eq. 6 is r ✓

State Control Performance when provided with one demonstration. CARTPOLE SWING-UP 106.0 ± 50.1 709.0 ± 40.9 591.4 ± 134.2 320.1 ± 105.8 970.1 ± 26.1

Reward Transfer when provided with ten demonstrations in the original environment ± 98.3 455.2 ± 99.3 301.3 ± 89.4 668.9 ± 60.2 SLIPPERY CHEETAH 366.5 ± 29.3 591.9 ± 65.7 93.1 ± 60.1 803.9 ± 98.9

ACKNOWLEDGEMENTS

This research was supported by NSF (#1651565), ARO (W911NF-21-1-0125), ONR (N00014-23-1-2159), CZ Biohub, and HAI. We would like to acknowledge Bertsekas & Tsitsiklis (1995) for their descriptions of the Value Iteration algorithm which sparked crucial intuitions for the theory in this work.

