QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS

Abstract

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https: //github.com/HumanCompatibleAI/evaluating-rewards.

1. INTRODUCTION

Reinforcement learning (RL) has reached or surpassed human performance in many domains with clearly defined reward functions, such as games [20; 15; 23] and narrowly scoped robotic manipulation tasks [16] . Unfortunately, the reward functions for most real-world tasks are difficult or impossible to specify procedurally. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned [22, IV.A] . Tasks involving human interaction can have far more complex reward functions that users may not even be able to introspect on. These challenges have inspired work on learning a reward function, whether from demonstrations [13; 17; 26; 8; 3], preferences [1; 25; 6; 18; 27] or both [10; 4] . Prior work has usually evaluated the learned reward function R using the "rollout method": training a policy π R to optimize R and then examining rollouts from π R. Unfortunately, using RL to compute π R is often computationally expensive. Furthermore, the method produces false negatives when the reward R matches user preferences but the RL algorithm fails to optimize with respect to R. The rollout method also produces false positives. Of the many reward functions that induce the desired rollout in a given environment, only a small subset align with the user's preferences. For example, suppose the agent can reach states {A, B, C}. If the user prefers A > B > C, but the agent instead learns A > C > B, the agent will still go to the correct state A. However, if the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies. For example, if A is no longer reachable at deployment, the previously reliable agent would misbehave by going to the least-favoured state C. We propose instead to evaluate learned rewards via their distance from other reward functions, and summarize our desiderata for reward function distances in Table 1 . For benchmarks, it is usually possible to directly compare a learned reward R to the true reward function R. Alternatively, benchmark creators can train a "proxy" reward function from a large human data set. This proxy can then be used as a stand-in for the true reward R when evaluating algorithms trained on a different or smaller data set. Table 1 : Summary of the desiderata satisfied by each reward function distance. Key -the distance is: a pseudometric (section 3); invariant to potential shaping [14] and positive rescaling (section 3); a computationally efficient approximation achieving low error (section 6.1); robust to the choice of coverage distribution (section 6.2); and predictive of the similarity of the trained policies (section 6.3).

EPIC NPEC ERC

Comparison with a ground-truth reward function is rarely possible outside of benchmarks. However, even in this challenging case, comparisons can at least be used to cluster reward models trained using different techniques or data. Larger clusters are more likely to be correct, since multiple methods arrived at a similar result. Moreover, our regret bound (Theorem 4.9) suggests we could use interpretability methods [12] on one model and get some guarantees for models in the same cluster. We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance that meets all the criteria in Table 1 . We believe EPIC is the first method to quantitatively evaluate reward functions without training a policy. EPIC (section 4) canonicalizes the reward functions' potential-based shaping [14], then takes the correlation between the canonical rewards over a coverage distribution D of transitions. We also introduce baselines NPEC and ERC (section 5) which partially satisfy the criteria. EPIC works best when D has support on all realistic transitions. We achieve this in our experiments by using uninformative priors, such as rollouts of a policy taking random actions. Moreover, we find EPIC is robust to the exact choice of distribution D, producing similar results across a range of distributions, whereas ERC and especially NPEC are highly sensitive to the choice of D (section 6.2). Moreover, low EPIC distance between a learned reward R and the true reward R predicts low regret. That is, the policies π R and π R optimized for R and R obtain similar returns under R. Theorem 4.9 bounds the regret even in unseen environments; by contrast, the rollout method can only determine regret in the evaluation environment. We also confirm this result empirically (section 6.3).

2. RELATED WORK

There exist a variety of methods to learn reward functions. Inverse reinforcement learning (IRL) [13] is a common approach that works by inferring a reward function from demonstrations. The IRL problem is inherently underconstrained: many different reward functions lead to the same demonstrations. Bayesian IRL [17] handles this ambiguity by inferring a posterior over reward functions. By contrast, Maximum Entropy IRL [26] selects the highest entropy reward function consistent with the demonstrations; this method has scaled to high-dimensional environments [7; 8] . An alternative approach is to learn from preference comparisons between two trajectories [1; 25; 6; 18] . T-REX [4] is a hybrid approach, learning from a ranked set of demonstrations. More directly, Cabi et al. [5] learn from "sketches" of cumulative reward over an episode. To the best of our knowledge, there is no prior work that focuses on evaluating reward functions directly. The most closely related work is Ng et al. [14] , identifying reward transformations guaranteed to not change the optimal policy. However, a variety of ad-hoc methods have been developed to evaluate reward functions. The rollout method -evaluating rollouts of a policy trained on the learned reward -is evident in the earliest work on IRL [13] . Fu et al. [8] refined the rollout method by testing on a transfer environment, inspiring our experiment in section 6.3. Recent work has compared reward functions by scatterplotting returns [10; 4], inspiring our ERC baseline (section 5.1).

3. BACKGROUND

This section introduces material needed for the distances defined in subsequent sections. We start by introducing the Markov Decision Process (MDP) formalism, then describe when reward functions induce the same optimal policies in an MDP, and finally define the notion of a distance metric.

