QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS

Abstract

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https: //github.com/HumanCompatibleAI/evaluating-rewards. * Work partially conducted while at DeepMind. An alternative approach is to learn from preference comparisons between two trajectories [1; 25; 6; 18]. T-REX [4] is a hybrid approach, learning from a ranked set of demonstrations. More directly, Cabi et al. [5] learn from "sketches" of cumulative reward over an episode. To the best of our knowledge, there is no prior work that focuses on evaluating reward functions directly. The most closely related work is Ng et al. [14] , identifying reward transformations guaranteed to not change the optimal policy. However, a variety of ad-hoc methods have been developed to evaluate reward functions. The rollout method -evaluating rollouts of a policy trained on the learned reward -is evident in the earliest work on IRL [13] . Fu et al. [8] refined the rollout method by testing on a transfer environment, inspiring our experiment in section 6.3. Recent work has compared reward functions by scatterplotting returns [10; 4], inspiring our ERC baseline (section 5.1). This section introduces material needed for the distances defined in subsequent sections. We start by introducing the Markov Decision Process (MDP) formalism, then describe when reward functions induce the same optimal policies in an MDP, and finally define the notion of a distance metric. * Note constant shifts in the reward of an undiscounted MDP would cause the value function to diverge. Fortunately, the shaping γΦ(s ) -Φ(s) is unchanged by constant shifts to Φ when γ = 1.

1. INTRODUCTION

Reinforcement learning (RL) has reached or surpassed human performance in many domains with clearly defined reward functions, such as games [20; 15; 23] and narrowly scoped robotic manipulation tasks [16] . Unfortunately, the reward functions for most real-world tasks are difficult or impossible to specify procedurally. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned [22, IV.A] . Tasks involving human interaction can have far more complex reward functions that users may not even be able to introspect on. These challenges have inspired work on learning a reward function, whether from demonstrations [13; 17; 26; 8; 3] , preferences [1; 25; 6; 18; 27] or both [10; 4] . Prior work has usually evaluated the learned reward function R using the "rollout method": training a policy π R to optimize R and then examining rollouts from π R. Unfortunately, using RL to compute π R is often computationally expensive. Furthermore, the method produces false negatives when the reward R matches user preferences but the RL algorithm fails to optimize with respect to R. The rollout method also produces false positives. Of the many reward functions that induce the desired rollout in a given environment, only a small subset align with the user's preferences. For example, suppose the agent can reach states {A, B, C}. If the user prefers A > B > C, but the agent instead learns A > C > B, the agent will still go to the correct state A. However, if the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies. For example, if A is no longer reachable at deployment, the previously reliable agent would misbehave by going to the least-favoured state C. We propose instead to evaluate learned rewards via their distance from other reward functions, and summarize our desiderata for reward function distances in Table 1 . For benchmarks, it is usually possible to directly compare a learned reward R to the true reward function R. Alternatively, benchmark creators can train a "proxy" reward function from a large human data set. This proxy can then be used as a stand-in for the true reward R when evaluating algorithms trained on a different or smaller data set. Table 1 : Summary of the desiderata satisfied by each reward function distance. Key -the distance is: a pseudometric (section 3); invariant to potential shaping [14] and positive rescaling (section 3); a computationally efficient approximation achieving low error (section 6.1); robust to the choice of coverage distribution (section 6.2); and predictive of the similarity of the trained policies (section 6.3).

Distance Pseudometric Invariant Efficient Robust Predictive EPIC NPEC ERC

Comparison with a ground-truth reward function is rarely possible outside of benchmarks. However, even in this challenging case, comparisons can at least be used to cluster reward models trained using different techniques or data. Larger clusters are more likely to be correct, since multiple methods arrived at a similar result. Moreover, our regret bound (Theorem 4.9) suggests we could use interpretability methods [12] on one model and get some guarantees for models in the same cluster. We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance that meets all the criteria in Table 1 . We believe EPIC is the first method to quantitatively evaluate reward functions without training a policy. EPIC (section 4) canonicalizes the reward functions' potential-based shaping [14] , then takes the correlation between the canonical rewards over a coverage distribution D of transitions. We also introduce baselines NPEC and ERC (section 5) which partially satisfy the criteria. EPIC works best when D has support on all realistic transitions. We achieve this in our experiments by using uninformative priors, such as rollouts of a policy taking random actions. Moreover, we find EPIC is robust to the exact choice of distribution D, producing similar results across a range of distributions, whereas ERC and especially NPEC are highly sensitive to the choice of D (section 6.2). Moreover, low EPIC distance between a learned reward R and the true reward R predicts low regret. That is, the policies π R and π R optimized for R and R obtain similar returns under R. Theorem 4.9 bounds the regret even in unseen environments; by contrast, the rollout method can only determine regret in the evaluation environment. We also confirm this result empirically (section 6.3).

2. RELATED WORK

There exist a variety of methods to learn reward functions. Inverse reinforcement learning (IRL) [13] is a common approach that works by inferring a reward function from demonstrations. The IRL problem is inherently underconstrained: many different reward functions lead to the same demonstrations. Bayesian IRL [17] handles this ambiguity by inferring a posterior over reward functions. By contrast, Maximum Entropy IRL [26] selects the highest entropy reward function consistent with the demonstrations; this method has scaled to high-dimensional environments [7; 8] . Definition 3.1. A Markov Decision Process (MDP) M = (S, A, γ, d 0 , T , R) consists of a set of states S and a set of actions A; a discount factor γ ∈ [0, 1]; an initial state distribution d 0 (s); a transition distribution T (s | s, a) specifying the probability of transitioning to s from s after taking action a; and a reward function R(s, a, s ) specifying the reward upon taking action a in state s and transitioning to state s . A trajectory τ = (s 0 , a 0 , s 1 , a 1 , • • • ) consists of a sequence of states s i ∈ S and actions a i ∈ A. The return on a trajectory is defined as the sum of discounted rewards, g(τ ; R) = |τ | t=0 γ t R(s t , a t , s t+1 ), where the length of the trajectory |τ | may be infinite. In the following, we assume a discounted (γ < 1) infinite-horizon MDP. The results can be generalized to undiscounted (γ = 1) MDPs subject to regularity conditions needed for convergence. A stochastic policy π(a | s) assigns probabilities to taking action a ∈ A in state s ∈ S. The objective of an MDP is to find a policy π that maximizes the expected return G(π) = E τ (π) [g(τ ; R)], where τ (π) is a trajectory generated by sampling the initial state s 0 from d 0 , each action a t from the policy π(a t | s t ) and successor states s t+1 from the transition distribution T (s t+1 | s t , a t ). An MDP M has a set of optimal policies π * (M ) that maximize the expected return, π * (M ) = arg max π G(π). In this paper, we consider the case where we only have access to an MDP\R, M -= (S, A, γ, d 0 , T ). The unknown reward function R must be learned from human data. Typically, only the state space S, action space A and discount factor γ are known exactly, with the initial state distribution d 0 and transition dynamics T only observable from interacting with the environment M -. Below, we describe an equivalence class whose members are guaranteed to have the same optimal policy set in any MDP\R M -with fixed S, A and γ (allowing the unknown T and d 0 to take arbitrary values). Definition 3.2. Let γ ∈ [0, 1] be the discount factor, and Φ : S → R a real-valued function. Then R(s, a, s ) = γΦ(s ) -Φ(s) is a potential shaping reward, with potential Φ [14] . Definition 3.3 (Reward Equivalence). We define two bounded reward functions R A and R B to be equivalent, R A ≡ R B , for a fixed (S, A, γ) if and only if there exists a constant λ > 0 and a bounded potential function Φ : S → R such that for all s, s ∈ S and a ∈ A: R B (s, a, s ) = λR A (s, a, s ) + γΦ(s ) -Φ(s). (1) Proposition 3.4. The binary relation ≡ is an equivalence relation. Let R A , R B , R C : S×A×S → R be bounded reward functions. Then ≡ is reflexive, R A ≡ R A ; symmetric, R A ≡ R B implies R B ≡ R A ; and transitive, (R A ≡ R B ) ∧ (R B ≡ R C ) implies R A ≡ R C . Proof. See section A.3.1 in supplementary material. The expected return of potential shaping γΦ(s ) -Φ(s) on a trajectory segment (s 0 , • • • , s T ) is γ T Φ(s T ) -Φ(s 0 ). The first term γ T Φ(s T ) → 0 as T → ∞, while the second term Φ(s 0 ) only depends on the initial state, and so potential shaping does not change the set of optimal policies. Moreover, any additive transformation that is not potential shaping will, for some reward R and transition distribution T , produce a set of optimal policies that is disjoint from the original [14] . The set of optimal policies is invariant to constant shifts c ∈ R in the reward, however this can already be obtained by shifting Φ by c γ-1 . * Scaling a reward function by a positive factor λ > 0 scales the expected return of all trajectories by λ, also leaving the set of optimal policies unchanged. If R A ≡ R B for some fixed (S, A, γ), then for any MDP\R M -= (S, A, γ, d 0 , T ) we have π * ((M -, R A )) = π * ((M -, R B )) , where (M -, R) denotes the MDP specified by M -with reward function R. In other words, R A and R B induce the same optimal policies for all initial state distributions d 0 and transition dynamics T . Definition 3.5. Let X be a set and d : X × X → [0, ∞) a function. d is a premetric if d(x, x) = 0 for all x ∈ X. d is a pseudometric if, furthermore, it is symmetric, d(x, y) = d(y, x ) for all x, y ∈ X; and satisfies the triangle inequality, d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ X. d is a metric if, furthermore, for all x, y ∈ X, d(x, y) = 0 =⇒ x = y. We wish for d(R A , R B ) = 0 whenever the rewards are equivalent, R A ≡ R B , even if they are not identical, R A = R B . This is forbidden in a metric but permitted in a pseudometric, while retaining other guarantees such as symmetry and triangle inequality that a metric provides. Accordingly, a pseudometric is usually the best choice for a distance d over reward functions.

4. COMPARING REWARD FUNCTIONS WITH EPIC

In this section we introduce the Equivalent-Policy Invariant Comparison (EPIC) pseudometric. This novel distance canonicalizes the reward functions' potential-based shaping, then compares the canonical representatives using Pearson correlation, which is invariant to scale. Together, this construction makes EPIC invariant on reward equivalence classes. See section A.3.2 for proofs. We define the canonically shaped reward C D S ,D A (R) as an expectation over some arbitrary distributions D S and D A over states S and actions A respectively. This construction means that C D S ,D A (R) does not depend on the MDP's initial state distribution d 0 or transition dynamics T . In particular, we may evaluate R on transitions that are impossible in the training environment, since these may become possible in a deployment environment with a different d 0 or T . Definition 4.1 (Canonically Shaped Reward). Let R : S × A × S → R be a reward function. Given distributions D S ∈ ∆(S) and D A ∈ ∆(A) over states and actions, let S and S be random variables independently sampled from D S and A sampled from D A . We define the canonically shaped R to be: C D S ,D A (R) (s, a, s ) = R(s, a, s ) + E [γR(s , A, S ) -R(s, A, S ) -γR(S, A, S )] . Informally, if R is shaped by potential Φ, then increasing Φ(s) decreases R (s, a, s  ) but in- creases E [-R (s, A, S )], canceling. Similarly, increasing Φ(s ) increases R (s, a, s ) but decreases E [γR (s , A, S )]. A ∈ ∆(A). Let R, ν : S × A × S → R be reward functions, with ν(s, a, s ) = λI[(s, a, s ) = (x, u, x )], λ ∈ R, x, x ∈ S and u ∈ A. Let Φ D S ,D A (R)(s, a, s ) = C D S ,D A (R) (s, a, s ) -R(s, a, s ). Then: Φ D S ,D A (R + ν) -Φ D S ,D A (R) ∞ = λ (1 + γD S (x)) D A (u)D S (x ). We have canonicalized potential shaping; next, we compare the rewards in a scale-invariant manner. Definition 4.4. The Pearson distance between random variables X and Y is defined by the expression D ρ (X, Y ) = 1 -ρ(X, Y )/ √ 2, where ρ(X, Y ) is the Pearson correlation between X and Y . Lemma 4.5. The Pearson distance D ρ is a pseudometric. Moreover, let a, b ∈ (0, ∞), c, d ∈ R and X, Y be random variables. Then it follows that 0 ≤ D ρ (aX + c, bY + d) = D ρ (X, Y ) ≤ 1. We can now define EPIC in terms of the Pearson distance between canonically shaped rewards.  D EPIC (R A , R B ) = D ρ (C D S ,D A (R A ) (S, A, S ), C D S ,D A (R B ) (S, A, S )) . Theorem 4.7. The Equivalent-Policy Invariant Comparison distance is a pseudometric. Since EPIC is a pseudometric, it satisfies the triangle inequality. To see why this is useful, consider an environment with an expensive to evaluate ground-truth reward R. Directly comparing many learned rewards R to R might be prohibitively expensive. We can instead pay a one-off cost: query R a finite number of times and infer a proxy reward R P with D EPIC (R, R P ) ≤ . The triangle inequality allows us to evaluate R via comparison to R P , since D EPIC ( R, R) ≤ D EPIC ( R, R P ) + . This is particularly useful for benchmarks, which can be expensive to build but should be cheap to use. Theorem 4.8. Let R A , R A , R B , R B : S × A × S → R be reward functions such that R A ≡ R A and R B ≡ R B . Then 0 ≤ D EPIC (R A , R B ) = D EPIC (R A , R B ) ≤ 1. The following is our main theoretical result, showing that D EPIC (R A , R B ) distance gives an upper bound on the difference in returns under either R A or R B between optimal policies π * R A and π * R B . In other words, EPIC bounds the regret under R A of using π * R B instead of π * R A . Moreover, by symmetry D EPIC (R A , R B ) also bounds the regret under R B of using π * R A instead of π * R B . Theorem 4.9. Let M be a γ-discounted MDP\R with finite state and action spaces S and A. Let R A , R B : S × A × S → R be rewards, and π * A , π * B be respective optimal policies. Let D π (t, s t , a t , s t+1 ) denote the distribution over transitions S × A × S induced by policy π at time t, and D(s, a, s ) be the coverage distribution used to compute D EPIC . Suppose there exists K > 0 such that KD(s t , a t , s t+1 ) ≥ D π (t, s t , a t , s t+1 ) for all times t ∈ N, triples (s t , a t , s t+1 ) ∈ S × A × S and policies π ∈ {π * A , π * B }. Then the regret under R A from executing π * B instead of π * A is at most G R A (π * A ) -G R A (π * B ) ≤ 16K R A 2 (1 -γ) -1 D EPIC (R A , R B ), where G R (π) is the return of policy π under reward R. We generalize the regret bound to continuous spaces in theorem A. 16 While EPIC upper bounds policy regret, it does not lower bound it. In fact, no reward distance can lower bound regret in arbitrary environments. For example, suppose the deployment environment transitions to a randomly chosen state independent of the action taken. In this case, all policies obtain the same expected return, so the policy regret is always zero, regardless of the reward functions. To demonstrate EPIC's properties, we compare the gridworld reward functions from Figure 1 , reporting the distances between all reward pairs in Figure A.2. Dense is a rescaled and shaped version of Sparse, despite looking dissimilar at first glance, so D EPIC (Sparse, Dense) = 0. By contrast, D EPIC (Path, Cliff) = 0.27. In deterministic gridworlds, Path and Cliff have the same optimal policy, so the rollout method could wrongly conclude they are equivalent. But in fact the rewards are fundamentally different: when there is a significant risk of "slipping" in the wrong direction the optimal policy for Cliff walks along the top instead of the middle row, incurring a -1 penalty to avoid the risk of falling into the -4 "cliff". For this example, we used state and action distributions D S and D A uniform over S and A, and coverage distribution D uniform over state-action pairs (s, a), with s deterministically computed. It is important these distributions have adequate support. As an extreme example, if D S and D have no support for a particular state then the reward of that state has no effect on the distance. We can compute EPIC exactly in a tabular setting, but in general use a sample-based approximation (section A.1.1).

5. BASELINE APPROACHES FOR COMPARING REWARD FUNCTIONS

Given the lack of established methods, we develop two alternatives as baselines: Episode Return Correlation (ERC) and Nearest Point in Equivalence Class (NPEC). 

5.1. EPISODE RETURN CORRELATION (ERC)

The goal of an MDP is to maximize expected episode return, so it is natural to compare reward functions by the returns they induce. If the return of a reward function R A is a positive affine transformation of another reward R B , then R A and R B have the same set of optimal policies. This suggests using Pearson distance, which is invariant to positive affine transformations. Definition 5.1 (Episode Return Correlation (ERC) pseudometric). Let D be some distribution over trajectories. Let E be a random variable sampled from D. The Episode Return Correlation distance between reward functions R A and R B is the Pearson distance between their episode returns on D, D ERC (R A , R B ) = D ρ (g(E; R A ), g(E; R B )). Prior work has produced scatter plots of the return of R A against R B over episodes [4, Figure 3 ] and fixed-length segments [10, section D]. ERC is the Pearson distance of such plots, so is a natural baseline. We approximate ERC by the correlation of episode returns on a finite collection of rollouts. ERC is invariant to shaping when the initial state s 0 and terminal state s T are fixed. Let R be a reward function and Φ a potential function, and define the shaped reward R (s, a, s ) = R(s, a, s ) + γΦ(s ) -Φ(s). The return under the shaped reward on a trajectory τ = (s 0 , a 0 , • • • , s T ) is g(τ ; R ) = g(τ ; R) + γ T Φ(s T ) -Φ(s 0 ). Since s 0 and s T are fixed, γ T Φ(s T ) -Φ(s 0 ) is constant. It follows that ERC is invariant to shaping, as Pearson distance is invariant to constant shifts. In fact, for infinite-horizon discounted MDPs only s 0 needs to be fixed, since γ T Φ(s T ) → 0 as T → ∞. However, if the initial state s 0 is stochastic, then the ERC distance can take on arbitrary values under shaping. Let R A and R B be two arbitrary reward functions. Suppose that there are at least two distinct initial states, s X and s Y , with non-zero measure in D. Choose potential Φ(s) = 0 everywhere except Φ(s X ) = Φ(s Y ) = c, and let R A and R B denote R A and R B shaped by Φ. As c → ∞, the correlation ρ (g(E; R A ), g(E; R B )) → 1. This is since the relative difference tends to zero, even though g(E; R A ) and g(E; R B ) continue to have the same absolute difference as c varies. Consequently, the ERC pseudometric D ERC (R A , R B ) → 0 as c → ∞. By an analogous argument, setting Φ(s X ) = c and Φ(s Y ) = -c gives D ERC (R A , R B ) → 1 as c → ∞.

5.2. NEAREST POINT IN EQUIVALENCE CLASS (NPEC)

NPEC takes the minimum L p distance between equivalence classes. See section A.3.3 for proofs. Definition 5.2 (L p distance). Let D be a coverage distribution over transitions s a → s and let p ≥ 1 be a power. The L p distance between reward functions R A and R B is the L p norm of their difference: quadratic penalty on control ẍ2 , no control penalty. S is Sparse D L p ,D (R A , R B ) = E s,a,s ∼D |R A (s, a, s ) -R B (s, a, s )| p 1/p . (x) = 1[|x| < 0.05], D is shaped Dense(x, x ) = Sparse(x) + |x | -|x|, while M is Magnitude(x) = -|x|. The L p distance is affected by potential shaping and positive rescaling that do not change the optimal policy. A natural solution is to take the distance from the nearest point in the equivalence class: D U NPEC (R A , R B ) = inf R A ≡R A D L p ,D (R A , R B ). Unfortunately, D U NPEC is sensitive to R B 's scale. It is tempting to instead take the infimum over both arguments of D L p ,D . However, inf R A ≡R A ,R B ≡R B D L p ,D (R A , R B ) = 0 since all equivalence classes come arbitrarily close to the origin in L p space. Instead, we fix this by normalizing D U NPEC . Definition 5.3. NPEC is defined by D NPEC (R A , R B ) = D U NPEC (R A , R B )/D U NPEC (Zero, R B ) when D U NPEC (Zero, R B ) = 0, and is otherwise given by D NPEC (R A , R B ) = 0. If D U NPEC (Zero, R B ) = 0 then D U NPEC (R A , R B ) = 0 since R A can be scaled arbitrarily close to Zero. Since all policies are optimal for R ≡ Zero, we choose D NPEC (R A , R B ) = 0 in this case. Theorem 5.4. D NPEC is a premetric on the space of bounded reward functions. Moreover, let R A , R A , R B , R B : S × A × S → R be bounded reward functions such that R A ≡ R A and R B ≡ R B . Then 0 ≤ D NPEC (R A , R B ) = D NPEC (R A , R B ) ≤ 1. Note that D NPEC may not be symmetric and so is not, in general, a pseudometric: see proposition A.3.

The infimum in D U

NPEC can be computed exactly in a tabular setting, but in general we must approximate it using gradient descent. This gives an upper bound for D U NPEC , but the quotient of upper bounds D NPEC may be too low or too high. See section A.1.2 for details of the approximation.

6. EXPERIMENTS

We evaluate EPIC and the baselines ERC and NPEC in a variety of continuous control tasks. In section 6.1, we compute the distance between hand-designed reward functions, finding EPIC to be the most reliable. NPEC has substantial approximation error, and ERC sometimes erroneously assigns high distance to equivalent rewards. Next, in section 6.2 we show EPIC is robust to the exact choice of coverage distribution D, whereas ERC and especially NPEC are highly sensitive to the choice of D. Finally, in section 6.3 we find that the distance of learned reward functions to a ground-truth reward predicts the return obtained by policy training, even in an unseen test environment.

6.1. COMPARING HAND-DESIGNED REWARD FUNCTIONS

We compare procedurally specified reward functions in four tasks, finding that EPIC is more reliable than the baselines NPEC and ERC, and more computationally efficient than NPEC. Figure 2 presents results in the proof-of-concept PointMass task. The results for Gridworld, HalfCheetah and Hopper, in section A.2.4, are qualitatively similar. In PointMass the agent can accelerate ẍ left or right on a line. The reward functions include ( ) or exclude ( ) a quadratic penalty ẍ2 . The sparse reward (S) gives a reward of 1 in the region ±0.05 from the origin. The dense reward (D) is a shaped version of the sparse reward. The magnitude reward (M) is the negative distance of the agent from the origin. We find that EPIC correctly identifies the equivalent reward pairs (S -D and S -D ) with estimated distance < 1 × 10 -3 . By contrast, NPEC has substantial approximation error: D NPEC (D , S ) = 0.58. Similarly, D ERC (D , S ) = 0.56 due to ERC's erroneous handling of stochastic initial states. Moreover, NPEC is computationally inefficient: Figure 2 (b) took 31 hours to compute. By contrast, the figures for EPIC and ERC were generated in less than two hours, and a lower precision approximation of EPIC finishes in just 17 seconds (see section A.2.6).

6.2. SENSITIVITY OF REWARD DISTANCE TO COVERAGE DISTRIBUTION

Reward distances should be robust to the choice of coverage distribution D. In Table 2 (center), we report distances from the ground-truth reward (GT) to reward functions (rows) across coverage distributions D ∈ {π uni , π * , Mix} (columns). We find EPIC is fairly robust to the choice of D with a similar ratio between rows in each column D. By contrast, ERC and especially NPEC are substantially more sensitive to the choice of D. We evaluate in the PointMaze MuJoCo task from Fu et al. [8] , where a point mass agent must navigate around a wall to reach a goal. The coverage distributions D are induced by rollouts from three different policies: π uni takes actions uniformly at random, producing broad support over transitions; π * is an expert policy, yielding a distribution concentrated around the goal; and Mix is a mixture of the two. In EPIC, D S and D A are marginalized from D and so also vary with D. We evaluate four reward learning algorithms: Regression onto reward labels [target method from 6, section 3.3], Preference comparisons on trajectories [6] , and adversarial IRL with a state-only (AIRL SO) and state-action (AIRL SA) reward model [8] . All models are trained using synthetic data from an oracle with access to the ground-truth; see section A.2.2 for details. We find EPIC is robust to varying D when comparing the learned reward models: the distance varies by less than 2×, and the ranking between the reward models is the same across coverage distributions. By contrast, NPEC is highly sensitive to D: the ratio of AIRL SO (817) to Pref (8.51) is 96 : 1 under π uni but only 2 : 1 (2706 : 1333) under π * . ERC lies somewhere in the middle: the ratio is 22 : 1 (549 : 24.9) under π uni and 3 : 2 (523 : 360) under π * . We evaluate the effect of pathological choices of coverage distribution D in Table A.8. For example, Ind independently samples states and next states, giving physically impossible transitions, while Jail constrains rollouts to a tiny region excluding the goal. We find that the ranking of EPIC changes in only one distribution, whilst the ranking of NPEC changes in two cases and ERC changes in all cases. However, we do find that EPIC is sensitive to D on Mirage, a reward function we explicitly designed to break these methods. Mirage assigns a larger reward when close to a "mirage" state than when at the true goal, but is identical to GT at all other points. The "mirage" state is rarely visited by random exploration π uni as it is far away and on the opposite side of the wall from the agent. The expert policy π * is even less likely to visit it, as it is not on or close to the optimal path to the goal. As a result, the EPIC distance from Mirage to GT (Table 2 , bottom row) is small under π uni and π * . In general, any black-box method for assessing reward models -including the rollout method -only has predictive power on transitions visited during testing. Fortunately, we can achieve a broad support over states with Mix: it often navigates around the wall due to π * , but strays from the goal thanks to π uni . As a result, EPIC under Mix correctly infers that Mirage is far from the ground-truth GT. These empirical results support our theoretically inspired recommendation from section 4: "in general, it is best to choose D to have broad coverage over plausible transitions." Distributions such as π * are too narrow, assigning coverage only on a direct path from the initial state to the goal. Very broad distributions such as Ind waste probability mass on impossible transitions like teleporting. Distributions like Mix strike the right balance between these extremes.

6.3. PREDICTING POLICY PERFORMANCE FROM REWARD DISTANCE

We find that low distance from the ground-truth reward GT (Table 2 , center) predicts high GT return (Table 2 , right) of policies optimized for that reward. Moreover, the distance is predictive of return not just in PointMaze-Train where the reward functions were trained and evaluated in, but also in the unseen variant PointMaze-Test. This is despite the two variants differing in the position of the wall, such that policies for PointMaze-Train run directly into the wall in PointMaze-Test. Both Regress and Pref achieve very low distances at convergence, producing near-expert policy performance. The AIRL SO and AIRL SA models have reward distances an order of magnitude higher and poor policy performance. Yet intriguingly, the generator policies for AIRL SO and AIRL SA -trained simultaneously with the reward -perform reasonably in PointMaze-Train. This suggests the learned rewards are reasonable on the subset of transitions taken by the generator policy, yet fail to transfer to the different transitions taken by a policy being trained from scratch. 

7. CONCLUSION

Our novel EPIC distance compares reward functions directly, without training a policy. We have proved it is a pseudometric, is bounded and invariant to equivalent rewards, and bounds the regret of optimal policies (Theorems 4.7, 4.8 and 4.9). Empirically, we find EPIC correctly infers zero distance between equivalent reward functions that the NPEC and ERC baselines wrongly consider dissimilar. Furthermore, we find the distance of learned reward functions to the ground-truth reward predicts the return of policies optimized for the learned reward, even in unseen environments. Standardized metrics are an important driver of progress in machine learning. Unfortunately, traditional policy-based metrics do not provide any guarantees as to the fidelity of the learned reward function. We believe the EPIC distance will be a highly informative addition to the evaluation toolbox, and would encourage researchers to report EPIC distance in addition to policy-based metrics. Our implementation of EPIC and our baselines, including a tutorial and documentation, are available at https://github.com/HumanCompatibleAI/evaluating-rewards.

A.1.1 SAMPLE-BASED APPROXIMATION FOR EPIC DISTANCE

We approximate EPIC distance (definition 4.6) by estimating Pearson distance on a set of samples, canonicalizing the reward on-demand. Specifically, we sample a batch B V of N V samples from the coverage distribution D, and a batch B M of N M samples from the joint state and action distributions D S × D A . For each (s, a, s ) ∈ B V , we approximate the canonically shaped rewards (definition 4.1) by taking the mean over B M : C D S ,D A (R) (s, a, s ) = R(s, a, s ) + E [γR(s , A, S ) -R(s, A, S ) -γR(S, A, S )] (5) ≈ R(s, a, s ) + γ N M (x,u)∈B M R(s , u, x) - 1 N M (x,u)∈B M R(s, u, x) -c. We drop the constant c from the approximation since it does not affect the Pearson distance; it can also be estimated in O(N 2 M ) time by c = γ N 2 M (x,•)∈B M (x ,u)∈B M R(x, u, x ). Finally, we compute the Pearson distance between the approximate canonically shaped rewards on the batch of samples B V , yielding an O(N V N M ) time algorithm. A.1.2 OPTIMIZATION-BASED APPROXIMATION FOR NPEC DISTANCE D NPEC (R A , R B ) (section 5. 2) is defined as the infimum of L p distance over an infinite set of equivalent reward functions R ≡ R A . We approximate this using gradient descent on the reward model R ν,c,w (s, a, s ) = exp(ν)R A (s, a, s ) + c + γΦ w (s ) -Φ w (s), where ν, c ∈ R are scalar weights and w is a vector of weights parameterizing a deep neural network Φ w . The constant c ∈ R is unnecessary if Φ w has a bias term, but its inclusion simplifies the optimization problem. We optimize ν, c, w to minimize the mean of the cost J(ν, c, w)(s, a, s ) = R ν,c,w (s, a, s ), R B (s, a, s ) p (9) on samples (s, a, s ) from a coverage distribution D. Note E (S,A,S )∼D [J(ν, c, w)(S, A, S )] 1/p = D L p ,D (R ν,c,w , R B ) upper bounds the true NPEC distance since R ν,c,w ≡ R A . We found empirically that ν and c need to be initialized close to their optimal values for gradient descent to reliably converge. To resolve this problem, we initialize the affine parameters to ν ← log λ and c found by: arg min λ≥0,c∈R E s,a,s ∼D (λR A (s, a, s ) + c -R B (s, a, s )) 2 . ( ) We use the active set method of Lawson & Hanson [11] to solve this constrained least-squares problem. These initial affine parameters minimize the L p distance D L p ,D (R ν,c,0 (s, a, s ), R B (s, a, s )) when p = 2 with the potential fixed at Φ 0 (s) = 0.

A.1.3 CONFIDENCE INTERVALS

We report confidence intervals to help measure the degree of error introduced by the approximations. Since approximate distances may not be normally distributed, we use bootstrapping to produce a distribution-free confidence interval. For EPIC, NPEC and Episode Return (sometimes reported as regret rather than return), we compute independent approximate distances or returns over different For the experiments on learned reward functions (sections 6.3 and 6.2), we trained reward models using adversarial inverse reinforcement learning (AIRL; 8), preference comparison [6] and by regression onto the ground-truth reward [target method from 6, section 3.3]. For AIRL, we use an existing open-source implementation [24] . We developed new implementations for preference comparison and regression, available at https://github.com/HumanCompatibleAI/ evaluating-rewards. We also use the RL algorithm proximal policy optimization (PPO; 19) on the ground-truth reward to train expert policies to provide demonstrations for AIRL. We use 9 seeds, taking rollouts from the seed with the highest ground-truth return. Our hyperparameters for PPO in Table A.2 were based on the defaults in Stable Baselines [9] . We only modified the batch size and learning rate, and disabled value function clipping to match the original PPO implementation. Our AIRL hyperparameters in Table A .3 likewise match the defaults, except for increasing the total number of timesteps to 10 6 . Due to the high variance of AIRL, we trained 5 seeds, selecting the one with the highest ground-truth return. While this does introduce a positive bias for our AIRL results, in spite of this AIRL performed worse in our tests than other algorithms. Moreover, the goal in this paper is to evaluate distance metrics, not reward learning algorithms. For preference comparison we performed a sweep over batch size, trajectory length and learning rate to decide on the hyperparameters in Table A.4. Total time steps was selected once diminishing returns were observed in loss curves. The exact value of the regularization weight was found to be unimportant, largely controlling the scale of the output at convergence. Finally, for regression we performed a sweep over batch size, learning rate and total time steps to decide on the hyperparameters in Table A.5. We found batch size and learning rate to be relatively unimportant with many combinations performing well, but regression was found to converge slowly but steadily requiring a relatively large 10 × 10 6 time steps for good performance in our environments. All algorithms are trained on synthetic data generated from the ground-truth reward function. AIRL is provided with a large demonstration dataset of 100 000 time steps from an expert policy trained on the ground-truth reward (see Table A .3). In preference comparison and regression, each batch is sampled afresh from the coverage distribution specified in Table A .1 and labeled according to the ground-truth reward.

A.2.3 COMPUTING INFRASTRUCTURE

Experiments were conducted on a workstation (Intel i9-7920X CPU with 64 GB of RAM), and a small number of r5.24xlarge AWS VM instances, with 48 CPU cores on an Intel Skylake processor and 768 GB of RAM. It takes less than three weeks of compute on a single r5.24xlarge instance to run all the experiments described in this paper.

A.2.4 COMPARING HAND-DESIGNED REWARD FUNCTIONS

We compute distances between hand-designed reward functions in four environments: GridWorld, PointMass, HalfCheetah and Hopper. The reward functions for GridWorld are described in We find the (approximate) EPIC distance closely matches our intuitions for similarity between the reward functions. NPEC often produces similar results to EPIC, but unfortunately is dogged by optimization error. This is particularly notable in higher-dimensional environments like HalfCheetah and Hopper, where the NPEC distance often exceeds the theoretical upper bound of 1.0 and the confidence interval width is frequently larger than 0.2. By contrast, ERC distance generally has a tight confidence interval, but systematically fails in the presence of shaping. For example, it confidently assigns large distances between equivalent reward pairs in PointMass such as S -D . However, ERC produces reasonable results in HalfCheetah and Hopper where rewards are all similarly shaped. In fact, ERC picks up on a detail in Hopper that EPIC misses: whereas EPIC assigns a distance of around 0.71 between all rewards of different types (running vs backflipping), ERC assigns lower distances when the rewards are in the same direction (forward or backward). Given this, ERC may be attractive in some circumstances, especially given the ease of implementation. However, we would caution against using it in isolation due to the likelihood of misleading results in the presence of shaping.

A.2.5 COMPARING LEARNED REWARD FUNCTIONS

Previously, we reported the mean approximate distance from a ground-truth reward of four learned reward models in PointMaze (Table 2 ). Since these distances are approximate, we report 95% lower and upper bounds computed via bootstrapping in Table A .7. We also include the relative difference of the upper and lower bounds from the mean, finding the relative difference to be fairly consistent across reward models for a given algorithm and coverage distribution pair. The relative difference is less than 1% for all EPIC and ERC distances. However, NPEC confidence intervals can be as wide as 50%: this is due to the method's high variance, and the small number of seeds we were able to run because of the method's computational expense.

A.2.6 RUNTIME OF DISTANCE METRICS

We report the empirical runtime for EPIC and baselines in Table A .6, performing 25 pairwise comparisons across 5 reward functions in PointMass. These comparisons were run on an unloaded machine running Ubuntu 20.04 (kernel 5.4.0-52) with an Intel i9-7920X CPU and 64 GB of RAM. We report sequential runtimes: runtimes for all methods could be decreased further by parallelizing across seeds. The algorithms were configured to use 8 parallel environments for sampling. Inference and training took place on CPU. All methods used the same TensorFlow configuration, parallelizing operations across threads both within and between operations. We found GPUs offered no performance benefit in this setting, and in some cases even increased runtime. This is due to the fixed cost of CPU-GPU communication, and the relatively small size of the observations. We find that in just 17 seconds EPIC can provide results with a 95% confidence interval < 0.023, an order of magnitude tighter than NPEC running for over 8 hours. Training policies for all learned rewards in this environment using PPO is comparatively slow, taking over 4 hours even with only 3 seeds. While ERC is relatively fast, it takes a large number of samples to achieve tight confidence intervals. Moreover, since PointMass has stochastic initial states, ERC can take on arbitrary values under shaping, as discussed in sections 5.1 and 6.3.  (x) = 1[|x| < 0.05], D is shaped Dense(x, x ) = Sparse(x) + |x | -|x|, while M is Magnitude(x) = -|x|. Confidence Interval (CI): 95% CI computed by bootstraping over 10 000 samples. R B R A 0.00 0.17 1.00 0.98 0.17 0.00 0.98 0.94 1.00 0.98 0.00 0.17 0.98 0.94 0.17 0.00 Published as a conference paper at ICLR 2021 R B R A 0.00 0.00 0.99 0.99 0.71 0.71 0.71 0.71 0.00 0.00 0.99 0.99 0.71 0.71 0.71 0.71 0.99 0.99 0.00 0.00 0.71 0.71 0.71 0.71 0.99 0.99 0.00 0.00 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.00 0.00 1.00 1.00 0.71 0.71 0.71 0.71 0.00 0.00 1.00 1.00 0.71 0.71 0.71 0.71 1.00 1.00 0.00 0.00 0.71 0.71 0.71 0.71 1.00 1.00 0.00 0.00 (a) EPIC Median R B 0.00 0.00 1.06 1.01 1.20 1.24 0.95 1.00 0.00 0.00 0.92 1.40 1.24 1.15 0.97 1.02 1.25 1.46 0.00 0.00 1.02 0.99 0.95 1.08 0.94 1.13 0.00 0.00 1.03 1.00 0.95 0.99 1.73 1.42 1.02 1.18 0.00 0.00 1.03 0.98 1.29 1.27 1.12 1.17 0.00 0.00 0.97 0.92 1.41 1.14 1.36 1.52 1.07 1.03 0.00 0.00 1.18 1.65 1.16 1.69 1.09 1.01 0.00 0.00 (b) NPEC Median R B 0.00 0.00 0.97 0.97 0.37 0.37 0.93 0.93 0.00 0.00 0.97 0.97 0.37 0.37 0.93 0.93 0.97 0.97 0.00 0.00 0.88 0.88 0.47 0.47 0.97 0.97 0.00 0.00 0.88 0.88 0.47 0.47 0.37 0.37 0.88 0.88 0.00 0.00 1.00 1.00 0.37 0.37 0.88 0.88 0.00 0.00 1.00 1.00 0.93 0.93 0.47 0.47 1.00 1.00 0.00 0.00 0.93 0.93 0.47 0.47 1.00 1.00 0.00 0.00  R A ≡ R A ; symmetric, R A ≡ R B implies R B ≡ R A ; and transitive, (R A ≡ R B ) ∧ (R B ≡ R C ) implies R A ≡ R C . Proof. R A ≡ R A since choosing λ = 1 > 0 and Φ(s) = 0, a bounded potential function, we have R A (s, a, s ) = λR A (s, a, s ) + γΦ(s ) -Φ(s) for all s, s ∈ S and a ∈ A. Suppose R A ≡ R B . Then there exists some λ > 0 and a bounded potential function Φ : S → R such that R B (s, a, s ) = λR A (s, a, s ) + γΦ(s ) -Φ(s) for all s, s ∈ S and a ∈ A. Rearranging: R A (s, a, s ) = 1 λ R B (s, a, s ) + γ -1 λ Φ(s ) - -1 λ Φ(s) . Since 1 λ > 0 and Φ (s) = -1 λ Φ(s) is a bounded potential function, it follows that R B ≡ R A . Finally, suppose R A ≡ R B and R B ≡ R C . Then there exists some λ 1 , λ 2 > 0 and bounded potential functions Φ 1 , Φ 2 : S → R such that for all s, s ∈ S and a ∈ A: R B (s, a, s ) = λ 1 R A (s, a, s ) + γΦ 1 (s ) -Φ 1 (s), R C (s, a, s ) = λ 2 R B (s, a, s ) + γΦ 2 (s ) -Φ 2 (s). Substituting the expression for R B into the expression for R C : R C (s, a, s ) = λ 2 (λ 1 R A (s, a, s ) + γΦ 1 (s ) -Φ 1 (s)) + γΦ 2 (s ) -Φ 2 (s) = λ 1 λ 2 R A (s, a, s ) + γ (λ 2 Φ 1 (s ) + Φ 2 (s )) -(λ 2 Φ 1 (s) + Φ 2 (s)) = λR A (s, a, s ) + γΦ(s ) -Φ(s), Proof. Let s, a, s ∈ S × A × S. Then by substituting in the definition of R and using linearity of expectations: C D S ,D A (R ) (s, a, s ) R (s, a, s ) + E [γR (s , A, S ) -R (s, A, S ) -γR (S, A, S )] (18) = (R(s, a, s ) + γΦ(s ) -Φ(s)) + E γR(s , A, S ) + γ 2 Φ(S ) -γΦ(s ) -E [R(s, A, S ) + γΦ(S ) -Φ(s)] -E γR(S, A, S ) + γ 2 Φ(S ) -γΦ(S) = R(s, a, s ) + E [γR(s , A, S ) -R(s, A, S ) -γR(S, A, S )] + (γΦ(s ) -Φ(s)) -E [γΦ(s ) -Φ(s)] + E γ 2 Φ(S ) -γΦ(S ) -E γ 2 Φ(S ) -γΦ(S) = R(s, a, s ) + E [γR(s , A, S ) -R(s, A, S ) -γR(S, A, S )] C D S ,D A (R) (s, a, s ), where the penultimate step uses E[Φ(S )] = E[Φ(S)] since S and S are identically distributed.  Φ D S ,D A (R + ν) -Φ D S ,D A (R) ∞ = λ (1 + γD S (x)) D A (u)D S (x ). Proof. Observe that: Φ D S ,D A (R)(s, a, s ) = E [γR(s , A, S ) -R(s, A, S ) -γR(S, A, S )] , where S and S are random variables independently sampled from D S , and A independently sampled from D A . Then: Φ D S ,D A (R + ν) -Φ D S ,D A (R) = Φ D S ,D A (ν). Now: LHS Φ D S ,D A (R + ν) -Φ D S ,D A (R) ∞ = max s,s ∈S |E [γν(s , A, S ) -ν(s, A, S ) -γν(S, A, S )]| = max s,s ∈S |λ (γI[x = s ]D A (u)D S (x ) - I[x = s]D A (u)D S (x ) -γD S (x)D A (u)D S (x ))| (28) = max s,s ∈S |λD A (u)D S (x ) (γI[x = s ] -I[x = s] -γD S (x))| (29) = λ (1 + γD S (x)) D A (u)D S (x ), where the final step follows by substituting s = x and s = x (using |S| ≥ 2). Proof. For a non-constant random variable V , define a standardized (zero mean and unit variance) version: Z(V ) = V -E[V ] E (V -E[V ]) 2 . The Pearson correlation coefficient on random variables A and B is equal to the expected product of these standardized random variables: ρ(A, B) = E [Z(A)Z(B)] . Let W , X, Y be random variables. Identity. Have ρ(X, X) = 1, so D ρ (X, X) = 0. Symmetry. Have ρ(X, Y ) = ρ(Y, X) by commutativity of multiplication, so D ρ (X, Y ) = D ρ (Y, X). Triangle Inequality. For any random variables A, B: E (Z(A) -Z(B)) 2 = E Z(A) 2 -2Z(A)Z(B) + Z(B) 2 (33) = E Z(A) 2 + E Z(B) 2 -2E [Z(A)Z(B)] (34) = 2 -2E [Z(A)Z(B)] (35) = 2 (1 -ρ(A, B)) (36) = 4D ρ (A, B) 2 . ( ) reward function R : S × A × S → R: D L p ,D (R, R B ) E s,a,s ∼D |R(s, a, s ) -R B (s, a, s )| p 1/p = E s,a,s ∼D |R(s, a, s ) -(R C (s, a, s ) + γΦ(s ) -Φ(s))| p 1/p = E s,a,s ∼D |(R(s, a, s ) -γΦ(s ) + Φ(s)) -R C (s, a, s )| p 1/p = E s,a,s ∼D |f (R)(s, a, s ) -R C (s, a, s )| p 1/p D L p ,D (f (R), R C ), Crucially, note f (R) is a bijection on the equivalence class [R]. Now, substituting this into the expression for the NPEC premetric: D U NPEC (R A , R B ) inf R≡R A D L p ,D (R, R B ) = inf R≡R A D L p ,D (f (R), R C ) eq. 62 = inf f (R)≡R A D L p ,D (f (R), R C ) f bijection on [R] = inf R≡R A D L p ,D (R, R C ) f bijection on [R] D U NPEC (R A , R C ). Scalable in target First, note that D L p ,D is absolutely scalable in both arguments: D L p ,D (λR A , λR B ) E s,a,s ∼D |λR A (s, a, s ) -λR B (s, a, s )| p 1/p = E s,a,s ∼D |λ| p |R A (s, a, s ) -R B (s, a, s )| p 1/p |•| absolutely scalable = |λ| p E s,a,s ∼D |R A (s, a, s ) -R B (s, a, s )| p 1/p E linear = |λ| E s,a,s ∼D |R A (s, a, s ) -R B (s, a, s )| p 1/p |λ| D L p ,D (R A , R B ). Now, for λ > 0, applying this to D U NPEC : D U NPEC (R A , λR B ) inf R≡R A D L p ,D (R, λR B ) (65) = inf R≡R A D L p ,D (λR, λR B ) R ≡ λR (66) = inf R≡R A λD L p ,D (R, R B ) eq. 64 (67) = λ inf R≡R A D L p ,D (R, R B ) (68) λD U NPEC (R A , R B ). In the case λ = 0, then: D U NPEC (R A , Zero) inf R≡R A D L p ,D (R, Zero) (70) = inf R≡R A D L p ,D 1 2 R, Zero R ≡ 1 2 R (71) = inf R≡R A 1 2 D L p ,D (R, Zero) (72) = 1 2 inf R≡R A D L p ,D (R, Zero) (73) = 1 2 D U NPEC (R A , Zero). Rearranging, we have: D U NPEC (R A , Zero) = 0. ( ) Bounded Let d D NPEC (Zero, R B ). Then for any > 0, there exists some potential function Φ : S → R such that the L p distance of the potential shaping R(s, a, s ) γΦ(s) -Φ(s) from R B satisfies: D L p ,D (R, R B ) ≤ d + . ( ) Let λ ∈ [0, 1]. Define: R λ (s, a, s ) λR A (s, a, s ) + R(s, a, s ). Now: D L p ,D (R λ , R) E s,a,s ∼D |R λ (s, a, s ) -R(s, a, s )| p 1/p (78) = E s,a,s ∼D |λR A (s, a, s )| p 1/p (79) = |λ| p E s,a,s ∼D |R A (s, a, s )| p 1/p (80) = |λ| E s,a,s ∼D |R A (s, a, s )| p 1/p (81) = |λ|D L p ,D (R A , Zero). Since R A is bounded, D L p ,D (R A , Zero) must be finite, so: lim λ→0 + D L p ,D (R λ , R) = 0. It follows that for any > 0 there exists some λ > 0 such that: D L p ,D (R λ , R) ≤ . Note that R A ≡ R λ for all λ > 0. So: D NPEC (R A , R B ) ≤ D L p ,D (R λ , R B ) (85) ≤ D L p ,D (R λ , R) + D L p ,D (R, R B ) prop. A.1 ≤ + (d + ) eq. 76 and eq. 84 (87) = d + 2 . ( ) Since > 0 can be made arbitrarily small, it follows: D NPEC (R A , R B ) ≤ d D NPEC (Zero, R B ), completing the proof. Theorem 5.4. D NPEC is a premetric on the space of bounded reward functions. Moreover, let R A , R A , R B , R B : S × A × S → R be bounded reward functions such that R A ≡ R A and R B ≡ R B . Then 0 ≤ D NPEC (R A , R B ) = D NPEC (R A , R B ) ≤ 1. Proof. We will first prove D NPEC is a premetric, and then prove it is invariant and bounded.

Premetric

First, we will show that D NPEC is a premetric.

Respects identity: D

NPEC (R A , R A ) = 0 If D U NPEC (Zero, R A ) = 0 then D NPEC (R A , R A ) = 0 as required. Suppose from now on that D U NPEC (R A , R A ) = 0. It follows from prop A.1 that D L p ,D (R A , R A ) = 0. Since X ≡ X, 0 is an upper bound for D U NPEC (R A , R A ). By prop A.1 D L p ,D is non-negative, so this is also a lower bound for D U NPEC (R A , R A ). So D U NPEC (R A , R A ) = 0 and: D NPEC (R A , R A ) = D U NPEC (R A , R A ) D U NPEC (Zero, R A ) = 0 D U NPEC (Zero, R A ) = 0. Well-defined:  D NPEC (R A , R B ) ≥ 0 By prop A. D U NPEC (X, R B ) inf R≡X D L p ,D (R, R B ) ≥ 0. In the case that D U NPEC (Zero, R B ) = 0, then D NPEC (R A , R B ) = 0 which is non-negative. From now on, suppose that D U NPEC (Zero, R B ) = 0. The quotient of a non-negative value with a positive value is non-negative, so: D NPEC (R A , R B ) = D U NPEC (R A , R B ) D U NPEC (Zero, R B ) ≥ 0.

Invariant and Bounded

Since R B ≡ R B , we have R B -λR B ≡ Zero for some λ > 0. By proposition A.2, D U NPEC is invariant under scale-preserving ≡ in target and scalable in target. That is, for any reward R: D U NPEC (R, R B ) = D U NPEC (R, λR B ) = λD U NPEC (R, R B ). In particular, D U NPEC (Zero, R B ) = λD U NPEC (Zero, R B ). As λ > 0, it follows that D U NPEC (Zero, R B ) = 0 ⇐⇒ D U NPEC (Zero, R B ) = 0. Suppose D U NPEC (Zero, R B ) = 0. Then D NPEC (R, R B ) = 0 = D NPEC (R, R B ) for any reward R, so the result trivially holds. From now on, suppose D U NPEC (Zero, R B ) = 0. By proposition A.2, D U NPEC is invariant to ≡ in source. That is, D U NPEC (R A , R B ) = D U NPEC (R A , R B ), so: D NPEC (R A , R B ) = D U NPEC (R A , R B ) D U NPEC (Zero, R B ) = D U NPEC (R A , R B ) D U NPEC (Zero, R B ) = D NPEC (R A , R B ). By eq. ( 93): D NPEC (R A , R B ) = λD U NPEC (R A , R B ) λD U NPEC (Zero, R B ) = D U NPEC (R A , R B ) D U NPEC (Zero, R B ) = D NPEC (R A , R B ). ( ) Since D NPEC is a premetric it is non-negative. By the boundedness property of proposition A.2, D U NPEC (R, R B ) ≤ D U NPEC (Zero, R B ), so: D NPEC (R A , R B ) = D U NPEC (R A , R B ) D U NPEC (Zero, R B ) ≤ 1, which completes the proof. Published as a conference paper at ICLR 2021 Note when D L p ,D is a metric, then D NPEC (X, Y ) = 0 if and only if X = Y . Proposition A.3. D NPEC is not symmetric in the undiscounted case. Proof. We will provide a counterexample showing that D NPEC is not symmetric. Choose the state space S to be binary {0, 1} and the actions A to be the singleton {0}. Choose the coverage distribution D to be uniform on s 0 → s for s ∈ S. Take γ = 1, i.e. undiscounted. Note that as the successor state is always the same as the start state, potential shaping has no effect on D L p ,D , so WLOG we will assume potential shaping is always zero. Now, take R A (s) = 2s and R B (s) = 1. Take p = 1 for the L p distance. Observe that D L p ,D (Zero, R A ) = 1 2 (|0| + |2|) = 1 and D L p ,D (Zero, R B ) = 1 2 (|1| + |1|) = 1. Since poten- tial shaping has no effect, D U NPEC (Zero, R) = D L p ,D (Zero, R) and so D NPEC (Zero, R A ) = 1 and D NPEC (Zero, R B ) = 1. Now: D U NPEC (R A , R B ) = inf λ>0 D L p ,D (λR A , R B ) (97) = inf λ>0 1 2 (|1| + |2λ -1|) (98) = 1 2 , with the infimum attained at λ = 1 2 . But: D U NPEC (R B , R A ) = inf λ>0 D L p ,D (λR B , R A ) (100) = inf λ>0 1 2 f (λ) (101) = 1 2 inf λ>0 f (λ), where: f (λ) = |λ| + |2 -λ|, λ > 0. (103) Note that: f (λ) = 2 λ ∈ (0, 2], 2λ -2 λ ∈ (2, ∞). So f (λ) ≥ 2 on all of its domain, thus: D U NPEC (R B , R A ) = 1. Consequently: D NPEC (R A , R B ) = 1 2 = 1 = D NPEC (R B , R A ), so D NPEC is not symmetric.

A.4 FULL NORMALIZATION VARIANT OF EPIC

Previously, we used the Pearson distance D ρ to compare the canonicalized rewards. Pearson distance is naturally invariant to scaling. An alternative is to explicitly normalize the canonicalized rewards, and then compare them using any metric over functions. Definition A.4 (Normalized Reward). Let R : S × A × S → R be a bounded reward function. Let • be a norm on the vector space of reward functions over the real field. Then the normalized R is: R N (s, a, s ) = R(s, a, s ) R Note that (λR) N = R N for any λ > 0 as norms are absolutely homogeneous. We say a reward is standardized if it has been canonicalized and then normalized. Definition A.5 (Standardized Reward). Let R : S × A × S → R be a bounded reward function. Then the standardized R is: R S = (C D S ,D A (R)) N . Now, we can define a pseudometric based on the direct distance between the standardized rewards. Definition A.6 (Direct Distance Standardized Reward). Let D be some coverage distribution over transitions s a → s . Let D S and D A be some distributions over states S and A respectively. Let S, A, S be random variables jointly sampled from D. The Direct Distance Standardized Reward pseudometric between two reward functions R A and R B is the L p distance between their standardized versions over D: D DDSR (R A , R B ) = 1 2 D L p ,D R S A (S, A, S ), R S B (S, A, S ) , where the normalization step, R N , uses the L p norm. For brevity, we omit the proof that D DDSR is a pseudometric, but this follows from D L p ,D being a pseudometric in a similar fashion to theorem 4.7. Note it additionally is invariant to equivalence classes, similarly to EPIC. Theorem A.7. Let R A , R A , R B and R B be reward functions mapping from transitions S × A × S to real numbers R such that R A ≡ R A and R B ≡ R B . Then: 0 ≤ D DDSR (R A , R B ) = D DDSR (R A , R B ) ≤ 1. Proof. The invariance under the equivalence class follows from R S being invariant to potential shaping and scale in R. The non-negativity follows from D L p ,D being a pseudometric. The upper bound follows from the rewards being normalized to norm 1 and the triangle inequality: Since both DDSR and EPIC are pseudometrics and invariant on equivalent rewards, it is interesting to consider the connection between them. In fact, under the L 2 norm, then DDSR recovers EPIC. First, we will show that canonical shaping centers the reward functions.  D DDSR (R A , R B ) = 1 2 R S A -R S B where the penultimate step follows since A is identically distributed to U , and S is identically distributed to X and therefore to X. Theorem A.9. D DDSR with p = 2 is equivalent to D EPIC . Let R A and R B be reward functions mapping from transitions S × A × S to real numbers R. Then: D DDSR (R A , R B ) = D EPIC (R A , R B ). ( ) Lemma A.11. Let M be an MDP\R with finite state and action spaces S and A. Let R A , R B : S × A × S → R be rewards. Let π * A and π * B be policies optimal for rewards R A and R B in M . Let D π (t, s t , a t , s t+1 ) denote the distribution over trajectories that policy π induces in M at time step t. Let D(s, a, s ) be the (stationary) coverage distribution over transitions S × A × S used to compute D EPIC . Suppose that there exists some K > 0 such that KD(s t , a t , s t+1 ) ≥ D π (t, s t , a t , s t+1 ) for all time steps t ∈ N, triples s t , a t , s t+1 ∈ S × A × S and policies π ∈ {π * A , π * B }. Then the regret under R A from executing π * B optimal for R B instead of π * A is at most: G R A (π * A ) -G R A (π * B ) ≤ 2K 1 -γ D L 1 ,D (R A , R B ). Proof. Noting G R A (π) is maximized when π = π * A , it is immediate that G R A (π * A ) -G R A (π * B ) = |G R A (π * A ) -G R A (π * B )| (136) = |(G R A (π * A ) -G R B (π * B )) + (G R B (π * B ) -G R A (π * B ))| (137) ≤ |G R A (π * A ) -G R B (π * B )| + |G R B (π * B ) -G R A (π * B )| . ( ) We will show that both these terms are bounded above by K 1-γ D L 1 ,D (R A , R B ), from which the result follows. First, we will show that for policy π ∈ {π * A , π * B }: |G R A (π) -G R B (π)| ≤ K 1 -γ D L 1 ,D (R A , R B ). ( ) Let T be the horizon of M . This may be infinite (T = ∞) when γ < 1; note since S × A × S is bounded, so are R A , R B so the discounted infinite returns G R A (π), G R B (π) converge (as do their differences). Writing τ = (s 0 , a 0 , s 1 , a 1 , • • • ), we have for any policy π:  ∆ |G R A (π) -G R B (π)| = K T t=0 γ t D L 1 ,D (R A , R B ) (146) = K 1 -γ D L 1 ,D (R A , R B ), as required. In particular, substituting π = π * B gives: |G R B (π * B ) -G R A (π * B )| = |G R A (π * B ) -G R B (π * B )| ≤ K 1 -γ D L 1 ,D (R A , R B ). A.6 LIPSCHITZ REWARD FUNCTIONS In this section, we generalize the previous results to MDPs with continuous state and action spaces. The challenge is that even though the spaces may be continuous, the distribution D π * induced by an optimal policy π * may only have support on some measure zero set of transitions B. However, the expectation over a continuous distribution D is unaffected by the reward at any measure zero subset of points. Accordingly, the reward can be varied arbitrarily on transitions B -causing arbitrarily small or large regret -while leaving the EPIC distance fixed. To rule out this pathological case, we assume the rewards are Lipschitz smooth. This guarantees that if the expected difference between rewards is small on a given region, then all points in this region have bounded reward difference. We start by defining a relaxation of the Wasserstein distance W α in definition A. 13 . In lemma A.14 we then bound the expected value under distribution µ in terms of the expected value under alternative distribution ν plus W α (µ, ν). Next, in lemma A.15 we bound the regret in terms of the L 1 distance between the rewards plus W α ; this is analogous to lemma A.11 in the finite case. Finally, in theorem A. 16 we use the previous results to bound the regret in terms of the EPIC distance plus W α . Definition A.13. Let S be some set and let µ, ν be probability measures on S with finite first moment. We define the relaxed Wasserstein distance between µ and ν by: W α (µ, ν) inf p∈Γα(µ,ν) xy dp(x, y), where Γ α (µ, ν) is the set of probability measures on S × S satisfying for all x, y ∈ S:  Note that W 1 is equal to the (unrelaxed) Wasserstein distance (in the 1 norm). 



† In the finite-horizon case, there is also a term γ T Φ(sT ), where sT is the fixed terminal state. Since sT is fixed, it also cancels in eq. 158. This term can be neglected in the discounted infinite-horizon case as γ T Φ(sT ) → 0 as T → ∞ for any bounded Φ.



Finally, E[γR(S, A, S )] centers the reward, canceling constant shift. Proposition 4.2 (The Canonically Shaped Reward is Invariant to Shaping). Let R : S × A × S → R be a reward function and Φ : S → R a potential function. Let γ ∈ [0, 1] be a discount rate, and D S ∈ ∆(S) and D A ∈ ∆(A) be distributions over states and actions. Let R denote R shaped by Φ: R (s, a, s ) = R(s, a, s ) + γΦ(s ) -Φ(s). Then the canonically shaped R and R are equal: C D S ,D A (R ) = C D S ,D A (R).Proposition 4.2 holds for arbitrary distributions D S and D A . However, in the following Proposition we show that the potential shaping applied by the canonicalization C D S ,D A (R) is more influenced by perturbations to R of transitions (s, a, s ) with high joint probability. This suggests choosing D S and D A to have broad support, making C D S ,D A (R) more robust to perturbations of any given transition. Proposition 4.3. Let S and A be finite, with |S| ≥ 2. Let D S ∈ ∆(S) and D

Equivalent-Policy Invariant Comparison (EPIC) pseudometric). Let D be some coverage distribution over transitions s a → s . Let S, A, S be random variables jointly sampled from D. Let D S and D A be some distributions over states S and A respectively. The Equivalent-Policy Invariant Comparison (EPIC) distance between reward functions R A and R B is:

Figure 1: Heatmaps of four reward functions for a 3 × 3 gridworld. Sparse and Dense look different but are actually equivalent with D EPIC (Sparse, Dense) = 0. By contrast, the optimal policies for Path and Cliff are the same if the gridworld is deterministic but different if it is "slippery". EPIC recognizes this difference with D EPIC (Path, Cliff) = 0.27. Key: Reward R(s, s ) for moving from s to s is given by the triangular wedge in cell s that is adjacent to cell s . R(s, s) is given by the central circle in cell s. Optimal action(s) (deterministic, infinite horizon, discount γ = 0.99) have bold labels. See Figure A.2 for the distances between all reward pairs.

Figure 2: Approximate distances between hand-designed reward functions in PointMass, where the agent moves on a line trying to reach the origin. EPIC correctly assigns 0 distance between equivalent rewards such as (D , S ) while D NPEC (D , S ) = 0.58 and D ERC (D , S ) = 0.56. The coverage distribution D is sampled from rollouts of a policy π uni taking actions uniformly at random. Key: The agent has position x ∈ R, velocity ẋ ∈ R and can accelerate ẍ ∈ R, producing future position x ∈ R.

Figure A.6 shows reward distance and policy regret during reward model training. The lines all closely track each other, showing that the distance to GT is highly correlated with policy regret for intermediate reward checkpoints as well as at convergence. Regress and Pref converge quickly to low distance and low regret, while AIRL SO and AIRL SA are slower and more unstable.

and the distances are reported in Figure A.2. We report the approximate distances and confidence intervals between reward functions in the other environments in Figures A.3, A.4 and A.5.

Figure A.1: Heatmaps of reward functions R(s, a, s ) for a 3×3 deterministic gridworld. R(s, stay, s) is given by the central circle in cell s. R(s, a, s ) is given by the triangular wedge in cell s adjacent to cell s in direction a. Optimal action(s) (for infinite horizon, discount γ = 0.99) have bold labels against a hatched background. See Figure A.2 for the distance between all reward pairs.

Figure A.2: Distances (EPIC, top; NPEC, bottom) between hand-designed reward functions for the 3 × 3 deterministic Gridworld environment. EPIC and NPEC produce similar results, but EPIC more clearly discriminates between rewards whereas NPEC distance tends to saturate. For example, the NPEC distance from Penalty to other rewards lies in the very narrow [0.98, 1.0] range, whereas EPIC uses the wider [0.66, 1.0] range. See Figure A.1 for definitions of each reward. Distances are computed using tabular algorithms. We do not report confidence intervals since these algorithms are deterministic and exact up to floating point error.

Figure A.4: Approximate distances between hand-designed reward functions in HalfCheetah. The coverage distribution D is sampled from rollouts of a policy π uni taking actions uniformly at random. Key: is a reward proportional to the change in center of mass and moving forward is rewarded when to the right, and moving backward is rewarded when to the left. quadratic control penalty, no control penalty. Confidence Interval (CI): 95% CI computed by bootstraping over 10 000 samples.

Figure A.5: Approximate distances between hand-designed reward functions in Hopper. The coverage distribution D is sampled from rollouts of a policy π uni taking actions uniformly at random. Key: is a reward proportional to the change in center of mass and is the backflip reward defined in Amodei et al. [2, footnote]. Moving forward is rewarded when or is to the right, and moving backward is rewarded when or is to the left. quadratic control penalty, no control penalty. Confidence Interval (CI): 95% CI computed by bootstraping over 10 000 samples.

Comparisons of Regress using all distance algorithms. Comparisons of Pref using all distance algorithms. Comparisons of AIRL SO using all distance algorithms. Comparisons of AIRL SA using all distance algorithms.

Figure A.6: Distance of reward checkpoints from the ground-truth in PointMaze and policy regret for reward checkpoints during reward model training. Each point evaluates a reward function checkpoint from a single seed. EPIC, NPEC and ERC distance use the Mixture distribution. Regret is computed by running RL on the checkpoint. The shaded region represents the bootstrapped 95% confidence interval for the distance or regret at that checkpoint, calculated following Section A.1.3.

Comparisons using ERC on all reward models. Comparisons using Episode Return on all reward models.

Figure A.7: Distance of reward checkpoints from the ground-truth in PointMaze and policy regret for reward checkpoints during reward model training. Each point evaluates a reward function checkpoint from a single seed. EPIC, NPEC and ERC distance use the Mixture distribution. Regret is computed by running RL on the checkpoint. The shaded region represents the bootstrapped 95% confidence interval for the distance or regret at that checkpoint, calculated following Section A.1.3.

0 and Φ(s) = λ 2 Φ 1 (s) + Φ 2 (s) is bounded. Thus R A ≡ R C . A.3.2 EQUIVALENT-POLICY INVARIANT COMPARISON (EPIC) PSEUDOMETRIC Proposition 4.2 (The Canonically Shaped Reward is Invariant to Shaping). Let R : S × A × S → R be a reward function and Φ : S → R a potential function. Let γ ∈ [0, 1] be a discount rate, and D S ∈ ∆(S) and D A ∈ ∆(A) be distributions over states and actions. Let R denote R shaped by Φ: R (s, a, s ) = R(s, a, s ) + γΦ(s ) -Φ(s). Then the canonically shaped R and R are equal: C D S ,D A (R ) = C D S ,D A (R).

Let S and A be finite, with |S| ≥ 2. Let D S ∈ ∆(S) and D A ∈ ∆(A). Let R, ν : S × A × S → R be reward functions, with ν(s, a, s ) = λI[(s, a, s ) = (x, u, x )], λ ∈ R, x, x ∈ S and u ∈ A. Let Φ D S ,D A (R)(s, a, s ) = C D S ,D A (R) (s, a, s ) -R(s, a, s ). Then:

The Pearson distance D ρ is a pseudometric. Moreover, let a, b ∈ (0, ∞), c, d ∈ R and X, Y be random variables. Then it follows that 0 ≤ D ρ (aX + c, bY + d) = D ρ (X, Y ) ≤ 1.

The Canonically Shaped Reward is Mean Zero). Let R be a reward function mapping from transitions S × A × S to real numbers R. Then:E [C D S ,D A (R) (S, A, S )] = 0. (114)Proof. Let X, U and X be random variables that are independent of S, A and S but identically distributed.LHS E [C D S ,D A (R) (S, A, S )] (115) = E [R(S, A, S ) + γR(S , U, X ) -R(S, U, X ) -γR(X, U, X )] (116) = E [R(S, A, S )] + γE [R(S , U, X )] -E [R(S, U, X )] -γE [R(X, U, X )] (117) = E [R(S, U, X )] + γE [R(X, U, X )] -E [R(S, U, X )] -γE [R(X, U, X )]

p(x, y)dy = µ(x),

p(x, y)dx ≤ αν(y).

Let S be some set and let µ, ν be probability measures on S. Let f : S → R be an L-Lipschitz function on the 1 norm • 1 . Then, for any α ≥ 1:E X∼µ [|f (X)|] ≤ αE Y ∼ν [|f (Y )|] + LW α (µ, ν). (179)Proof. Let p ∈ Γ α (µ, ν). Then:E X∼µ [|f (X)|] |f (x)|dµ(x) definition of E (180) = |f (x)|dp(x, y) µ is a marginal of p (181) ≤ |f (y)| + L xy dp(x, y) f L-Lipschitz (182) = |f (y)|dp(x, y) + L xy dp(x, y) (183) = |f (y)| p(x, y)dxdy + L xy dp(x, y)(184)≤ |f (y)|αν(y)dy + L xy dp(x, y) eq. 178 (185)= αE Y ∼ν [|f (Y )|] + L xy dp(x, y) definition of E.(186)Since this holds for all choices of p, we can take the infimum of both sides, giving:E X∼µ [|f (X)|] ≤ αE Y ∼ν [|f (Y )|] + L inf p∈Γα(µ,ν)xy dp(x, y)(187)= αE Y ∼ν [|f (Y )|] + LW α (µ, ν).

via a Lipschitz assumption, with Wasserstein distance replacing K. Importantly, the returns of π * A and π * B converge as D EPIC (R A , R B ) → 0 in both cases, no matter which reward function you evaluate on.

Low reward distance from the ground-truth (GT) in PointMaze-Train predicts high policy return even in unseen task PointMaze-Test. EPIC distance is robust to the choice of coverage distribution D, with similar values across columns, while ERC and especially NPEC are sensitive to D. Center: approximate distances (1000× scale) of reward functions from GT. The coverage distribution D is computed from rollouts in PointMaze-Train of: a uniform random policy π uni , an expert π * and a Mixture of these policies. D S and D A are computed by marginalizing D. Right: mean GT return over 9 seeds of RL training on the reward in PointMaze-{Train,Test}, and returns for AIRL's generator policy. Confidence Intervals: see TableA.7.

1: Summary of hyperparameters and distributions used in experiments. The uniform random coverage distribution D unif samples states and actions uniformly at random, and samples the next state from the transition dynamics. Random policy π uni takes uniform random actions. The synthetic expert policy π * was trained with PPO on the ground-truth reward. Mixture samples actions from either π uni or π * , switching between them at each time step with probability 0.05. Warmstart Size is the size of the dataset used to compute initialization parameters described in section A.1.2.

4: Hyperparameters for preference comparison used in our implementation of Christiano et al. [6]. Table A.5: Hyperparameters for regression used in our implementation of Christiano et al. [6, target method from section 3.3].

6: Time and resources taken by different metrics to perform 25 distance comparisons on PointMass, and the confidence interval widths obtained (smaller is better). Methods EPIC, NPEC and ERC correspond to Figures 2(a), (b) and (c) respectively. EPIC Quick is an abbreviated version with fewer samples. RL (PPO) is estimated from the time taken using PPO to train a single policy (16m:23s) until convergence (10 6 time steps). EPIC samples N M + N V time steps from the environment and performs N M N V reward queries. In EPIC Quick, N M = N V = 4096; in EPIC, N M = N V = 302768. Other methods query the reward once per environment time step.

7: Approximate distances of reward functions from the ground-truth (GT). We report the 95% bootstrapped lower and upper bounds, the mean, and a 95% bound on the relative error from the mean. Distances (1000× scale) use coverage distribution D from rollouts in the PointMaze-Train environment of: a uniform random policy π uni , an expert π * and a Mixture of these policies. D S and D A are computed by marginalizing D.(a) 95% lower bound D LOW of approximate distance. ±D REL % of the sample mean in TableA.7b with 95% probability.Table A.8: Approximate distances of reward functions from the ground-truth (GT) under pathological coverage distributions. We report the 95% bootstrapped lower and upper bounds, the mean, and a 95% bound on the relative error from the mean. Distances (1000× scale) use four different coverage distributions D. σ independently samples states, actions and next states from the marginal distributions of rollouts from the uniform random policy π uni in the PointMaze-Train environment. Ind independently samples the components of states and next states from N (0, 1), and actions from U [-1, 1]. Jail consists of rollouts of π uni restricted to a small 0.09 × 0.09 "jail" square that excludes the goal state 0.5 distance away. π bad are rollouts in PointMaze-Train of a policy that goes to the corner opposite the goal state. σ and Ind are not supported by ERC since they do not produce complete episodes.(a) 95% lower bound D LOW of approximate distance. Results are the same as Table2. Ind Jail π bad Jail π bad ±D REL % of the sample mean in TableA.8b with 95% probability.

A (s t , a t , s t+1 ) -R B (s t , a t , s t+1 )) |R A (s t , a t , s t+1 ) -R B (s t , a t , s t+1 )| t ,a t ,s t+1 ∼Dπ[|R A (s t , a t , s t+1 ) -R B (s t , a t , s t+1 )|] ,a t ,s t+1 ∈S×A×S D π (t, s t , a t , s t+1 ) |R A (s t , a t , s t+1 ) -R B (s t , a t , s t+1 )| . (144) Let π ∈ {π * A , π * B }.By assumption, D π (t, s t , a t , s t+1 ) ≤ KD(s t , a t , s t+1 ), so: ,a t ,s t+1 ∈S×A×S D(s t , a t , s t+1 ) |R A (s t , a t , s t+1 ) -R B (s t , a t , s t+1 )| (145)

ACKNOWLEDGEMENTS

Thanks to Sam Toyer, Rohin Shah, Eric Langlois, Siddharth Reddy and Stuart Armstrong for helpful discussions; to Miljan Martic for code-review; and to David Krueger, Matthew Rahtz, Rachel Freedman, Cody Wild, Alyssa Dayan, Adria Garriga, Jon Uesato, Zac Kenton and Alden Hung for feedback on drafts. This work was supported by Open Philanthropy and the Leverhulme Trust.

annex

 A .2 for the policy training hyperparameters seeds, and then compute a bootstrapped confidence interval for each seed. We use 30 seeds for EPIC, but only 9 seeds for computing Episode Return and 3 seeds for NPEC due to their greater computational requirements. In ERC, computing the distance is very fast, so we instead apply bootstrapping to the collected episodes, computing the ERC distance for each bootstrapped episode sample.

A.2.1 HYPERPARAMETERS FOR APPROXIMATE DISTANCES

Table A.1 summarizes the hyperparameters and distributions used to compute the distances between reward functions. Most parameters are the same across all environments. We use a coverage distribution of uniform random transitions D unif in the simple GridWorld environment with known determinstic dynamics. In other environments, the coverage distribution is sampled from rollouts of a policy. We use a random policy π uni for PointMass, HalfCheetah and Hopper in the handdesigned reward experiments (section 6.1). In PointMaze, we compare three coverage distributions (section 6.2) induced by rollouts of π uni , an expert policy π * and a Mixture of the two policies, sampling actions from either π uni or π * and switching between them with probability 0.05 per time step.Table A.2: Hyperparameters for proximal policy optimisation (PPO) [19] . We used the implementation and default hyperparameters from Hill et al. [9] . PPO was used to train expert policies on ground-truth reward and to optimize learned reward functions for evaluation. So:

Parameter

is an inner product over R, it follows by the Cauchy-Schwarz inequality thatTaking the square root of both sides:as required.Positive Affine Invariant and Bounded D ρ (aX + c, bYTheorem 4.7. The Equivalent-Policy Invariant Comparison distance is a pseudometric.Proof. The result follows from D ρ being a pseudometric. Let R A , R B and R C be reward functions mapping from transitions S × A × S to real numbers R.Identity. Have:since D ρ (X, X) = 0.Symmetry. Have:sinceTriangle Inequality. Have:sinceProof. Since D EPIC is defined in terms of D ρ , the bounds 0 ≤ D EPIC (R A , R B ) and D EPIC (R A , R B ) ≤ 1 are immediate from the bounds in lemma 4.5.Since R A ≡ R A and R B ≡ R B , we can write for X ∈ {A, B}:for some scaling factor λ X > 0 and potential function Φ X : S → R.By proposition 4.2:Moreover, since C D S ,D A (R) is defined as an expectation over R and expectations are linear:Unrolling the definition of D EPIC and applying this result gives:)) eqs. 55 and 56A. Proof.(1) D L p ,D is a metric in the L p space since L p is a norm in the L p space, and d(x, y) = xy is always a metric. ( 2) As f = g at all points implies f = g almost everywhere, certainly D L p ,D (R, R) = 0. Symmetry and triangle inequality do not depend on identity so still hold.Proof. We will show each case in turn.

Invariance under ≡ in source

If R A ≡ R B , then:

s). Then for any

Proof. Recall from the proof of lemma 4.5 that:where • 2 is the L 2 norm (treating the random variables as functions on a measure space) and Z(U ) is a centered (zero-mean) and rescaled (unit variance) random variable. By lemma A.8, the canonically shaped reward functions are already centered under the joint distribution D S × D A × D S , and normalization by the L 2 norm also ensures they have unit variance. Consequently:completing the proof.

A.5 REGRET BOUND

In this section, we derive an upper bound on the regret in terms of the EPIC distance. Specifically, given two reward functions R A and R B with optimal policies π * A and π * B , we show that the regret (under reward R A ) of using policy π * B instead of a policy π * A is bounded by a function of D EPIC (R A , R B ). First, in section A.5.1 we derive a bound for MDPs with finite state and action spaces. In section A.6 we then present an alternative bound for MDPs with arbitrary state and action spaces and Lipschitz reward functions. Finally, in section A.7 we show that in both cases the regret tends to 0 as D EPIC (R A , R B ) → 0.

A.5.1 DISCRETE MDPS

We start in lemma A.10 by showing that L 2 distance upper bounds L 1 distance. Next, in lemma A.11 we show regret is bounded by the L 1 distance between reward functions using an argument similar to [21] . Then in lemma A.12 we relate regret bounds for standardized rewards R S to the original reward R. Finally, in theorem 4.9 we use section A.4 to express D EPIC in terms of the L 2 distance on standardized rewards, deriving a bound on regret in terms of the EPIC distance.Lemma A.10. Let (Ω, F, P ) be a probability space and f : Ω → R a measurable function whose absolute value raised to the n-th power for n ∈ {1, 2} has a finite expectation. Then the L 1 norm of f is bounded above by the L 2 norm:Proof. Let X be a random variable sampled from P , and consider the variance of f (X):Rearranging terms, we haveTaking the square root of both sides gives:as required.Rearranging gives:So certainly:By a symmetric argument, substituting π = π * A gives:Eqs. 150 and 151 respectively giveSubstituting inequalities 148 and 152 into eq. 138 yields the required result.Note that if D = D unif , uniform over S × A × S, then K ≤ |S| 2 |A|.Lemma A.12. Let M be an MDP\R with state and action spaces S and A. Let R A , R B : S×A×S → R be bounded rewards. Let π * A and π * B be policies optimal for rewards R A and R B in M . Suppose the regret under the standardized rewardThen the regret under the original reward R A is bounded by:Proof. Recall thatwhere C D S ,D A (R) is simply R shaped with some (bounded) potential Φ. It follows that:where s 0 depends only on the initial state distribution d 0 . † Since s 0 does not depend on π, the terms cancel when taking the difference in returns:Combining this with eq 153 givesFinally, we will bound C D S ,D A (R A ) 2 in terms of R A 2 , completing the proof. Recall:where S and S are random variables independently sampled from D S and A sampled from D A . By the triangle inequality on the L 2 norm and linearity of expectations, we have:where f (s, a, s ) = E [R(s , A, S )], g(s, a, s ) = E [R(s, A, S )] and c = E [R(S, A, S )]. Letting X be a random variable sampled from D S independently from S and S , haveSo f 2 ≤ R 2 and, by an analogous argument,Combining these results, we haveSubstituting eq. 170 into eq. 159 yields:Theorem 4.9. Let M be a γ-discounted MDP\R with finite state and action spaces S and A.Let R A , R B : S × A × S → R be rewards, and π * A , π * B be respective optimal policies. Let D π (t, s t , a t , s t+1 ) denote the distribution over transitions S × A × S induced by policy π at time t, and D(s, a, s ) be the coverage distribution used to compute D EPIC . Suppose there exists K > 0 such that KD(s t , a t , s t+1 ) ≥ D π (t, s t , a t , s t+1 ) for all times t ∈ N, triples (s t , a t , s t+1 ) ∈ S × A × S and policies π ∈ {π * A , π * B }. Then the regret under R A from executing π * B instead of π * A is at mostwhere G R (π) is the return of policy π under reward R.Proof. Recall from section A.4 that:Applying lemma A.10 we obtain:Note that π * A is optimal for R S A and π * B is optimal for R S B since the set of optimal policies for R S is the same as for R. Applying lemma A.11 and eq. 173 givesSince S × A × S is bounded, R A and R B must be bounded, so we can apply lemma A.12, giving:completing the proof.Lemma A.15. Let M be an MDP\R with state and action spaces S and A. Let R A , R B : S×A×S → R be L-Lipschitz, bounded rewards on the 1 norm • 1 . Let π * A and π * B be policies optimal for rewards R A and R B in M . Let D π,t (s t , a t , s t+1 ) denote the distribution over trajectories that policy π induces in M at time step t. Let D(s, a, s ) be the (stationary) coverage distribution over transitions S × A × S used to compute D EPIC . Let α ≥ 1, and letThen the regret under R A from executing π * B optimal for R B instead of π * A is at most:Proof. By the same argument as lemma A.11 up to eq. 144, we have for any policy π:, and note f is 2L-Lipschitz and bounded since R A and R B are both L-Lipschitz and bounded. Now, by lemma A.14, letting µ = D π,t and ν = D, we have:By the same argument as for eq. 148 to 152 in lemma A.11, it follows thatcompleting the proof.Theorem A. 16 . Let M be an MDP\R with state and action spaces S and A. Let R A , R B : S × A × S → R be bounded, L-Lipschitz rewards on the 1 norm • 1 . Let π * A and π * B be policies optimal for rewards R A and R B in M . Let D π (t, s t , a t , s t+1 ) denote the distribution over trajectories that policy π induces in M at time step t. Let D(s, a, s ) be the (stationary) coverage distribution over transitions S × A × S used to compute D EPIC . Let α ≥ 1, and letProof. The proof for theorem 4.9 holds in the general setting up to eq. 173. Applying lemma A.15 to eq. 173 givesApplying lemma A.12 yields

A.7 LIMITING BEHAVIOR OF REGRET

The regret bound for finite MDPs, Theorem 4.9, directly implies that, as EPIC distance tends to 0, the regret also tends to 0. By contrast, our regret bound in theorem A.16 for (possibly continuous) MDPs with Lipschitz reward functions includes the relaxed Wasserstein distance W α as an additive term. At first glance, it might therefore appear possible for the regret to be positive even with a zero EPIC distance. However, in this section we will show that in fact the regret tends to 0 as D EPIC (R A , R B ) → 0 in the Lipschitz case as well as the finite case.We show in lemma A.17 that if the expectation of a non-negative function over a totally bounded measurable metric space M tends to zero under one distribution with adequate support, then it also tends to zero under all other distributions. For example, taking M to be a hypercube in Euclidean space with the Lebesque measure satisfies these assumptions. We conclude in theorem A.18 by showing the regret tends to 0 as the EPIC distance tends to 0.Lemma A.17. Let M = (S, d) be a totally bounded metric space, where d(x, y) = xy . Let (S, A, µ) be a measure space on S with the Borel σ-algebra A and measure µ. Let p, q ∈ ∆(S) be probability density functions on S. Let δ > 0 such that p(s) ≥ δ for all s ∈ S. Let f n : S → R be a sequence of L-Lipschitz functions on norm • . Suppose such that B∈C(r) B = S.It is possible for some balls B r (c n ) to have measure zero, µ(B r (c n )) = 0, such as if S contains an isolated point c n . Define P (r) to be the subset of C(r) with positive measure:and let p r,1 , • • • , p r,Q (r) denote the centers of the balls in P . Since P (r) is a finite collection, it must have a minimum measure:Moreover, by construction of P , α(r) > 0.Let S (r) be the union only over balls of positive measure:Now, let D(r) = S \ S (r), comprising the (finite number of) measure zero balls in C(r). Since measures are countably additive, it follows that D(r) is itself measure zero: µ (D(r)) = 0. Consequently:for any measurable function g : S → R.Since lim n→∞ E X∼p [|f n (X)|] = 0, for all r > 0 there exists some N r ∈ N such that for all n ≥ N r :By Lipschitz continuity, for any s, s ∈ S:In particular, since any point s ∈ S (r) is at most r distance from some ball center p r,i , then completing the proof.Theorem A.18. Let M be an MDP\R with state and action spaces S and A. Let R A , R B : S × A × S → R be bounded rewards on some norm • on S × A × S. Let π * A and π * B be policies optimal for rewards R A and R B in M . Let D π (t, s t , a t , s t+1 ) denote the distribution over trajectories that policy π induces in M at time step t. Let D(s, a, s ) be the (stationary) coverage distribution over transitions S × A × S used to compute D EPIC . Suppose that either:1. Discrete: S and A are discrete. Moreover, suppose that there exists some K > 0 such that KD(s t , a t , s t+1 ) ≥ D π (t, s t , a t , s t+1 ) for all time steps t ∈ N, triples s t , a t , s t+1 ∈ S × A × S and policies π ∈ {π * A , π * B }. 2. Lipschitz: (S × A × S, d) is a totally bounded measurable metric space where d(x, y) =xy . Moreover, R A and R B are L-Lipschitz on • . Furthermore, suppose there exists some δ > 0 such that D(s, a, s ) ≥ δ for all s, a, s ∈ S × A × S, and that D π (t, s t , a t , s t+1 ) is a non-degenerate probability density function (i.e. no single point has positive measure).

Then as D EPIC

Proof. In case (1) Discrete, by theorem 4.9:Moreover, by optimality of π * A we have 0 ≤ G R A (π * A ) -G R A (π * B ). So by the squeeze theorem, asFrom now on, suppose we are in case (2) Lipschitz. By the same argument as lemma A.11 up to eq. 144, we have for any policy π:Published as a conference paper at ICLR 2021 Applying lemma A.12 we have:By equation 173, we know that D L 1 ,D (R S A , R S B ) → 0 as D EPIC (R A , R B ) → 0. By lemma A.17, we know that D L 1 ,Dπ,t (R S A , R S B ) → 0 as D L 1 ,D (R S A , R S B ) → 0. So we can conclude that asas required.

