QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS

Abstract

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https: //github.com/HumanCompatibleAI/evaluating-rewards.

1. INTRODUCTION

Reinforcement learning (RL) has reached or surpassed human performance in many domains with clearly defined reward functions, such as games [20; 15; 23] and narrowly scoped robotic manipulation tasks [16] . Unfortunately, the reward functions for most real-world tasks are difficult or impossible to specify procedurally. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned [22, IV.A]. Tasks involving human interaction can have far more complex reward functions that users may not even be able to introspect on. These challenges have inspired work on learning a reward function, whether from demonstrations [13; 17; 26; 8; 3], preferences [1; 25; 6; 18; 27] or both [10; 4] . Prior work has usually evaluated the learned reward function R using the "rollout method": training a policy π R to optimize R and then examining rollouts from π R. Unfortunately, using RL to compute π R is often computationally expensive. Furthermore, the method produces false negatives when the reward R matches user preferences but the RL algorithm fails to optimize with respect to R. The rollout method also produces false positives. Of the many reward functions that induce the desired rollout in a given environment, only a small subset align with the user's preferences. For example, suppose the agent can reach states {A, B, C}. If the user prefers A > B > C, but the agent instead learns A > C > B, the agent will still go to the correct state A. However, if the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies. For example, if A is no longer reachable at deployment, the previously reliable agent would misbehave by going to the least-favoured state C. We propose instead to evaluate learned rewards via their distance from other reward functions, and summarize our desiderata for reward function distances in Table 1 . For benchmarks, it is usually possible to directly compare a learned reward R to the true reward function R. Alternatively, benchmark creators can train a "proxy" reward function from a large human data set. This proxy can then be used as a stand-in for the true reward R when evaluating algorithms trained on a different or smaller data set.

