BENEFITS OF ASSISTANCE OVER REWARD LEARNING Anonymous

Abstract

Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward model through human feedback that is provided externally from the environment. The second is assistance, in which the human is modeled as a part of the environment, and the true reward function is modeled as a latent variable in the environment that the agent may make inferences about. The key difference between the two paradigms is that in the reward learning paradigm, by construction there is a separation between reward learning and control using the learned reward. In contrast, in assistance these functions are performed as needed by a single policy. By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning. We illustrate these advantages in simple environments by showing desirable qualitative behaviors of assistive agents that cannot be found by agents based on reward learning.

1. INTRODUCTION

Traditional computer programs are instructions on how to perform a particular task. However, we do not know how to mechanically perform more challenging tasks like translation. The field of artificial intelligence raises the level of abstraction so that we simply specify what the task is, and let the machine to figure out how to do it. As task complexity increases, even specifying the task becomes difficult. Several criteria that we might have thought were part of a specification of fairness turn out to be provably impossible to simultaneously satisfy (Kleinberg et al., 2016; Chouldechova, 2017; Corbett-Davies et al., 2017) . Reinforcement learning agents often "game" their reward function by finding solutions that technically achieve high reward without doing what the designer intended (Lehman et al., 2018; Krakovna, 2018; Clark & Amodei, 2016) . In complex environments, we need to specify what not to change (McCarthy & Hayes, 1981) ; failure to do so can lead to negative side effects (Amodei et al., 2016) . Powerful agents with poor specifications may pursue instrumental subgoals (Bostrom, 2014; Omohundro, 2008) such as resisting shutdown and accumulating resources and power (Turner, 2019) . A natural solution is to once again raise the level of abstraction, and create an agent that is uncertain about the objective and infers it from human feedback, rather than directly specifying some particular task(s). Rather than using the current model of intelligent agents optimizing for their objectives, we would now have beneficial agents optimizing for our objectives (Russell, 2019). Reward learning (Leike et al., 2018; Jeon et al., 2020; Christiano et al., 2017; Ziebart et al., 2010) attempts to instantiate this by learning a reward model from human feedback, and then using a control algorithm to optimize the learned reward. Crucially, the control algorithm does not reason about the effects of the chosen actions on the reward learning process, which is external to the environment. In contrast, in the assistance paradigm (Hadfield-Menell et al., 2016; Fern et al., 2014) , the human H is modeled as part of the environment and as having some latent goal that the agent R (for robot) does not know. R's goal is to maximize this (unknown) human goal. In this formulation, R must balance between actions that help learn about the unknown goal, and control actions that lead to high reward. Our key insight is that by integrating reward learning and control modules, assistive agents can take into account the reward learning process when selecting actions. This gives assistive agents a significant advantage over reward learning agents, which cannot perform similar reasoning. The goal of this paper is to clarify and illustrate this advantage. We first precisely characterize the differences between reward learning and assistance, by showing that two phase, communicative assistance is equivalent to reward learning (Section 3). We then give qualitative examples of desirable behaviors that can only be expressed once these restrictions are lifted, and thus are only exhibited by assistive agents (Section 4). Consider for example the kitchen environment illustrated in Figure 1 , in which R must bake a pie for H. R is uncertain about which type of pie H prefers to have, and currently H is at work and cannot answer R's questions. An assistive R can make the pie crust, but wait to ask H about her preferences over the filling (Section 4.1). R may never clarify all of H's preferences: for example, R only needs to know how to dispose of food if it turns out that the ingredients have gone bad (Section 4.2). If H will help with making the pie, R can allow H to disambiguate her desired pie by watching what filling she chooses (Section 4.3). Vanilla reward learning agents do not show these behaviors. We do not mean to suggest that all work on reward learning should cease and only research on assistive agents should be pursued. Amongst other limitations, assistive agents are very computationally complex. Our goal is simply to clarify what qualitative benefits an assistive formulation could theoretically provide. Further research is needed to develop efficient algorithms that can capture these benefits. Such algorithms may look like algorithms designed to solve assistance problems as we have formalized them here, but they may also look like modified variants of reward learning, where the modifications are designed to provide the qualitative benefits we identify.

2. BACKGROUND AND RELATED WORK

We introduce the key ideas behind reward learning and assistance. X * denotes a sequence of X. We use parametric specifications for ease of exposition, but our results apply more generally.

2.1. POMDPS

A partially observable Markov decision process (POMDP) M = S, A, Ω, O, T, r, P 0 , γ consists of a finite state space S, a finite action space A, a finite observation space Ω, an observation function O : S → ∆(Ω) (where ∆(X) is the set of probability distributions over X), a transition function T : S × A → ∆(S), a reward function r : S × A × S → R, an initial state distribution P 0 : ∆(S), and a discount rate γ ∈ (0, 1). We will write o t to signify the tth observation O(s t ). A solution to the POMDP is given by a policy π : (O × A) * × O → ∆(A) that maximizes the expected sum of rewards ER(π) = Es 0∼P0,at∼π(•|o0:t,a0:t-1),st+1∼T (•|st,at) [ ∞ t=0 γ t r(s t , a t , s t+1 )].

2.2. REWARD LEARNING

We consider two variants of reward learning: non-active reward learning, in which R must infer the reward by observing H's behavior, and active reward learning, in which R may choose particular questions to ask H in order to get particular feedback. A non-active reward learning problem P = M\r, C, Θ, r θ , P Θ , π H , k contains a POMDP without reward M\r = S, A R , Ω R , O R , T, P 0 , γ , and instead R has access to a parameterized



Figure 1: R must cook a pie for H, by placing flour on the plate to make the pie dough, filling it with either Apple, Blueberry, or Cherry filling, and finally baking it. However, R does not know which filling H prefers, and H is not available for questions since she is doing something else. What should R do in this situation? On the right, we show what qualitative reasoning we might want R to use to handle the situation.

