BENEFITS OF ASSISTANCE OVER REWARD LEARNING Anonymous

Abstract

Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward model through human feedback that is provided externally from the environment. The second is assistance, in which the human is modeled as a part of the environment, and the true reward function is modeled as a latent variable in the environment that the agent may make inferences about. The key difference between the two paradigms is that in the reward learning paradigm, by construction there is a separation between reward learning and control using the learned reward. In contrast, in assistance these functions are performed as needed by a single policy. By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning. We illustrate these advantages in simple environments by showing desirable qualitative behaviors of assistive agents that cannot be found by agents based on reward learning.

1. INTRODUCTION

Traditional computer programs are instructions on how to perform a particular task. However, we do not know how to mechanically perform more challenging tasks like translation. The field of artificial intelligence raises the level of abstraction so that we simply specify what the task is, and let the machine to figure out how to do it. As task complexity increases, even specifying the task becomes difficult. Several criteria that we might have thought were part of a specification of fairness turn out to be provably impossible to simultaneously satisfy (Kleinberg et al., 2016; Chouldechova, 2017; Corbett-Davies et al., 2017) . Reinforcement learning agents often "game" their reward function by finding solutions that technically achieve high reward without doing what the designer intended (Lehman et al., 2018; Krakovna, 2018; Clark & Amodei, 2016) . In complex environments, we need to specify what not to change (McCarthy & Hayes, 1981) ; failure to do so can lead to negative side effects (Amodei et al., 2016) . Powerful agents with poor specifications may pursue instrumental subgoals (Bostrom, 2014; Omohundro, 2008) such as resisting shutdown and accumulating resources and power (Turner, 2019) . A natural solution is to once again raise the level of abstraction, and create an agent that is uncertain about the objective and infers it from human feedback, rather than directly specifying some particular task(s). Rather than using the current model of intelligent agents optimizing for their objectives, we would now have beneficial agents optimizing for our objectives (Russell, 2019) . Reward learning (Leike et al., 2018; Jeon et al., 2020; Christiano et al., 2017; Ziebart et al., 2010) attempts to instantiate this by learning a reward model from human feedback, and then using a control algorithm to optimize the learned reward. Crucially, the control algorithm does not reason about the effects of the chosen actions on the reward learning process, which is external to the environment. In contrast, in the assistance paradigm (Hadfield-Menell et al., 2016; Fern et al., 2014) , the human H is modeled as part of the environment and as having some latent goal that the agent R (for robot) does not know. R's goal is to maximize this (unknown) human goal. In this formulation, R must balance between actions that help learn about the unknown goal, and control actions that lead to high reward. Our key insight is that by integrating reward learning and control modules, assistive agents can take into account the reward learning process when selecting actions. This gives assistive agents a significant advantage over reward learning agents, which cannot perform similar reasoning. What is the reward of ? All pies need , let me make it If I use the to make , there won't be any left for and . I'll wait for more information.

Learning about reward

Making robust plans Preserving option value when possible + + I won't find out which pie is preferable before Alice gets very hungry, so I'll make . Guessing when feedback is unavailable + Figure 1: R must cook a pie for H, by placing flour on the plate to make the pie dough, filling it with either Apple, Blueberry, or Cherry filling, and finally baking it. However, R does not know which filling H prefers, and H is not available for questions since she is doing something else. What should R do in this situation? On the right, we show what qualitative reasoning we might want R to use to handle the situation. The goal of this paper is to clarify and illustrate this advantage. We first precisely characterize the differences between reward learning and assistance, by showing that two phase, communicative assistance is equivalent to reward learning (Section 3). We then give qualitative examples of desirable behaviors that can only be expressed once these restrictions are lifted, and thus are only exhibited by assistive agents (Section 4). Consider for example the kitchen environment illustrated in Figure 1 , in which R must bake a pie for H. R is uncertain about which type of pie H prefers to have, and currently H is at work and cannot answer R's questions. An assistive R can make the pie crust, but wait to ask H about her preferences over the filling (Section 4.1). R may never clarify all of H's preferences: for example, R only needs to know how to dispose of food if it turns out that the ingredients have gone bad (Section 4.2). If H will help with making the pie, R can allow H to disambiguate her desired pie by watching what filling she chooses (Section 4.3). Vanilla reward learning agents do not show these behaviors. We do not mean to suggest that all work on reward learning should cease and only research on assistive agents should be pursued. Amongst other limitations, assistive agents are very computationally complex. Our goal is simply to clarify what qualitative benefits an assistive formulation could theoretically provide. Further research is needed to develop efficient algorithms that can capture these benefits. Such algorithms may look like algorithms designed to solve assistance problems as we have formalized them here, but they may also look like modified variants of reward learning, where the modifications are designed to provide the qualitative benefits we identify.

2. BACKGROUND AND RELATED WORK

We introduce the key ideas behind reward learning and assistance. X * denotes a sequence of X. We use parametric specifications for ease of exposition, but our results apply more generally.

2.1. POMDPS

A partially observable Markov decision process (POMDP) M = S, A, Ω, O, T, r, P 0 , γ consists of a finite state space S, a finite action space A, a finite observation space Ω, an observation function O : S → ∆(Ω) (where ∆(X) is the set of probability distributions over X), a transition function T : S × A → ∆(S), a reward function r : S × A × S → R, an initial state distribution P 0 : ∆(S), and a discount rate γ ∈ (0, 1). We will write o t to signify the tth observation O(s t ). A solution to the POMDP is given by a policy π : (O × A) * × O → ∆(A) that maximizes the expected sum of rewards ER(π) = Es 0∼P0,at∼π(•|o0:t,a0:t-1),st+1∼T (•|st,at) [ ∞ t=0 γ t r(s t , a t , s t+1 )].

2.2. REWARD LEARNING

We consider two variants of reward learning: non-active reward learning, in which R must infer the reward by observing H's behavior, and active reward learning, in which R may choose particular questions to ask H in order to get particular feedback. A non-active reward learning problem P = M\r, C, Θ, r θ , P Θ , π H , k contains a POMDP without reward M\r = S, A R , Ω R , O R , T, P 0 , γ , and instead R has access to a parameterized reward space Θ, r θ , P Θ . R is able to learn about θ * by observing H make k different choices c, each chosen from a set of potential choices C. In order for R to learn from the human's choices, it also assumes access to the human decision function π H (c | θ) that determines how the human makes choices for different possible reward functions r θ . Common decision functions include perfect optimality (Ng & Russell, 2000) and Boltzmann rationality (Ziebart et al., 2010) . There are many types of choices (Jeon et al., 2020 ), including demonstrations (Argall et al., 2009; Ng & Russell, 2000; Ziebart et al., 2010; Fu et al., 2017; Gao et al., 2012) , comparisons (Zhang et al., 2017; Wirth et al., 2017; Christiano et al., 2017; Sadigh et al., 2017) , corrections (Bajcsy et al., 2017) , the state of the world (Shah et al., 2019) , proxy rewards (Hadfield-Menell et al., 2017b) , natural language (Fu et al., 2019) , etc. A policy decision function f (c 0:k-1 ) produces a policy π R after observing H's choices. A solution is a policy decision function f that maximizes expected reward E θ∼PΘ,c 0:k-1 ∼π H [ER(f (c 0:k-1 ))]. Since H's choices c 0:k-1 do not affect the state of the environment that R is acting in, this is equivalent to choosing π R that maximizes expected reward given the posterior over reward functions, that is Eθ∼P (θ|c 0:k-1 ) ER(π R ) . An active reward learning problem P = M\r, Q, C, Θ, r θ , P Θ , π H , k adds the ability for R to ask H particular questions q ∈ Q in order to get more targeted feedback about θ. The human decision function π H (c | q, θ) now depends on the question asked. A solution consists of a question policy π R Q (q i | q 0:i-1 , c 0:i-1 ) and a policy decision function f (q (Eric et al., 2008; Daniel et al., 2014; Maystre & Grossglauser, 2017; Christiano et al., 2017; Sadigh et al., 2017; Zhang et al., 2017; Wilde et al., 2020) will compute and ask q ∈ Q that maximizes an active learning criterion such as information gain (Bıyık et al., 2019) or volume removal (Sadigh et al., 2017) . Best results are achieved by selecting questions with the highest value of information (Cohn, Robert W, 2016; Zhang et al., 2017; Mindermann et al., 2018; Wilde et al., 2020) , but these are usually much more computationally expensive. R then finds a policy that maximizes expected reward under the inferred distribution over θ, in order to approximately solve the original POMDP. 0:k-1 , c 0:k-1 ) that maximize expected reward E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H [ER(f (q 0:k-1 , c 0:k-1 ))]. A typical algorithm Note that a non-active reward learning problem is equivalent to an active reward learning problem with only one question, since having just a single question means that R has no choice in what feedback to get (see Appendix A.1 for proofs).

2.3. ASSISTANCE

The key idea of assistance is that helpful behaviors like reward learning are incentivized when R does not know the true reward r and can only learn about it by observing human behavior. So, we model the human H as part of the environment, leading to a two-agent POMDP, and assume there is some true reward r that only H has access to, while the robot R only has access to a model relating r to H's behavior. Intuitively, as R acts in the environment, it will also observe H's behavior, which it can use to make inferences about the true reward. Following Hadfield-Menell et al. (2016) foot_0 , we define an assistance game M as a tuple M = S, {A H , A R }, {Ω H , Ω R }, {O H , O R }, T, P S , γ, Θ, r θ , P Θ . Here S is a finite set of states, A H a finite set of actions for H, Ω H a finite set of observations for H, and O H : S → ∆(Ω H ) an observation function for H (respectively A R , Ω R , O R for R). The transition function T : S × A H × A R → ∆(S) gives the probability over next states given the current state and both actions. The initial state is sampled from P S ∈ ∆(S). Θ is a set of possible reward function parameters θ which parameterize a class of reward functions r θ : S × A H × A R × S → R, and P θ is the distribution from which θ is sampled. γ ∈ (0, 1) is a discount factor. As with POMDPs, policies can depend on history. Both H and R are able to observe each other's actions, and on a given timestep, R acts before H. We use τ R Should it be playing a Nash strategy or optimal strategy pair of the game, and if so, which one? Should it use a non-equilibrium policy, since humans likely do not use equilibrium strategies? This is a key hyperparameter in assistance games, as it determines the communication protocol for H and R. For maximum generality, we can equip the assistance game with a policy-conditioned belief B : Π R → ∆(Π H ) over π H , which specifies how the human responds to the agent's choice of policy (Halpern & Pass, 2018) . The agent's goal is to maximize expected reward given this belief. t : (Ω R × A H × A R ) t to denote Prior work on assistance games (Hadfield-Menell et al., 2016; Malik et al., 2018; Woodward et al., 2019) focuses on finding optimal strategy pairs. This corresponds to a belief that H will know and perfectly respond to R's policy (see Appendix A.3). However, our goal is to compare assistance to reward learning. Typical reward learning algorithms assume access to a model of human decision-making: for example, H might be modeled as optimal (Ng & Russell, 2000) or Boltzmann-rational (Ziebart et al., 2010) . As a result, we also assume that we have access to a model of human decision-making π H . Note that π H depends on θ: we are effectively assuming that we know how H chooses how to behave given a particular reward r θ . This assumption corresponds to the policy-conditioned belief B(π R )(π H ) = 1[π H = π H ]. We define an assistance problem P as a pair M, π H where π H is a human policy for the assistance game M. Given an assistance problem, a robot policy π R induces a probability distribution over trajectories: τ ∼ s 0 , θ, π H , π R , τ ∈ [S × A H × A R ] * . We denote the support of this distribution by Traj(π R ). The expected reward of a robot policy for M, π H is given by ER(π R ) = E s0∼P S ,θ∼P θ ,τ ∼ s0,θ,π H ,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) . A solution of M, π H is a robot policy that maximizes expected reward: π R = argmax πR ER(π R ).

2.3.1. SOLVING ASSISTANCE PROBLEMS

Once the π H is given, H can be thought of as an aspect of the environment, and θ can be thought of as a particularly useful piece of information for estimating how good actions are. This suggests that we can reduce the assistance problem to an equivalent POMDP. Following Desai (2017) , the key idea is to embed π H in the transition function T and embed θ in the state. In theory, to embed potentially non-Markovian π H in T , we need to embed the entire history of the trajectory in the state, but this leads to extremely large POMDPs. In our experiments, we only consider Markovian human policies, for which we do not need to embed the full history, keeping the state space manageable. Thus, the policy can be written as π H (a H | o H , a R , θ). To ensure that R must infer θ from human behavior, as in the original assistance game, the observation function does not reveal θ, but does reveal the previous human action a H . Proposition 1. Every assistance problem M, π H can be reduced to an equivalent POMDP M . The full reduction and proof of equivalence is given in Appendix A.2. When M is fully observable, in the reduced POMDP θ is the only part of the state not directly observable to the robot, making it an instance of a hidden-goal MDP (Fern et al., 2014) . For computational tractability, much of the work on hidden goals (Javdani et al., 2015; Fern et al., 2014) selects actions assuming that all goal ambiguity is resolved in one step. This effectively separates reward learning and control in the same way as typical reward learning algorithms, thus negating many of the benefits we highlight in this work. Intention-aware motion planning (Bandyopadhyay et al., 2013 ) also embeds the human goal in the state in order to avoid collisions with humans during motion planning, but does not consider applications for assistance. Macindoe et al. (2012) uses the formulation of a POMDP with a hidden goal to produce an assistive agent in a cops and robbers gridworld environment. Nikolaidis et al. (2015) assumes a dataset of joint human-robot demonstrations, which they leverage to learn "types" of humans that can then be inferred online using a POMDP framework. This is similar to solving an assistance problem, where we think of the different values of θ as different "types" of humans. Chen et al. (2018) uses an assistance-style framework in which the unknown parameter is the human's trust in the robot (rather than the reward θ). Woodward et al. (2019) uses deep reinforcement learning to solve an assistance game in which the team must collect either plums or lemons. To our knowledge, these are the only prior works that use an assistive formulation in a way that does not ignore the information-gathering aspect of actions. While these works typically focus on algorithms to solve assistance games, we instead focus on the qualitative benefits of using an assistance formulation. Since we can reduce an assistance problem to a regular POMDP, we can use any POMDP solver to find the optimal π R . In our examples for this paper, we use an exact solver when feasible, and point-based value iteration (PBVI) (Pineau et al., 2003) or deep reinforcement learning (DRL) when not. When using DRL, we require recurrent models, since the optimal policy can depend on history. A common confusion is to ask how DRL can be used, given that it requires a reward signal, but by assumption R does not know the reward function. This stems from a misunderstanding of what it means for R "not to know" the reward function. When DRL is run, at the beginning of each episode, a specific value of θ is sampled as part of the initial state. The learned policy π R is not provided with θ: it can only see its observations o R and human actions a H , and so it is accurate to say that π R "does not know" the reward function. However, the reward is calculated by the DRL algorithm, not by π R , and the algorithm can and does use the sampled value of θ for this computation. π R can then implicitly learn the correlation between the actions a H chosen by π H , and the high reward values that the DRL algorithm computes; this can be often be thought of as an implicit estimation of θ in order to choose the right actions.

3. REWARD LEARNING AS TWO-PHASE COMMUNICATIVE ASSISTANCE

There are two key differences between reward learning and assistance. First, reward learning algorithms split reward learning and control into two separate phases, while assistance merges them into a single phase. Second, in reward learning, the human's only role is to communicate reward information to the robot, while in assistance the human can help with the task. These two properties exactly characterize the difference between the two: reward learning problems and communicative assistance problems with two phases can be reduced to each other, in a very natural way. A communicative assistance problem is one in which the transition function T and the reward function r θ are independent of the choice of human action a H , and the human policy π H (• | o H , a R , θ) is independent of the observation o H . Thus, in a communicative assistance problem, H's actions only serve to respond to R, and have no effects on the state or the reward (other than by influencing R). Such problems can be cast as instances of HOP-POMDPs (Rosenthal & Veloso, 2011) . For the notion of two phases, we will also need to classify robot actions as communicative or not. We will assume that there is some distinguished action a R noop that "does nothing". Then, a robot action âR is communicative if for any s, a H , s we have T (s | s, a H , âR ) = T (s | s, a H , a R noop ) and R(s, a H , âR , s ) = R(s, a H , a R noop , s ). A robot action is physical if it is not communicative. Now consider a communicative assistance problem M, π H with noop action a R noop and let the optimal robot policy be π R * . Intuitively, we would like to say that there is an initial communication phase in which the only thing that happens is that H responds to questions from R, and then a second action phase in which H does nothing and R acts. Formally, the assistance problem is two phase with actions at t act if it satisfies the following property: ∃a H noop ∈ A H , ∀τ ∈ Traj(π R * ), ∀t < t act : a R t is communicative ∧ ∀t ≥ t act : a H t = a H noop . Thus, in a two phase assistance problem, every trajectory from an optimal policy can be split into a "communication" phase where R cannot act and an "action" phase where H cannot communicate. Reducing reward learning to assistance. We can convert an active reward learning problem to a two-phase communicative assistance problem in an intuitive way: we add Q to the set of robot actions, make C the set of human actions, add a timestep counter to the state, and construct the reward such that an optimal policy must switch between the two phases after k questions. A non-active reward learning problem can first be converted to an active reward learning problem. Proposition 2. Every active reward learning problem M, Q, C, Θ, r θ , P Θ , π H , k can be reduced to an equivalent two phase communicative assistance problem M , π H . Corollary 3. Every non-active reward learning problem M, C, Θ, r θ , P Θ , π H , k can be reduced to an equivalent two phase communicative assistance problem M , π H . Reducing assistance to reward learning. The reduction from a two-phase communicative assistance problem to an active reward learning problem is similarly straightforward: we interpret R's communicative actions as questions and H's actions as answers. There is once again a simple generalization to non-active reward learning. Proposition 4. Every two-phase communicative assistance problem M, π H , a R noop can be reduced to an equivalent active reward learning problem. Corollary 5. If a two-phase communicative assistance problem M, π H has only one communicative robot action, it can be reduced to an equivalent non-active reward learning problem.

4. QUALITATIVE IMPROVEMENTS FOR GENERAL ASSISTANCE

We have seen that reward learning is equivalent to two-phase communicative assistance problems, where inferring the reward distribution can be separated from control using the reward distribution. However, for general assistance games, it is necessary to merge estimation and control, leading to several new qualitative behaviors. When the two phase restriction is lifted, we observe relevance aware active learning and plans conditional on future feedback. When the communicative restriction is lifted, we observe learning from physical actions. We demonstrate these qualitative behaviors in simple environments using point-based value iteration (PBVI) or deep reinforcement learning (DRL). We describe the qualitative results here, deferring detailed explanations of environments and results to Appendix C. For communicative assistance problems, we also consider two baselines: 1. Active reward learning. This is the reward learning paradigm discussed so far. 2. Interactive reward learning. This is a variant of reward learning that aims to recover some of the benefits of interactivity, by alternating reward learning and acting phases. During an action phase, R chooses actions that maximize expected reward under its current belief over θ (without "knowing" that its belief may change), while during a reward learning phase, R chooses questions that maximizes information gain.

4.1. PLANS CONDITIONAL ON FUTURE FEEDBACK

Here, we show how an assistive agent can make plans that depend on obtaining information about θ in the future. The agent can first take some "preparatory" actions that whose results can be used later once the agent has clarified details about θ. A reward learning agent would not be able to do this, as it would require three phases (acting, then learning, then acting again). We illustrate this with our original kitchen environment (Figure 1 ), in which R must bake a pie for H, but doesn't know what type of pie H would like: Apple, Blueberry, or Cherry. Each type has a weight specifying the reward for that pie. Assuming people tend to like apple pie the most and cherry pie the least, we have θ A ∼ Uniform[2, 4], θ B ∼ Uniform[1, 3], and θ C ∼ Uniform[0, 2]. We define the questions Q = {q A , q B , q C }, where q X means "What is the value of θ X ?", and thus, the answer set is C = R. R can select ingredients to assemble the pie. Eventually, R must use "bake", which bakes the selected ingredients into a finished pie, resulting in reward that depends on what type of pie has been created. H initially starts outside the room, but will return at some prespecified time. r θ assigns a cost of asking a question of 0.1 if H is inside the room, and 3 otherwise. The horizon is 6 timesteps. Assistance. Notice that, regardless of H's preferences, R will need to use flour to make pie dough. So, R always makes the pie dough first, before querying H about her preferences. Whether R then queries H about her preferences depends on how late H returns. If H arrives home before timestep 5, R will query her about her preferences and then make the appropriate pie as expected. However, if H will arrive later, then there will not be enough time to query her for her preferences and bake a pie. Instead, R bakes an apple pie, since its prior suggests that that's what H wants. This behavior, where R takes actions (making dough) that are robustly good but waits on actions (adding the filling) whose reward will be clarified in the future, is very related to conservative agency (Turner et al., 2020), a connection explored in more depth in Appendix D. Reward learning. The assistance solution requires R to act (to make dough), then to learn preferences, and then to act again (to make pie). A reward learning agent can only have two phases, and so we see one of two suboptimal behaviors. First, R could stay in the learning phase until H returns home, then ask which pie she prefers, and then make the pie from scratch. Second, R could make an apple pie without asking H her preferences. (In this case there would be no learning phase.) Which of these happens depends on the particular method and hyperparameters used. Interactive reward learning. Adding interactivity is not sufficient to get the correct behavior. Suppose we start with an action phase. The highest reward plan under R's current belief over θ is to bake an apple pie, so that's what it will do, as long as the phase lasts long enough. Conversely, suppose we start with a learning phase. In this case, R does nothing until H returns, and then asks about her preferences. Once we switch to an action phase, it bakes the appropriate pie from scratch.

4.2. RELEVANCE AWARE ACTIVE LEARNING

? Figure 2 : The wormy-apples kitchen environment. H wants an apple, but R might discover worms in the apple, and have to dispose of it in either of the trash or compost bins. Once we relax the two-phase restriction, R starts to further optimize whether and when it asks questions. In particular, since R may be uncertain about whether a question's answer will even be necessary, R will only ask questions once they become immediately relevant to the task at hand. In contrast, a reward learning agent would have to decide at the beginning of the episode (during the learning phase) whether or not to ask these questions, and so cannot evaluate how relevant they are. Consider for example a modification to the kitchen environment: R knows that H wants an apple pie, but when R picks up some apples, there is a 20% chance that it finds worms in some of the apples. R is unsure whether H wants her compost bin to have worms, and so does not know whether to dispose of the bad apples in the trash or compost bin. Since this situation is relatively unlikely, ideally R would only clarify H's preferences when the situation arises. Assistance. An assistive R only asks about wormy apples when it needs to dispose of one. R always starts by picking up apples. If the apple does not have worms, R immediately uses the apples to bake the pie. If some apples have worms and the cost of asking a question is sufficiently low, R elicits H's preferences and disposes of the apples appropriately. It then bakes the pie with the remaining apples. This behavior, in which questions are asked only if they are useful for constraining future behavior, has been shown previously using probabilistic recipe trees (PRTs) Kamar et al. (2009) , but to our knowledge has not been shown with optimization-based approaches. Reward learning. A reward learning policy must have only two phases and so would show one of two undesirable behaviors: either it would always ask H where to dispose of wormy apples, or it never asks and instead guesses when it does encounter wormy apples. Interactive reward learning. This has the same problem as in the previous section. If we start in the action phase and R picks up wormy apples, it will dispose of them in an arbitrary bin without asking H about her preferences, because it doesn't "know" that it will get the opportunity to do so. Alternatively, if we start with a learning phase, R will ask H where to dispose of wormy apples, even if R would never pick up any wormy apples. Note that more complex settings can have many more questions. Should R ask whether H would prefer to use seedless apples, should scientists ever invent them? Perhaps R should ask H how her pie preferences vary based on her emotional state? Asking about all possible situations is not scalable.

4.3. LEARNING FROM PHYSICAL ACTIONS

So far we have considered communicative assistance problems, in which H only provides feedback rather than acting to maximize reward herself. Allowing H to have physical actions enables a greater variety of potential behaviors. Most clearly, when R knows the reward (that is, P Θ puts support over a single θ), assistance games become equivalent to human-AI collaboration (Nikolaidis & Shah, 2013; Carroll et al., 2019; Dimitrakakis et al., 2017) . With uncertain rewards, we can see further interesting qualitative behaviors: R can learn just by observing how H acts in an environment, and then work with H to maximize reward, all within a single episode, as in shared autonomy with intent inference (Javdani et al., 2015; Brooks & Szafir, 2019) and other works that interpret human actions as communicative Whitney et al. (2017) . This can significantly reduce the burden on H in providing reward information to R (or equivalently, reduce the cost incurred by R in asking questions to H). Some work has shown that in such situations, humans tend to be pedagogic: they knowingly take individually suboptimal actions, in order to more effectively convey the goal to the agent (Ho et al., 2016; Hadfield-Menell et al., 2016 ). An assistive R who knows this can quickly learn what H wants, and help her accomplish her goals.

+ + + +

Figure 3 : The cake-or-pie variant of the kitchen environment. H is equally likely to prefer cake or pie. Communication must take place through physical actions alone. We illustrate this with a variant of our kitchen environment, shown in Figure 3 . There are no longer questions and answers. Both H and R can move to an adjacent free space, and pick up and place the various objects. Only R may bake the dessert. R is uncertain whether H prefers cake or cherry pie. For both recipes, it is individually more efficient for H to pick up the dough first. However, we assume H is pedagogic and wants to quickly show R which recipe she wants. So, if she wants cake, she will pick up the chocolate first to signal to R that cake is the preferred dessert. It is not clear how exactly to think about this from a reward learning perspective: there aren't any communicative human actions since every action alters the state of the environment. In addition, there is no clear way to separate out a given trajectory into two phases. This situation cannot be easily coerced into the reward learning paradigm. In contrast, an assistive R can handle this situation perfectly. It initially waits to see which ingredient H picks up first, and then quickly helps H by putting in the ingredients from its side of the environment and baking the dessert. It learns implicitly to make the cake when H picks up chocolate, and to make the pie when H picks up dough. This is equivalent to pragmatic reasoning (Goodman & Frank, 2016): "H would have picked up the chocolate if she wanted cake, so the fact that she picked up the dough implies that she wants cherry pie". However, we emphasize that R is not explicitly programmed to reason in this manner, and is learned using deep reinforcement learning (Appendix C.3). Note that R is not limited to learning from H's physical actions: R can also use its own physical actions to "query" the human for information (Woodward et al., 2019; Sadigh et al., 2016) .

5. LIMITATIONS AND FUTURE WORK

Computational complexity. The major limitation of assistance compared to reward learning is that assistance problems are significantly more computationally complex, since we treat the unknown reward θ as the hidden state of a POMDP. We are hopeful that this can be solved through the application of deep reinforcement learning. An assistance problem is just like any other POMDP, except that there is one additional unobserved state variable θ and one additional observation a H . This should not be a huge burden, since deep reinforcement learning has been demonstrated to scale to huge observation and action spaces (OpenAI, 2018; Vinyals et al., 2019) . Another avenue for future work is to modify active reward learning algorithms in order to gain the benefits outlined in Section 4, while maintaining their computational efficiency. Increased chance of incorrect inferences. In practice, assistive agents will extract more information from H than reward learning agents, and so it is worse if π H is misspecified. We don't see this as a major limitation: to the extent this is a major worry, we can design π H so that the robot only makes inferences about human behavior in specific situations. For example, by having π H be independent of θ in a given state s, we ensure that the robot does not make any inferences about θ in that state. Environment design. We have shown that by having a hidden human goal, we can design environments in which optimal agent behavior is significantly more "helpful". One important direction for future work is to design larger, more realistic environments, in order to spur research into how best to solve such environments. We would be particularly excited to see a suite of assistance problems become a standard benchmark by which deep reinforcement learning algorithms are assessed.

5.1. LIMITATIONS OF ASSISTANCE AND REWARD LEARNING

While we believe that the assistance framework makes meaningful conceptual progress over reward learning, a number of challenges for reward learning remain unaddressed by assistance: Human modeling. A major motivation for both paradigms is that reward specification is very difficult. However, now we need to specify a prior over reward functions, and the human model π H . Consequently, misspecification can still lead to bad results (Armstrong et al., 2020; Carey, 2018) . While it should certainly be easier to specify a prior over θ with a "grain of truth" on the true reward θ * than to specify θ * directly, it is less clear that we can specify π H well. One possibility is to add uncertainty over the human policy π H . However, this can only go so far: information about θ must come from somewhere. If R is sufficiently uncertain about θ and π H , then it cannot learn about the reward (Armstrong & Mindermann, 2018) . Thus, for good performance we need to model π H . While imitation learning can lead to good results (Carroll et al., 2019) , the best results will likely require insights from a broad range of fields that study human behavior. Assumption that H knows θ. Both assistance games and reward learning makes the assumption that H knows her reward exactly, but in practice, human preferences change over time (Allais, 1979; Cyert & DeGroot, 1975; Shogren et al., 2000) . We could model this as the human changing their subgoals (Michini & How, 2012; Park et al., 2020) , adapting to the robot (Nikolaidis et al., 2017) or learning from experience (Chan et al., 2019) . Dependence on uncertainty. All of the behaviors of Section 4, as well as previously explored benefits such as off switch corrigibility (Hadfield-Menell et al., 2017a) , depend on R expecting to gain information about θ. However, R will eventually exhaust the available information about θ. If everything is perfectly specified, this is not a problem: R will have converged to the true θ * . However, in the case of misspecification, after convergence R is effectively certain in an incorrect θ, which has many troubling problems that we sought to avoid in the first place (Yudkowsky, year unknown).

6. CONCLUSION

While much recent work has focused on how we can build agents that learn what they should do from human feedback, there is not yet a consensus on how such agents should be built. In this paper, we contrasted the paradigms of reward learning and assistance. We showed that reward learning problems are equivalent to a special type of assistance problem, in which the human may only provide feedback at the beginning of the episode, and the agent may only act in the environment after the human has finished providing feedback. By relaxing these restrictions, we enable the agent to reason about how its actions in the environment can influence the process by which it solicits and learns from human feedback. This allows the agent to (1) choose questions based on their relevance, (2) create plans whose success depends on future feedback, and (3) learn from physical human actions in addition to communicative feedback.

A REWARD LEARNING AND ASSISTANCE FORMALISMS A.1 RELATION BETWEEN NON-ACTIVE AND ACTIVE REWARD LEARNING

The key difference between non-active and active reward learning is that in the latter R may ask H questions in order to get more targeted feedback. This matters as long as there is more than one question: with only one question, since there is no choice for R to make, R cannot have any influence on the feedback that H provides. As a result, non-active reward learning is equivalent to active reward learning with a single question. Proposition 6. Every non-active reward learning problem M\r, C, Θ, r θ , P Θ , π H , k can be reduced to an active reward learning problem. Proof. We construct the active reward learning problem as M\r, Q , C, Θ, r θ , P Θ , π H , k , where Q {q φ } where q φ is some dummy question, and π H (c | q, θ) π H (c | θ). Suppose the solution to the new problem is π R Q , f . Since f is a solution, we have: f = argmax f E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H (•|qi,θ) ER( f (q 0:k-1 , c 0:k-1 )) = argmax f E θ∼PΘ,q 0:k-1 =q φ ,c 0:k-1 ∼π H (•|q φ ,θ) ER( f (q 0:k-1 = q φ , c 0:k-1 )) all q are q φ = argmax f E θ∼PΘ,c 0:k-1 ∼π H (•|θ) ER( f (q 0:k-1 = q φ , c 0:k-1 )) . Thus f (c 0:k-1 ) = f (q 0:k-1 = q φ , c 0:k-1 ) is a maximizer of E θ∼PΘ,c 0:k-1 ∼π H (•|θ) ER( f (c 0:k-1 ) , making it a solution to our original problem. Proposition 7. Every active reward learning problem M\r, Q, C, Θ, r θ , P Θ , π H , k with |Q| = 1 can be reduced to a non-active reward learning problem. Proof. Let the sole question in Q be q φ . We construct the non-active reward learning problem as M\r, C, Θ, r θ , P Θ , π H , k , with π H (c | θ) = π H (c | q φ , θ). Suppose the solution to the new problem is f . Then we can construct a solution to the original problem as follows. First, note that π R Q must be π R Q (q i | q 0:i-1 , c 0:i-1 ) = 1[q i = q φ ], since there is only one possible question q φ . Then by inverting the steps in the proof of Proposition 6, we can see that f is a maximizer of E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H (•|qi,θ) ER( f (• | c 0:k-1 ) ) . Thus, by defining f (q 0:k-1 , c 0:k-1 ) = f (c 0:k-1 ), we get a maximizer to our original problem, making π R Q , f a solution to the original problem.

A.2 REDUCING ASSISTANCE PROBLEMS TO POMDPS

Suppose that we have an assistance problem M, π H with: M = S, {A H , A R }, {Ω H , Ω R }, {O H , O R }, T, P S , γ, Θ, r θ , P Θ . Then, we can derive a single-player POMDP for the robot M = S , A R , Ω , O , T , r , P 0 , γ by embedding the human reward parameter into the state. We must include the human's previous action a H into the state, so that the robot can observe it, and so that the reward can be computed. To allow for arbitrary (non-Markovian) human policies π H , we could encode the full history in the state, in order to embed π H into the transition function T . However, in our experiments we only consider human policies that are in fact Markovian. We make the same assumption here, giving a policy π H (a H t | o H t , a R t , θ) that depends on the current observation and previous robot action. The transformation M → M is given as follows: S S × A H × Θ State space Ω Ω R × A H Observation space O (o | s ) = O ((o R , a H 1 ) | (s, a H 2 , θ)) Observation function 1[a H 1 = a H 2 ] • O R (o R | s) T (s 2 | s 1 , a R ) = T ((s 2 , a H 1 , θ 2 ) | (s 1 , a H 0 , θ 1 ), a R ) Transition function T (s 2 | s 1 , a H 1 , a R ) • 1[θ 2 = θ 1 ] • o H ∈Ω H O H (o H | s 1 ) • π H (a H 1 | o H , a R , θ) r (s 1 , a R , s 2 ) = r ((s 1 , a H 0 , θ), a R , (s 2 , a H 1 , θ)) Reward function r θ (s 1 , a H 1 , a R , s 2 ) P 0 (s ) = P 0 ((s, a H , θ)) Initial state distribution P S (s) • P Θ (θ) • 1[a H = a H init ] where a H init is arbitrary In the case where the original assistance problem is fully observable, the resulting POMDP is an instance of a Bayes-Adaptive MDP (Martin, 1967; Duff, 2002) . Any robot policy π R can be translated from the APOMDP M naturally into an identical policy on M . Note that in either case, policies are mappings from (Ω R , A H , A R ) * × Ω R to ∆(A R ). This transformation preserves optimal agent policies: Proposition 8. A policy π R is a solution of M if and only if it is a solution of M . Proof. Recall that an optimal policy π * in the POMDP M is one that maximizes the expected value: EV(π) = E s 0 ∼P 0 ,τ ∼ s 0 ,π ∞ t=0 γ t r (s t , a t , s t+1 ) = E s 0 ∼P 0 ,τ ∼ s 0 ,π ∞ t=0 γ t r θ (s t , a H t , a t , s t+1 ) where the trajectories τ s are sequences of state, action pairs drawn from the distribution induced by the policy, starting from state s 0 . Similarly, an optimal robot policy π R * in the APOMDP M is one that maximizes its expected reward: ER(π R ) = E s0∼P S ,θ∼PΘ,τ ∼ s0,θ,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) . To show that the optimal policies coincide, suffices to show that for any π, ER(π) (in M) is equal to EV(π) (in M ). To do this, we will show that π induces the "same" distributions over the trajectories. For mathematical convenience, we will abuse notation and consider trajectories of the form τ ; θ ∈ (S, A H , A R ) * × Θ; it is easy to translate trajectories of this form to trajectories in either M or M. We will show that the sequence τ ; θ has the same probability when the robot takes the policy π in both M and M by induction on the lengths of the sequence. First, consider the case of length 1 sequences. τ ; θ = [(s, a R , a  π R (• | o R t , τ R t-1 )).

A.3 OPTIMAL STRATEGY PAIRS AS POLICY-CONDITIONED BELIEF

We use the term policy-conditioned belief to refer to a distribution over human policies which depends on the chosen robot policy. We use policy-conditioned beliefs as opposed to a simple unconditional distribution over human policies, because it allows us to model a wide range of situations, including situations with prior coordination, or where humans adapt to the robot's policy as a result of prior interactions. Moreover, this presents a unifying framework with prior work on assistance games (Hadfield-Menell et al., 2016) . In fact, finding an optimal strategy pair for the assistance game can be thought of as finding the policy which is best when the human adapts optimally, as formalized below: Proposition 9. Let M = S, {A H , A R }, {Ω H , Ω R }, {O H , O R }, T, P S , γ, Θ, r θ , P Θ be an as- sistance game. Let B(π R )(π H ) ∝ 1[EJR(π H , π R ) = max πH ∈Π H EJR(π H , π R ) ] be an associated policy-conditioned belief. Let π R be the solution to M, B . Then B(π R ), π R is an optimal strategy pair. Proof. Let π H , π R be an arbitrary strategy pair. Then EJR(π H , π R ) ≤ EJR(B(π R ), π R ) by the definition of B, and EJR(B(π R ), π R ) ≤ EJR(B(π R ), π R ) by the definition of π R . Thus EJR(π H , π R ) ≤ EJR(B(π R ), π R ). Since π H , π R was assumed to be arbitrary, B(π R ), π R is an optimal strategy pair.

B EQUIVALENCE OF RESTRICTED ASSISTANCE AND EXISTING ALGORITHMS B.1 EQUIVALENCE OF TWO PHASE ASSISTANCE AND REWARD LEARNING

Here we prove the results in Section 3 showing that two phase communicative assistance problems and reward learning problems are equivalent. We first prove Proposition 4, and then use it to prove the others. Proposition 4. Every two-phase communicative assistance problem M, π H , a R noop can be reduced to an equivalent active reward learning problem. Proof. Let M = S, {A H , A R }, {Ω H , Ω R }, {O H , O R }, T, P S , γ, Θ, r θ , P Θ be the assistance game, and let the assistance problem's action phase start at t act . Let a H φ ∈ A H be some arbitrary human action and o H φ ∈ Ω H be some arbitrary human observation. We construct the new active reward learning problem M , Q , C , Θ, r θ , P Θ , π H , k as follows: Q {a R ∈ A R : a R is communicative} Questions C A H Answers M S, A , Ω R , O R , T , P 0 , γ POMDP A A R \Q Physical actions T (s | s, a R ) T (s | s, a H φ , a R ) Transition function k t act Number of questions P 0 (s) s 0:k ∈S P M (s 0:k , s k +1 = s | a R 0:k = a R noop , a H 0:k = a H φ ) Initial state distribution r θ (s, a R , s ) r θ (s, a H φ , a R , s ) Reward function π H (c | q, θ) π H (c | o H φ , q, θ)

Human decision function

Note that it is fine to use a H φ in T, r θ and to use o H φ in π H even though they were chosen arbitrarily, because since the assistance problem is communicative, the result does not depend on the choice. The P M term in the initial state distribution denotes the probability of a trajectory under M and can be computed as P M (s 0:T +1 | a R 0:T , a H 0:T ) = P S (s 0 ) T t=0 T (s t+1 | s t , a H t , a R t ). Given some pair π R Q , f to the active reward learning problem, we construct a policy for the assistance problem as π R (a R t | o R t , τ R t-1 )    π R Q (a R t | a R 0:t-1 , a H 0:t-1 ), t < k and a R 0:t ∈ Q f (a R 0:k-1 , a H 0:k-1 )(a R t | o R k:t , a R k:t-1 ), t ≥ k and a R 0:k-1 ∈ Q and a R k:t ∈ A 0, else . We show that there must exist a solution to P that is the analogous policy to some pair. Assume towards contradiction that this is not the case, and that there is a solution π R * that is not the analogous policy to some pair. Then we have a few cases: 1. π R * assigns positive probability to a R i = a / ∈ Q for i < k. This contradicts the two-phase assumption. 2. π R * assigns positive probability to a R i = q ∈ Q for i ≥ k. This contradicts the two-phase assumption. 3. π R * (a R t | o R t , τ R t-1 ) depends on the value of o R i for some i < k. Since both a H 0:k-1 and a R 0:k-1 cannot affect the state or reward (as they are communicative), the distribution over o R 0:k-1 is fixed and independent of π R , and so there must be some other π R that is independent of o R 0:k-1 that does at least as well. That π R would be the analogous policy to some pair, giving a contradiction. Now, suppose we have some pair π R Q , f , and let its analogous policy be π R . Then we have: E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H [ER(f (q 0:k-1 , c 0:k-1 ))] = E θ∼PΘ E q 0:k-1 ∼π R ,c 0:k-1 ∼π H [ER(f (q 0:k-1 , c 0:k-1 ))] = E θ∼PΘ E q 0:k-1 ∼π R ,c 0:k-1 ∼π H E s0∼P 0 ,a R t ∼f (q 0:k-1 ,c 0:k-1 ),st+1∼T (•|st,a R t ) ∞ t=0 γ t r θ (s t , a R t , s t+1 ) = E θ∼PΘ E q 0:k-1 ∼π R ,c 0:k-1 ∼π H E s k ∼P 0 ,a R t ∼π R (•| c 0:k-1 ,o k:t , q 0:k-1 ,a k:t-1 ,st+1∼T (•|st,a R t ) 1 γ k ∞ t=k γ t r θ (s t , a R t , s t+1 ) = E θ∼PΘ E q 0:k-1 ∼π R ,c 0:k-1 ∼π H E s k ∼P 0 ,a R t ∼π R (•| c 0:k-1 ,o k:t , q 0:k-1 ,a k:t-1 ,st+1∼T (•|st,a R t ) 1 γ k ∞ t=k γ t r θ (s t , a H φ , a R t , s t+1 ) However, since all the actions in the first phase are communicative and thus don't impact state or reward, the first k timesteps in the two phase assistance game have constant reward in expectation. Let C = Es 0:k k-1 t=0 γ t r θ (s t , a H φ , a R noop , s t+1 ) . This gives us: E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H [ER(f (q 0:k-1 , c 0:k-1 ))] = E θ∼PΘ E s0∼P S ,θ∼PΘ,τ ∼ s0,θ,π H ,π R 1 γ k ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) - 1 γ k C = 1 γ k ER(π R ) -C . Under review as a conference paper at ICLR 2021 Thus, if π R Q , f is a solution to the active reward learning problem, then π R is a solution of the two-phase communicative assistance problem. Corollary 5. If a two-phase communicative assistance problem M, π H , a R noop has exactly one communicative robot action, it can be reduced to an equivalent non-active reward learning problem. Proof. Apply Proposition 4 followed by Proposition 7. (Note that the construction from Proposition 4 does lead to an active reward learning problem with a single question, meeting the precondition for Proposition 7.) Proposition 2. Every active reward learning problem P = M, Q, C, Θ, r θ , P Θ , π H , k can be reduced to an equivalent two phase communicative assistance problem P = M , π H . Proof. Let M = S, A, Ω, O, T, P 0 , γ . Let q 0 ∈ Q be some question and c 0 ∈ C be some (unrelated) choice. Let N be a set of fresh states {n 0 , . . . n k-1 }: we will use these to count the number of questions asked so far. Then, we construct the new two phase communicative assistance problem P = M , π H , a R noop as follows: M S , {C, A R }, {Ω H , Ω R }, {O H , O R }, T , P S , γ, Θ, r θ , P Θ Assistance game S S ∪ N State space P S (ŝ) 1[ŝ = n 0 ] Initial state distribution A R A ∪ Q Robot actions Ω H S H's observation space Ω R Ω ∪ N R's observation space O H (o H | ŝ) 1[o H = ŝ] H's observation function O R (o R | ŝ) 1[o R = ŝ], ŝ ∈ N O(o R | ŝ, else R's observation function T (ŝ | ŝ, a H , a R )        P 0 (ŝ ), ŝ = n k-1 , 1[ŝ = n i+1 ], ŝ = n i with i < k -1 T (ŝ | ŝ, a R ), ŝ ∈ S and a R ∈ A, 1[s = s], else Transition function r θ (ŝ, a H , a R , ŝ )        -∞, ŝ ∈ N and a R / ∈ Q, -∞, ŝ ∈ S and a R ∈ Q, 0, ŝ ∈ N and a R ∈ Q, r θ (s, a R , s ), else Reward function π H (a H | o H , a R , θ) π H (a H | a R , θ), a R ∈ Q c 0 , else Human policy a R noop q 0 Distinguished noop action Technically r θ should not be allowed to return -∞. However, since S and A are finite, r θ is bounded, and so there exists some large finite negative number that is functionally equivalent to -∞ that we could use instead. Looking at the definitions, we can see T and r are independent of a H , and π H is independent of o H , making this a communicative assistance problem. By inspection, we can see that every q ∈ Q is a communicative robot action. Any a R / ∈ Q must not be a communicative action, because the reward r θ differs between a R and q 0 . Thus, the communicative robot actions are Q and the physical robot actions are A. Note that by construction of P S and T , we must have s i = n i for i ∈ {0, 1, . . . k -1}, after which s k is sampled from P 0 and all s t ∈ S for t ≥ k. Given this, by inspecting r θ , we can see that an optimal policy must have a R 0:k-1 ∈ Q and a R k: / ∈ Q to avoid the -∞ rewards. Since a R k: / ∈ Q, we have a H k: = c 0 . Thus, setting a H noop = c 0 , we have that the assistance problem is two phase with actions at t act = k, as required. Let a policy π R for the assistance problem be reasonable if it never assigns positive probability to a R ∈ A when t < k or to a R ∈ Q when t ≥ k. Then, for any reasonable policy π R we can construct an analogous pair π R Q , f to the original problem P as follows: π R Q (q i | q 0:i-1 , c 0:i-1 ) π R (q i | o R 0:i-1 = n 0:i-1 , a R 0:i-1 = q 0:i-1 , a H 0:i-1 = c 0:i-1 ), f (q 0:k-1 , c 0:k-1 )(a t | o 0:t , a 0:t-1 ) π R (a t | o R 0:t+k , a R 0:t+k-1 , a H 0:t+k-1 ) , where for the second equation we have o R 0:k-1 = n 0:k-1 a R 0:k-1 = q 0:k-1 a H 0:k-1 = c 0:k-1 o R k:t+k = o 0:t a R k:t+k-1 = a 0:t-1 a H k:t+k-1 = a H noop Note that this is a bijective mapping. Consider some such policy π R and its analogous pair π R Q , f . By construction of T , we have that the first k states in any trajectory are n 0:k-1 and the next state is distributed as P 0 (•). By our assumption on π R we know that the first k robot actions must be selected from Q and the remaining robot actions must be selected from A, which also implies (based on π H ) that after the the remaining human actions must be c 0 . Finally, looking at r θ we can see that the first k timesteps get 0 reward. Thus: ER P (π R ) = E s 0 ∼P S ,θ∼P θ ,τ ∼ s 0 ,θ,π H ,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) = E θ∼P θ ,a R 0:k-1 ∼π R ,a H 0:k-1 ∼π H ,s k ∼P0,τ k: ∼ s k ,θ,π H ,π R ∞ t=k γ t r θ (s t , a H t , a R t , s t+1 ) = E θ∼P θ ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H ,s0∼P0,τ ∼ s0,θ,f (q 0:k-1 ,c 0:k-1 ) γ k ∞ t=0 γ t r θ (s t , a t , s t+1 ) = γ k E θ∼PΘ,q 0:k-1 ∼π R Q ,c 0:k-1 ∼π H [ER(f (q 0:k-1 , c 0:k-1 ))] , which is the objective of the reward learning problem scaled by γ k . Since we have a bijection between reasonable policies in P and tuples in P that preserves the objectives (up to a constant), given a solution π R * to P (which must be reasonable), its analogous pair π R Q , f must be a solution to P. Corollary 3. Every non-active reward learning problem M, C, Θ, r θ , P Θ , π H , k can be reduced to an equivalent two phase communicative assistance problem M , π H . Proof. Apply Proposition 6 followed by Proposition 2.

B.2 ASSISTANCE WITH NO REWARD INFORMATION

In a communicative assistance problem, once there is no information to be gained about θ, the best thing for R to do is to simply maximize expected reward according to its prior. We show this in the particular case where π H is independent of θ and thus cannot communicate any information about θ: Proposition 10. A communicative assistance problem M, π H where π H is independent of θ can be reduced to a POMDP M with the same state space. Proof. Given M = S, {A H , A R }, {Ω H , Ω R }, {O H , O R }, T, P S , γ, Θ, r θ , P Θ , we define a new POMDP as M = S, A R , Ω R , O R , T , r , P S , γ , with T (s | s, a R ) = T (s | s, a H φ , a R ) and r (s, a R , s ) = Eθ∼P θ r θ (s, a H φ , a R , s ) . Here, a H φ is some action in A H ; note that it does not matter which action is chosen since in a communicative assistance problem human actions have no impact on T and r. Expanding the definition of expected reward for the assistance problem, we get: ER(π R ) = E s0∼P S ,θ∼PΘ,τ ∼ s0,θ,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) = E s0∼P S E θ∼PΘ E τ ∼ s0,θ,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) Note that because π H (a H | o H , a R , θ ) is independent of θ, the robot gains no information about θ and thus π R is also independent of θ. This means that we have: ER(π R ) = E s0∼P S E θ∼PΘ E τ ∼ s0,π R ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) Let r max = max s,a H ,a R ,s |r θ (s, a H , a R , s )| (which exists since S, A H , and A R are finite). Then: ∞ t=0 γ t |r θ (s t , a H t , a R t , s )| ≤ ∞ t=0 γ t r max = r max 1 -γ < ∞. So we can apply Fubini's theorem to swap the expectations and sums. Applying Fubini's theorem twice gives us: ER(π R ) = E s0∼P S E τ ∼ s0,π R E θ∼PΘ ∞ t=0 γ t r θ (s t , a H t , a R t , s t+1 ) = E s0∼P S E τ ∼ s0,π R ∞ t=0 γ t E θ∼PΘ r θ (s t , a H t , a R t , s t+1 ) = E s0∼P S E τ ∼ s0,π R ∞ t=0 γ t r (s t , a R t , s t+1 ) . In addition, the trajectories are independent of π H , since the assistance problem is communicative, and so for a given policy π R , the trajectory distributions for M and M coincide, and thus the expected rewards for π R also coincide. Thus, the optimal policies must coincide.

C EXPERIMENTAL DETAILS C.1 PLANS CONDITIONAL ON FUTURE FEEDBACK

In the environment described in Section 4.1, R needs to bake either apple or blueberry pie (cherry is never preferred over apple) within 6 timesteps, and may query H about her preferences about the pie. Making the pie takes 3 timesteps: first R must make flour into dough, then it must add one of the fillings, and finally it must bake the pie. Baking the correct pie results in +2 reward, while baking the wrong one results in a penalty of -1. In addition, H might be away for several timesteps at the start of the episode. Querying H costs 0.1 when she is present and 3 when she is away. The optimal policy for this environment depends on whether H would be home early enough for R to query her and bake the desired the pie by the end of the episode. R should always quickly make dough, as that is always required. If H returns home on timestep 4 or earlier, R should wait for her to get home, ask her about her preferences and then finish the desired pie. If H returns home later, R should make its best guess about what she wants, and ensure that there is a pie ready for her to eat: querying H when she is away is too costly, and there is not enough time to wait for H, query her, put in the right filling, and bake the pie. In the wormy-apple environment described in Section 4.2, the robot had to bring the human some apples in order to make a pie, but there's a 20% chance that the apples have worms in them, and the robot does not yet know how to dispose of soiled apples. The robot gets 2 reward for making an apple pie (regardless of how it disposed of any wormy apples), and gets -2 reward if it disposes of the apples in the wrong container. Additionally, asking a question incurs a cost of 0.1. We solve this environment with exact value iteration. If the environment is two-phase, with a lower discount rate (λ = 0.9), R's policy never asks questions and instead simply tries to make the apple pie, guessing which bin to dispose of wormy apples in if it encounters any. Intuitively, since it would have to always ask the question at the beginning, it would always incur a cost of 0.1 as well as delay the pie by a timestep resulting in 10% less value, and this is only valuable when there turn out to be worms and its guess about which bin to dispose of them in is incorrect, which only happens 10% of the time. This ultimately isn't worthwhile. This achieves an expected undiscounted reward of 1.8. Removing the two-phase restriction causes R to ask questions mid-trajectory, even with this low discount. With this result achieves the maximal expected undiscounted reward of 1.98. With a higher discount rate of λ = 0.99, the two-phase policy will always ask about which bin to dispose of wormy apples in, achieving 1.9 expected undiscounted reward. This is still less than the policy without the two-phase restriction, which continues to get undiscounted reward 1.98 because it avoids asking a question 80% of the time, and so incurs the cost of asking a question less often.

C.3 LEARNING FROM PHYSICAL ACTIONS: CAKE-OR-PIE EXPERIMENT

In the environment described in Section 4.3, H wants a dessert, but R is unsure whether H prefers cake or pie. Preparing the more desired recipe provides a base value of V = 10, and the less desired recipe provides a base value of V = 1. Since H doesn't want the preparation to take too long, the actual reward when a dessert is made is given by r t = V • f (t), with f (t) = 1 -(t/N ) 4 , and N = 20 as the episode horizon. The experiments use the pedagogic H, that picks the chocolate first if they want cake, which allows R to distinguish the desired recipe early on -this is in contrast with the non-pedagogic H, which does not account for R beliefs and always goes for the dough first. With the pedagogic H, the optimal R does not move until H picks or skips the dough; if H skips the dough, this implies the recipe is cake and R takes the sugar, and then the cherries -otherwise it goes directly for the cherries. With the non-pedagogic H, the optimal R goes for the cherries first (since it is a common ingredient), and only then it checks whether H went for the chocolate or not, and has to go all the way back to grab the sugar if H got the chocolate. We train R with Deep Q-Networks (DQN; (Mnih et al., 2013 )); we ran 6 seeds for 5M timesteps and a learning rate of 10 -4 ; results are shown in Figure 4 .

D OPTION VALUE PRESERVATION

In Section 4.1, we showed that R takes actions that are robustly good given its uncertainty over θ, but waits on actions whose reward will be clarified by future information about θ. Effectively, R is preserving its option value: it ensures that it remains capable of achieving any of the plausible reward functions it is uncertain over. A related notion is that of conservative agency (Turner et al., 2020) , which itself aims to preserve an agent's ability to optimize a wide variety of reward functions. This is achieved via attainable utility preservation (AUP). Given an agent optimizing a reward r spec and a distribution over auxiliary reward functions r aux , the AUP agent instead optimizes the reward where the hyperparameter λ determines how much to penalize an action for destroying option value, and a φ is an action that corresponds to R "doing nothing". However, the existing AUP penalty is applied to the reward, which means it penalizes any action that is part of a long-term plan that destroys option value, even if the action itself does not destroy option value. For example, in the original Kitchen environment of Figure 1 with a sufficiently high λ, any trajectory that ends with baking a pie destroys option value and so would have negative reward. As a result, there is no incentive to make dough: the only reason to make dough is to eventually make a pie, but we have established that the value of making a pie is negative. What we need is to only penalize an action when it is going to immediately destroy option value. This can be done by applying the penalty during action selection, rather than directly to the reward: After this modification, the agent will correctly make dough, and stop since it does not know what filling to use. In an assistance problem, R will only preserve option value if it expects to get information that will resolve its uncertainty later: otherwise, it might as well get what reward it can given its uncertainty. Thus, we might expect to recover existing notions of option value preservation in the case where the agent is initially uncertain over θ, but will soon learn the true θ. Concretely, let us consider a fully observable communicative Assistance POMDP where the human will reveal θ on their next action. In that case, R's chosen action a gets immediate reward r(s, a) = Eθ [r θ (s, a)], and future reward Eθ∼P Θ ,s ∼T (•|s,a) [V θ (s )], where V θ (s) refers to the value of the optimal policy when the reward is known to be r θ and the initial state is s. Thus, the agent should choose actions according to: This bears many resemblances to the AUP policy, once we set the distribution over auxiliary rewards to be the distribution over r θ , along with r spec = r and λ = γ. Nonetheless, there are significant differences, primarily because AUP was designed for the case where r spec and r aux could be arbitrarily different, which is not the case for us. In particular, with AUP the agent is penalized for any loss in r aux by taking the chosen action a relative to doing nothing, while in the assistance problem, the agent is penalized for any loss in r θ by acting according to r relative to what could be achieved if R knew the true reward. It is intriguing that both these methods lead to behavior that we would characterize as "preserving option value".



Relative toHadfield-Menell et al. (2016), our definition allows for partial observability and requires that the initial distribution over S and Θ be independent. We also have H choose her action sequentially after R, rather than simultaneously with R, in order to better parallel the reward learning setting.



AU P (s, a) = r spec (s, a) -λ E raux [max(Q raux (s, a φ ) -Q raux (s, a), 0)]

AU P (s) = argmax a Q rspec (s, a) -λ E raux [max(Q raux (s, a φ ) -Q raux (s, a), 0)]

a) + γV r (s ) -γV r (s ) + γ E θ

Note that unlike H, R does not observe the reward parameter θ, and must infer θ much like it does the hidden state. A fully observable assistance game is one in which both H and R can observe the full state. In such cases, we omit Ω H , Ω R , O H and O R . Since we have not yet specified how H behaves, it is not clear what the agent should optimize for.

Under both M and M, s and θ are drawn from P S and P Θ respectively. Similarly, a R and a H are drawn from π R (• | o R 0 ) and π H (• | o H , a R , θ) respectively. So the distribution of length 1 sequences is the same under both M and M.

