BENEFITS OF ASSISTANCE OVER REWARD LEARNING Anonymous

Abstract

Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward model through human feedback that is provided externally from the environment. The second is assistance, in which the human is modeled as a part of the environment, and the true reward function is modeled as a latent variable in the environment that the agent may make inferences about. The key difference between the two paradigms is that in the reward learning paradigm, by construction there is a separation between reward learning and control using the learned reward. In contrast, in assistance these functions are performed as needed by a single policy. By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning. We illustrate these advantages in simple environments by showing desirable qualitative behaviors of assistive agents that cannot be found by agents based on reward learning.

1. INTRODUCTION

Traditional computer programs are instructions on how to perform a particular task. However, we do not know how to mechanically perform more challenging tasks like translation. The field of artificial intelligence raises the level of abstraction so that we simply specify what the task is, and let the machine to figure out how to do it. As task complexity increases, even specifying the task becomes difficult. Several criteria that we might have thought were part of a specification of fairness turn out to be provably impossible to simultaneously satisfy (Kleinberg et al., 2016; Chouldechova, 2017; Corbett-Davies et al., 2017) . Reinforcement learning agents often "game" their reward function by finding solutions that technically achieve high reward without doing what the designer intended (Lehman et al., 2018; Krakovna, 2018; Clark & Amodei, 2016) . In complex environments, we need to specify what not to change (McCarthy & Hayes, 1981) ; failure to do so can lead to negative side effects (Amodei et al., 2016) . Powerful agents with poor specifications may pursue instrumental subgoals (Bostrom, 2014; Omohundro, 2008) such as resisting shutdown and accumulating resources and power (Turner, 2019) . A natural solution is to once again raise the level of abstraction, and create an agent that is uncertain about the objective and infers it from human feedback, rather than directly specifying some particular task(s). Rather than using the current model of intelligent agents optimizing for their objectives, we would now have beneficial agents optimizing for our objectives (Russell, 2019). Reward learning (Leike et al., 2018; Jeon et al., 2020; Christiano et al., 2017; Ziebart et al., 2010) attempts to instantiate this by learning a reward model from human feedback, and then using a control algorithm to optimize the learned reward. Crucially, the control algorithm does not reason about the effects of the chosen actions on the reward learning process, which is external to the environment. In contrast, in the assistance paradigm (Hadfield-Menell et al., 2016; Fern et al., 2014) , the human H is modeled as part of the environment and as having some latent goal that the agent R (for robot) does not know. R's goal is to maximize this (unknown) human goal. In this formulation, R must balance between actions that help learn about the unknown goal, and control actions that lead to high reward. Our key insight is that by integrating reward learning and control modules, assistive agents can take into account the reward learning process when selecting actions. This gives assistive agents a significant advantage over reward learning agents, which cannot perform similar reasoning.

