THE IMPACTS OF KNOWN AND UNKNOWN DEMON-STRATOR IRRATIONALITY ON REWARD INFERENCE

Abstract

Algorithms inferring rewards from human behavior typically assume that people are (approximately) rational. In reality, people exhibit a wide array of irrationalities. Motivated by understanding the benefits of modeling these irrationalities, we analyze the effects that demonstrator irrationality has on reward inference. We propose operationalizing several forms of irrationality in the language of MDPs, by altering the Bellman optimality equation, and use this framework to study how these alterations affect inference. We find that incorrectly assuming noisy-rationality for an irrational demonstrator can lead to remarkably poor reward inference accuracy, even in situations where inference with the correct model leads to good inference. This suggests a need to either model irrationalities or find reward inference algorithms that are more robust to misspecification of the demonstrator model. Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference. In other words, if we could choose between a world where humans were perfectly rational and the current world where humans have systematic biases, the current world might counterintuitively be preferable for reward inference. We reproduce this effect in several domains. While this finding is mainly conceptual, it is perhaps actionable as well: we might ask human demonstrators for myopic demonstrations instead of optimal ones, as they are more informative for the learner and might be easier for a human to generate.

1. INTRODUCTION

Motivated by difficulty in reward specification (Lehman et al., 2018) , inverse reinforcement learning (IRL) methods estimate a reward function from human demonstrations (Ng et al., 2000; Abbeel and Ng, 2004; Kalman, 1964; Jameson and Kreindler, 1973; Mombaur et al., 2010) . The central assumption behind these methods is that human behavior is rational, i.e., optimal with respect to their reward (cumulative, in expectation). Unfortunately, decades of research in behavioral economics and cognitive science (Chipman, 2014) have unearthed a deluge of irrationalities, i.e., of ways in which people deviate from optimal decision making: hyperbolic discounting, scope insensitivity, illusion of control, decision noise, loss aversion, to name a few. While as a community we are starting to account for some possible irrationalities plaguing demonstrations in different ways (Ziebart et al., 2008; 2010; Singh et al., 2017; Reddy et al., 2018; Evans et al., 2016; Shah et al., 2019) , we understand relatively little about what effect irrationalities have on the difficulty of inferring the reward. In this work, we seek a systematic analysis of this effect. Do irrationalities make it harder to infer the reward? Is it the case that the more irrational someone is, the harder it is to infer the reward? Do we need to account for the specific irrationality type during learning, or can we get away with the standard noisy-rationality model? The answers to these questions are important in deciding how to move forward with reward inference. If irrationality, even when well modelled, still makes reward inference very difficult, then we will need alternate ways to specify behaviors. If well-modelled irrationality leads to decent reward inference but we run into problems when we just make a noisy-rational assumption, that suggests we need to start accounting for irrationality more explicitly, or at least seek assumptions or models that are robust to many different types of biases people might present. Finally, if the noisy-rational model leads

