THE IMPACTS OF KNOWN AND UNKNOWN DEMON-STRATOR IRRATIONALITY ON REWARD INFERENCE

Abstract

Algorithms inferring rewards from human behavior typically assume that people are (approximately) rational. In reality, people exhibit a wide array of irrationalities. Motivated by understanding the benefits of modeling these irrationalities, we analyze the effects that demonstrator irrationality has on reward inference. We propose operationalizing several forms of irrationality in the language of MDPs, by altering the Bellman optimality equation, and use this framework to study how these alterations affect inference. We find that incorrectly assuming noisy-rationality for an irrational demonstrator can lead to remarkably poor reward inference accuracy, even in situations where inference with the correct model leads to good inference. This suggests a need to either model irrationalities or find reward inference algorithms that are more robust to misspecification of the demonstrator model. Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference. In other words, if we could choose between a world where humans were perfectly rational and the current world where humans have systematic biases, the current world might counterintuitively be preferable for reward inference. We reproduce this effect in several domains. While this finding is mainly conceptual, it is perhaps actionable as well: we might ask human demonstrators for myopic demonstrations instead of optimal ones, as they are more informative for the learner and might be easier for a human to generate.

1. INTRODUCTION

Motivated by difficulty in reward specification (Lehman et al., 2018) , inverse reinforcement learning (IRL) methods estimate a reward function from human demonstrations (Ng et al., 2000; Abbeel and Ng, 2004; Kalman, 1964; Jameson and Kreindler, 1973; Mombaur et al., 2010) . The central assumption behind these methods is that human behavior is rational, i.e., optimal with respect to their reward (cumulative, in expectation). Unfortunately, decades of research in behavioral economics and cognitive science (Chipman, 2014) have unearthed a deluge of irrationalities, i.e., of ways in which people deviate from optimal decision making: hyperbolic discounting, scope insensitivity, illusion of control, decision noise, loss aversion, to name a few. While as a community we are starting to account for some possible irrationalities plaguing demonstrations in different ways (Ziebart et al., 2008; 2010; Singh et al., 2017; Reddy et al., 2018; Evans et al., 2016; Shah et al., 2019) , we understand relatively little about what effect irrationalities have on the difficulty of inferring the reward. In this work, we seek a systematic analysis of this effect. Do irrationalities make it harder to infer the reward? Is it the case that the more irrational someone is, the harder it is to infer the reward? Do we need to account for the specific irrationality type during learning, or can we get away with the standard noisy-rationality model? The answers to these questions are important in deciding how to move forward with reward inference. If irrationality, even when well modelled, still makes reward inference very difficult, then we will need alternate ways to specify behaviors. If well-modelled irrationality leads to decent reward inference but we run into problems when we just make a noisy-rational assumption, that suggests we need to start accounting for irrationality more explicitly, or at least seek assumptions or models that are robust to many different types of biases people might present. Finally, if the noisy-rational model leads 

Short Planning Horizon Long Planning Horizon

Figure 1 : We found that irrationality does not always hinder reward inference (section 3.2) -it is in many cases actually helpful. Here, we depict rational and myopic (short-sighted) behavior in a merging environment (section 4.2) for two different rewards, with higher and lower weights on going fast. The rational car (white) exhibits similar behavior under both reward functions, while the myopic car (blue) overtakes on the shoulder when the reward function places a high weight on speed. This makes it easier to differentiate what its reward is. to decent inference even when the demonstrator is irrational, then we need not dedicate significant resources to addressing irrationality. One challenge with conducting such an analysis is that there are many irrationalities in the psychology and behavioral economics literature, with varying degrees of mathematical formalization versus empirical description. To structure the space for our analysis, we operationalize irrationalities in the language of MDPs by systematically enumerating possible deviations from the Bellman equation -imperfect maximization, deviations from the true transition function, etc. This gives us a formal framework in which we can simulate irrational behavior, run reward inference, and study its performance. Armed with this formalism, we then explore the various impacts of irrationality on reward learning in three families of environments: small random MDPs, a more legible gridworld MDP, and an autonomous driving domain drawn from the robotics literature (Sadigh et al., 2016) . Irrationality can help, rather than hinder reward inference -if it is modelled correctly. We first explore the impacts of demonstrator irrationality when the irrationality is known to the reward inference algorithm. Surprisingly, we find that certain irrationalities actually improve the quality of reward inference -that is, they make reward easier to learn. Importantly, this is not compared to assuming the wrong model of the human: our finding is that humans who exhibit (correctly modelled) irrationality are more informative than humans who exhibit (correctly modelled) rationality! This is consistent in all three domains. We explain this theoretically from the perspective of the mutual information between the demonstrator behavior and the reward parameters, proving that some irrationalities are arbitrarily more informative than rational behavior. Unmodelled irrationality leads to remarkably poor reward inference. It might seem that we can't immediately benefit from the knowledge that irrationalities help inference unless we have a comprehensive understanding of human decision-making, and so we should just stick to the status quo of modeling people as rational. However, we find that modeling irrational demonstrators as noisily-rational can lead to worse outcomes than not performing inference at all and just using the prior (section 5). Encouragingly, we also find evidence that even just modeling the demonstrator's irrationality approximately allows a learner to outperform modeling the demonstrator as noisilyrational (section E). Overall, we contribute 1) a theoretical and empirical analysis of the effects of different irrationalities on reward inference, 2) a way to systematically formalize and cover the space of irrationalities in order to conduct such an analysis, and 3) evidence for the importance and benefit of accounting for irrationality irrationality during inference. Our results suggest that modeling people as noisily rational leads to poor reward inference, and that it is important to model the irrationalities of human demonstrators. Our good news is that if we manage to do that well, we might be better off even compared to a counterfactual world in which people are actually rational! Of course, modeling irrationality is a long term endeavour. Our near-term good news is two fold: first, irrationalities can be an ally for teaching. For example, we could ask human demonstrators to act more myopically to better communicate their reward to the learners. Second, we need not get the biases perfectly correct to do better than assuming noisy-rationality. Instead, using slightly more realistic models of human irrationality could lead to better inference.

