LEARNING ZERO-SHOT COOPERATION WITH HU-MANS, ASSUMING HUMANS ARE BIASED

Abstract

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.

1. INTRODUCTION

Building intelligent agents that can interact with, cooperate and assist humans remains a longstanding AI challenge with decades of research efforts (Klien et al., 2004; Ajoudani et al., 2018; Dafoe et al., 2021) . Classical approaches are typically model-based, which (repeatedly) build an effective behavior model over human data and plan with the human model (Sheridan, 2016; Carroll et al., 2019; Bobu et al., 2020) . Despite great successes, this model-based paradigm requires an expensive and time-consuming data collection process, which can be particularly problematic for complex problems tackled by today's AI techniques (Kidd & Breazeal, 2008; Biondi et al., 2019) and may also suffer from privacy issues (Pan et al., 2019) . Recently, multi-agent reinforcement learning (MARL) has become a promising approach for many challenging decision-making problems. Particularly in competitive settings, AIs developed by MARL algorithms based on self-play (SP) defeated human professionals in a variety of domains (Silver et al., 2018; Vinyals et al., 2019; Berner et al., 2019) . This empirical evidence suggests a new direction of developing strong AIs that can directly cooperate with humans in a similar "model-free" fashion, i.e., via self-play. Different from zero-sum games, where simply adopting a Nash equilibrium strategy is sufficient, an obvious issue when training cooperative agents by self-play is convention overfitting. Due to the existence of a large number of possible optimal strategies in a cooperative game, SP-trained agents can easily converge to a particular optimum and make decisions solely based on a specific behavior pattern, i.e., convention (Lowe et al., 2019; Hu et al., 2020) , of its co-trainers, leading to poor generalization ability to unseen partners. To tackle this problem, recent works proposed a two-staged framework by first developing a diverse policy pool consisting of multiple SP-trained policies, which possibly cover different conventions, and then further training an adaptive policy against this policy pool (Lupu et al., 2021; Strouse et al., 2021; Zhao et al., 2021) . Despite the empirical success of this two-staged framework, a fundamental drawback exists. Even though the policy pool prevents convention overfitting, each SP-trained policy in the pool remains a solution, which is either optimal or sub-optimal, to a fixed reward function specified by the underlying cooperative game. This implies a crucial generalization assumption that any test-time partner will be precisely optimizing the specified game reward. Such an assumption results in a pitfall in the case of cooperation with humans. Human behavior has been widely studied in cognitive science (Griffiths, 2015), economics (Wilkinson & Klaes, 2017) and game theory (Fang et al., 2021) . Systematic research has shown that humans' utility functions can be substantially biased even when a clear objective is given (Pratt, 1978; Selten, 1990; Camerer, 2011; Barberis, 2013) , suggesting that human behaviors may be subject to an unknown reward function that is very different from the game reward (Nguyen et al., 2013) . This fact reveals an algorithmic limitation of the existing SP-based methods. In this work, we propose Hidden-Utility Self-Play (HSP), which extends the SP-based two-staged framework to the assumption of biased humans. HSP explicitly models the human bias via an additional hidden reward function in the self-play training objective. Solutions to such a generalized formulation are capable of representing any non-adaptive human strategies. We further present a tractable approximation of the hidden reward function space and perform a random search over this approximated space when building the policy pool in the first stage. Hence, the enhanced pool can capture a wide range of possible human biases beyond conventions (Hu et al., 2020; Zhao et al., 2021) and skill-levels (Dafoe et al., 2021) w.r.t. the game reward. Accordingly, the final adaptive policy derived in the second phase can have a much stronger adaptation capability to unseen humans. We evaluate HSP in a popular human-AI cooperation benchmark, Overcooked (Carroll et al., 2019), which is a fully observable two-player cooperative game. We conduct comprehensive ablation studies and comparisons with baselines that do not explicitly model human biases. Empirical results show that HSP achieves superior performances when cooperating with behavior models learned from human data. In addition, we also consider a collection of manually scripted biased strategies, which are ensured to be sufficiently distinct from the policy pool, and HSP produces an even larger performance improvement over the baselines. Finally, we conduct real human studies. Collected feedbacks show that the human participants consistently feel that the agent trained by HSP is much more assistive than the baselines. We emphasize that, in addition to algorithmic contributions, our empirical analysis, which considers learned models, script policies and real humans as diverse testing partners, also provides a more thorough evaluation standard for learning human-assistive AIs.

2. RELATED WORK

There is a broad literature on improving the zero-shot generalization ability of MARL agents to unseen partners (Kirk et al., 2021) . Particularly for cooperative games, this problem is often called ad hoc team play (Stone et al., 2010) or zero-shot cooperation (ZSC) (Hu et al., 2020) . Since most existing methods are based on self-play (Rashid et al., 2018; Yu et al., 2021) , how to avoid convention overfitting becomes a critical challenge in ZSC. Representative works include improved policy representation (Zhang et al., 2020; Chen et al., 2020) , randomization over invariant game structures (Hu et al., 2020; Treutlein et al., 2021 ), population-based training (Long* et al., 2020; Lowe* et al., 2020; Cui et al., 2021) and belief modeling for partial observable settings (Hu et al., 2021; Xie et al., 2021) . Fictitious co-play (FCP) (Strouse et al., 2021) proposes a two-stage framework by first creating a pool of self-play policies and their previous versions and then training an adaptive policy against them. Some techniques improves the diversity of the policy pool (Garnelo et al., 2021; Liu et al., 2021; Zhao et al., 2021; Lupu et al., 2021) for a stronger adaptive policy (Knott et al., 2021) . We follow the FCP framework and augment the policy pool with biased strategies. Notably, techniques for learning a robust policy in competitive games, such as policy ensemble (Lowe et al., 2017 ), adversarial training (Li et al., 2019) and double oracle (Lanctot et al., 2017) , are complementary to our focus. Building AIs that can cooperate with humans remains a fundamental challenge in AI (Dafoe et al., 2021) . A critical issue is that humans can be systematically biased (Camerer, 2011; Russell, 2019) . Hence, great efforts have been made to model human biases, such as irrationality (Selten, 1990; Bobu et al., 2020; Laidlaw & Dragan, 2022 ), risk aversion (Pratt, 1978; Barberis, 2013) , and myopia (Evans et al., 2016) . Many popular models further assume humans have hidden subject utility functions (Nguyen et al., 2013; Hadfield-Menell et al., 2016; Eckersley, 2019; Shah et al., 2019) . Conventional methods for human-AI collaboration require an accurate behavior model over human data (Ajoudani et al., 2018; Kwon et al., 2020; Kress-Gazit et al., 2021; Wang et al., 2022) , while we consider the setting of no human data. Hence, we explicitly model human biases as a hidden utility function in the self-play objective to reflect possible human biases beyond conventions w.r.t. optimal rewards. We prove that such a hidden-utility model can represent any strategy of nonadaptive humans. Notably, it is also feasible to generalize our model to capture higher cognitive hierarchies (Camerer et al., 2004) , which we leave as a future direction.

