LEARNING ZERO-SHOT COOPERATION WITH HU-MANS, ASSUMING HUMANS ARE BIASED

Abstract

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.

1. INTRODUCTION

Building intelligent agents that can interact with, cooperate and assist humans remains a longstanding AI challenge with decades of research efforts (Klien et al., 2004; Ajoudani et al., 2018; Dafoe et al., 2021) . Classical approaches are typically model-based, which (repeatedly) build an effective behavior model over human data and plan with the human model (Sheridan, 2016; Carroll et al., 2019; Bobu et al., 2020) . Despite great successes, this model-based paradigm requires an expensive and time-consuming data collection process, which can be particularly problematic for complex problems tackled by today's AI techniques (Kidd & Breazeal, 2008; Biondi et al., 2019) and may also suffer from privacy issues (Pan et al., 2019) . Recently, multi-agent reinforcement learning (MARL) has become a promising approach for many challenging decision-making problems. Particularly in competitive settings, AIs developed by MARL algorithms based on self-play (SP) defeated human professionals in a variety of domains (Silver et al., 2018; Vinyals et al., 2019; Berner et al., 2019) . This empirical evidence suggests a new direction of developing strong AIs that can directly cooperate with humans in a similar "model-free" fashion, i.e., via self-play. Different from zero-sum games, where simply adopting a Nash equilibrium strategy is sufficient, an obvious issue when training cooperative agents by self-play is convention overfitting. Due to the existence of a large number of possible optimal strategies in a cooperative game, SP-trained agents can easily converge to a particular optimum and make decisions solely based on a specific behavior pattern, i.e., convention (Lowe et al., 2019; Hu et al., 2020) , of its co-trainers, leading to poor generalization ability to unseen partners. To tackle this problem, recent works proposed a two-staged framework by first developing a diverse policy pool consisting of multiple SP-trained policies, which possibly cover different conventions, and then further training an adaptive policy against this policy pool (Lupu et al., 2021; Strouse et al., 2021; Zhao et al., 2021) . Despite the empirical success of this two-staged framework, a fundamental drawback exists. Even though the policy pool prevents convention overfitting, each SP-trained policy in the pool remains a solution, which is either optimal or sub-optimal, to a fixed reward function specified by the underlying cooperative game. This implies a crucial generalization assumption that any test-time partner will be precisely optimizing the specified game reward. Such an assumption results in a pitfall in the case of cooperation with humans. Human behavior has been widely studied in cognitive science (Griffiths, 2015) , economics (Wilkinson & Klaes, 2017) and game theory (Fang et al., 2021) . Systematic research has shown that humans' utility functions can be substantially biased even when a clear objective is given (Pratt, 1978; Selten, 1990; Camerer, 2011; Barberis, 2013) , suggesting that human behaviors may be subject to an unknown reward function that is very different from the game reward (Nguyen et al., 2013) . This fact reveals an algorithmic limitation of the existing SP-based methods. In this work, we propose Hidden-Utility Self-Play (HSP), which extends the SP-based two-staged framework to the assumption of biased humans. HSP explicitly models the human bias via an additional hidden reward function in the self-play training objective. Solutions to such a generalized formulation are capable of representing any non-adaptive human strategies. We further present a tractable approximation of the hidden reward function space and perform a random search over this approximated space when building the policy pool in the first stage. Hence, the enhanced pool can capture a wide range of possible human biases beyond conventions (Hu et al., 2020; Zhao et al., 2021) and skill-levels (Dafoe et al., 2021) w.r.t. the game reward. Accordingly, the final adaptive policy derived in the second phase can have a much stronger adaptation capability to unseen humans. We evaluate HSP in a popular human-AI cooperation benchmark, Overcooked (Carroll et al., 2019) , which is a fully observable two-player cooperative game. We conduct comprehensive ablation studies and comparisons with baselines that do not explicitly model human biases. Empirical results show that HSP achieves superior performances when cooperating with behavior models learned from human data. In addition, we also consider a collection of manually scripted biased strategies, which are ensured to be sufficiently distinct from the policy pool, and HSP produces an even larger performance improvement over the baselines. Finally, we conduct real human studies. Collected feedbacks show that the human participants consistently feel that the agent trained by HSP is much more assistive than the baselines. We emphasize that, in addition to algorithmic contributions, our empirical analysis, which considers learned models, script policies and real humans as diverse testing partners, also provides a more thorough evaluation standard for learning human-assistive AIs.

2. RELATED WORK

There is a broad literature on improving the zero-shot generalization ability of MARL agents to unseen partners (Kirk et al., 2021) . Particularly for cooperative games, this problem is often called ad hoc team play (Stone et al., 2010) or zero-shot cooperation (ZSC) (Hu et al., 2020) . Since most existing methods are based on self-play (Rashid et al., 2018; Yu et al., 2021) , how to avoid convention overfitting becomes a critical challenge in ZSC. Representative works include improved policy representation (Zhang et al., 2020; Chen et al., 2020) , randomization over invariant game structures (Hu et al., 2020; Treutlein et al., 2021) , population-based training (Long* et al., 2020; Lowe* et al., 2020; Cui et al., 2021) and belief modeling for partial observable settings (Hu et al., 2021; Xie et al., 2021) . Fictitious co-play (FCP) (Strouse et al., 2021) proposes a two-stage framework by first creating a pool of self-play policies and their previous versions and then training an adaptive policy against them. Some techniques improves the diversity of the policy pool (Garnelo et al., 2021; Liu et al., 2021; Zhao et al., 2021; Lupu et al., 2021) for a stronger adaptive policy (Knott et al., 2021) . We follow the FCP framework and augment the policy pool with biased strategies. Notably, techniques for learning a robust policy in competitive games, such as policy ensemble (Lowe et al., 2017) , adversarial training (Li et al., 2019) and double oracle (Lanctot et al., 2017) , are complementary to our focus. Building AIs that can cooperate with humans remains a fundamental challenge in AI (Dafoe et al., 2021) . A critical issue is that humans can be systematically biased (Camerer, 2011; Russell, 2019) . Hence, great efforts have been made to model human biases, such as irrationality (Selten, 1990; Bobu et al., 2020; Laidlaw & Dragan, 2022) , risk aversion (Pratt, 1978; Barberis, 2013) , and myopia (Evans et al., 2016) . Many popular models further assume humans have hidden subject utility functions (Nguyen et al., 2013; Hadfield-Menell et al., 2016; Eckersley, 2019; Shah et al., 2019) . Conventional methods for human-AI collaboration require an accurate behavior model over human data (Ajoudani et al., 2018; Kwon et al., 2020; Kress-Gazit et al., 2021; Wang et al., 2022) , while we consider the setting of no human data. Hence, we explicitly model human biases as a hidden utility function in the self-play objective to reflect possible human biases beyond conventions w.r.t. optimal rewards. We prove that such a hidden-utility model can represent any strategy of nonadaptive humans. Notably, it is also feasible to generalize our model to capture higher cognitive hierarchies (Camerer et al., 2004) , which we leave as a future direction. We approximate the reward space by a linear function space over event-based features. Such a linear representation is typical in inverse reinforcement learning (Ng & Russell, 2000) , policy trans-fer (Barreto et al., 2017b) , evolution computing (Cully et al., 2015) and game theory (Winterfeldt & Fischer, 1975; Kiekintveld et al., 2013) . Event-based rewards are also widely adopted as a general design principle in robot learning (Fu et al., 2018; Zhu et al., 2019; Ahn et al., 2022) . We perform randomization over feature weights to produce diverse biased strategies. Similar ideas have been adopted in other settings, such as generating adversaries (Paruchuri et al., 2006) , emergent teamformation (Baker, 2020) , and searching for diverse Nash equilibria in general-sum games (Tang et al., 2020) . In our implementation, we use multi-reward signals as an approximate metric to filter out duplicated policies, which is inspired by the quality diversity method (Pugh et al., 2016) . There are also some works utilizing model-based methods to solve zero-shot cooperation Wu et al. (2021) . Their focus is orthogonal to our approach since they focus more on constructing an adaptive agent, while our approach aims to find more diverse strategies. Besides, we adopt an end-to-end fashion to train an adaptive agent, which is more general. Lastly, our final adaptive agent assumes a zero-shot setting without any data from its testing partner. This can be further extended by allowing metaadaptation at test time (Charakorn et al., 2021; Gupta et al., 2021; Nekoei et al., 2021) , which we leave as a future direction.

3. PRELIMINARY

Two-Player Human-AI Cooperative Game: A human-AI cooperative game is defined on a world model, i.e., a two-player Markov decision process denoted by M = ⟨S, A, P, R⟩, with one player with policy π A being an AI and the other with policy π H being a human. S is a set of world states. A is a set of possible actions for each player. P is a transition function over states given the actions from both players. R is a global reward function. A policy π i produces an action a (i) t ∈ A given a world state s t ∈ S at the time step t. We use the expected discounted return J(π A , π H ) = E st,a (i) t t γ t R(s t , a (A) t , a (H) t ) as the objective. Note that J(π H , π A ) can be similarly defined, and we use J(π A , π H ) for conciseness without loss of generality. Let P H : Π → [0, 1] be the unknown distribution of human policies. The goal is to find a policy π A that maximizes the expected return with an unknown human, i.e., E π H ∼P H [J(π H , π A )]. In practice, many works construct or learn a policy distribution PH to approximate real-world human behaviors, leading to an approximated objective for π A , i.e., E πH ∼ PH [J(π A , πH )]. Self-Play for Human-AI Cooperation: Self-play (SP) optimizes J(π 1 , π 2 ) with two parametric policies π 1 and π 2 and takes π 1 as π A without use of human data. However, SP suffers from poor generalization since SP converges to a specific optimum and overfits the resulting behavior convention. Population-based training (PBT) improves SP by representing π i as a mixture of K individual policies {π (k) i } K k=1 and runs cross-play between policies by optimizing the expected return (Long* et al., 2020; Lowe* et al., 2020; Cui et al., 2021) . PBT can be further improved by adding a diversity bonus over the population (Garnelo et al., 2021; Liu et al., 2021; Lupu et al., 2021) . Fictitious Co-Play (FCP): FCP (Strouse et al., 2021 ) is a recent work on zero-shot human-AI cooperation with strong empirical performances. FCP extends PBT via a two-stage framework. In the first stage, FCP trains K individual policy pairs {(π (k) 1 , π (k) 2 )} K k=1 by optimizing J(π (k) 1 , π 2 ) for each k. Each policy pair (π (k) 1 , π (k) 2 ) may quickly converge to a distinct local optimum. Then FCP constructs a policy pool Π 2 = {π (k) 2 , π (k) 2 } K k=1 with two past versions of each converged SP policy π (k) 2 , denoted by π(k) 2 . In the second stage, FCP constructs a human proxy distribution PH by randomly sampling from Π 2 and trains π A by optimizing E πH ∼ PH [J(π A , πH )]. We remark that, for a better cooperation, the adaptive policy π A should condition on the state-action history in an episode to infer the intention of its partner. Individual SP policies ensure PH contains diverse conventions while using past versions enables PH to cover different skill levels. So, the final policy π A can be forced to adapt to humans with unknown conventions or sub-optimalities. Maximum Entropy Population-based Training (MEP) (Zhao et al., 2021) is the latest variant of FCP, which adopts the population entropy as a diversity bonus in the first stage to improve the generalization of the learned π A .

4. COOPERATING WITH HUMANS IN Overcooked: A MOTIVATING EXAMPLE

Overcooked Game: Overcooked (Carroll et al., 2019) is a fully observable two-player cooperative game developed as a testbed for human-AI cooperation. In Overcooked, players cooperatively accomplish different soup orders and serve the soups for rewards. Basic game items include onions, tomatoes, and dishes. An agent can move in the game or "interact" to trigger some events, such as grabbing/putting an item, serving soup, etc., depending on the game state. To finish an order, players should put a proper amount of ingredients into the pot and cook for some time. Once a soup is finished, players should pick up the soup with a dish and serve it to get a reward. Different orders have different cooking times and different rewards. Fig. 1 demonstrates five layouts we consider, where the first three onion-only layouts are adopted from (Carroll et al., 2019) , while the latter two, Distant Tomato and Many Orders, are newly introduced to include tomato orders to make the problem more challenging: an AI needs to carefully adapt its behavior to either cook onions or tomatoes according to the other player's actions. A Concrete Example of Human Preference: Fig. 2 illustrates a motivating example in Distant Tomato (the 4th layout in Fig. 1 ). There are two orders: one requires three onions, and the other requires three tomatoes. We run FCP on this multi-order scenario, and all the policies in the FCP policy pool converge to the specific pattern of only cooking onion soup (Fig. 2a ). Hence, the final adaptive policy by FCP only learns to grab onions and cook onion soups. Cooking tomato soup is a sub-optimal strategy that requires many extra moves, so the onion-only policy pool is exactly the solution to the FCP self-play objective under the environment reward. However, it is particularly reasonable for a human to dislike onions and accordingly only grab tomatoes in a game. To be an assistive AI, the policy should adapt its strategy to follow the human preference for tomatoes. On the contrary, as shown in Fig. 2b , the FCP policy completely ignores human moves for tomatoes and even results in poor cooperation by producing valueless wrong orders of mixed onions and tomatoes. Thus, to make an FCP agent human-assistive, the first-stage policy pool should not only contain optimal strategies (i.e., onion soups) of different conventions but also cover diverse human preferences (e.g., tomatoes) even if these preferences are sub-optimal under the environment reward. 

5. METHODOLOGY

We introduce a general formulation to model human preferences and develop a tractable learning objective (Sec. 5.2). The algorithm, Hidden-Utility Self-Play (HSP), is summarized in Sec. 5.3.

5.1. HIDDEN-UTILITY MARKOV GAME

The key insight from Sec. 4 is that humans may not truthfully behave under the environment reward. Instead, humans are biased and driven by their own utility functions, which are formulated below. Definition: A two-player hidden utility Markov game is defined as ⟨S, A, P, R w , R t ⟩. ⟨S, A, P, R t ⟩ corresponds to the original game MDP with R t being the task reward function. R w denotes an additional hidden reward function. There are two players, π a , whose goal is to maximize the task reward R t , and π w , whose goal is to maximize the hidden reward R w . R w is only visible to π w . Let J(π 1 , π 2 |R) denote the expected return under reward R with a policy π 1 and π 2 . During self-play, π a optimizes J(π a , π w |R t ) while π w optimizes J(π a , π w |R w ). A solution policy profile (π * a , π * w ) to the hidden utility Markov game is now defined by a Nash equilibrium (NE): J(π * a , π * w |R w ) ≥ J(π * a , π ′ w |R w ), ∀π ′ w and J(π * a , π * w |R t ) ≥ J(π ′ a , π * w |R t ), ∀π ′ a . □ Intuitively, with a suitable hidden reward function R w , we can obtain any possible (non-adaptive and consistent) human policy by solving the hidden-utility game induced by R w . Lemma 5.1. Given an MDP M = ⟨S, A, P, R t ⟩, for any policy π : S × A → [0, 1], there exists a hidden reward function R w such that the two-player hidden utility Markov game M ′ = ⟨S, A, P, R w , R t ⟩ has a Nash equilibrium (π * a , π * w ) where π * w = π. Lemma 5.1 connects any human behavior to a hidden reward function. Then the objective of the adaptive policy π A in Eq. ( 3) can be formulated under the hidden reward function space R as follows. Theorem 5.1. For any ϵ > 0, there exists a mapping πw where πw (R w ) denotes the derived policy π * w in the NE of the hidden utility Markov game M w = ⟨S, A, P, R w , R t ⟩ induced by R w , and a distribution P R : R → [0, 1] over the hidden reward space R, such that, for any adaptive policy π A ∈ arg max π ′ E Rw∼P R [J(π ′ , πw (R w ))], π A approximately maximizes the ground-truth objective with at most an ϵ gap, i.e., E π H ∼P H [J(π A , π H )] ≥ max π ′ E π H ∼P H [J(π ′ , π H )] -ϵ. Theorem 5.1 indicates that it is possible to derive diverse human behaviors by properly designing a hidden reward distribution PR , which can have a much lower intrinsic dimension than the policy distribution. In Overcooked, human preferences can be typically described by a few features, such as interaction with objects or certain type of game events, like finishing an order or delivering a soup. By properly approximating the hidden reward distribution as PR , the learning objective becomes, π A = arg max π ′ E Rw∼ PR [J(π ′ , πw (R w ))] Eq. ( 1) naturally suggests a two-staged solution by first constructing a policy pool {π w (R) : R ∼ PR } from PR and then training π A to maximize the game reward w.r.t. the induced pool.

5.2. CONSTRUCT A POLICY POOL OF DIVERSE PREFERENCES

Event-based Reward Function Space: The fundamental question is how to design a proper hidden reward function space R. In general, a valid reward space is intractably large. Inspired by the fact that human preferences are often event-centric, we formulate R as linear functions over event features, namely R = {R w : R w (s, a 1 , a 2 ) = ϕ(s, a 1 , a 2 ) T w, ||w|| ∞ ≤ C max }. C max is a bound on the feature weight w while ϕ : S × A × A → R m specifies occurrences of different game events when taking joint action (a 1 , a 2 ) at state s. Derive a Diverse Set of Biased Policies: We simply perform a random search over the feature weight w to derive a set of diverse behaviors. We first draw N samples {w (i) } i∈[N ] for the feature weight w where w (i) j is sampled uniformly from a set of values C j , leading to a set of hidden reward functions {R (i) w : R (i) w (s, a 1 , a 2 ) = ϕ(s, a 1 , a 2 ) T w (i) } i∈[N ] . For each hidden reward function R (i) w , we find an approximated NE, π (i) w , π (i) a , of the hidden utility Markov game induced by R (i) w through self-play. The above process produces a policy pool {π (i) w } i∈[N ] that can cover a wide range of behavior preferences. Algorithm 1: Greedy Policy Selection S ← {i0} where i0 ∼ [N ]; for i = 1 → K -1 do k ′ ← arg max k ′ / ∈S ED(S ∪ {k ′ }); S ← S ∪ {k ′ }; end Policy Filtering: We notice that the derived pool often contains a lot of similar policies. This is because the same policy can be optimal under a set of reward functions, which is typical in multi-objective optimization (Chugh et al., 2019; Tabatabaei et al., 2015) . Duplicated policies simply slow down training without any help to learn π A . For more efficient training, we adopt a behavior metric, i.e., event-based diversity, to only keep distinct ones from the initial pool. For each biased policy π (i) w , let EC (i) denote the expected event count, i.e. E[ T t=1 ϕ(s t , a t )|π (i) w , π (i) a ]. We define event-based diversity for a subset S ⊆ [N ] by normalized pairwise EC differences, i.e., ED(S) = i,j∈S k c k • |EC (i) k -EC (j) k | , where c k is a frequency normalization constant. Finding a subset S * of size K with the optimal ED can be expensive. We simply adopt a greedy method in Algo. 1 to select policies incrementally.

5.3. HIDDEN-UTILITY SELF-PLAY

Algorithm 2: Hidden-Utility Self-Play for i = 1 → N do Train π (i) w and π Given the filtered policy pool, we train the final adaptive policy π A over rollout games by π A and randomly sampled policies from the pool, which completes our overall algorithm HSP in Algo. 2. We implement HSP using MAPPO (Yu et al., 2021) as the RL algorithm. In the first stage, we use MLP policies for fast SP convergence. In practice, we use half of the policy pool to train biased policies and the other half to train MEP policies (Zhao et al., 2021) under the game reward. This increases the overall pool towards the game reward, leading to improved empirical performances. For the final adaptive training, as suggested in (Tang et al., 2020) , we add the identity of each biased policy as an additional feature to the critic. For eventbased features for the reward space, we consider event types, including interactions with basic items and events causing non-zero rewards in Overcooked. Full implementation details can be found in Appendix D and E.

6. EXPERIMENTS

Baselines. We compare HSP with other SP-based baselines, including Fictitious Co-Play (FCP), Maximum Entropy Population-based training (MEP), and Trajectory Diversity-based PBT (Traj-Div). All methods follow a two-stage framework with a final pool size of 36, which we empirically verified to be sufficiently large to avoid performance degradation for all methods. More analysis on pool size can be found in Appendix F.2.1. The implementation details of baselines can be found in Appendix D.2. Each policy is trained for 100M timesteps for convergence over 5 random seeds. Full training details with hyper-parameter settings can be found in Appendix E.1. Evaluation. We aim to examine whether HSP can cooperate well with (1) learned human models, (2) scripted policies with strong preferences, and (3) real humans. We use both game reward and human feedback as evaluation metrics. We remark that since a biased human player may play a suboptimal strategy, the game reward may not fully reflect the performance gap between the baselines and HSP. Our goal is to ensure the learned policy is effective for biased partners/humans. Therefore, we consider human feedback as the fundamental metric. Ablation studies are also performed to investigate the impact of our design choices in HSP. In tables, maximum returns or comparable returns within a threshold of 5 are marked in bold. Full results can be found in Appendix F.

6.1. COOPERATION WITH LEARNED HUMAN MODELS IN ONION-ONLY LAYOUTS

For evaluation with learned human models, we adopted the models provided by (Carroll et al., 2019) , which only support onion-only layouts, including Asymm. Adv., Coord. Ring and Counter Circ.. The results are shown in Tab. 1. For a fair comparison, we reimplement all the baselines, labeled MEP, FCP, and TrajDiv, with the same training steps and policy pool size as HSP. We additionally take the best performance ever reported in the existing literature, labeled Existing SOTA in Tab. 1. Our implementation achieves substantially higher scores than Existing SOTA when evaluated with the same human proxy models. HSP further outperforms other reimplementations in Asymm. Adv. and is comparable with the best baseline in the rest. Full results of the evaluation with learned human models can be found in Appendix F.1. We emphasize that the improvement is marginal because the learned human models have limited representation power to imitate natural human behaviors, which typically cover many behavior modalities. Fig. 8 in Appendix F.1.1 shows trajectories induced by the learned human models only cover a narrow subspace of trajectories played by human players. Further analysis of the learned human models can be found in Appendix F.1.1. Furthermore, our implementation of baselines achieves substantially better results than the original papers (Carroll et al., 2019; Zhao et al., 2021) , which also makes the improvement margin smaller. We investigate the impact of our design choices, including the construction of the final policy pool and the batch size for training the adaptive policy.

Pos

Policy Pool Construction: HSP has two techniques for the policy pool, i.e., (1) policy filtering to remove duplicated biased policies and ( 2) the use of MEP policies under the game reward for half of the pool size. We measure the performance with human proxies by turning these options off. For "HSP w.o. Filtering", we keep all policies by random search in the policy pool, resulting in a larger pool size of 54 (18 MEP policies and a total of 36 random search ones). For"HSP w.o. MEP", we exclude MEP policies from the policy pool and keep all biased policies without filtering, which leads to the same pool size of 36. The results are shown in Fig. 3 and the detailed numbers can be found in Appendix F.2.2. By excluding MEP policies, the HSP variant (HSP w.o. MEP) performs worse in the more complicated layout Counter Circ. while remaining comparable in the other two simpler ones. So we suggest including a few MEP policies when possible. With policy filtering turned off, even though the policy pool size grows, the performance significantly decays in both Coord. Ring and Counter Circ. layouts, suggests that duplicated biased policies can hurt policy generalization. Batch Size: We measure the training curves of the final adaptive policy under the game reward using different numbers of parallel rollout threads in MAPPO. More parallel threads indicate a larger batch size. The results in all five layouts are reported in Fig. 4 . In general, we observe that a larger batch size often leads to better training performance. In particular, when the batch size is small, i.e., using 50 or 100 parallel threads, training becomes significantly unstable and even breaks in three layouts. Note that the biased policies in the HSP policy pool have particularly diverse behaviors, which cause a high policy gradient variance when training the final adaptive policy. Therefore, a sufficiently large training batch size can be critical to stable optimization. We adopt 300 parallel threads in all our experiments for a fair comparison. Practical Remark: Overall, we suggest using a pool size of 36 and including a few MEP policies for the best empirical performance. Besides, a sufficiently large training batch size can help stable optimization, and we use the same batch size for all methods for a fair comparison.

6.3. COOPERATION WITH SCRIPTED POLICIES WITH STRONG BEHAVIOR PREFERENCES

We empirically notice that human models learned by imitating the entire human trajectories cannot well capture a wide range of behavior modalities. So, we manually designed a set of script policies to encode some particular human preferences: Onion/Tomato Placement, which continuously places onions or tomatoes into the pot, Onion/Dish Everywhere, which keeps putting onions or dishes on the counters, Tomato/Onion Placement and Delivery, which puts tomatoes/onions into the pot in half of the time and tries to deliver soup in the other half of the time. For a fair comparison, we ensure that all scripted policies are strictly different from the HSP policy pool. More details about scripted policies and a full evaluation can be found in Appendix D.3. We remark that scripted policies are only used for evaluation but not for training HSP. Tab. 2 shows the average game reward of all the methods when paired with scripted policies, where HSP significantly outperforms all baselines. In particular, in Distant Tomato, when cooperating with a strong tomato preference policy (Tomato Placement), HSP achieves a 10× higher score than other baselines, suggesting that the tomato-preferred behavior is well captured by HSP.

6.4. COOPERATION WITH HUMAN PARTICIPANTS

We recruited 60 volunteers (28.6% female, 71.4% male, age between 18-30) by posting the experiment advertisement on a public platform and divided them into 5 groups for 5 layouts. They are provided with a detailed introduction to the basic gameplay and the experiment process. Vol- unteers are fully aware of all their rights and experiments are approved with the permission of the department. A detailed description of the human study can be found in Appendix F.4. We note that our user study design differs from that of the original Overcooked paper (Carroll et al., 2019) . The additional warm-up stage allows for diverse human behaviors under any possible preference, suggesting a strong testbed for human-assistive AIs.

6.4.1. RESULTS OF THE WARM-UP STAGE

The warm-up stage is designed to test the performance of AI policies in the face of diverse human preferences. Fig. 5 visualizes the human preference for different methods reported in the warm-up stage. The unit represents the difference between the percentage of human players who prefer row partners over column partners and human players who prefer column partners over row partners. The detailed calculation method can be found in Appendix F.4.3. HSP is preferred by humans with a clear margin. Since humans can freely explore any possible behavior, the results in Fig. 5 imply the strong generalization capability of HSP. We also summarize feedback from human participants in Appendix F.4.2.

6.4.2. RESULTS OF THE EXPLOITATION STAGE

The exploitation stage is designed to test the scoring capability of different AIs. Note that it is possible that a human player simply adapts to the AI strategy when instructed to have high scores. So, in addition to final rewards, we also examine the emergent human-AI behaviors to measure the human-AI cooperation level. The experiment layouts can be classified into two categories according to whether the layout allows diverse behavior modes. The first category contains simple onion-only layouts that are taken from (Carroll et al., 2019) , including Asymm. Adv., Coord. Ring and Counter Circ.. The second category contains newly introduced layouts with both onions and tomatoes, Distant Tomato and Many Orders, which allow for a much wider range of behavior modes. Onion-only Layouts: Fig. 6a shows the average reward in onion-only layouts for different methods when paired with humans. Among these onion-only layouts, all methods have comparable episode reward in simpler ones (Asymm. Adv. and Coord. Ring), while HSP is significantly better in the most complex Counter Circ. layout. Fig. 6b shows the frequency of successful onion passing between the human player and the AI player. The learned HSP policy is able to use the middle counter for passing onions, while the baseline policies are less capable of this strategy. Layouts with Both Onions and Tomatoes: The results and behavior analysis in Distant Tomato and Many Orders are shown as follows, • Distant Tomato: In Distant Tomato, the optimal strategy is always cooking onion soups, while it is suboptimal to cook tomato soups due to the much more time spent on moving. Interestingly, our human-AI experiments found that humans may have diverse biases over onions and tomatoes. However, all learned baseline policies tend to have a strong bias towards onions and often place onions into a pot with tomatoes in it already. Tab. 3 reports the average number of such Wrong The onion passing frequency in Counter Circ. shows that HSP is the most capable, among other baselines, of passing onions via the counter, suggesting better capabilities to assist humans. Placements made by different AI players. HSP makes the lowest number of wrong placements and is the only method that can correctly place additional tomatoes into a pot partially filled with tomatoes, labeled Correct Placements. This suggests that HSP is the only effective method to cooperate with biased human strategies, e.g., preferring tomatoes. In addition, as shown in Tab. 3, even when humans play the optimal strategy of cooking onion soups, HSP still achieves comparable performance with other methods. • Many Orders: In Many Orders, an effective strategy is to utilize all three pots to cook soups. Our experiments found that baseline policies tend to ignore the middle pot. Tab. 4 shows the average number of soups picked up from the middle pot by different AI players. The learned HSP policy is much more active in taking soups from the middle pot, leading to more soup deliveries. Furthermore, HSP achieves a substantially higher episode reward than other methods, as shown in Tab. 4. 

7. CONCLUSION

We developed Hidden-Utility Self-Play (HSP) to tackle the problem of zero-shot human-AI cooperation by explicitly modeling human biases as an additional reward function in self-play. HSP first generates a pool of diverse strategies and then trains an adaptive policy accordingly. Experiments verified that agents trained by HSP are more assistive for humans than baselines in Overcooked. Although our work suggests a new research direction on this fundamentally challenging problem, there are still limitations to be addressed. HSP requires domain knowledge to design a suitable set of events. There exists some work on learning reward functions rather than assuming event-based rewards (Shah et al., 2019; Zhou et al., 2021) . So a future direction is to utilize learning-based methods to design rewards automatically. Another major limitation is the computation needed to obtain a diverse policy pool. Possible solutions include fast policy transfer and leveraging a prior distribution of reward functions extracted from human data (Barreto et al., 2017a) . Learning and inferring the policy representations of partners could also provide further improvement. We leave these issues as our future work. We would suggest visiting https://sites.google.com/view/hsp-iclr for more information.

A THEOREM PROOFS

For simplicity, we assume state space and action space in our analysis are both discrete and finite, which is exactly the case for Overcooked, and the rewards r are bounded: |r(s, a)| ≤ R max , ∀s ∈ S, a ∈ A. Lemma 5.1. Given an MDP M = ⟨S, A, P, R t ⟩, for any policy π w : S × A → [0, 1], there exists a hidden reward function R w such that the two-player hidden utility Markov game M ′ = ⟨S, A, P, R w , R t ⟩ has a Nash equilibrium (π * a , π * w ) where π * w = π w . Proof. Our analysis is based on the maximum entropy reinforcement learning framework (Haarnoja et al., 2018; Ziebart et al., 2008; Wulfmeier et al., 2015) . Given a reward function R and policies of the two players π 1 and π 2 , we consider following maximum entropy RL objective for policy π i (1 ≤ i ≤ 2), J i (π 1 , π 2 |R) = E τ t γ t (R(s t , a t , a t ) + αH(π i (•|s t ))) a (i) t ∼ π i (•|s t ) We shall first constructs π a given policy π w to satisfy J 2 (π w , π a |R t ) ≥ J 2 (π w , π ′ a |R t ), ∀π ′ a and secondly constructs R w such that J 1 (π w , π a |R w ) ≥ J 1 (π ′ w , π a |R w ), ∀π ′ w is satisfied. Step 1: Construct π a given π w . Given π w , let π a ∈ arg max π J 2 (π w , π|R t ). Step 2: Construct R w such that J 1 (π w , π a |R w ) ≥ J 1 (π ′ w , π a |R w ), ∀π ′ w is satisfied given π w and π a . Given a fixed partner π a , by regarding π a as part of the environment dynamics, we could consider the dynamics for π w in a single-agent MDP M ′ = ⟨S, A, P ′ , R w , γ⟩ where S is the state space, A is the action space, P ′ denotes the transition probability and R w is the reward function to construct. More specifically, P ′ is defined as, P ′ (s ′ |s, a) = ã P (s ′ |s, a, ã) • π a (ã|s) In M ′ , given reward R w , the objective of π w becomes, max π E τ t γ t (R w (s t , a t ) + αH(π(s t ))) a t ∼ π(s t ) The value function and the Q function could be defined as, V (s) = E τ t γ t (R w (s t , a t ) + αH(π w (s t ))) a t ∼ π w (s t ), s 0 = s (4) = a π w (a|s)(R w (s, a) + γE s ′ [V (s ′ )|s, a]) + αH(π w (s)) (5) Q(s, a) = R w (s, a) + γ • E s ′ [V (s ′ )|s, a] It is sufficient to construct R w such that V (s) is a stable point of the Bellman backup operator (Sutton & Barto, 2018) T * under some R w : (T * V )(s) = max d: a d(a)=1 αH(d) + a d(a)(R w (s, a) + γE s ′ [V (s ′ )|s, a]) Now we assume V (s) is a stable point for Eq. 7 and construct R w . For all s ∈ S, π w (•|s) should be a solution to the following maximization problem, max d αH(d) + a d(a)Q(s, a) (8) s.t. a d(a) = 1 (9) Applying KKT conditions over the above optimization problem indicates that, π w (•|s) ∝ exp(Q(s, •)/α), ∀s Let π * w (s) = arg max a π w (a|s), V * (s) = max a Q(s, a), A(s, a) = Q(s, a) -V * (s). By Eq. 10, we also have A(s, a) = α(log π w (a|s) -log π w (π * w (s)|s)) By definition of value function V (s), V (s) = a π w (a|s)Q(s, a) + αH(π w (s)) (12) = a π w (a|s)(A(s, a) + V * (s)) + αH(π w (s)) (13) = a π w (a|s)A(s, a) + V * (s) + αH(π w (s)) (14) = a π w (a|s)A(s, a) + R w (s, π * w (s)) + γE s ′ [V (s ′ )|s ′ ∼ P ′ (s, π * w (s))] + αH(π w (s)) (15) = E τ t γ t a ′ π w (a ′ |s)A(s, a ′ ) + R w (s t , a t ) + αH(π w (s t )) a t = π * w (s t ) (16) Let b(s) = R w (s, π * w (s)). Then V (s) is determined given π w and b, V (s) = E τ t γ t a ′ π w (a ′ |s)A(s, a ′ ) + b(s t ) + αH(π w (s t )) a t = π * w (s t ) (17) By A(s, a) = α(log π w (a|s) -log π w (π * w (s)|s)) = Q(s, a) -V * (s), α(log π w (a|s) -log π w (π * w (s)|s)) = R w (s, a) + γE s ′ [V (s ′ )|s ′ ∼ P ′ (s, a)] -V * (s) (18) R w (s, a) = α log π w (a|s) π w (π * w (s)|s) -γE s ′ [V (s ′ )|s ′ ∼ P ′ (s, a)] + V * (s) To summarize, for policy π w , we can construct a valid hidden reward function R w via following process, 1. Choose a function b : S ′ → R. 2. Compute A(s, a) by Eq. 11. 3. Compute V (s) and V * (s) by Eq. 17. 4. Construct R w (s, a) by R w (s, π * w (s)) = b(s) and Eq. 19. Now we show that, for any b : S ′ → R, under R w constructed by the above process, V (s) is a stable point of the Bellman backup operator T * . This is straightforward. First, constructed R w ensures that α(log π w (a|s) -log π w (π * w (s)|s)) = Q(s, a) -V * (s) (Eq. 19) and therefore π w (a|w) ∝ exp(Q(s, a)/α), which means π w is a solution for the maximization problem 8. So (T * V )(s) = max d αH(d) + a d(a)(R w (s, a) + γE s ′ [V (s ′ )]) (20) = a π w (a|s)Q(s, a) + αH(π w (s)) = V (s). ( ) Theorem 5.1. For any ϵ > 0, there exists a mapping πw where πw (R w ) denotes the derived policy π * w in the NE of the hidden utility Markov game M w = ⟨S, A, P, R w , R t ⟩ induced by R w , and a distribution P R : R → [0, 1] over the hidden reward space R, such that, for any adaptive policy π A ∈ arg max π ′ E Rw∼P R [J(π ′ , πw (R w ))], π A approximately maximizes the ground-truth objective with at most an ϵ gap, i.e., E π H ∼P H [J(π A , π H )] ≥ max π ′ E π H ∼P H [J(π ′ , π H )] -ϵ. Proof. Let K(K > |A|) be a large positive integer. We construct a discretization of the policy space Π by Π K = {π : π(a|s) = i K where i ∈ [K], ∀s ∈ S, a ∈ A and a π(a|s) = 1, ∀s ∈ S}. Note that Π K is finite, i.e. |Π K | ≤ (K + 1) |S|•|A| . Let M = |Π K | and π 1 , π 2 , • • • , π M be an ordering of the policies in Π K . For simplicity of notation, let δ = |A| K . Given the discretization Π K , it's straightforward to specify the nearest policy π ∈ Π K for any policy π ∈ Π. Formally, for any policy π ∈ Π, let G(π) = arg min i=1,...,M s,a |π(a|s) - π i (a|s)|. An obvious property of G is that, ∀s ∈ S, ||π(•|s) -G(π)(•|s)|| ∞ ≤ |A| K = δ. For two policies π 1 and π 2 , consider π 1 playing with π 2 and G(π 2 ) respectively. Since the action distribution of π 2 and G(π 2 ) at each state differ at most δ, we have follows, |J(π 1 , π 2 ) -J(π 1 , G(π 2 ))| ≤ t γ t • (1 -δ) t • δ • 2R max 1 -γ ≤ 2δR max (1 -γ) 2 (22) We can then derive a discretized approximation of the ground-truth policy distribution P H as follows, PH (π) = Pr π ′ ∼P H [π = G(π ′ )] We could show that the difference between the objective under the ground-truth policy distribution P H and that under the approximated policy distribution PH is bounded. By Eq. 22, for any adaptive policy π A , E π H ∼ PH [J(π A , π H )] -E π H ∼P H [J(π A , π H )] = E π H ∼P H [J(π A , G(π H )) -J(π A , π H )] (24) ≤ 2δR max (1 -γ) 2 (25) On the other hand, consider following an iterative process to find hidden reward functions for policies in Π K . For i = 1..M , we find hidden reward function R (i) w where R (i) w / ∈ {R (j) w |1 ≤ j ≤ i -1} and R (i) w could be constructed from π i as in Lemma 5.1. Notice that, by construction rule in Lemma 5.1, such R (i) w must exists since we can specify arbitrary b : S → R.

Let πw (R (i)

w ) = π i , ∀i = 1 . . . M and the hidden reward distribution P R be P R (R (i) w ) = PH (π i ), ∀i = 1 • • • M . We immediately see that, for any adaptive policy π A , the objective is equivalent under the approximated policy distribution PH and hidden reward function distribution P R , E Rw∼P R [J(π A , πw (R w ))] = E π H ∼ PH [J(π A , π H )] Finally, for any adaptive policy π A ∈ arg max π ′ E Rw∼P R [J(π ′ , πw (R w ))] and any policy π ′ ∈ Π, E π H ∼P H [J(π A , π H )] ≥ E π H ∼ PH [J(π A , π H )] - 2δR max (1 -γ) 2 (27) = E Rw∼P R [J(π A , πw (R w ))] - 2δR max (1 -γ) 2 (28) ≥ E Rw∼P R [J(π ′ , πw (R w ))] - 2δR max (1 -γ) 2 (29) = E π H ∼ PH [J(π ′ , π H )] - 2δR max (1 -γ) 2 (30) ≥ E π H ∼P H [J(π ′ , π H )] - 4δR max (1 -γ) 2 (31) Let K ≥ 4|A|Rmax ϵ(1-γ) 2 and we have E π H ∼P H [J(π A , π H )] ≥ max π ′ E π H ∼P H [J(π ′ , π H )] -ϵ. B ENVIRONMENT DETAILS 

B.1 DESCRIPTION

The Overcooked Environment, first introduced in (Carroll et al., 2019) , is based on the popular video game Overcooked where multiple players cooperate to finish as many orders as possible within a time limit. In this simplified version of the original game, two chiefs, each controlled by a player (either human or AI), work in grid-like layouts. Chiefs can move between non-table tiles and interact with table tiles by picking up or placing objects. Ingredients (e.g., onions and tomatoes) and empty dishes can be picked up from the corresponding dispenser tiles and placed on empty table tiles or into the pots. The typical pipeline for completing an order is (1) players put appropriate ingredients into a pot; (2) a pot starts cooking automatically once filled and takes a certain amount of time (depending on the recipe) to finish; (3) a player harvests the cooked soup with an empty dish and deliver it to the serving area. The observation for an agent includes the whole layout, items on the counter and pots, player positions, orders, and time. The possible actions are up, down, left, right, no-op, and "interacting" with the tile the player is facing. Reward is given to both agents upon successful soup delivery, with the amount varying with the type of soup. An episode of the game terminates when the time limit is reached. The environment used in (Carroll et al., 2019) has only onions as ingredients and onion soups as orders. In our work, we evaluate all methods in three of them, namely Asymmetric Advantage, Coordination Ring, and Counter Circuit, each designed to enforce a specific cooperation pattern. Our work introduces two new layouts: Distant Tomato and Many orders, with new ingredients and order types to make cooperation more challenging. In Distant Tomato, a dish of onion soup takes 20 ticks to finish and gives 20 rewards when delivered, while a tomato soup takes 10 ticks and gives the same reward but needs more movements to get the ingredient. The two players need to agree on which type of soup to cook in order to reach a high score. Failure in cooperation may result in tomato-onion soups that give no reward. In many orders, there are three types of orders: onion, tomato, and 1-onion-2-tomato. To fully utilize the three pots, the players need to work seamlessly in filling not just the pots near each of them but also the pot in the middle. We show all the layouts in Fig. 7 . and conclude the cooperation pattern of our interest as follows. • Asymmetric Advantage tests whether the players can choose a strategy to their strengths. • Coordination Ring requires the players not to block each other when traveling between the two corners. • Counter Circuit embeds a non-trivial but efficient strategy of passing onions through the middle counter, which needs close cooperation. • Distant Tomato and Many Orders both encourage the players to reach an agreement on the fly in order to achieve a high reward.

B.2 EVENTS

In Overcooked, we consider the following events for random search in HSP and reward shaping during training of all methods: • putting an onion/tomato/dish/soup on the counter, • picking up an onion/tomato/dish/soup from the counter, • picking up an onion from the onion dispenser, • picking up a tomato from tomato dispenser, • picking up a dish from the dish dispenser, • picking up a ready soup from the pot with a dish, • placing an onion/tomato into the pot, • valid placement: after the placement, we can finish an order with a positive reward by placing other ingredients, • optimal placement: the placement is optimal if the maximum order reward we can achieve for this particular pot is not decreased after the placement, • catastrophic placement: the placement is catastrophic if the maximum order reward we can achieve for this particular pot decreases from positive to zero after the placement, • useless placement: the placement is useless if the maximum order reward we can achieve for this particular pot is already zero before the placement, • useful dish pickup: picking up a dish is useful when there are no dishes on the counter, and the number of dishes already taken by players is less than the total number of unready and ready soups, • delivering a soup to the serving area. Additionally, in Distant Tomato, we consider the following events only for reward shaping, • placing a tomato into an empty pot, • optimal tomato placement: the placement is optimal and a tomato placement, • useful tomato pickup: the agent picks up a tomato when the partner isn't holding a tomato, and there is a pot that is not full but only has tomatoes in it.

C OVERCOOKED VERSION

In our experiments, we use two versions of Overcooked for a fair comparison with prior works and introduce challenging layouts. One version, in which we tested Asymmetric Advantages, Coordination Ring and Counter Circuit, is consistent with the "neurips2019" branch in the released GitHub repository of (Carroll et al., 2019) . We remark that MEP (Zhao et al., 2021) also follows this version. Following this also allows us to perform an evaluation with human proxy models provided in the released code of (Carroll et al., 2019) . The other version is an up-to-date version of Overcooked, which supports tomatoes and user-defined orders. We notice that a pot automatically starts cooking soup once there are three items in it in the former version, while it requires an additional "interact" action to start cooking in the latter version. This additional "interact" is required in the latter version since it supports orders with different amounts of ingredients. However, having an additional "interact" significantly influences a human player's interactive experience. Therefore, we make modifications on the latter version to restrict orders to 3 items and support auto-cooking when there are 3 items. For more details, please refer to the released code.

D IMPLEMENTATION DETAILS D.1 HSP

Algorithm 3: Hidden-Utility Self-Play for i = 1 → N do Train π (i) w and π (i) a under sampled R (i) w ; end Run greedy policy selection to only keep K policies; Initial policy π A ; repeat Rollout with π A and sampled π (i) w ; Update π A ; until enough iterations; The pseudocode of HSP is shown in Algo. 3. We implemented HSP on top of MAPPO (Yu et al., 2021) . Following the standard practice, we use multiprocessing to collect trajectories in parallel and then update the models. In the first stage, we use MLP policies, which empirically yield better results. In the second stage, we use RNN policies so that the adaptive policy could infer the intention of its partner by observing the history of its partner and make decisions accordingly for better adaptation. As suggested in (Tang et al., 2020) , we add the identities of the policies in the policy pool as an additional feature to the critic. For better utilization of the computation resources, each environment sub-process loads a uniformly sampled policy and performs inference on CPUs, while the inference of the adaptive policy is batched across sub-processes in a GPU.

D.2 BASELINES

For a fair comparison, we implement all baselines to be two-staged and train layout-specific agents. We remark that our implementation of MEP achieves substantially higher scores than reported in the original paper (Zhao et al., 2021) when evaluated with the same human proxy models as MEP. All baselines are implemented with techniques stated above: loading policies from the pool per subprocess and the additional feature of identities of policies in the policy pool. We detail the baselines here and point out the difference with the original papers, FCP (Strouse et al., 2021) : We list the differences between our implementation and the original FCP as follows, 1. The original FCP uses image-based egocentric observations, while we use feature-based observations as provided in Overcooked. 2. The original FCP uses a pool size of 96 while we use 36. We empirically found 36 a sufficiently large pool size in our experiments. As shown in Table 21 , in the three layouts that have human proxy models, there is no significant difference between using a pool size of 36 and of 72. MEP (Zhao et al., 2021) : We list the differences between our implementation and the original MEP as follows, 1. While the released code of MEP uses MLP policy in the second training stage, we found RNN policy to work better. Intuitively, for better cooperation, the adaptive policy should infer the intention of its partner by observing the state-action history. 2. MEP uses a pool size of 15 while we use 36. 3. MEP uses prioritized sampling in the second stage, which favors weak policies in the pool, while we adopt uniform sampling for MEP since we found prioritized sampling not helpful with our carefully tuned implementation (shown in Table 5 ). 4. In the released code of MEP, the policy updates are performed on data against only one policy from the pool, while we perform policy updates on data against many policies from the pool. This avoids the update from being biased towards some specific policies. TrajDiv (Lupu et al., 2021) : While the original TrajDiv is tested in hand-crafted MDPs and Hanabi, we test TrajDiv in Overcooked. Although (Lupu et al., 2021) suggests training the adaptive policy and the policy pool together in a single stage, we choose to follow MEP and FCP to have a twostaged design that trains the adaptive policy in the second stage. 

D.3 SCRIPTED POLICIES

To evaluate all methods with policies that have strong preferences, we consider the following scripted policies, • Onion/Tomato/Dish Everywhere continuously tries to put onions, tomatoes or dishes over the counter. • Onion/Tomato Placement always tries to put onion or tomato into the pot. • Delivery delivers a ready soup to the serving area whenever possible. • Onion/Tomato Placement and Delivery puts tomatoes/onions into the pot in half of the time and tries to deliver soup in the other half of the time. For Counter Circuit, we additionally consider a scripted policy, named Onion to Middle Counter, which keeps putting onions randomly over the counter in the middle of the layout. Input to these scripted policies is the ground-truth state of the game, which is accessible via the game simulator. When a scripted policy is unable to finish the event of its interest at some state, the scripted policy would walk to a random empty grid. For example, Onion Placement would choose a random walk when all pots are full. We ensure that these scripted policies are strictly different from policies in the policy pool of HSP. For more details, please refer to the released code. We also provide evidence to show scripted policies are sufficiently different from those in the training pool. We use the expected event count of scripted and biased policies to support our claim. Recall that expected event count for a pair of policy π a , π b is EC(π a , π b ) = E[  n) w } n∈[N ] ∪ {π (m) s } m∈M be the union of biased policies and scripted policies. For each policy π ′ ∈ Π, we measure how close it is to the rest of policies in Π in the expected event count, i.e. the event-based difference EventDiff Π (π ′ ) = min π ′′ ∈Π\{π ′ } k c k • |EC k (π ′ , π HSP ) -EC k (π ′′ , π HSP )| where c k is a frequency normalization constant. Then a large event-based difference indicates that π ′ is sufficiently different from other policies in Π. We calculate the event-based difference for all biased and scripted policies. Table. 6 reports the average event-based difference between biased and scripted policies, respectively. Scripted policies consistently have a larger average event-based difference, indicating scripted policies are sufficiently different from biased policies, which are used for training the HSP adaptive policy.

E TRAINING DETAILS

E.1 HYPERPARAMETERS HSP and baselines are all two-staged solutions by first constructing a policy pool and then training an adaptive policy π A to maximize the game reward w.r.t. the induced pool. The network architecture in both two stages is composed of 3 convolution layers with max pooling. Hyperparameters of these layers are listed in Table 7 . Each layer is followed by a max pooling layer with a kernel size of 2. For MLP policies, we add two linear layers after the convolution. For RNN policies, we add a 1-layer GRU after the convolution and two linear layers after the GRU layer. The hidden sizes for these linear layers and the GRU layer are all 64. We use ReLU as the activation function between layers and LayerNorm after GRU and linear layers except the last one. The output is a 6-dim vector denoting the categorical action distribution. Common hyperparameters for all methods in 5 layouts are listed in Table 8 and Table 9 . Specifically, for MEP, we use the suggested hyperparameters from the original paper (Zhao et al., 2021) . Detailed hyperparameters of MEP are shown in Table 10 , where population entropy coef. adjusts the importance of the population entropy term. Detailed hyperparameters of TrajDiv are shown in Table 11 , where traj. gamma is the discounting factor used in local action kernel and diversity coef. adjusts the importance of the diversity term. For each one of MEP, FCP and TrajDiv, we train 12 policies in the first stage and, following the convention of MEP (Zhao et al., 2021) and FCP (Strouse et al., 2021) , take the init/middle/final checkpoints for each policy to build up the policy pool, leading to a pool size of 36. For HSP, we use a random search to first train 36 biased policies and then filter out 18 biased policies from them. We then combine these biased policies and past checkpoints of 6 policies in the policy pool of MEP to build up the policy pool of HSP, again leading to a pool size of 36. To construct the policy pool for HSP, we perform a random search over possible hidden reward functions. Each reward function is formulated as a linear function over the event-based features, i.e. R = {R w : R w (s, a 1 , a 2 ) = ϕ(s, a 1 , a 2 ) T w, ||w|| ∞ ≤ C max } where ϕ : S × A × A → R m specifies occurrences of different events when taking joint action (a 1 , a 2 ) at state s. To perform random search, instead of directly sampling each w j from the section [-C max , C max ], we sample each w j from a set of possible values C j . We detail the C j for each event on each layout here. Tab. 12 shows C j in Asymmetric Advantages, Coordination Ring and Counter Circuit. Tab. 13 and Tab. 14 show C j in Distant Tomato and Many Orders respectively. A detailed description of the events is shown in Sec. B.2. Note that in addition to events, we also include order reward as one element in a random search. To filter out duplicated policies, we define an event-based diversity for a subset S, i.e. ED(S) Picking up an onion from onion dispenser -5, 0, 5 Picking up a tomato from tomato dispenser 0, 10, 20 = i,j∈S k c k • |EC (i) k -EC Picking up a dish from dish dispenser 0, 5 Picking up a soup -5, 0, 5 Viable placement -10, 0, 10 Optimal placement -10, 0 Catastrophic placement 0, 10 Placing an onion into the pot -3, 0, 3 Placing a tomato into the pot -3, 0, 3 Delivery -10, 0 Order reward 0, 1 trajectories by measuring self-delivery ratio, i.e., the ratio of deliveries by the specific player to the total delivery number in a trajectory, and self-cooking ratio, which is the ratio of onions that the player places in the pot to the total pot placement number in a trajectory. The distributions of these trajectories are demonstrated in Fig. 8 . From the figure, we can observe that the learned human models can not fully cover human behaviors. This suggests that evaluation results with the learned human models can not provide a comprehensive comparison among different methods.

Event Value

Picking up a dish from the dish dispenser 3 Picking up a ready soup from the pot 5 Table 18 : Reward shaping for Many Orders in the second stage. Picking up a dish from the dish dispenser 3 Picking up a ready soup from the pot 5 Useful tomato pickup 10 Optimal tomato placement 5 Placing a tomato into an empty pot -15 Table 19 : Reward shaping for Distant Tomato in the second stage.

F.4.2 HUMAN FEEDBACK

We collected and analyzed the feedback from the participants to see how they felt playing with AI agents. Here we summarize the typical reflections. 1. In Coordination Ring, the most annoying thing reported is players blocking each other during movement. To effectively maneuver in the ring-like layout, players must reach a temporary agreement on either going clockwise or counterclockwise. HSP is the only AI able to make way for the other player, while others can not recover by themselves once stuck. For example, both FCP and TrajDiv players tend to take a plate and wait next to the pot immediately after one pot is filled. But they can neither take a detour when blocked on their way to the dish dispenser nor yield their position to the human player trying to pass through. The video recorded in the human study can be found in Part 4.2 of https://sites.google.com/view/hsp-iclr. 2. In Counter Circuit, one efficient strategy is passing onion via the counter in the middle of the room: a player at the bottom fetches onions and places them on the counter, while another player at the top picks up the onions and puts them into pots. HSP is the only AI player capable of this strategy in both top and bottom places and performs the highest onion passing frequency cooperating with human players as shown in Figure . 6b. 3. In Distant Tomato, one critical thing is that mixed (onion-tomato) soups give no reward, which means two players need to agree on the soup to cook. All AI agents perform well when the other player focuses on onion soups. However, all AI agents except for HSP fail to deal with tomato-preferring partners as shown in Table . 3. FCP, MEP, or TrajDiv agents never actively choose to place tomatoes and keep placing onions even when a pot has tomatoes in it, resulting in invalid orders. On the contrary, HSP chooses to place tomatoes when there are tomatoes in the pot. Participants commonly agree that the HSP agent is the best partner to play with in this layout. The video recorded in the human study can be found in Part 4.2 of https://sites.google.com/view/hsp-iclr. 4. In Many Orders, most participants claim that HSP is able to pick up soups from all three pots, while other AI agents only concentrate on the pot in front of them and ignore the middle pot even if the human player attempts to use it. 



Figure 1: Layouts in Overcooked. From left to right are Asymmetric Advantages, Coordination Ring, Counter Circuit, Distant Tomato and Many Orders respectively, with orders shown below.

Figure 2: Motivating example. (a) FCP converges to the optimal onion soup strategy. (b) A failure case of FCP with a human partner: FCP agent corrupts the human's plan of cooking tomato soups.

Run Algo. 1 to only keep K policies; Initial policy πA; repeat Rollout with πA and sampled π (i) w ; Update πA; until enough iterations;

Figure 4: Average game reward by using different numbers of parallel rollout threads in MAPPO to train the final adaptive policy. More parallel threads imply a larger training batch size.

Figure 5: Human preference in the warm-up stage.The unit denotes the difference between the percentage of human players who prefer row partners over column partners and human players who prefer column partners over row partners. HSP is consistently preferred by human participants with a clear margin.

Figure 6: (a) Average episode reward in onion-only layouts of different methods when paired with humans in the exploitation stage. HSP has comparable performance with the baselines in Asymm. Adv. and Coord. Ring, and is significantly better in the most complex Counter Circ. layout. (b)The onion passing frequency in Counter Circ. shows that HSP is the most capable, among other baselines, of passing onions via the counter, suggesting better capabilities to assist humans.

Figure 7: All 5 layouts used in our work (from left to right): Asymmetric Advantage, Coordination Ring, Counter Circuit, Distant Tomato, and Many Orders, each featuring specific cooperation patterns we want to study.

t=1 ϕ(s t , a t )|π a , π b ]. Let π HSP be the HSP adaptive policy, {π (n) w } n∈[N ] be the set of biased policies in the training pool, and {π (m) s } m∈M be the set of scripted policies. For convenience, let Π = {π



Figure8: Trajectories induced by the learned human models and human players in Asymmetric Advantages, Coordination Ring and Counter Circuit. Each point or triangle denotes a trajectory with the X-axis coordinate being the self-cooking ratio, which is the ratio of onions the player places in the pot to the total amount of placements in the trajectory, and the Y-axis coordinate being the selfdelivery ratio, which is the ratio of deliveries given by the player to the total number of deliveries in the trajectory. Triangles and points denote trajectories induced by human players and learned human models, respectively. Different colors stand for different player indices. "BC" represents the learned human models, and "Human" denotes human players. Clearly, trajectories induced by the learned human models can not fully cover those by human players.

. Asymm. Adv. Coord. Ring Counter Circ.

Average episode reward and standard deviation with unseen testing scripted policies. HSP significantly outperforms all baselines.

Average onion-preferred episode reward and frequency of different emergent behaviors in Distant Tomato during the exploitation stage. Onion-Preferred Episode Reward is the average episode reward when humans prefer onions. Wrong Placements and Correct Placements are the average numbers of wrong and correct placements into a pot partially filled with tomatoes. HSP makes the lowest number of wrong placements and is the only method that can place tomatoes correctly, suggesting that HSP is effective at cooperating with biased human strategies.

Average episode reward and average number of picked-up soups from the middle pot by different AI players in Many Orders during the exploitation stage. HSP achieves significantly better performance and is much more active in taking soups from the middle pot than baselines.

Pos. Asy. Adv. Coor. Ring Coun. Circ. Average episode reward and standard deviation (over 5 seeds) with different sampling methods of MEP. The "1" and "2" indicates the roles played by AI policies.

The average event-based difference of biased and scripted policies respectively.

CNN feature extractor hyperparameters.

Common hyperparameters in the first stage. E.2 CONSTRUCTING THE POLICY POOL FOR HSP

. The coefficient c k balances the importance of different kinds of events. We simply set c k as a normalization constant, i.e. c k = max i∈[N ] EC Common hyperparameters in the second stage.

TrajDiv hyperparameters in the first stage.

C j for random search in Many Orders.

Reward shaping for Asymmetric Advantages, Coordination Ring and Counter Circuit in the first stage.

Table. 4  shows that HSP agent picks up most soups from the middle pot and meanwhile gets the highest average episode reward.F.4.3 HUMAN PREFERENCE ON DIFFERENT AI AGENTSFigure. 9 illustrates human preference for different AI agents. In all layouts except a relatively restricted and simple layout, Coordination Ring, human players strongly prefer HSP over other AI agents. In Coordination Ring, though human players rank MEP above HSP, HSP is still significantly better than FCP and TrajDiv.Calculation Method: Human preference for different methods is computed as follows. Assume we are comparing human preference between method A and method B. Let N be the total number of human players attending the experiments in one layout, N A be the number of human players who Pos. Asy. Adv. Coor. R. Coun. Circ. Average episode reward and standard deviation (over 5 seeds) on 3 layouts for different methods played with human proxy policies. All values within 5 standard deviations of the maximum episode return are marked in bold. The Pos. column indicates the roles played by AI policies. Pos. Asy. Adv. Coor. R. Coun. Circ. Asy. Adv. Coor. R. Coun. Circ.

Average episode reward and standard deviation (over 5 seeds) on 3 layouts for different methods played with human proxy policies. The Pos. column indicates the roles played by AI policies. Pos. Asy. Adv. Coor. R. Cou. Circ.

Average episode reward and standard deviation (over 5 seeds) on 3 layouts for different methods played with human proxy policies. The Pos. column indicates the roles played by AI policies.

acknowledgement

ACKNOWLEDGMENTS This research was supported by National Natural Science Foundation of China (No.U19B2019, 62203257, M-0248), Tsinghua University Initiative Scientific Research Program, Tsinghua-Meituan Joint Institute for Digital Life, Beijing National Research Center for Information Science, Technology (BNRist), and Beijing Innovation Center for Future Chips and 2030 Innovation Megaprojects of China (Programme on New Generation Artificial Intelligence) Grant No. 2021AAA0150000.

Event

C jPicking up an onion from onion dispenser -10, 0, 10 Picking up a dish from dish dispenser 0, 10 Picking up a ready soup from the pot -10, 0, 10 Placing an onion into the pot -10, 0, 10 Delivery -10, 0 Order reward 0, 1 Picking up an onion from onion dispenser -5, 0, 5 Picking up a tomato from tomato dispenser 0, 10, 20Picking up a dish from dish dispenser 0, 10 Picking up a soup -5, 0, 5 Viable placement -10, 0, 10 Optimal placement -10, 0, 10 Catastrophic placement 0, 10 Placing an onion into the pot -10, 0, 10 Placing a tomato into the pot -10, 0, 10 Delivery -10, 0 Order reward 0, 1 

F.1 COOPERATION WITH LEARNED HUMAN MODELS

Table 20 shows average episode reward and standard deviation (over 5 seeds) on 3 layouts for different methods played with human proxy policies. All values within 5 standard deviations of the maximum episode return are marked in bold. These three simple layouts may not fully reflect the performance gap between the baselines and HSP. The results with learned human models are reported for a fair comparison with existing SOTA methods. Besides, our implementation of the baselines achieves substantially better results than their original papers with the same human proxy models, making the improvement margin look smaller. We also remark that the learned human models have limited representation power to imitate natural human behaviors that typically cover many behavior modalities. Here we give empirical evidence of the learned human models failing to fully reflect human behaviors.

F.1.1 EMPIRICAL EVIDENCE

The original Overcooked paper (Carroll et al., 2019) collected human-play trajectories. We then collect game trajectories played by the learned human models and compare them with human-play

Event Value

Optimal placement 3 Picking up a dish from dish dispenser 3 Picking up a ready soup from the pot 5 2) the use of MEP policies under the game reward for half of the pool size. We measure the performance with human proxies by turning these options off. For "HSP w.o. Filtering", we keep all the policies by random search in the policy pool, which results in a larger pool size of 54 (18 MEP policies and a total of 36 random search ones). For"HSP w.o. MEP", we exclude MEP policies in the policy pool and keep all the biased policies without filtering, which leads to the same pool size of 36. The results are shown in Table . 22.

F.3 COOPERATION WITH SCRIPTED POLICIES WITH STRONG BEHAVIOR PREFERENCES

Table 23 illustrates average episode reward and standard deviation (over 5 seeds) in all layouts with scripted policies. All values within a difference of 5 from the maximum value are marked in bold.

F.4 HUMAN-AI EXPERIMENT F.4.1 EXPERIMENT SETTING

We recruited 60 volunteers by posting the experiment advertisement on a public platform. They are provided with a detailed introduction to the basic gameplay and the experiment process. The Overcooked game was deployed remotely on a server that the volunteers could access with their browsers. According to the feedback, over 90 percent of volunteers had no prior experience with Overcooked. We uniformly divided 60 volunteers into 5 groups assigned to each of the 5 layouts. We designed the experiment to last around 30 minutes for each volunteer to ensure the validity of the data. Due to availabilities of volunteers, experiments are conducted within two consecutive days. The experiment has two stages. In the first stage, which is called the warm-up stage, the participants are encouraged to explore the behaviors of 4 given AI agents without a time limit. After the first stage, they are required to comment on their game experience, e.g., whether the AI agents are cooperative and comfortable to play with, and rank the agents accordingly. In the second stage, each participant is instructed to achieve scores as high as they could in 24 games (4 AI agents × 2 player positions × 3 repeats).We remark that, on the environment side, different from human-AI experiments performed by prior works (Zhao et al., 2021; Carroll et al., 2019) in Overcooked, we slow down the AI agents so that the AI agents have similar speed with human players. More specifically, 7 idle steps are inserted before each step of the AI agent. Such an operation is necessary since, in our prior user studies, we find that human players commonly feel uncomfortable if the AI agent is much faster and human players could contribute little to the score. Table 24 shows average reward per episode during the second stage in all layouts. All methods have comparable episode rewards in Asymm. Adv and Coord. Ring. There is no room for improvement since all the methods have reached the highest possible rewards. In Counter Circ., the most complex layout in this category, HSP achieves a better performance than baselines: HSP has a 155+ reward while the most competitive baseline MEP has a reward of 134+. We remark that the reward difference between HSP and MEP is around 20, which is exactly the value of 1 onion soup delivery. This implies that the HSP agent can, on average, deliver one more soup than all the baselines per game episode with humans, which is a significant improvement. 

