HUMAN-AI COORDINATION VIA HUMAN-REGULARIZED SEARCH AND LEARNING

Abstract

We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle outof-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.

1. INTRODUCTION

One of the most fundamental goals of artificial intelligence research, especially multi-agent research, is to produce agents that can successfully collaborate with humans to achieve common goals. Although search and reinforcement learning (RL) from scratch without human knowledge have achieved impressive superhuman performance in competitive games (Silver et al., 2017; Brown & Sandholm, 2019) , prior works (Hu et al., 2020; Carroll et al., 2019) have shown that agents produced by vanilla multi-agent reinforcement learning do not collaborate well with humans. A canonical way to obtain agents that collaborate well with humans is to first use behavioral cloning (BC) (Bain & Sammut, 1996) to train a policy that mimics human behavior and then use RL to train a best response (BR policy) to the fixed BC policy. However, such an approach has a few issues. The BC policy is hardly a perfect representation of human play. It may struggle to mimic strong players' performance without search (Jacob et al., 2022) . The BC policy's response to new conventions developed during BR training is also not well defined. Therefore, the BR policy may develop strategies that exploit those undefined behaviors and confuse humans and causes humans to diverge from routine behaviors or even quit the task because they believe the partner is non-sensible. Recently, Jacob et al. (2022) introduced piKL, a search technique regularized towards BC policies learned from human data that can produce strong yet human-like policies. In some environments, the regularized search, with the proper amount of regularization, achieves better performance while maintaining or even improving its accuracy when predicting human actions. Inspired by piKL, we propose a three-step algorithm to create agents that can collaborate well with humans in complex partially observable environments. In the first step, we repeatedly apply imitation learning and piKL (piKL-IL) with multiple regularization coefficients to model human players of different skill levels. Secondly, we integrate the regularization idea with RL to train a human-like best response agent (piKL-BR) to the agents from step one. Thirdly and finally, at test time, we apply piKL on the trained best response agent to further improve performance. We call our method piKL3. We test our method on the challenging benchmark Hanabi (Bard et al., 2020) through large-scale experiments with real human players. We first show that it outperforms human experts when partnering with a group of unknown human players in an ad hoc setting without prior communication or warmup games. Players were recruited from a diverse player group and have different skill levels. We then evaluate piKL3 when partnered with expert human partners. We find that piKL3 outperforms an RL best response to a behavioral cloning policy (BR-BC) -a strong and established baseline for cooperative agents -in this setting.

2. RELATED WORK

The research on learning to collaborate with humans can be roughly categorized into two groups based on whether or not they rely on human data. With human data, the most straightforward method is behavioral cloning, which uses supervised learning to predict human moves and executes the move with the highest predicted probability. The datasets often contain sub-optimal decisions and mistakes made by humans and behavioral cloning inevitably suffers by training on such data. A few methods from the imitation learning and offline RL community have been proposed to address such issues. For example, conditioning the policy on a reward target (Kumar et al., 2019; Chen et al., 2021) can help guide the policy towards imitating the human behaviors that achieve the maximum future rewards at test time. Behavioral cloning with neural networks alone may struggle to model sufficiently strong humans, especially in complex games that require long-term planning (McIlroy-Young et al., 2020) . Jacob et al. ( 2022) address this issue by regularizing search towards a behavioral cloning policy. The proposed method, piKL, not only improves the overall performance as most search methods do, but also achieves better accuracy when predicting human moves in a wide variety of games compared to the behavioral cloning policy on which it is based. Human data can also be used in combination with reinforcement learning. Observationally Augmented Self-Play (OSP) (Lerer & Peysakhovich, 2019) augments the normal MARL training procedure with a behavioral cloning loss on a limited amount of data collected from a test time agent to find an equilibrium policy that may work well with that agent. OSP increases the probability of learning conventions that are compatible with the test time agents. However it may not be able to model partners with diverse skill levels given a large aggregation of data from various players. We can also use RL to train a best response policy to the behavioral cloning policy (Carroll et al., 2019) . This method is the optimal solution given a perfect human model. In practice, however, RL is prone to overfitting to the imperfections of the human model. In addition, RL alone may not be sufficient in practice to learn superhuman strategies (Silver et al., 2018; Brown & Sandholm, 2019) . A parallel research direction seeks to achieve better human-AI coordination without using any human data. Some of them take inspiration from human behavior or the human learning process. Hu et al. (2020) enforce the RL policies to not break the symmetries in the game arbitrarily, a common practice of human players in certain games. Inspired by humans' learning and reasoning process, Off-belief learning (Hu et al., 2021c ) and K-level reasoning (Costa-Gomes & Crawford, 2006; Cui et al., 2021b) train sequences of policies with increasing cognitive capabilities. Both methods achieve strong performance with a human proxy model trained with behavioral cloning. Another group of methods use population-based training and various diversity metrics (Strouse et al., 2021; Lupu et al., 2021; Tang et al., 2021) to first obtain a set of different policies and then train a common best response that may generalize better to human partners than best response to a single RL policy.

3.1. DEC-POMDP AND DEEP REINFORCEMENT LEARNING

We consider human-AI coordination in a decentralized partially observable Markov decision process (Dec-POMDP) (Nayyar et al., 2013) . A Dec-POMDP consists of N agents indexed by (1, . . . , N ), a state space S, a joint action space A = A 1 × • • • × A N , a transition function T : S × A → S, a reward function r : S × A → R and a set of observation function o i = Ω i (s), s ∈ S for each agent i. We further assume that the joint actions a and rewards r are observable by all agents. We then define the trajectory of true states until time step t as τ t = (s 0 , a 0 , r 0 , . . . , s t ) and its partially observed counterpart (action-observation history, AOH) for agent i as τ i t = (o i 0 , a 0 , r 0 , . . . , o i t ). An agent's policy π i (τ i t ) = P (a i t |τ i t ) maps each possible AOH to a distribution over the action space of

