HUMAN-AI COORDINATION VIA HUMAN-REGULARIZED SEARCH AND LEARNING

Abstract

We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle outof-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.

1. INTRODUCTION

One of the most fundamental goals of artificial intelligence research, especially multi-agent research, is to produce agents that can successfully collaborate with humans to achieve common goals. Although search and reinforcement learning (RL) from scratch without human knowledge have achieved impressive superhuman performance in competitive games (Silver et al., 2017; Brown & Sandholm, 2019 ), prior works (Hu et al., 2020; Carroll et al., 2019) have shown that agents produced by vanilla multi-agent reinforcement learning do not collaborate well with humans. A canonical way to obtain agents that collaborate well with humans is to first use behavioral cloning (BC) (Bain & Sammut, 1996) to train a policy that mimics human behavior and then use RL to train a best response (BR policy) to the fixed BC policy. However, such an approach has a few issues. The BC policy is hardly a perfect representation of human play. It may struggle to mimic strong players' performance without search (Jacob et al., 2022) . The BC policy's response to new conventions developed during BR training is also not well defined. Therefore, the BR policy may develop strategies that exploit those undefined behaviors and confuse humans and causes humans to diverge from routine behaviors or even quit the task because they believe the partner is non-sensible. Recently, Jacob et al. ( 2022) introduced piKL, a search technique regularized towards BC policies learned from human data that can produce strong yet human-like policies. In some environments, the regularized search, with the proper amount of regularization, achieves better performance while maintaining or even improving its accuracy when predicting human actions. Inspired by piKL, we propose a three-step algorithm to create agents that can collaborate well with humans in complex partially observable environments. In the first step, we repeatedly apply imitation learning and piKL (piKL-IL) with multiple regularization coefficients to model human players of different skill levels. Secondly, we integrate the regularization idea with RL to train a human-like best response agent (piKL-BR) to the agents from step one. Thirdly and finally, at test time, we apply piKL on the trained best response agent to further improve performance. We call our method piKL3. We test our method on the challenging benchmark Hanabi (Bard et al., 2020) through large-scale experiments with real human players. We first show that it outperforms human experts when partnering

