LEARNING ZERO-SHOT COOPERATION WITH HU-MANS, ASSUMING HUMANS ARE BIASED

Abstract

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.

1. INTRODUCTION

Building intelligent agents that can interact with, cooperate and assist humans remains a longstanding AI challenge with decades of research efforts (Klien et al., 2004; Ajoudani et al., 2018; Dafoe et al., 2021) . Classical approaches are typically model-based, which (repeatedly) build an effective behavior model over human data and plan with the human model (Sheridan, 2016; Carroll et al., 2019; Bobu et al., 2020) . Despite great successes, this model-based paradigm requires an expensive and time-consuming data collection process, which can be particularly problematic for complex problems tackled by today's AI techniques (Kidd & Breazeal, 2008; Biondi et al., 2019) and may also suffer from privacy issues (Pan et al., 2019) . Recently, multi-agent reinforcement learning (MARL) has become a promising approach for many challenging decision-making problems. Particularly in competitive settings, AIs developed by MARL algorithms based on self-play (SP) defeated human professionals in a variety of domains (Silver et al., 2018; Vinyals et al., 2019; Berner et al., 2019) . This empirical evidence suggests a new direction of developing strong AIs that can directly cooperate with humans in a similar "model-free" fashion, i.e., via self-play. Different from zero-sum games, where simply adopting a Nash equilibrium strategy is sufficient, an obvious issue when training cooperative agents by self-play is convention overfitting. Due to the existence of a large number of possible optimal strategies in a cooperative game, SP-trained agents can easily converge to a particular optimum and make decisions solely based on a specific behavior pattern, i.e., convention (Lowe et al., 2019; Hu et al., 2020) , of its co-trainers, leading to poor generalization ability to unseen partners. To tackle this problem, recent works proposed a two-staged framework by first developing a diverse policy pool consisting of multiple SP-trained policies, which possibly cover different conventions, and then further training an adaptive policy against this policy pool (Lupu et al., 2021; Strouse et al., 2021; Zhao et al., 2021) . Despite the empirical success of this two-staged framework, a fundamental drawback exists. Even though the policy pool prevents convention overfitting, each SP-trained policy in the pool remains a solution, which is either optimal or sub-optimal, to a fixed reward function specified by the underlying cooperative game. This implies a crucial generalization assumption that any test-time partner

