MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN-REGULARIZED REINFORCEMENT LEARNING AND PLANNING

Abstract

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitationlearned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model. * Equal first author contribution. 1 Dota 2 is a two-team zero-sum game, but the presence of full information sharing between teammates makes it equivalent to 2p0s. Beyond 2p0s settings, self-play algorithms have also proven successful in highly adversarial games like six-player poker (Brown & Sandholm, 2019).

1. INTRODUCTION

In two-player zero-sum (2p0s) settings, principled self-play algorithms converge to a minimax equilibrium, which in a balanced game ensures that a player will not lose in expectation regardless of the opponent's strategy (Neumann, 1928) . This fact has allowed self-play, even without human data, to achieve remarkable success in 2p0s games like chess (Silver et al., 2018) , Go (Silver et al., 2017) , poker (Bowling et al., 2015; Brown & Sandholm, 2017) , and Dota 2 (Berner et al., 2019) . 1 In principle, any finite 2p0s game can be solved via self-play given sufficient compute and memory. However, in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite compute and memory. This is because in complex domains there may be arbitrarily many conventions and expectations for how to cooperate, of which humans may use only a small subset (Lerer & Peysakhovich, 2019) . The clearest example of this is language. A self-play agent trained from scratch without human data in a cooperative game involving free-form communication channels would almost certainly not converge to using English as the medium of communication. Obviously, such an agent would perform poorly when paired with a human English speaker. Indeed, prior work has shown that naïve extensions of self-play from scratch without human data perform poorly when playing with humans or human-like agents even in dialogue-free domains that involve cooperation rather than just competition, such as the benchmark games no-press Diplomacy (Bakhtin et al., 2021) and Hanabi (Siu et al., 2021; Cui et al., 2021) . Recently, (Jacob et al., 2022) introduced piKL, which models human behavior in many games better than pure behavioral cloning (BC) on human data by regularizing inference-time planning toward a BC policy. In this work, we introduce an extension of piKL, called DiL-piKL, that replaces piKL's single fixed regularization parameter λ with a probability distribution over λ parameters. We then show how DiL-piKL can be combined with self-play reinforcement learning, allowing us to train a strong agent that performs well with humans. We call this algorithm RL-DiL-piKL. Using RL-DiL-piKL we trained an agent, Diplodocus, to play no-press Diplomacy, a difficult benchmark for multi-agent AI that has been actively studied in recent years (Paquette et al., 2019; Anthony et al., 2020; Gray et al., 2020; Bakhtin et al., 2021; Jacob et al., 2022) . We conducted a 200-game no-press Diplomacy tournament with a diverse pool of human players, including expert humans, in which we tested two versions of Diplodocus using different RL-DiL-piKL settings, and other baseline agents. All games consisted of one bot and six humans, with all players being anonymous for the duration of the game. These two versions of Diplodocus achieved the top two average scores in the tournament among all 48 participants who played more than two games, and ranked first and third overall among all participants according to an Elo ratings model. Our code and models are available at https://github.com/facebookresearch/diplomacy_cicero.

2. BACKGROUND AND PRIOR WORK

Diplomacy is a benchmark 7-player mixed cooperative/competitive game featuring simultaneous moves and a heavy emphasis on negotiation and coordination. In the no-press variant of the game, there is no cheap talk communication. Instead, players only implicitly communicate through moves. In the game, seven players compete for majority control of 34 "supply centers" (SCs) on a map. On each turn, players simultaneously choose actions consisting of an order for each of their units to hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use a common scoring system in which the score of player i is et al. (2019) . Subsequent work on no-press Diplomacy have mostly relied on a similar architecture with some modeling improvements (Gray et al., 2020; Anthony et al., 2020; Bakhtin et al., 2021) . Gray et al. (2020) proposed an agent that plays an improved policy via one-ply search. It uses policy and value functions trained on human data to conduct search using regret minimization. C 2 i / i ′ C 2 i ′ , Several works explored applying self-play to compute improved policies. Paquette et al. ( 2019) applied an actor-critic approach and found that while the agent plays stronger in populations of other self-play agents, it plays worse against a population of human-imitation agents. Anthony et al. ( 2020) used a self-play approach based on a modification of fictitious play in order to reduce drift from human conventions. The resulting policy is stronger than pure imitation learning in both 1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, Bakhtin et al. ( 2021) combined one-ply search based on equilibrium computation with value iteration to produce an agent called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without human data, but in the full 7-player game plays poorly with agents other than itself. Jacob et al. (2022) showed that regularizing inference-time search techniques can produce agents that are not only strong but can also model human behaviour well. In no-press Diplomacy, they show that regularizing hedge (an equilibrium-finding algorithm) with a KL-divergence penalty towards a human imitation learning policy can match or exceed the human action prediction accuracy of imitation learning while being substantially stronger. KL-regularization toward human behavioral policies has previously been proposed in various forms in single-and multi-agent RL algorithms (Nair et al., 2018; Siegel et al., 2020; Nair et al., 2020) , and was notably employed in AlphaStar (Vinyals et al., 2019) , but this has typically been used to improve sample efficiency and aid exploration rather than to better model and coordinate with human play. An alternative line of research has attempted to build human-compatible agents without relying on human data (Hu et al., 2020; 2021; Strouse et al., 2021) . These techniques have shown some success in simplified settings but have not been shown to be competitive with humans in large-scale collaborative environments.



where C i is the number of SCs player i owns. A more detailed description of the rules is provided in Appendix A. Most recent successes in no-press Diplomacy use deep learning to imitate human behavior given a corpus of human games. The first Diplomacy agent to leverage deep imitation learning was Paquette

