MASTERING THE GAME OF NO-PRESS DIPLOMACY VIA HUMAN-REGULARIZED REINFORCEMENT LEARNING AND PLANNING

Abstract

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitationlearned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.

1. INTRODUCTION

In two-player zero-sum (2p0s) settings, principled self-play algorithms converge to a minimax equilibrium, which in a balanced game ensures that a player will not lose in expectation regardless of the opponent's strategy (Neumann, 1928) . This fact has allowed self-play, even without human data, to achieve remarkable success in 2p0s games like chess (Silver et al., 2018 ), Go (Silver et al., 2017 ), poker (Bowling et al., 2015; Brown & Sandholm, 2017), and Dota 2 (Berner et al., 2019) . 1In principle, any finite 2p0s game can be solved via self-play given sufficient compute and memory. However, in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite compute and memory. This is because in complex domains there may be arbitrarily many conventions and expectations for how to cooperate, of which humans may use only a small subset (Lerer & Peysakhovich, 2019). The clearest example of this is language. A self-play agent trained from scratch without human data in a cooperative game involving free-form communication channels would almost certainly not converge to using English as the medium of communication. Obviously, such an agent would perform poorly when paired with a human English speaker. Indeed, prior work has shown that naïve extensions of self-play from scratch without human data perform poorly when playing with humans or human-like agents even in dialogue-free domains that involve cooperation rather than just competition, such as the benchmark games no-press Diplomacy (Bakhtin et al., 2021) and Hanabi (Siu et al., 2021; Cui et al., 2021) .



Dota is a two-team zero-sum game, but the presence of full information sharing between teammates makes it equivalent to 2p0s. Beyond 2p0s settings, self-play algorithms have also proven successful in highly adversarial games like six-player poker(Brown & Sandholm, 2019).

