HUMAN-LEVEL PERFORMANCE IN NO-PRESS DIPLOMACY VIA EQUILIBRIUM SEARCH

Abstract

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via regret minimization. Regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.

1. INTRODUCTION

A primary goal for AI research is to develop agents that can act optimally in real-world multi-agent interactions (i.e., games). In recent years, AI agents have achieved expert-level or even superhuman performance in benchmark games such as backgammon (Tesauro, 1994) , chess (Campbell et al., 2002) , Go (Silver et al., 2016; 2017; 2018) , poker (Moravčík et al., 2017; Brown & Sandholm, 2017; 2019b) , and real-time strategy games (Berner et al., 2019; Vinyals et al., 2019) . However, previous large-scale game AI results have focused on either purely competitive or purely cooperative settings. In contrast, real-world games, such as business negotiations, politics, and traffic navigation, involve a far more complex mixture of cooperation and competition. In such settings, the theoretical grounding for the techniques used in previous AI breakthroughs falls apart. In this paper we augment neural policies trained through imitation learning with regret minimization search techniques, and evaluate on the benchmark game of no-press Diplomacy. Diplomacy is a longstanding benchmark for research that features a rich mixture of cooperation and competition. Like previous researchers, we evaluate on the widely played no-press variant of Diplomacy, in which communication can only occur through the actions in the game (i.e., no cheap talk is allowed). Specifically, we begin with a blueprint policy that approximates human play in a dataset of Diplomacy games. We then improve upon the blueprint during play by approximating an equilibrium for the current phase of the game, assuming all players (including our agent) play the blueprint for the remainder of the game. Our agent then plays its part of the computed equilibrium. The equilibrium is computed via regret matching (RM) (Blackwell et al., 1956; Hart & Mas-Colell, 2000) . Search via RM has led to remarkable success in poker. However, RM only converges to a Nash equilibrium in two-player zero-sum games and other special cases, and RM was never previously shown to produce strong policies in a mixed cooperative/competitive game as complex as no-press Diplomacy. Nevertheless, we show that our agent exceeds the performance of prior agents and for the first time convincingly achieves human-level performance in no-press Diplomacy. Specifically, we show that our agent soundly defeats previous agents, that our agent is far less exploitable than previous agents, that an expert human cannot exploit our agent even in repeated play, and, most importantly, that our agent ranks in the top 2% of human players when playing anonymous games on a popular Diplomacy website.

2. BACKGROUND AND RELATED WORK

Search has previously been used in almost every major game AI breakthrough, including backgammon (Tesauro, 1994) , chess (Campbell et al., 2002) , Go (Silver et al., 2016; 2017; 2018 ), poker (Moravčík et al., 2017;; Brown & Sandholm, 2017; 2019b), and Hanabi (Lerer et al., 2020) . A major exception is real-time strategy games (Vinyals et al., 2019; Berner et al., 2019) . Similar to SPARTA as used in Hanabi (Lerer et al., 2020) , our agent conducts one-ply lookahead search (i.e., changes the policy just for the current game turn) and thereafter assumes all players play according to the blueprint. Similar to the Pluribus poker agent (Brown & Sandholm, 2019b), our search technique uses regret matching to compute an approximate equilibrium. In a manner similar to the sampled best response algorithm of Anthony et al. ( 2020), we sample a limited number of actions from the blueprint policy rather than search over all possible actions, which would be intractable. Learning effective policies in games involving cooperation and competition has been studied extensively in the field of multi-agent reinforcement learning (MARL) (Shoham et al., 2003) . Nash-Q and CE-Q applied Q learning for general sum games by using Q values derived by computing Nash (or correlated) equilibrium values at the target states (Hu & Wellman, 2003; Greenwald et al., 2003) . Friend-or-foe Q learning treats other agents as either cooperative or adversarial, where the Nash Q values are well defined Littman ( 2001). The recent focus on "deep" MARL has led to learning rules from game theory such as fictitious play and regret minimization being adapted to deep reinforcement learning (Heinrich & Silver, 2016; Brown et al., 2019) , as well as work on game-theoretic challenges of mixed cooperative/competitive settings such as social dilemmas and multiple equilibria in the MARL setting (Leibo et al., 2017; Lerer & Peysakhovich, 2017; 2019) . Diplomacy in particular has served for decades as a benchmark for multi-agent AI research (Kraus & Lehmann, 1988; Kraus et al., 1994; Kraus & Lehmann, 1995; Johansson & Håård, 2005; Ferreira et al., 2015) 

2.1. DESCRIPTION OF DIPLOMACY

The rules of no-press Diplomacy are complex; a full description is provided by Paquette et al. (2019) . No-press Diplomacy is a seven-player zero-sum board game in which a map of Europe is divided into 75 provinces. 34 of these provinces contain supply centers (SCs), and the goal of the game is for a player to control a majority (18) of the SCs. Each players begins the game controlling three or four SCs and an equal number of units. The game consists of three types of phases: movement phases in which each player assigns an order to each unit they control, retreat phases in which defeated units retreat to a neighboring province, and adjustment phases in which new units are built or existing units are destroyed. During a movement phase, a player assigns an order to each unit they control. A unit's order may be to hold (defend its province), move to a neighboring province, convoy a unit over water, or support a neighboring unit's hold or move order. Support may be provided to units of any player. We refer to a tuple of orders, one order for each of a player's units, as an action. That is, each player chooses one action each turn. There are an average of 26 valid orders for each unit (Paquette et al., 2019) , so the game's branching factor is massive and on some turns enumerating all actions is intractable. Importantly, all actions occur simultaneously. In live games, players write down their orders and then reveal them at the same time. This makes Diplomacy an imperfect-information game in which an optimal policy may need to be stochastic in order to prevent predictability. Diplomacy is designed in such a way that cooperation with other players is almost essential in order to achieve victory, even though only one player can ultimately win.



. Recently, Paquette et al. (2019) applied imitation learning (IL) via deep neural networks on a dataset of more than 150,000 Diplomacy games. This work greatly improved the state of the art for no-press Diplomacy, which was previously a handcrafted agent (van Hal, 2013). Paquette et al. (2019) also tested reinforcement learning (RL) in no-press Diplomacy via Advantage Actor-Critic (A2C) (Mnih et al., 2016). Anthony et al. (2020) introduced sampled best response policy iteration, a self-play technique, which further improved upon the performance of Paquette et al. (2019).

