ADVERSARIAL DIVERSITY IN HANABI

Abstract

Many Dec-POMDPs admit a qualitatively diverse set of "reasonable" joint policies, where reasonableness is indicated by symmetry equivariance, nonsabotaging behaviour and the graceful degradation of performance when paired with ad-hoc partners. Some of the work in diversity literature is concerned with generating these policies. Unfortunately, existing methods fail to produce teams of agents that are simultaneously diverse, high performing, and reasonable. In this work, we propose a novel approach, adversarial diversity (ADVERSITY), which is designed for turn-based Dec-POMDPs with public actions. ADVERSITY relies on off-belief learning to encourage reasonableness and skill, and on "repulsive" fictitious transitions to encourage diversity. We use this approach to generate new agents with distinct but reasonable play styles for the card game Hanabi and opensource our agents to be used for future research on (ad-hoc) coordination.

1. INTRODUCTION

A key objective of cooperative multi-agent reinforcement learning (MARL) is to produce agents capable of coordinating with novel partners, including other artificial agents and ultimately humans. In order to make progress on this objective, a number of works have focused on the general challenge of ad-hoc team play, which is to create autonomous agents able to "efficiently and robustly collaborate with previously unknown teammates on tasks to which they are all individually capable of contributing as team members" (Stone et al., 2010) . To evaluate such agents, many works on ad-hoc coordination rely on evaluation setups similar to the one proposed by Stone et al. (2010) , pairing the agents at test time with partners sampled from a pre-determined pool. The value of such evaluations depends on the size and quality of the pool of partners. A pool that is too small or too homogeneous may not be representative of all possible play-styles, and provide an inaccurate evaluation of the coordination capabilities of an agent. For this reason, previous works in coordination have relied on various approaches to generate a diverse pool of partners. A first approach is to handcraft policies, either directly or by shaping the reward at train-time (Albrecht, 2015; Barrett et al., 2017; Zand et al., 2022) , but it requires domain knowledge and scales poorly. Another is to train a population with varying hyperparameters or by deploying multiple RL algorithms on the same task (Nekoei et al., 2021; Zand et al., 2022; Albrecht, 2015) . The diversity achieved this way is unclear, since it is a byproduct of the variability of the algorithms used rather than being actively optimized for. Yet other works augment training with a diversity loss (Lupu et al., 2021) or save multiple checkpoints (Strouse et al., 2021) but often do not report the level of diversity achieved. Measures of diversity based on policy similarity struggle in settings where not all different actions result in "meaningfully" different outcomes 2 . Furthermore, the number of possible trajectories is often so large that it becomes easy to maximize diversity objectives without learning qualitatively different policies -imagine a humanoid robot that wiggles a finger at any sin- gle time step in the episode rather than learning different walking styles. This is particularly true in multi-agent settings, where the number of trajectories is exponential in the number of agents. To avoid such pitfalls, another approach, followed by Charakorn et al., is to require distinct policies to be incompatible by training them to obtain a low score when paired in mixed teams. In Section 7, we show that this results in policies that simply identify whether they are playing in self-play (SP, with themselves) or in cross play (XP), with another agent. In the latter case, they purposely "sabotage" the game by selecting actions that minimize return, such as playing unplayable cards in the card game Hanabi. Adapting to such policies in an ad-hoc pool is a non-goal, since they do not represent meaningfully different policies but rather actively poor and adversarial game play. This is in line with previous findings that partners trained with SP rely on arbitrary conventions and symmetry breaking, making collaboration with them difficult (Hu et al., 2020; 2021b; Lupu et al., 2021) . As such, producing strong and meaningfully diverse policies in Dec-POMDPs remains an important unsolved problem. We address this problem in turn-based settings with public actions by introducing adversarial diversity (ADVERSITY), a policy training method which, given a repulser agent, produces an "adversary" whose conventions are fundamentally incompatible with those of the repulser. The key insight of ADVERSITY is that it prevents the adversary from identifying whether it is currently in SP or playing with the repulser agent by randomizing between the two at every time step. In other words, even if the adversary is in an action-observation history (AOH) that is incompatible with the repulser agent, the adversary is paired with the repulser agent with a fixed probability λ, in which case the next reward is inverted. Likewise, with probability 1 -λ the adversary is instead paired with itself. Crucially, the choice of the current partner determines not only the sign of the reward (positive or negative) and the partners' action, but also how the entire AOH thus far is interpreted. Here, we build on top of the fictitious transition mechanism from off-belief learning (Hu et al., 2021b, OBL) and use the belief model of the repulser policy on the corresponding repulsive transition. In a nutshell, if the adversary is currently paired with the repulser policy, the transition is sampled from a belief distribution that assumes the repulser policy took all actions thus far. When the adversary is paired with itself rather than the repulser, we must avoid the feedback loop between induced beliefs and future actions, which would allow the adversary to form arbitrary conventions. Thus, we train in a hierarchy: we start with the grounded belief, like in OBL, and at each level ℓ we compute the vanilla transitions using the belief model of the level below, ℓ -1. The adversary is trained to maximize a "difference value function", which estimates the forward looking discounted difference between adversarial and vanilla transitions under their corresponding beliefs. For the first time, ADVERSITY enables us to produce a number of high performing, diverse, and symmetry-invariant policies for the challenging collaborative card game Hanabi. et al. (2010) and Bowling & McCracken (2005) were among the first to formulate the ad-hoc teamwork ("impromptu team play") setting, requiring autonomous agents to collaborate with novel teammates. Works in the literature have often taken a type-based approach, where potential partners are grouped in a number of possible types, which must be identified at test time. Different types (or classes) of polices have notably been generated through genetic algorithms (Albrecht et al., 2015) to



(a) Standard OBL transition diff (b) Repulsive OBL transition

Figure 1: Standard (a) and repulsive (b) OBL transitions when training π ℓ for n = 2 steps. ADVERSITY trains policy π ℓ on (b) with probability λ, and on (a) otherwise. Differences between the two are in red.

