ADVERSARIAL DIVERSITY IN HANABI

Abstract

Many Dec-POMDPs admit a qualitatively diverse set of "reasonable" joint policies, where reasonableness is indicated by symmetry equivariance, nonsabotaging behaviour and the graceful degradation of performance when paired with ad-hoc partners. Some of the work in diversity literature is concerned with generating these policies. Unfortunately, existing methods fail to produce teams of agents that are simultaneously diverse, high performing, and reasonable. In this work, we propose a novel approach, adversarial diversity (ADVERSITY), which is designed for turn-based Dec-POMDPs with public actions. ADVERSITY relies on off-belief learning to encourage reasonableness and skill, and on "repulsive" fictitious transitions to encourage diversity. We use this approach to generate new agents with distinct but reasonable play styles for the card game Hanabi and opensource our agents to be used for future research on (ad-hoc) coordination.

1. INTRODUCTION

A key objective of cooperative multi-agent reinforcement learning (MARL) is to produce agents capable of coordinating with novel partners, including other artificial agents and ultimately humans. In order to make progress on this objective, a number of works have focused on the general challenge of ad-hoc team play, which is to create autonomous agents able to "efficiently and robustly collaborate with previously unknown teammates on tasks to which they are all individually capable of contributing as team members" (Stone et al., 2010) . To evaluate such agents, many works on ad-hoc coordination rely on evaluation setups similar to the one proposed by Stone et al. (2010) , pairing the agents at test time with partners sampled from a pre-determined pool. The value of such evaluations depends on the size and quality of the pool of partners. A pool that is too small or too homogeneous may not be representative of all possible play-styles, and provide an inaccurate evaluation of the coordination capabilities of an agent. For this reason, previous works in coordination have relied on various approaches to generate a diverse pool of partners. A first approach is to handcraft policies, either directly or by shaping the reward at train-time (Albrecht, 2015; Barrett et al., 2017; Zand et al., 2022) , but it requires domain knowledge and scales poorly. Another is to train a population with varying hyperparameters or by deploying multiple RL algorithms on the same task (Nekoei et al., 2021; Zand et al., 2022; Albrecht, 2015) . The diversity achieved this way is unclear, since it is a byproduct of the variability of the algorithms used rather than being actively optimized for. Yet other works augment training with a diversity loss (Lupu et al., 2021) or save multiple checkpoints (Strouse et al., 2021) but often do not report the level of diversity achieved. Measures of diversity based on policy similarity struggle in settings where not all different actions result in "meaningfully" different outcomesfoot_1 . Furthermore, the number of possible trajectories is often so large that it becomes easy to maximize diversity objectives without learning qualitatively different policies -imagine a humanoid robot that wiggles a finger at any sin-



https://github.com/facebookresearch/off-belief-learning While "meaningfully different" is environment dependent, we elaborate on what we mean in Section 4 1

