MAESTRO: OPEN-ENDED ENVIRONMENT DESIGN FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environments and co-players and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player games, spanning discrete and continuous control settings. 1 

1. INTRODUCTION

The past few years have seen a series of remarkable achievements in producing deep reinforcement learning (RL) agents with expert (Vinyals et al., 2019; Berner et al., 2019; Wurman et al., 2022) and superhuman (Silver et al., 2016; Schrittwieser et al., 2020) performance in challenging competitive games. Central to these successes are adversarial training processes that result in curricula creating new challenges at the frontier of an agent's capabilities (Leibo et al., 2019; Yang et al., 2021) . Such automatic curricula, or autocurricula, can improve the sample efficiency and generality of trained policies (Open Ended Learning Team et al., 2021) , as well as induce an open-ended learning process (Balduzzi et al., 2019; Stanley et al., 2017) that continues to endlessly robustify an agent. Autocurricula have been effective in multi-agent RL for adapting to different co-players in competitive games (Leibo et al., 2019; Garnelo et al., 2021; Baker et al., 2019; Bansal et al., 2018; Feng et al., 2021) , where it is crucial to play against increasingly stronger opponents (Silver et al., 2018) and avoid being exploited by other agents (Vinyals et al., 2019) . Here, algorithms such as self-play (Silver et al., 2018; Tesauro, 1995) and fictitious self-play (Brown, 1951; Heinrich et al., 2015) have proven especially effective. Similarly, in single-agent RL, autocurricula methods based on Unsupervised Environment Design (UED, Dennis et al., 2020) have proven effective in producing agents robust to a wide distribution of environments (Wang et al., 2019; 2020; Jiang et al., 2021a; Parker-Holder et al., 2022) . UED seeks to adapt distributions over environments to maximise some metrics of interest. Minimax-regret UED seeks to maximise the regret of the learning agent, viewing this process as a game between a teacher that proposes challenging environments and a student that learns to solve them. At a Nash equilibrium of such games, the student policy provably reaches a minimax-regret policy over the set of possible environments, thereby providing a strong robustness guarantee. However, prior works in UED focus on single-agent RL and do not address the dependency between the environment and the strategies of other agents within it. In multi-agent domains, the behaviour of other agents plays a critical role in modulating the complexity and diversity of the challenges faced by a learning agent. For example, an empty environment that has no blocks to hide behind might be most challenging when playing against opponent policies that attack head-on, whereas environments that are full of winding hallways might be difficult when playing against defensive policies. Robust RL agents should be expected to interact successfully with a wide assortment of other rational agents in their environment (Yang et al., 2021; Mahajan et al., 2022) . Therefore, to become widely applicable, UED must be extended to include multi-agent dynamics as part of the environment design process. We formalise this novel problem as an Underspecified Partially-Observable Stochastic Game (UP-OSG), which generalises UED to multi-agent settings. We then introduce Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), the first approach to train generally capable agents in two-player UPOSGs such that they are robust to changes in the environment and co-player policies. MAESTRO is a replay-guided approach that explicitly considers the dependence between agents and environments by jointly sampling over environment/co-player pairs using a regret-based curriculum and population learning (see Figure 1 ). In partially observable two-player zero-sum games, we show that at equilibrium, the MAESTRO student policy reaches a Bayes-Nash Equilibrium with respect to a regret-maximising distribution over environments. Furthermore, in fully observable settings, it attains a Nash-Equilibrium policy in every environment against every rational agent. We assess the curricula induced by MAESTRO and a variety of strong baselines in two competitive two-player games, namely a sparse-reward grid-based LaserTag environment with discrete actions (Lanctot et al., 2017) and a dense-reward pixel-based MultiCarRacing environment with continuous actions (Schwarting et al., 2021) . In both cases, MAESTRO produces more robust agents than baseline autocurriculum methods on out-of-distribution (OOD) human-designed environment instances against unseen co-players. Furthermore, we show that MAESTRO agents, trained only on randomised environments and having never seen the target task, can significantly outperform specialist agents trained directly on the target environment. Moreover, in analysing how the student's regret varies across environments and co-players, we find that a joint curriculum, as produced by MAESTRO, is indeed required for finding the highest regret levels, as necessitated by UED. In summary, we make the following core contributions: (i) we provide the first formalism for multiagent learning in underspecified environments, (ii) we introduce MAESTRO, a novel approach to jointly learn autocurricula over environment/co-player pairs, implicitly modelling their dependence, (iii) we prove MAESTRO inherits the theoretical property from the single-agent setting of implementing a minimax-regret policy at equilibrium, which corresponds to a Bayesian Nash or Nash equilibrium in certain settings, and (iv) by rigorously analysing the curriculum induced by MAESTRO and evaluating MAESTRO agents against strong baselines, we empirically demonstrate the importance of the joint curriculum over the environments and co-players.

2. PROBLEM STATEMENT AND PRELIMINARIES

In single-agent domains, the problem of Unsupervised Environment Design (UED) is cast in the framework of an underspecified POMDP (Dennis et al., 2020) , which explicitly augments a standard POMDP with a set of free parameters controlling aspects of the environment subject to the design process. We extend this formalism to the multi-agent setting using stochastic games (Shapley, 1953) .



Videos of MAESTRO agents are available at maestro.samvelyan.com



Figure 1: A diagram of MAESTRO. MAESTRO maintains a population of co-players, each having an individual buffer of high-regret environments. When new environments are sampled, the student's regret is calculated with respect to the corresponding co-player and added to the co-player's buffer. MAESTRO continually provides high-regret environment/co-player pairs for training the student.

