BENCHMARKING MULTI-AGENT DEEP REINFORCE-MENT LEARNING ALGORITHMS

Abstract

We benchmark commonly used multi-agent deep reinforcement learning (MARL) algorithms on a variety of cooperative multi-agent games. While there has been significant innovation in MARL algorithms, algorithms tend to be tested and tuned on a single domain and their average performance across multiple domains is less characterized. Furthermore, since the hyperparameters of the algorithms are carefully tuned to the task of interest, it is unclear whether hyperparameters can easily be found that allow the algorithm to be repurposed for other cooperative tasks with different reward structure and environment dynamics. To investigate the consistency of the performance of MARL algorithms, we build an open-source library of multi-agent algorithms including DDPG/TD3/SAC with centralized Q functions, PPO with centralized value functions, as well as QMix, and test them across a range of tasks that vary in coordination difficulty and agent number. The domains include the Multi-agent Particle World environments, StarCraftII micromanagement challenges, the Hanabi challenges, and the Hide-And-Seek environments. Finally, we investigate the ease of hyperparameters tuning for each of the algorithms by tuning hyperparameters in one environment per domain and re-using them in the other environments within the domain. The open-source code and more details can be found in our website: https://sites.google. com/view/marlbenchmarks.

1. INTRODUCTION

Widespread availability of high-speed computing, neural network architectures, and advances in reinforcement learning (RL) algorithms have led to a continuing series of interesting results in building cooperative artificial agents: agents collectively playing Hanabi to an expert level (Hu & Foerster, 2019) , designing cooperative StarCraftII bots (Rashid et al., 2018) that outperform hand-designed heuristics, and constructing emergent languages between agents (Mordatch & Abbeel, 2017) . Each of these aforementioned results have often come with the introduction of a new algorithm, leading to a proliferation of new algorithms that is rapidly advancing the field. However, these algorithms are often designed and tuned to get optimal performance in a particular deployment environment. In particular, it is not unusual for each new algorithm to come with a new proposed benchmark on which it is evaluated. Consequently it is not obvious that these algorithms can easily be re-purposed for new tasks; subtle interactions between the algorithm, the architecture and the environment may lead to high asymptotic performance on one task and total failure when applied to a new task. Without examining an algorithm across a range of tasks, it is difficult to assess how general purpose it is. Furthermore, the high asymptotic rewards that are often presented may hide complexities in using the algorithms in practice. The amount of time that researchers spent in finding optimal hyperparameters is often obscured, making it unclear how extensive of a hyperparameter search was needed to find the good hyperparameters. That is, researchers will often report a grid search of hyperparameters but not the prior work that was done to pick out a hyperparameter grid that actually contained good hyperparameters. Furthermore, the amount of computation provided to tune the studied algorithm may not be provided to the baseline algorithms that it will be compared against. This can lead to an inflated performance of the proposed algorithm relative to the benchmarks. All these problems can arise without any ill intent on the part of the authors, but they make the problem of assessing algorithms quite challenging. The downstream consequence of this proliferation of algorithms coupled with an absence of standard benchmarks is a lack of clarity on the part of practitioners as to which algorithm will give consistent, high performance with minimal tuning. Researchers are often operating under computational constraints that limit how extensive of a hyper-parameter sweep they can perform; the ease with which good hyperparameters can be found is consequently a useful metric. When tackling a new multi-agent problem, researchers have no clear answer to the questions: 1) which MARL algorithm should I use to maximize performance and 2) given my computational resources, which algorithm is likeliest to work under my constraints? We present an attempt to evaluate the performance, robustness and the relative ease of using these algorithms by benchmarking them across a wide variety of environments that vary in both agent number, exploration difficulty, and coordination complexity. By exploring a large range of possible environments, we identify algorithms that perform well on average and serve as a strong starting point for a variety of problems. We tackle the question of relative difficulty in finding hyperparameters by looking at how hyperparameters transfer across settings: tuning hyperparameters on one set of environments and applying them without re-tuning on the remaining environments. Using this procedure, we can provide effective recommendations on algorithm choice for researchers attempting to deploy deep multi-agent reinforcement learning while operating under constrained hyper-parameter budgets. We test Proximal Policy Optimization (Schulman et al., 2017) and QMix (Rashid et al., 2018) . We focus specifically on the performance of these algorithms on fully cooperative tasks, as this avoids game theoretic issues around computing the distance to Nash equilibria, and allows us to solely characterize performance in terms of asymptotic reward. • Establishing that under constrained hyperparameter searching budgets, the multi-agent variant of PPO appears to be the most consistent algorithm across different domains. • The design and release of a new multi-agent library of various on/off-policy learning algorithms with recurrent policy support.

2. RELATED WORK

MARL algorithms have a long history but have, until recently, primarily been applied in tabular settings (Littman, 1994; Busoniu et al., 2008) . Notions of using a Q-function that operated on the actions of all agents, known as Joint-Action Learners (Claus & Boutilier, 1998) et al., 2017) , a platform that can efficiently support hundreds of particle agents for cooperative tasks, multi-agent MuJoCo, in which each joint is an independent agent (Schroeder de Witt et al., 2020), and CityFlow (Zhang et al., 2019) , which studies large-scale decentralized traffic light control. There also has been a variety of attempts to benchmark MARL algorithms that differ in scope from our paper. Gupta et al. (2017) benchmarks a similar set of algorithms to ours on a wide variety of environments. However, they do not consider algorithms that train in a centralized fashion



The contributions of this paper are the following • Benchmarking multi-agent variants of single-agent algorithms across a wide range of possible tasks including StarCraftII micromanagement (Rashid et al., 2019), Multi-agent Particle World (Mordatch & Abbeel, 2017), Hanabi (Bard et al., 2020), and the Hide-And-Seek domain (Baker et al., 2019).

with centralized value functions (MAPPO), Multi-Agent DDPG (MADDPG) (Lowe et al., 2017), Multi-Agent TD3 (Fujimoto et al., 2018a) (MATD3), a Multi-Agent variant of Soft Actor Critic (Haarnoja et al., 2018) (MASAC),

have existed in the literature since its inception with algorithms like Hyper-Q (Tesauro, 2004) using inferred estimates of other agent strategies in the Q-function. Recent MARL algorithms have built upon these ideas by incorporating neural networks(Tampuu et al., 2017), policy-gradient methods(Foerster et al., 2017), and finding ways to combine local and centralized Q-functions to enable centralized learning with decentralized execution(Lowe et al., 2017; Sunehag et al., 2018).

