BENCHMARKING MULTI-AGENT DEEP REINFORCE-MENT LEARNING ALGORITHMS

Abstract

We benchmark commonly used multi-agent deep reinforcement learning (MARL) algorithms on a variety of cooperative multi-agent games. While there has been significant innovation in MARL algorithms, algorithms tend to be tested and tuned on a single domain and their average performance across multiple domains is less characterized. Furthermore, since the hyperparameters of the algorithms are carefully tuned to the task of interest, it is unclear whether hyperparameters can easily be found that allow the algorithm to be repurposed for other cooperative tasks with different reward structure and environment dynamics. To investigate the consistency of the performance of MARL algorithms, we build an open-source library of multi-agent algorithms including DDPG/TD3/SAC with centralized Q functions, PPO with centralized value functions, as well as QMix, and test them across a range of tasks that vary in coordination difficulty and agent number. The domains include the Multi-agent Particle World environments, StarCraftII micromanagement challenges, the Hanabi challenges, and the Hide-And-Seek environments. Finally, we investigate the ease of hyperparameters tuning for each of the algorithms by tuning hyperparameters in one environment per domain and re-using them in the other environments within the domain. The open-source code and more details can be found in our website: https://sites.google. com/view/marlbenchmarks.

1. INTRODUCTION

Widespread availability of high-speed computing, neural network architectures, and advances in reinforcement learning (RL) algorithms have led to a continuing series of interesting results in building cooperative artificial agents: agents collectively playing Hanabi to an expert level (Hu & Foerster, 2019) , designing cooperative StarCraftII bots (Rashid et al., 2018) that outperform hand-designed heuristics, and constructing emergent languages between agents (Mordatch & Abbeel, 2017) . Each of these aforementioned results have often come with the introduction of a new algorithm, leading to a proliferation of new algorithms that is rapidly advancing the field. However, these algorithms are often designed and tuned to get optimal performance in a particular deployment environment. In particular, it is not unusual for each new algorithm to come with a new proposed benchmark on which it is evaluated. Consequently it is not obvious that these algorithms can easily be re-purposed for new tasks; subtle interactions between the algorithm, the architecture and the environment may lead to high asymptotic performance on one task and total failure when applied to a new task. Without examining an algorithm across a range of tasks, it is difficult to assess how general purpose it is. Furthermore, the high asymptotic rewards that are often presented may hide complexities in using the algorithms in practice. The amount of time that researchers spent in finding optimal hyperparameters is often obscured, making it unclear how extensive of a hyperparameter search was needed to find the good hyperparameters. That is, researchers will often report a grid search of hyperparameters but not the prior work that was done to pick out a hyperparameter grid that actually contained good hyperparameters. Furthermore, the amount of computation provided to tune the studied algorithm may not be provided to the baseline algorithms that it will be compared against. This can lead to an inflated performance of the proposed algorithm relative to the benchmarks. All these problems can arise without any ill intent on the part of the authors, but they make the problem of assessing algorithms quite challenging.

