CO-EVOLUTION AS MORE THAN A SCALABLE ALTERNATIVE FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

In recent years, gradient based multi-agent reinforcement learning is growing in success. One contributing factor is the use of shared parameters for learning policy networks. While this approach scales well with the number of agents during execution it lacks this ambiguity for training as the number of produced samples grows linearly with the number of agents. For a very large number of agents, this could lead to an inefficient use of the circumstantial amount of produced samples. Moreover in single-agent reinforcement learning policy search with evolutionary algorithms showed viable success when sampling can be parallelized on a larger scale. The here proposed method does not only consider sampling in concurrent environments but further investigates sampling diverse parameters from the population in co-evolution in joint environments during training. This co-evolutionary policy search has shown to be capable of training a large number of agents. Beyond that, it has been shown to produce competitive results in smaller environments in comparison to gradient descent based methods. This surprising result make evolutionary algorithms a promising candidate for further research in the context of multi-agent reinforcement learning.

1. INTRODUCTION

The core idea of this work is based on a union of the concepts of parameter sharing for multiagent policies and policy search with evolutionary algorithms (EA). In general stochastic gradient descent (SGD) together with back-propagation is a powerful approach for optimizing neural network parameters. For on-policy reinforcement learning with SGD the generated samples are dependent on the current policy which is again subject to the gradient update. This circumstance makes the vanilla policy gradient high in variance and slow in learning Sutton et al. (1999) . Hence contemporary policy gradient methods use baseline terms and other remedies to reduce the variance and further increase sample efficiency. One exemplary algorithm of this class is PPO (Schulman et al., 2017b) which further includes a clipping based on the probability ratio of the old and updated policy. Policy search with evolutionary algorithms is gradient-free and its use in reinforcement learning has already a long history (Heidrich-Meisner & Igel, 2008; Moriarty et al., 1999) . One advantage of this method is the lack of backpropagation and its computational cost that scales linearly with the amount of samples per iteration. One disadvantage of evolutionary algorithms, as black-box optimization methods, is the absence of step-wise information about state, action, and reward transitions. Furthermore, a key distinction to on-policy policy gradient methods is that a ensemble of policies, the population, is evaluated in each iteration. This increases the demand for necessary samples for each iteration compared to policy gradients. However, due to the advent of cloud computing and multi-threading CPU architectures parallel sampling became more accessible to a broader audience. Another difficulty of evolutionary algorithms was that they did not scale well with the number of parameters needed for deep learning. One efficient and almost hyperparameter free algorithm CMA-ES Hansen et al. (2003) scales unfortunately quadratically with the amount of parameters. Nevertheless small networks with just 10 4 parameters have shown to be capable of learning vision-based tasks as shown by Tang et al. (2020) . However, that deep reinforcement learning is possible with evolutionary algorithms is shown by multiple works that use some variants of evolutionary strategies 1

