CO-EVOLUTION AS MORE THAN A SCALABLE ALTERNATIVE FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

In recent years, gradient based multi-agent reinforcement learning is growing in success. One contributing factor is the use of shared parameters for learning policy networks. While this approach scales well with the number of agents during execution it lacks this ambiguity for training as the number of produced samples grows linearly with the number of agents. For a very large number of agents, this could lead to an inefficient use of the circumstantial amount of produced samples. Moreover in single-agent reinforcement learning policy search with evolutionary algorithms showed viable success when sampling can be parallelized on a larger scale. The here proposed method does not only consider sampling in concurrent environments but further investigates sampling diverse parameters from the population in co-evolution in joint environments during training. This co-evolutionary policy search has shown to be capable of training a large number of agents. Beyond that, it has been shown to produce competitive results in smaller environments in comparison to gradient descent based methods. This surprising result make evolutionary algorithms a promising candidate for further research in the context of multi-agent reinforcement learning.

1. INTRODUCTION

The core idea of this work is based on a union of the concepts of parameter sharing for multiagent policies and policy search with evolutionary algorithms (EA). In general stochastic gradient descent (SGD) together with back-propagation is a powerful approach for optimizing neural network parameters. For on-policy reinforcement learning with SGD the generated samples are dependent on the current policy which is again subject to the gradient update. This circumstance makes the vanilla policy gradient high in variance and slow in learning Sutton et al. (1999) . Hence contemporary policy gradient methods use baseline terms and other remedies to reduce the variance and further increase sample efficiency. One exemplary algorithm of this class is PPO (Schulman et al., 2017b) which further includes a clipping based on the probability ratio of the old and updated policy. Policy search with evolutionary algorithms is gradient-free and its use in reinforcement learning has already a long history (Heidrich-Meisner & Igel, 2008; Moriarty et al., 1999) . One advantage of this method is the lack of backpropagation and its computational cost that scales linearly with the amount of samples per iteration. One disadvantage of evolutionary algorithms, as black-box optimization methods, is the absence of step-wise information about state, action, and reward transitions. Furthermore, a key distinction to on-policy policy gradient methods is that a ensemble of policies, the population, is evaluated in each iteration. This increases the demand for necessary samples for each iteration compared to policy gradients. However, due to the advent of cloud computing and multi-threading CPU architectures parallel sampling became more accessible to a broader audience. Another difficulty of evolutionary algorithms was that they did not scale well with the number of parameters needed for deep learning. One efficient and almost hyperparameter free algorithm CMA-ES Hansen et al. (2003) scales unfortunately quadratically with the amount of parameters. Nevertheless small networks with just 10 4 parameters have shown to be capable of learning vision-based tasks as shown by Tang et al. (2020) . However, that deep reinforcement learning is possible with evolutionary algorithms is shown by multiple works that use some variants of evolutionary strategies (ES) or genetic algorithms (GA). In particular, the work of Salimans et al. ( 2017) has validated the capabilities of ES in deep reinforcement learning. Some of their core outcomes were that policy search with ES is relatively easy to scale with almost linear training time reduction. Since the policy update time cost is relatively cheap in contrast to the sampling time. Further, in their experiments the data efficiency was not substantially worse, with a range of 3-10 times of compared policy gradient methods. Another noteworthy implementation of ES in deep reinforcement learning is from Conti et al. ( 2017 Multi-agent reinforcement learning (MARL) is an interesting extension of reinforcement learning that involves game theoretical issues. The advantages of MARL compared to RL are obvious for problems that can be separated into multiple agents. The separation of state and action is limiting the curse of dimensionality and it enables further decentralized learning and execution schemes compared to the single agent case. Nevertheless, MARL introduces additional problems such as the non-stationarity caused by interdependent non-converged policies. Some other important problems dependent on the specific task are the credit attribution, heterogeneity of agents, partial-observable states and communication between agents. The issue of scaling the number of agents is the primary motivation for this work. The study of (Gupta et al., 2017) investigated the concept of parameter sharing (PS) for policies. This method can scale in execution to an arbitrary amount of agents if the problem itself does not change due to the number of agents. Else curriculum learning Bengio et al. (2009) was found to enable adjustment to tasks that change with size. Yet not every environment allows scaling in size and considering very large environments an overabundance of samples generated per timestep and agent could further slow down SGD. Still for the agent-wise smaller 5 vs. 5 player game of DOTA 2 the work of (Berner et al., 2019) showed that also learning on a larger scale with batches of about 2 × 10 6 timesteps every 2 seconds is feasible for parameter sharing PPO in self-play.

2. METHODS

First, let us revisit evolutionary strategies (ES) as a particularly interesting evolutionary algorithm for this work. For ES the population is sampled from a normal distribution with its mean and variances as only describing parameters. For parameter distribution based RL the authors of RE-INFORCE (Williams, 1992) already described a gradient ascent term. For this gradient ascent term similar to TRPO (Schulman et al., 2017a) a natural gradient can be defined that accounts for the KL-divergences (Wierstra et al., 2014) . A simple but empirical still effective variant of ES with a constant variance is that of Salimans et al. (2017) . This makes that variant a good candidate for large-scale MARL experiments. The work of Lehman et al. (2018) gives an insightful description of the differences in learning between ES and finite difference based methods. Further, they conclude to the understanding that because ES optimizes the average reward of the population they seek especially robust parameters for perturbation. Further, it is assumed by the authors that this parameter robustness could be a useful feature for co-evolution and self-play. The pressure to find a distribution that finds samples compatible for co-operation could lead to more stable solutions. Some successful demonstrations of early self-play with EA is found in Chellapilla & Fogel (2001) . All in all, investigating co-evolution on MARL problems seems interesting. But how can coevolution be realized in the context of collaborative MARL? Parameter sharing is a valid option for policy gradient methods and would be also a seemingly viable option for policy search within EA. Yet EA need to sample multiple policies for their expected fitness values. This increases sampling demand linearly by population size. By co-evolutionary sampling, one could reduce this growing complexity, especially for environments with many agents. However, if no local rewards are available credit assignment could be a problem and additional runs in different pairings are necessary for a good estimation of the fitness.



) that extends the fitness objective by a novelty term(Lehman & Stanley, 2011)  for improving exploration. Besides the work ofSuch et al. (2017)  has shown that also GA are capable of training large networks for more complex tasks like Atari games and continuous control problems. Additionally, they showed a method describing network parameters in a compressed form through a chain of mutation seeds. Moreover, in the contribution ofHausknecht et al. (2014)  one subclass of GA the generative encoding algorithmHyperNEAT (Stanley et al., 2009)  was used. It was besidesDQN (Mnih et al., 2013)  one of the first solutions for Atari(Brockman et al., 2016)  games. While not too popular in single-agent reinforcement learning, policy search with EA still showed some competitive results over recent years.

