EVOLVING POPULATIONS OF DIVERSE RL AGENTS WITH MAP-ELITES

Abstract

Quality Diversity (QD) has emerged as a powerful alternative optimization paradigm that aims at generating large and diverse collections of solutions, notably with its flagship algorithm MAP-ELITES (ME) which evolves solutions through mutations and crossovers. While very effective for some unstructured problems, early ME implementations relied exclusively on random search to evolve the population of solutions, rendering them notoriously sample-inefficient for highdimensional problems, such as when evolving neural networks. Follow-up works considered exploiting gradient information to guide the search in order to address these shortcomings through techniques borrowed from either Black-Box Optimization (BBO) or Reinforcement Learning (RL). While mixing RL techniques with ME unlocked state-of-the-art performance for robotics control problems that require a good amount of exploration, it also plagued these ME variants with limitations common among RL algorithms that ME was free of, such as hyperparameter sensitivity, high stochasticity as well as training instability, including when the population size increases as some components are shared across the population in recent approaches. Furthermore, existing approaches mixing ME with RL tend to be tied to a specific RL algorithm, which effectively prevents their use on problems where the corresponding RL algorithm fails. To address these shortcomings, we introduce a flexible framework that allows the use of any RL algorithm and alleviates the aforementioned limitations by evolving populations of agents (whose definition include hyperparameters and all learnable parameters) instead of just policies. We demonstrate the benefits brought about by our framework through extensive numerical experiments on a number of robotics control problems, some of which with deceptive rewards, taken from the QD-RL literature. We open source an efficient JAX-based implementation of our algorithm in the QDax library 1 .

1. INTRODUCTION

Drawing inspiration from natural evolution's ability to produce living organisms that are both diverse and high-performing through competition in different niches, Quality Diversity (QD) methods evolve populations of diverse solutions to solve an optimization problem. In contrast to traditional Optimization Theory, where the goal is to find one solution maximizing a given scoring function, QD methods explicitly use a mapping from solutions to a vector space, referred to as a behavior descriptor space, to characterize solutions and maintain a data structure, referred to as a repertoire, filled with high-performing solutions that cover this space as much as possible, in a process commonly referred to as illumination. This new paradigm has led to breakthroughs over the past decade in many domains ranging from robotics control to engineering design and games generation (Gaier et al., 2018; Sarkar & Cooper, 2021; Gravina et al., 2019; Cully & Demiris, 2018) . There are a number of advantages to QD methods over standard optimization ones. Actively seeking and maintaining diversity in a population of solutions has proved to be an effective exploration strategy, by reaching high-performing regions through a series of stepping stones, when the fitness function has no particular structure (Gaier et al., 2019) . Additionally, having at disposal a diverse set of high-performing solutions can be greatly beneficial to a decision maker (Lehman et al., 2020) , for instance because the scoring function may fail to model accurately the reality (Cully et al., 2015) . MAP-ELITES (Mouret & Clune, 2015) has emerged as one of the most widely used algorithm in the QD community for its simplicity and efficacy. It divides the behavior descriptor space into a discrete mesh of cells and strives to populate them all with solutions with matching behavior descriptors that maximize the fitness function as much as possible. This algorithm has been used in many applications with great success, such as developing controllers for hexapod robots that can adapt to damage in real time (Cully et al., 2015) . However, just like many evolutionary algorithms, it struggles on problems with high-dimensional search spaces, such as when evolving controllers parametrized by neural networks, as it uses random mutations and crossovers to evolve the population. 2022), exploit the Markov-Decision-Process structure of the problem and adapt off-policy RL algorithms, such as TD3 (Fujimoto et al., 2018) , to evolve the population. This often entails adding additional components to the evolutionary algorithm (e.g. a replay buffer, critic networks, hyperparameters of the RL agent, ...) and methods differ along the way these components are managed. RL-based MAP-ELITES approaches have outperformed other MAP-ELITES variants, and even state-of-the art RL methods, on a variety of robotics control problems that require a substantial amount of exploration due to deceptive or sparse rewards. However, the introduction of RL components in MAP-ELITES has come with a number of downsides: (i) high sensibility to hyperparameters (Khadka et al., 2019; Zhang et al., 2021) , (ii) training instability, (iii) high variability in performance, and perhaps most importantly (iv) limited parallelizability of the methods due to the fact that many components are shared in these methods for improved sample-efficiency. Furthermore, existing RL-based MAP-ELITES approaches are inflexibly tied to a specific RL algorithm, which effectively prevents their use on problems where the latter fails. These newly-introduced downsides are particularly problematic as they are some of the main advantages offered by evolutionary methods that are responsible for their widespread use. These methods are notoriously trivial to parallelize and there is almost a linear scaling between the convergence speed and the amount of computational power available, as shown in Lim et al. (2022) for MAP-ELITES. This is all the more relevant with the advent of modern libraries, such as JAX (Bradbury et al., 2018) , that seamlessly enable not only to distribute the computations, including computations taking place in the physics engine with BRAX (Freeman et al., 2021) , over multiple accelerators but also to fully leverage their parallelization capabilities through automated vectorization primitives, see Lim et al. (2022); Flajolet et al. (2022); Tang et al. (2022) . Evolutionary methods are also notoriously robust to the exact choice of hyperparameters, see Khadka et al. (2019) , which makes them suited to tackle new problems. This is in stark contrast with RL algorithms that tend to require problem-specific hyperparameter tuning to perform well (Khadka et al., 2019; Zhang et al., 2021) . In order to overcome the aforementioned limitations of RL-based MAP-ELITES approaches, we develop a new MAP-ELITES framework that 1. can be generically and seamlessly compounded with any RL agent, 2. is robust to the exact choice of hyperparameters by embedding a meta-learning loop within MAP-ELITES, 3. is trivial to scale to large population sizes, which helps alleviating stochasticity and training stability issues, without entering offline RL regimes a priori by independently evolving populations of entire agents (including all of their components, such as replay buffers) instead of evolving policies only and sharing the other components across the population. Our method, dubbed PBT-MAP-ELITES, builds on MAP-ELITES and combines standard isoline operators with policy gradient updates to get the best of both worlds. We evaluate PBT-MAP-ELITES when used with the SAC (Haarnoja et al., 2018) and TD3 (Fujimoto et al., 2018) agents on a set of five standard robotics control problems taken from the QD literature and show that it either yields performance on par with or outperforms state-of-the-art MAP-ELITES approaches, in some cases by a strong margin, while not being provided with hyperparameters tuned beforehand for these problems. Finally, we open source an efficient JAX-based implementation of our algorithm that combines the efficient implementation of PBT from Flajolet et al. (2022) with that of MAP-ELITES from Lim et al. (2022) . We refer to these two prior works for speed-up data points compared to alternative implementations.



https://github.com/adaptive-intelligent-robotics/QDax



The breakthroughs of Deep Reinforcement Learning in sequential decision making problems prompted a new line of work in the QD field to make the algorithms capable of dealing with deep neural network parametrizations. These new methods borrow techniques from either Black-Box Optimization (BBO) or Reinforcement Learning (RL) in order to exploit gradient information to guide the search. Methods based on BBO techniques (Colas et al., 2020; Conti et al., 2018) follow the approaches from earlier works on scaling evolutionary algorithms to neuro-evolution, such as Salimans et al. (2017); Stanley & Miikkulainen (2002), and empirically evaluate gradients w.r.t. the parameters by stochastically perturbing them by small values a number of times. Methods borrowing tools from RL, such as Nilsson & Cully (2021); Pierrot et al. (

