SAMPLE EFFICIENT QUALITY DIVERSITY FOR NEURAL CONTINUOUS CONTROL

Abstract

We propose a novel Deep Neuroevolution algorithm, QD-RL, that combines the strengths of off-policy reinforcement learning (RL) algorithms and Quality Diversity (QD) approaches to solve continuous control problems with neural controllers. The QD part contributes structural biases by decoupling the search for diversity from the search for high return, resulting in efficient management of the exploration-exploitation trade-off. The RL part contributes sample efficiency by relying on off-policy gradient-based updates of the agents. More precisely, we train a population of off-policy deep RL agents to simultaneously maximize diversity within the population and the return of each individual agent. QD-RL selects agents interchangeably from a Pareto front or from a Map-Elites grid, resulting in stable and efficient population updates. Our experiments in the ANT-MAZE and ANT-TRAP environments show that QD-RL can solve challenging exploration and control problems with deceptive rewards while being two orders of magnitude more sample efficient than the evolutionary counterpart.

1. INTRODUCTION

Natural evolution has the fascinating ability to produce organisms that are all high-performing in their respective niche. Inspired by this ability to produce a tremendous diversity of living systems within one run, Quality-Diversity (QD) is a new family of optimization algorithms that aim at searching for a collection of both diverse and high-performing solutions (Pugh et al., 2016) . While classic optimization methods focus on finding a single efficient solution, the role of QD optimization is to cover the range of possible solution types and to return the best solution for each type. This process is sometimes referred to as "illumination" in opposition to optimization, as the goal of these algorithms is to reveal (or illuminate) a search space of interest (Mouret & Clune, 2015) . QD approaches generally build on black-box optimization methods such as evolutionary algorithms to optimize a population of solutions (Cully & Demiris, 2017) . These algorithms often rely on random mutations to explore small search spaces but struggle when confronted to higher-dimensional problems. As a result, QD approaches often scale poorly in large and continuous sequential decision problems, where using controllers with many parameters such as deep neural networks is mandatory (Colas et al., 2020) . Besides, while evolutionary methods are the most valuable when the policy gradient cannot be applied safely (Cully et al., 2015) , in policy search problem that can be formalized as a Markov Decision Process (MDP), Policy Gradient (PG) methods can exploit the analytical structure of neural networks to more efficiently optimize their parameters. Therefore, it makes sense to exploit these properties when the Markov assumption holds and the controller is a neural network. From the deep reinforcement learning (RL) perspective, the focus on sparse or deceptive rewards led to realize that maximizing diversity independently from rewards might be a good exploration strategy (Lehman & Stanley, 2011a; Colas et al., 2018; Eysenbach et al., 2018) . More recently, it was established that if one can define a small behavior space or outcome space corresponding to what matters to determine success, maximizing diversity in this space might be the optimal strategy to find a sparse reward (Doncieux et al., 2019) . In this work, we are the first to combine QD methods with PG methods. From one side, our aim is to strongly improve the sample efficiency of QD methods to get neural controllers solving continuous action space MDPs. From the other side, it is to strongly improve the exploration capabilities of deep RL methods in the context of sparse rewards or deceptive gradients problems, such as avoid traps

