SAMPLE EFFICIENT QUALITY DIVERSITY FOR NEURAL CONTINUOUS CONTROL

Abstract

We propose a novel Deep Neuroevolution algorithm, QD-RL, that combines the strengths of off-policy reinforcement learning (RL) algorithms and Quality Diversity (QD) approaches to solve continuous control problems with neural controllers. The QD part contributes structural biases by decoupling the search for diversity from the search for high return, resulting in efficient management of the exploration-exploitation trade-off. The RL part contributes sample efficiency by relying on off-policy gradient-based updates of the agents. More precisely, we train a population of off-policy deep RL agents to simultaneously maximize diversity within the population and the return of each individual agent. QD-RL selects agents interchangeably from a Pareto front or from a Map-Elites grid, resulting in stable and efficient population updates. Our experiments in the ANT-MAZE and ANT-TRAP environments show that QD-RL can solve challenging exploration and control problems with deceptive rewards while being two orders of magnitude more sample efficient than the evolutionary counterpart.

1. INTRODUCTION

Natural evolution has the fascinating ability to produce organisms that are all high-performing in their respective niche. Inspired by this ability to produce a tremendous diversity of living systems within one run, Quality-Diversity (QD) is a new family of optimization algorithms that aim at searching for a collection of both diverse and high-performing solutions (Pugh et al., 2016) . While classic optimization methods focus on finding a single efficient solution, the role of QD optimization is to cover the range of possible solution types and to return the best solution for each type. This process is sometimes referred to as "illumination" in opposition to optimization, as the goal of these algorithms is to reveal (or illuminate) a search space of interest (Mouret & Clune, 2015) . QD approaches generally build on black-box optimization methods such as evolutionary algorithms to optimize a population of solutions (Cully & Demiris, 2017) . These algorithms often rely on random mutations to explore small search spaces but struggle when confronted to higher-dimensional problems. As a result, QD approaches often scale poorly in large and continuous sequential decision problems, where using controllers with many parameters such as deep neural networks is mandatory (Colas et al., 2020) . Besides, while evolutionary methods are the most valuable when the policy gradient cannot be applied safely (Cully et al., 2015) , in policy search problem that can be formalized as a Markov Decision Process (MDP), Policy Gradient (PG) methods can exploit the analytical structure of neural networks to more efficiently optimize their parameters. Therefore, it makes sense to exploit these properties when the Markov assumption holds and the controller is a neural network. From the deep reinforcement learning (RL) perspective, the focus on sparse or deceptive rewards led to realize that maximizing diversity independently from rewards might be a good exploration strategy (Lehman & Stanley, 2011a; Colas et al., 2018; Eysenbach et al., 2018) . More recently, it was established that if one can define a small behavior space or outcome space corresponding to what matters to determine success, maximizing diversity in this space might be the optimal strategy to find a sparse reward (Doncieux et al., 2019) . In this work, we are the first to combine QD methods with PG methods. From one side, our aim is to strongly improve the sample efficiency of QD methods to get neural controllers solving continuous action space MDPs. From the other side, it is to strongly improve the exploration capabilities of deep RL methods in the context of sparse rewards or deceptive gradients problems, such as avoid traps and dead-ends in navigation tasks. We build on off-policy PG methods to propose a new mutation operator that takes into account the Markovian nature of the problem and analytically exploits the known structure of the neural controller. Our QD-RL algorithm falls within the QD framework described by Cully & Demiris (2017) and takes advantage of its powerful exploration capabilities, but also demonstrates remarkable sample efficiency brought by off-policy RL methods. We compare QD-RL to several recent algorithms that also combine a diversity objective and a return maximization method, namely the NS-ES family (Conti et al., 2018) and the ME-ES algorithm (Colas et al., 2020) and show that QD-RL is two orders of magnitude more sample efficient.

2. PROBLEM STATEMENT

We consider the general context of a fully observable Markov Decision Problem (MDP) (S, A, T , R, γ, ρ 0 ) where S is the state space, A is the action space, T : S × A → S is the transition function, R : S × A → R is the reward function, γ is a discount factor and ρ 0 is the initial state distribution. We aim to find a set of parameters θ of a parameterized policy π θ : S → A so as to maximize the objective function J (θ) = E τ t γ t r t where τ is a trajectory obtained from π θ starting from state s 0 ∼ ρ 0 and r t is the reward obtained along this trajectory at time t. We define the Q-value for policy π, Q π : S × A → R as Q π (s, a) = E τ t γ t r t , where τ is a trajectory obtained from π θ starting from s and performing initial action a. QD aims at evolving a set of solutions θ that are both diverse and high performing. To measure diversity, we first define a Behavior Descriptor (BD) space, which characterizes solutions in functional terms, in addition to their score J (θ). We note bd θ the BD of a solution θ. The solution BD space is often designed using relevant features of the task. For instance, in robot navigation, a relevant BD is the final position of the robot. In robot locomotion, it may rather be the position and/or velocity of the robot center of gravity at specific times. From BDs, we define the diversity (or novelty) of a solution as measuring the difference between its BD and those of the solutions obtained so far. Additionally, we define a state Behavior Descriptor, or state BD, noted bd t . It is a set of relevant features extracted from a state. From state BDs, we define the BD of a solution θ as a function of all state BDs encountered by policy π θ when interacting with the environment, as illustrated in Figure 1a . More formally, we note bd θ = IE τ [f bd ({bd 1 , . . . , bd T })], where T is the trajectory length and f bd is an aggregation function. For instance, f bd can average over state BDs or return only the last state BD of the trajectory. If we consider again robot navigation, a state BD bd t may represent the position of the robot at time t and the solution BD bd θ may be the final position of the robot. With state BDs, we measure the novelty of a state relatively to all other seen states. The way we compute diversity at the solution and the state levels is explained in Section 4.

3. RELATED WORK

A distinguishing feature of our approach is that we combine diversity seeking at the level of trajectories using solution BDs bd θ and diversity seeking in the state space using state BDs bd t . The former is used to select agents from the archive in the QD part of the architecture, whereas the latter is used during policy gradient steps in the RL part, see Figure 1b . We organize the literature review below according to this split between two types of diversity seeking mechanisms. Besides, some families of methods are related to our work in a lesser extent. This is the case of algorithms combining evolutionary approaches and deep RL such as CEM-RL (Pourchot & Sigaud, 2018) , ERL (Khadka & Tumer, 2018) and CERL (Khadka et al., 2019) , algorithms maintaining a population of RL agents for exploration without an explicit diversity criterion (Jaderberg et al., 2017) or algorithms explicitly looking for diversity but in the action space rather than in the state space like ARAC (Doan et al., 2019) , P3S-TD3 (Jung et al., 2020) and DvD (Parker-Holder et al., 2020) . We include CEM-RL as one of our baselines. Seeking for diversity and performance in the space of solutions Simultaneously maximizing diversity and performance is the central goal of QD methods (Pugh et al., 2016; Cully & Demiris, 2017) . Among the various possible combinations offered by the QD framework, Novelty Search with Local Competition (NSLC) (Lehman & Stanley, 2011b) and MAP-Elites (ME) (Mouret & 

