NEUROEVOLUTION IS A COMPETITIVE ALTERNATIVE TO REINFORCEMENT LEARNING FOR SKILL DISCOVERY

Abstract

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term -often derived from information theory -in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art algorithms (four flagship algorithms from each line of work) on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations.

1. INTRODUCTION

In the past decade, Reinforcement Learning (RL) has shown great promise at tackling sequential decision making problems in a generic fashion, leading to breakthroughs in many fields such as games (Silver et al., 2017) , robotics (Andrychowicz et al., 2020) , and control in industrial settings (Degrave et al., 2022) . However, neural policies trained with RL algorithms tend to be over-tuned to the exact specifications of the tasks and environments they were trained on. Even minor disturbances to the environment, to the starting state, or to the task definition can incur a significant loss of performance (Kumar et al., 2020; Pinto et al., 2017) . A standard approach to improve generalization is to introduce more variations during training (Tobin et al., 2017) but this assumes we can foresee all possibilities, which is not always true in the real world. Even in settings where this is feasible, introducing a wide spectrum of variations will make the problem harder to solve and the resulting policy may not perform as well in the nominal case. Another approach consists in co-training an adversarial agent whose task is to perturb the environments so as to minimize the policy's performance (Pinto et al., 2017) . However, adversarial methods are notoriously unstable in Deep Learning (Arjovsky & Bottou, 2017) and can also compromise performance in the nominal scenario. To improve robustness without explicitly identifying all possible variations, jointly training multiple policies to solve the same task in diverse ways has emerged as a promising line of work in the RL literature (Kumar et al., 2020) . To motivate the approach, consider the problem of learning a policy to control the joints of a legged robot with the goal of running as fast as possible. Any damage to the robot legs might affect an optimal policy's ability to make the robot run fast, if at all. Yet, many of the slightly sub-optimal policies to the original problem (e.g. a policy making the robot hop using only one leg) would perform equally well in this perturbed setting. Two seemingly-opposed main lines of work have been pursued to maximize both performance and diversity in a collection of policies. RL-rooted approaches (Eysenbach et al., 2019; Sharma et al., 2019; Kumar et al., 2020) introduce a randomly-generated latent variable and parametrize the policy to be a function of the state as well as this latent variable. At training time, the latent variable is drawn from a static distribution and fed as an input alongside the state to the policy, effectively defining a mixture of policies. To encourage diversity among these policies, a term derived from information theory that depends both on the policy parameters and the latent variable is added to the objective function (hereinafter referred to as fitness function). This term is typically formulated as the mutual information between the latent variable and a subset of the policy's trajectory, possibly conditioned on observations from the past. Neuroevolution-rooted approaches instead stem from the subfield of Quality Diversity (QD) optimization (Pugh et al., 2016; Cully & Demiris, 2017; Chatzilygeroudis et al., 2021) and combine the tools developed in this space with RL algorithms to get the best of both worlds (Nilsson & Cully, 2021; Pierrot et al., 2022) . QD optimization aims at generating and maintaining large and diverse collections of solutions, as opposed to a single optimal solution in Optimization Theory, by imitating the natural evolution of individuals competing for resources in their respective niches. In comparison to traditional Evolutionary Strategies, QD algorithms explicitly use a mapping from solution to a vector space, referred to as behavior descriptor space, to characterize solutions and maintain a data structure, a repertoire, filled with high-performing solutions that cover this space as much as possible. Evolutionary Strategies (possibly hybridized with RL algorithms) have proven to be a competitive alternative to RL algorithms for many common sequential-decision making problems (Pierrot et al., 2022; Salimans et al., 2017) . Hence, it is natural to believe that QD algorithms could also be competitive with information-theory-augmented RL approaches to generate diverse populations of high-performing policies in similar settings. Yet, QD approaches remain neglected in skill-discovery studies (Kumar et al., 2020) , perhaps because they lack the sample-efficiency of state-of-the-art RL algorithms, sometimes requiring two orders of magnitude more interactions with the environment to solve a task (Pierrot et al., 2022) . While this is a significant shortcoming for real-world applications that cannot be accurately described by a computational model, simulators are readily available for many applications. Additionally, when the simulator and the algorithm are implemented using modern vectorized frameworks such as JAX (Bradbury et al., 2018) and BRAX (Freeman et al., 2021) , evolutionary approaches are competitive with RL approaches in terms of total training time on an accelerator in spite of the low sample-efficiency of these methods (Lim et al., 2022) . Our contributions are the following. (1.) We provide extensive experimental evidence that QD methods are competitive with RL ones for skill discovery in terms of performance given fixed compute and training time budgets and hyperparameter sensitivity. Specifically, using environments taken from the QD and RL literature, we compare eight state-of-the-art skill-discovery methods from the RL and QD world on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning. (2.) We open source efficient implementations of all environments and algorithmsfoot_0 based on the QDax libraryfoot_1 . Armed with these, running any of the experiments, some of which require hundreds of millions of environments steps, takes only 2 hours on a single affordable accelerator. (3.) We provide a detailed analysis of the strengths and weaknesses of all methods, we show that no single method outperforms all others on all environments, and we identify future research directions.

2. PRELIMINARIES AND PROBLEM STATEMENT

We consider sequential decision making problems formulated as Markov Decision Processes (MDPs) and defined by (S, A, R, P, γ), where S is the state space, A is the action space, γ ∈ [0, 1] is the discount factor, R : S × A → R is the reward signal and P : S × A → S is the transition function.



https://github.com/instadeepai/qd-skill-discovery-benchmark https://github.com/adaptive-intelligent-robotics/QDax

