NEUROEVOLUTION IS A COMPETITIVE ALTERNATIVE TO REINFORCEMENT LEARNING FOR SKILL DISCOVERY

Abstract

Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term -often derived from information theory -in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art algorithms (four flagship algorithms from each line of work) on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations.

1. INTRODUCTION

In the past decade, Reinforcement Learning (RL) has shown great promise at tackling sequential decision making problems in a generic fashion, leading to breakthroughs in many fields such as games (Silver et al., 2017 ), robotics (Andrychowicz et al., 2020) , and control in industrial settings (Degrave et al., 2022) . However, neural policies trained with RL algorithms tend to be over-tuned to the exact specifications of the tasks and environments they were trained on. Even minor disturbances to the environment, to the starting state, or to the task definition can incur a significant loss of performance (Kumar et al., 2020; Pinto et al., 2017) . A standard approach to improve generalization is to introduce more variations during training (Tobin et al., 2017) but this assumes we can foresee all possibilities, which is not always true in the real world. Even in settings where this is feasible, introducing a wide spectrum of variations will make the problem harder to solve and the resulting policy may not perform as well in the nominal case. Another approach consists in co-training an adversarial agent whose task is to perturb the environments so as to minimize the policy's performance (Pinto et al., 2017) . However, adversarial methods are notoriously unstable in Deep Learning (Arjovsky & Bottou, 2017) and can also compromise performance in the nominal scenario.

