EFFICIENT EXPLORATION USING MODEL-BASED QUALITY-DIVERSITY WITH GRADIENTS Anonymous

Abstract

Exploration is a key challenge in Reinforcement Learning, especially in longhorizon, deceptive and sparse-reward environments. For such applications, population-based approaches have proven effective. Methods such as Quality-Diversity deals with this by encouraging novel solutions and producing a diversity of behaviours. However, these methods are driven by either undirected sampling (i.e. mutations) or use approximated gradients (i.e. Evolution Strategies) in the parameter space, which makes them highly sample-inefficient. In this paper, we propose a model-based Quality-Diversity approach. It extends existing QD methods to use gradients for efficient exploitation and leverage perturbations in imagination for efficient exploration. Our approach optimizes all members of a population simultaneously to maintain both performance and diversity efficiently by leveraging the effectiveness of QD algorithms as good data generators to train deep models. We demonstrate that it maintains the divergent search capabilities of population-based approaches on tasks with deceptive rewards while significantly improving their sample efficiency and quality of solutions.

1. INTRODUCTION

Reinforcement Learning (RL) has demonstrated tremendous abilities to learn challenging tasks across a range of applications (Mnih et al., 2015; Silver et al., 2016; Akkaya et al., 2019) . However, they generally struggle with exploration as the agent can only gather data by interacting with the environment. On the other hand, population based learning methods have shown to be very effective approaches (Jaderberg et al., 2017; Vinyals et al., 2019; Ecoffet et al., 2021; Wang et al., 2020) . In contrast to single agent learning, training a population of agents allow diverse behaviors and data to be collected. This results in exploration that can better handle sparse and deceptive rewards Ecoffet et al. (2021) as well as alleviate catastrophic forgetting (Conti et al., 2018) . An effective way to use the population of agents for exploration are novelty search methods (Lehman & Stanley, 2011a; Conti et al., 2018) where the novelty of the behaviors of new agents is measured with respect to the population. This novelty measure is then used in place of the conventional task reward similar to curiosity and intrinsic motivation approaches (Oudeyer et al., 2007; Bellemare et al., 2016; Pathak et al., 2017) 2021) extends this but also optimizes all members of the population on the task reward while maintaining the diversity through novelty. Beyond exploration, the creativity involved in finding various ways to solve a problem/task (i.e. the QD problem) is an interesting aspect of general intelligence that is also associated with adaptability. For instance, discovering diverse walking gaits can enable rapid adaptation to damage (Cully et al., 2015) . However, a drawback of conventional population based approaches is the large amounts of samples and evaluations required, usually in the order of millions. Some methods that utilize Evolutionary Strategies (ES) and more recently MAP-Elites Mouret & Clune (2015) (a common QD algorithm), sidestep this issue as they can parallelize and scale better with compute (Salimans et al., 2017; Conti et al., 2018; Lim et al., 2022) than their Deep RL counterparts, resulting in faster wall-clock times. Despite this, they still come at a cost of many samples. One of the main reasons for this lies in the underlying optimization operators. QD methods generally rely on undirected search methods such as objective-agnostic random perturbations (Mouret & Clune, 2015; Vassiliades & Mouret, 2018) to favor creativity and exploration. More directed search such as ES has also been used (Colas et al., 2020) but relies on a large number of such perturbations (⇠thousands) to approximate a single step of natural gradient to direct the improvement of solutions.



. Quality-Diversity (QD) Pugh et al. (2016); Cully et al. (2015); Chatzilygeroudis et al. (

