EFFICIENT EXPLORATION USING MODEL-BASED QUALITY-DIVERSITY WITH GRADIENTS Anonymous

Abstract

Exploration is a key challenge in Reinforcement Learning, especially in longhorizon, deceptive and sparse-reward environments. For such applications, population-based approaches have proven effective. Methods such as Quality-Diversity deals with this by encouraging novel solutions and producing a diversity of behaviours. However, these methods are driven by either undirected sampling (i.e. mutations) or use approximated gradients (i.e. Evolution Strategies) in the parameter space, which makes them highly sample-inefficient. In this paper, we propose a model-based Quality-Diversity approach. It extends existing QD methods to use gradients for efficient exploitation and leverage perturbations in imagination for efficient exploration. Our approach optimizes all members of a population simultaneously to maintain both performance and diversity efficiently by leveraging the effectiveness of QD algorithms as good data generators to train deep models. We demonstrate that it maintains the divergent search capabilities of population-based approaches on tasks with deceptive rewards while significantly improving their sample efficiency and quality of solutions.

1. INTRODUCTION

Reinforcement Learning (RL) has demonstrated tremendous abilities to learn challenging tasks across a range of applications (Mnih et al., 2015; Silver et al., 2016; Akkaya et al., 2019) . However, they generally struggle with exploration as the agent can only gather data by interacting with the environment. On the other hand, population based learning methods have shown to be very effective approaches (Jaderberg et al., 2017; Vinyals et al., 2019; Ecoffet et al., 2021; Wang et al., 2020) . In contrast to single agent learning, training a population of agents allow diverse behaviors and data to be collected. This results in exploration that can better handle sparse and deceptive rewards Ecoffet et al. (2021) as well as alleviate catastrophic forgetting (Conti et al., 2018) . An effective way to use the population of agents for exploration are novelty search methods (Lehman & Stanley, 2011a; Conti et al., 2018) where the novelty of the behaviors of new agents is measured with respect to the population. This novelty measure is then used in place of the conventional task reward similar to curiosity and intrinsic motivation approaches (Oudeyer et al., 2007; Bellemare et al., 2016; Pathak et al., 2017) . Quality-Diversity (QD) Pugh et al. (2016); Cully et al. (2015) ; Chatzilygeroudis et al. (2021) extends this but also optimizes all members of the population on the task reward while maintaining the diversity through novelty. Beyond exploration, the creativity involved in finding various ways to solve a problem/task (i.e. the QD problem) is an interesting aspect of general intelligence that is also associated with adaptability. For instance, discovering diverse walking gaits can enable rapid adaptation to damage (Cully et al., 2015) . However, a drawback of conventional population based approaches is the large amounts of samples and evaluations required, usually in the order of millions. Some methods that utilize Evolutionary Strategies (ES) and more recently MAP-Elites Mouret & Clune (2015) (a common QD algorithm), sidestep this issue as they can parallelize and scale better with compute (Salimans et al., 2017; Conti et al., 2018; Lim et al., 2022) than their Deep RL counterparts, resulting in faster wall-clock times. Despite this, they still come at a cost of many samples. One of the main reasons for this lies in the underlying optimization operators. QD methods generally rely on undirected search methods such as objective-agnostic random perturbations (Mouret & Clune, 2015; Vassiliades & Mouret, 2018) to favor creativity and exploration. More directed search such as ES has also been used (Colas et al., 2020) but relies on a large number of such perturbations (⇠thousands) to approximate a single step of natural gradient to direct the improvement of solutions. Figure 1 : The GDA-QD algorithm can be summarized as follows: (1) the current population ⇥ is copied in ⇥, (2) ⇥ is used to perform multiple steps of QD optimization fully in imagination using the dynamics model, (3) the critic update sampled policies learned in imagination with policygradient updates; they are concatenated with (4) policies sampled from the resulting population learned in imagination ⇥, (5) these concatenated batch of policies are evaluated in the environment and used to update the real population of policies for the next optimization loop; the transitions collected in the environment are then used to train the dynamics model and the critic. In this paper, we introduce an extended version of Dynamics-Aware QD (DA-QD-ext) as well as Gradient and Dynamics Aware QD (GDA-QD), a new model-based QD method to perform sampleefficient exploration in RL. GDA-QD optimizes an entire population of diverse policies through a QD process in imagination using a learned dynamics model. Additionally, GDA-QD augments the conventional QD optimization operators with policy gradient updates using a critic network to obtain a more performant population. Beyond the effective exploration capabilities of QD methods, they are also excellent data generators. We leverage this idea to harvest a diversity of transitions to train the dynamics model and the critic. Thus, GDA-QD combine the powerful function-approximation capabilities of deep neural networks with the directed-search abilities of gradient-based learning and the creativity of population-based approaches. We demonstrate that it successfully outperforms both Deep RL and QD baselines in a hard-exploration task. GDA-QD exceeds the performance of baseline QD algorithms by ⇠ 1.5 times, and can reach the same results in 5 times less samples.

2. PRELIMINARIES

2.1 REINFORCEMENT LEARNING Reinforcement Learning (RL) is commonly formalised as a Markov Decision Process (MDP) (Sutton & Barto, 2018) represented by the tuple (S, A, P, R), where S and A are the set of states and actions. P(s t+1 |s t , a t ) is the probability of transition from state s t to s t+1 given an action a t , where s t , s t+1 2 S and a t 2 A. The reward function defines the reward obtained at each timestep r t = r(s t , a t , s t+1 ) when transitioning from state s t to s t+1 under action a t . An agent acting in the environment selects its next action based on the current state s t by following a policy ⇡ ✓ (a t |s t ). The conventional objective in RL is then to optimize the parameters ✓ of policy ⇡ ✓ , such that it maximizes the expected cumulative reward R(⌧ ) = P T t=1 r t over the entire episode trajectory ⌧ : J(⇡ ✓ ) = E ⌧ ⇠⇡ ✓ [R(⌧ )]

