QUALITY-SIMILAR DIVERSITY VIA POPULATION BASED REINFORCEMENT LEARNING

Abstract

Diversity is a growing research topic in Reinforcement Learning (RL). Previous research on diversity has mainly focused on promoting diversity to encourage exploration and thereby improve quality (the cumulative reward), maximizing diversity subject to quality constraints, or jointly maximizing quality and diversity, known as the quality-diversity problem. In this work, we present the quality-similar diversity problem that features diversity among policies of similar qualities. In contrast to task-agnostic diversity, we focus on task-specific diversity defined by a set of user-specified Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (e.g., the fire action rate for an Atari game), which delivers the type of diversity the user prefers. To derive the gradient of the user-specified diversity with respect to a policy, which is not trivially available, we introduce a set of BD estimators and connect it with the classical policy gradient theorem. Based on the diversity gradient, we develop a population-based RL algorithm to adaptively and efficiently optimize the population diversity at multiple quality levels throughout training. Extensive results on MuJoCo and Atari demonstrate that our algorithm significantly outperforms previous methods in terms of generating user-specified diverse policies across different quality levels (see Atari and MuJoCo videos).

1. INTRODUCTION

Existing research on policy diversity in deep Reinforcement Learning (RL) can be generally divided into three categories, according to the role diversity plays. The first category (Hong et al., 2018; Eysenbach et al., 2018; Conti et al., 2018; Parker-Holder et al., 2020; Kumar et al., 2020; Peng et al., 2020; Tang et al., 2020; Han & Sung, 2021; Chenghao et al., 2021; McKee et al., 2022) focuses on maximizing the final quality (the cumulative reward) of a policy, and policy diversity only serves as a means to better fulfill this goal via improving the efficiency of exploration. Therefore, the diversity measure is preferred to be task-agnostic as the knowledge of what type of task-specific diversity benefits the quality may not be accessible in most cases. The second category (Masood & Doshi-Velez, 2019; Zhang et al., 2019; Sun et al., 2020; Ghasemi et al., 2021; Zahavy et al., 2021; Zhou et al., 2022) is concerned with constrained optimization problems, where either diversity is optimized subject to quality constraints or vice-versa. Again, existing methods in this category have mainly focused on task-agnostic diversity, thereby the obtained diversity is often explained in hindsight, i.e., it is unknown what type of policy diversity to expect until the optimization is finished. The third category optimizes quality and diversity simultaneously, which is usually known as the Quality-Diversity (QD) method (Cully et al., 2015; Mouret & Clune, 2015; Pugh et al., 2016; Colas et al., 2020; Fontaine & Nikolaidis, 2021; Nilsson & Cully, 2021; Pierrot et al., 2022; Wang et al., 2021; Tjanaka et al., 2022) . In contrast to task-agnostic diversity, most QD methods focus on task-specific diversity, where users are allowed to specify a set of interested Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (i.e., the whole game episode) and thus does not have an analytical function form with respect to a single policy or state. Therefore, the gradient of a BD with respect to a policy is not trivially available, and this extends to the diversity measure defined on multiple BDs. As a result, previous QD methods (Cully et al., 2015; Mouret & Clune, 2015; Pugh et al., 2016) rely on black-box optimization techniques, such as evolutionary algorithms, to evolve a population of diverse policies. Some recent QD methods (Colas et al., 2020; Fontaine & Nikolaidis, 2021; Nilsson & Cully, 2021; Pierrot et al., 2022; Tjanaka et al., 2022) try to inject gradient information into the evolutionary optimization process. In this work, we formulate the Quality-Similar Diversity (QSD) problem where the objective is to produce a set of diverse policies at multiple quality levels. We propose a new QD metric called the QSD score that clusters policies of similar qualities, and the diversity is evaluated at each quality level. In QSD problems, diverse policies of non-optimal qualities are also preferred, which directly meet practical needs in some real-world AI applications. For example, in the field of game AI (Zhang et al., 2021; Fu et al., 2021) , it is often desirable to provide diverse accompanying AIs whose qualities are matched to a beginner, an amateur, and a master, respectively. Besides, measuring the diversity between a beginner and a master would be of little interest. The QSD problem also connects with adaptive curricula (Wang et al., 2019; Team et al., 2021; Parker-Holder et al., 2022) , where the environment gradually increases curriculum levels from simple to complex. Optimizing the intermediate diversity at non-optimal quality levels helps a faster and better convergence of the agent's capabilities than training directly at a complex curriculum level. Moreover, the ability to generate task-specific diversity is superior and supplementary to task-agnostic diversity when the user has a clear preference for the type of diversity in practice. For example, diverse hand gestures are of no interest if the user only needs gait diversity in robot locomotion tasks. Hence, in this work, we optimize an explicit diversity measure function defined on several user-specified BDs, as opposed to the non-differentiable cell coverage percentage in most QD methods. To the best of our knowledge, none of existing methods has obtained the exact gradient of a user-specified BD (defined on trajectories) with respect to a policy, nor has any derived an unbiased estimation of this gradient using state-action samples. In particular, the diversity gradient is approximated by generating samples in the policy parameter space (Colas et al., 2020; Tjanaka et al., 2022) , or simply assumed in Fontaine & Nikolaidis (2021), which might not hold in many real-world situations. A set of 'state' BDs (essentially a type of intrinsic reward) are introduced in Pierrot et al. ( 2022), expecting that a positive correlation between state and trajectory BDs might suffice. To fill this gap, we propose a set of BD estimators that predict the corresponding BD value for the current policy. Equipped with these BD estimators, we build on the policy gradient theorem (Sutton et al., 1999; Silver et al., 2014) to derive the gradient of user-specified BDs with respect to a policy for discrete or continuous actions. Based on the population-based training (PBT) (Jaderberg et al., 2017) , we develop an RL diversity algorithm, named QSD-PBT, that leverages diversity gradient and adaptively adjusts diversity loss to preserve similar qualities of the population. QSD-PBT efficiently optimizes the diversity of multiple quality levels in a single run and outperforms previous methods in terms of the QSD score in both MuJoCo and Atari environments. Meanwhile, QSD-PBT demonstrates strong abilities in achieving user-specified diversity by discovering visually distinct policies across a variety of environments. To summarize, the contributions of this work are as follows: • We formulate the Quality-Similar Diversity (QSD) problem and propose a new performance metric. • We derive the gradient of user-specified BDs defined on trajectories with respect to a policy. • We develop a population-based RL algorithm that efficiently optimizes the diversity of multiple quality levels in a single run.

2. PROBLEM DEFINITION

We focus on the episodic Markov Decision Processes (MDPs), which can be defined by a tuple (S, A, T , r, γ). S and A stand for the state space and action space respectively. T : S × A → S is the environment transition function, and r : S × A → R is the expected reward function. A policy π(s) maps a state s to a probability distribution over A. A trajectory τ is a state-action sequence [s 0 , a 0 , s 1 , a 1 , ..., s T ], which is obtained by executing a policy from the initial step t = 0 to the terminal step T in the environment. The objective of RL is to find a policy π that maximizes its expected cumulative rewards (also known as the quality in this work: J(π) = E τ ∼π [R(τ )], where R(τ ) = T t=0 γ t r(s t , a t ) is the return of a trajectory, and γ ∈ [0, 1] is the discount factor. The state value function V π (s) = E[ T t=i γ t-i r(s t , a t )|s i = s] measures the quality following π from state

