QUALITY-SIMILAR DIVERSITY VIA POPULATION BASED REINFORCEMENT LEARNING

Abstract

Diversity is a growing research topic in Reinforcement Learning (RL). Previous research on diversity has mainly focused on promoting diversity to encourage exploration and thereby improve quality (the cumulative reward), maximizing diversity subject to quality constraints, or jointly maximizing quality and diversity, known as the quality-diversity problem. In this work, we present the quality-similar diversity problem that features diversity among policies of similar qualities. In contrast to task-agnostic diversity, we focus on task-specific diversity defined by a set of user-specified Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (e.g., the fire action rate for an Atari game), which delivers the type of diversity the user prefers. To derive the gradient of the user-specified diversity with respect to a policy, which is not trivially available, we introduce a set of BD estimators and connect it with the classical policy gradient theorem. Based on the diversity gradient, we develop a population-based RL algorithm to adaptively and efficiently optimize the population diversity at multiple quality levels throughout training. Extensive results on MuJoCo and Atari demonstrate that our algorithm significantly outperforms previous methods in terms of generating user-specified diverse policies across different quality levels (see Atari and MuJoCo videos).

1. INTRODUCTION

Existing research on policy diversity in deep Reinforcement Learning (RL) can be generally divided into three categories, according to the role diversity plays. The first category (Hong et al., 2018; Eysenbach et al., 2018; Conti et al., 2018; Parker-Holder et al., 2020; Kumar et al., 2020; Peng et al., 2020; Tang et al., 2020; Han & Sung, 2021; Chenghao et al., 2021; McKee et al., 2022) focuses on maximizing the final quality (the cumulative reward) of a policy, and policy diversity only serves as a means to better fulfill this goal via improving the efficiency of exploration. Therefore, the diversity measure is preferred to be task-agnostic as the knowledge of what type of task-specific diversity benefits the quality may not be accessible in most cases. The second category (Masood & Doshi-Velez, 2019; Zhang et al., 2019; Sun et al., 2020; Ghasemi et al., 2021; Zahavy et al., 2021; Zhou et al., 2022) is concerned with constrained optimization problems, where either diversity is optimized subject to quality constraints or vice-versa. Again, existing methods in this category have mainly focused on task-agnostic diversity, thereby the obtained diversity is often explained in hindsight, i.e., it is unknown what type of policy diversity to expect until the optimization is finished. The third category optimizes quality and diversity simultaneously, which is usually known as the Quality-Diversity (QD) method (Cully et al., 2015; Mouret & Clune, 2015; Pugh et al., 2016; Colas et al., 2020; Fontaine & Nikolaidis, 2021; Nilsson & Cully, 2021; Pierrot et al., 2022; Wang et al., 2021; Tjanaka et al., 2022) . In contrast to task-agnostic diversity, most QD methods focus on task-specific diversity, where users are allowed to specify a set of interested Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (i.e., the whole game episode) and thus does not have an analytical function form with respect to a single policy or state. Therefore, the gradient of a BD with respect to a policy is not trivially available, and this extends to the diversity measure defined on multiple BDs. As a result, previous QD methods (Cully et al., 2015; Mouret & Clune, 2015; Pugh et al., 2016) rely on black-box optimization techniques, such as evolutionary algorithms, to evolve a population of diverse policies. Some recent QD methods (Colas et al., 2020; Fontaine & Nikolaidis, 2021; Nilsson & Cully, 2021; Pierrot et al., 2022; Tjanaka et al., 2022) try to inject gradient information into the evolutionary optimization process. In this work, we formulate the Quality-Similar Diversity (QSD) problem where the objective is to produce a set of diverse policies at multiple quality levels. We propose a new QD metric called the QSD score that clusters policies of similar qualities, and the diversity is evaluated at each quality level. In QSD problems, diverse policies of non-optimal qualities are also preferred, which directly meet practical needs in some real-world AI applications. For example, in the field of game AI (Zhang et al., 2021; Fu et al., 2021) , it is often desirable to provide diverse accompanying AIs whose qualities are matched to a beginner, an amateur, and a master, respectively. Besides, measuring the diversity between a beginner and a master would be of little interest. The QSD problem also connects with adaptive curricula (Wang et al., 2019; Team et al., 2021; Parker-Holder et al., 2022) , where the environment gradually increases curriculum levels from simple to complex. Optimizing the intermediate diversity at non-optimal quality levels helps a faster and better convergence of the agent's capabilities than training directly at a complex curriculum level. Moreover, the ability to generate task-specific diversity is superior and supplementary to task-agnostic diversity when the user has a clear preference for the type of diversity in practice. For example, diverse hand gestures are of no interest if the user only needs gait diversity in robot locomotion tasks. Hence, in this work, we optimize an explicit diversity measure function defined on several user-specified BDs, as opposed to the non-differentiable cell coverage percentage in most QD methods. To the best of our knowledge, none of existing methods has obtained the exact gradient of a user-specified BD (defined on trajectories) with respect to a policy, nor has any derived an unbiased estimation of this gradient using state-action samples. In particular, the diversity gradient is approximated by generating samples in the policy parameter space (Colas et al., 2020; Tjanaka et al., 2022) , or simply assumed in Fontaine & Nikolaidis (2021) , which might not hold in many real-world situations. A set of 'state' BDs (essentially a type of intrinsic reward) are introduced in Pierrot et al. (2022) , expecting that a positive correlation between state and trajectory BDs might suffice. To fill this gap, we propose a set of BD estimators that predict the corresponding BD value for the current policy. Equipped with these BD estimators, we build on the policy gradient theorem (Sutton et al., 1999; Silver et al., 2014) to derive the gradient of user-specified BDs with respect to a policy for discrete or continuous actions. Based on the population-based training (PBT) (Jaderberg et al., 2017) , we develop an RL diversity algorithm, named QSD-PBT, that leverages diversity gradient and adaptively adjusts diversity loss to preserve similar qualities of the population. QSD-PBT efficiently optimizes the diversity of multiple quality levels in a single run and outperforms previous methods in terms of the QSD score in both MuJoCo and Atari environments. Meanwhile, QSD-PBT demonstrates strong abilities in achieving user-specified diversity by discovering visually distinct policies across a variety of environments. To summarize, the contributions of this work are as follows: • We formulate the Quality-Similar Diversity (QSD) problem and propose a new performance metric. • We derive the gradient of user-specified BDs defined on trajectories with respect to a policy. • We develop a population-based RL algorithm that efficiently optimizes the diversity of multiple quality levels in a single run.

2. PROBLEM DEFINITION

We focus on the episodic Markov Decision Processes (MDPs), which can be defined by a tuple (S, A, T , r, γ). S and A stand for the state space and action space respectively. T : S × A → S is the environment transition function, and r : S × A → R is the expected reward function. A policy π(s) maps a state s to a probability distribution over A. A trajectory τ is a state-action sequence [s 0 , a 0 , s 1 , a 1 , ..., s T ], which is obtained by executing a policy from the initial step t = 0 to the terminal step T in the environment. The objective of RL is to find a policy π that maximizes its expected cumulative rewards (also known as the quality in this work: J(π) = E τ ∼π [R(τ )], where R(τ ) = T t=0 γ t r(s t , a t ) is the return of a trajectory, and γ ∈ [0, 1] is the discount factor. The state value function V π (s) = E[ T t=i γ t-i r(s t , a t )|s i = s] measures the quality following π from state s. In particular, we define V π (s T ) = 0. The Q-function Q π (s, a) = E[ T t=i γ t-i r(s t , a t )|s i = s, a i = a] measures the quality following π after taking action a in state s. Behavior descriptor. We would like to maximize a task-specific diversity measure at different quality levels in this work. Hence, users are allowed to specify a set of BDs, b i (τ ) (1 ≤ i ≤ L) that reflect the type of interested policy diversity. b i (τ ) is an evaluation function of a trajectory that is finite and easy to be implemented. For instance, a BD could be the ratio between left and right movements in the trajectory of Atari MsPacMan. The trajectory BD is a general form and can be simplified to state or action BD when the user is only interested in certain states or actions of the environment, e.g., the terminal position in MsPacMan. Accordingly, the BD value of a policy π θ (parameterized using θ) is defined as B i (π θ ) = E τ ∼π θ [b i (τ )]. We denote all the BD values of a policy π θ by B(π θ ) = [B 1 (π θ ), B 2 (π θ ), ..., B L (π θ )]. Diversity measure function. Since BD values of a policy form an L-dimensional vector, we need a function f that measures the overall diversity of the population as a scalar: Div(Π) := f (B(π θ1 ), B(π θ2 ), ..., B(π θ N )), where Π = {π θj |1 ≤ j ≤ N } denotes a set of policies. The choice of the f should follow these properties: (1) f should be bounded and non-negative for easy implementation; (2) Since we do not define the order of an agent in the population, f should be invariant of any permutation of the policies; (3) f should be differentiable such that we can derive its gradient. Two recommended measure functions are detailed in Section C.2: the mean of all pair-wise euclidean distances, and the Determinantal Point Process (DPP) (Parker-Holder et al., 2020) . Quality-similar diversity score. Note that the quality of a policy for a task is usually real-valued and one-dimensional, obtaining a diverse set of policies at every possible quality level would require an infinite number of policies. Hence, we approximately evaluate an algorithm's QSD performance by partitioning the obtained quality range into M disjoint intervals, and only the diversity of policies within the same quality interval is evaluated by f . The QSD score is defined as follows: QSD score := M m=1 Div(Π m ), where Π m denotes the set of policies obtained throughout training with qualities in the m-th interval.

3. POPULATION-BASED RL FOR QUALITY-SIMILAR DIVERSITY

The gradient of the user-specified BD with respect to a policy is not tractable since it is defined on a trajectory. Hence, a possible solution to our QSD problem would be adapting derivative-free methods, such as conventional QD methods (Mouret & Clune, 2015) . However, derivative-free methods often scale poorly with large-scale neural networks that are necessary to handle complex inputs, such as the image feature in Atari. Note that the form in our definition of user-specified BD is similar to the quality, it would be feasible to directly derive the diversity gradient using the policy gradient theorem (Sutton et al., 1999) . Given the diversity gradient, there exist two choices as how to obtain a diverse population: sequential training (Zhang et al., 2019; Zahavy et al., 2021; Zhou et al., 2022) and population-based training (Jaderberg et al., 2017; Jung et al., 2019; Parker-Holder et al., 2020) . The population-based training is more appropriate in our case, because the diversity measure is defined on a population that can not be factorized into incremental diversity settings as in Zhang et al. (2019) ; Zahavy et al. (2021) ; Zhou et al. (2022) . Based on the above discussion, we develop an efficient population-based RL algorithm, named QSD-PBT, for optimizing user-specified diversity BDs across different quality levels. Each component of QSD-PBT is elaborated in the following.

3.1. DERIVING THE DIVERSITY GRADIENT

Using the chain rule, we can write the gradient of the diversity of a population Π = {π θj |1 ≤ j ≤ N } with respect to the policy θ j (without loss of generality, we use θ hereafter) as: ∂Div(Π) ∂θ = L i=1 ∂f ∂B i (π θ ) ∂B i (π θ ) ∂θ . The partial derivative of f with respect to B i (π θ ) is easily obtained, as long as f is an explicit diversity measure function, such as the mean pair-wise distance or the determinant in DPP. Note that in B i (π θ ) = Eτ∼π θ [b i (τ )], b i (τ ) is a scalar function evaluating the i-th user-specified BD with respect to a trajectory τ . Following the policy gradient theorem (Sutton et al., 1999) , we have: ∂B i (π θ ) ∂θ = E τ ∼π θ [ T t=0 b i (τ )∇ θ log π θ (a t |s t )]. The gradient in Equation 3 is preferred to be estimated by samples [s t , a t , s t+1 , bi (τ )] in practice. To this end, a set of state and state-action BD estimators parameterized by ϕ i are introduced: V π θ Bi (τ 0:t-1 , s t ; ϕ i ) : = E at,τ t+1:T ∼π θ [b i (τ )], Q π θ Bi (τ 0:t-1 , s t , a t ; ϕ i ) : = E τ t+1:T ∼π θ [b i (τ )], where τ i:j denotes a segment of a trajectory starting from time step i to j: τ i:j = [s i , a i , ..., s j , a j ]. It is worth noting that b i (τ ) can not be factorized into a sum of quantities at each time step. For this reason, V π θ Bi (τ 0:t-1 , s t ) at state s t depends on the whole historical state-action sequence τ 0:t-1 . Similar to the advantage in RL, we define the BD advantage A π θ Bi (τ 0:t-1 , s t , a t ) = Q π θ Bi (τ 0:t-1 , s t , a t ) - V π θ Bi (τ 0:t-1 , s t ). Since V π θ Bi (τ 0:t-1 , s t ) is action-independent, the gradient in Equation 3 can be estimated using a mini-batch of samples {(τ (k) 0:t-1 , s (k) t , a (k) t , Â(k) Bi )} K k=1 as: ∂B i (π θ ) ∂θ ≈ 1 K K k=1 Â(k) Bi ∇ θ log π θ (a (k) t |s (k) t ), where the sampled BD advantage Â(k) Bi can be estimated using the conventional methods, e.g., Generalized Advantage Estimator (GAE) (Schulman et al., 2016) , Direct Advantage Estimation (Pan et al., 2021) . The above derives the diversity gradient for the discrete-action policy. Based on the deterministic policy gradient theorem (Silver et al., 2014) , we have the following proposition for deterministic policy π θ (s) in continuous action space. Proposition 1. In deterministic policy and continuous action space case, the derivative of B i (π θ ) with respect to policy parameters θ is: ∂B i (π θ ) ∂θ = T t=0 s0:t p(s 0 → s t )∇ θ π θ (s t )∇ at Q π θ Bi (τ 0:t-1 , s t , a t )| at=π θ (st) ds 0:t , where s0:t and ds 0:t are short for s0 s1 ... st and ds 0 ds 1 ...ds t respectively, and p(s i → s j ) = p(s i ) j-1 k=i p(s k+1 |s k , π θ (s k )). Using a mini-batch of samples{(τ (k) 0:t-1 , s (k) t , a t )} K k=1 , we can develop an unbiased estimator of the gradient above: ∂B i (π θ ) ∂θ ≈ 1 K K k=1 ∇ θ π θ (s (k) t )∇ a (k) t Q π θ Bi (τ (k) 0:t-1 , s (k) t , a (k) t )| a (k) t =π θ (s (k) t ) . The proof of proposition 1 is in Appendix A.1. Based on the proposition, at each training step, we can sample a mini-batch {(τ (k) 0:t-1 , s (k) t , a t )} K k=1 to estimate the gradient. It is worth noting that the trajectory τ 0:t-1 may be a quite long sequence, e.g., more than 10k state-action pairs in Atari games. Hence, a feature extractor is needed to encode the trajectory. As for training BD estimators, We use the final BD value of a trajectory as the target and apply the mean squared error loss function. More implementation details are provided in Appendix C.4.

3.2. OVERALL TRAINING SCHEME

In the QSD problem, we focus on maximizing the diversity at different quality levels. Previous methods applied to QD problems such as the constrained optimization subject to some quality constraints are inconsistent with our objective. Another possible solution is to combine the quality loss and the diversity loss with a coefficient and adjust the coefficient using some online learning algorithms, such as bandits (Parker-Holder et al., 2020) . However, it is not obvious what the online learning objective should be and how to adapt the coefficient optimally. To solve the QSD problem, we start from dealing with the sub-problem of QSD, i.e., maximizing diversity at one quality level. Optimizing diversity at one quality level. Considering a population with N policies, the diversity and the average quality of the population are Div(Π) and 1 N N j=1 J(π θj ) respectively. Each subproblem corresponds to optimizing a combined quality loss and diversity loss with a fixed coefficient λ, saying the target weight λ ∞ . Instead of training with λ ∞ from start to end, we let λ start from a large initial value λ 0 and gradually decay to λ ∞ at the end of training. Specifically, at each training step t, the combined loss for the population is L t (Π) = -1 N N j=1 J(π θj ) -λ t Div(Π) , where we require λ t+1 < λ t and lim t→∞ λ t = λ ∞ . This is motivated by a general observation that exploration is important in population-based RL training, and a larger λ focuses more on the diversity and thus helps exploration. Apart from encouraging exploration throughout training, our decaying method preserves the convergence property under some assumptions, which is proved in the following. Proposition 2. Let f (x) : R n → R, g(x) : R n → R be Lipschitz smooth, convex, bounded functions with derivative bounded, and a positive decreasing series {λ t } ∞ t=0 converges to λ. Denote h t (x) = f (x) + λ t g(x). Then using the gradient descent algorithm and choosing a proper step size, the algorithm will converge to the global minimum of f (x) + λg(x). The proof is in Appendix A.2. We can see that the decaying method converges in convex and Lipschitz smooth cases. In non-convex cases, we validate it experimentally in Section 4 and Appendix B.1. Preserving the quality similarity. Constraining only the average quality of the population at certain levels does not guarantee similar qualities in the population. To encourage quality similarity among agents in a population, we distribute λ t to each agent in the population, denoted as λ (j) t for the j-th agent, and adapt each λ (j) t during training. The loss function for the j-th agent (policy) is L (j) t (π θj ) = -J(π θj ) -λ (j) t Div(Π). Specifically, policies with better qualities should focus more on diversity optimization, and vice-versa. Following the discussion above, the scheme of λ (j) t throughout the training process is designed as λ (j) t = λ ∞ + (λ 0 -λ ∞ )exp(-t t0 • maxj R (j) t R (j) t ), where λ 0 is an initial coefficient. t 0 indicates the preferred decay step, and t is the current training step. R t is the evaluation of the current quality of the j-th policy in the population. We assume that the quality of any policy is non-negative and upper bounded. As a result, we have lim t→∞ λ (j) t = λ ∞ . Optimizing diversity across multiple quality levels. Once a target λ ∞ is designated, the objective function L t (Π) corresponds to maximizing diversity at a certain quality level. Since our goal in QSD is to obtain policy diversity at multiple quality levels, a straightforward way is to optimize the objective function L t (Π) multiple times with a set of target weights {λ ∞,h } H h=1 . Alternatively, we could obtain policy diversity at multiple quality levels in a single run. Given two target weights λ ∞,1 > λ ∞,2 , when we have finished the optimization with λ ∞,1 , we can continue the optimization from λ ∞,1 to solve the quality level targeted by λ ∞,2 , rather than re-initializing the model and training from the initial λ 0 . This can be extended to multiple target weights, if λ is decayed slowly enough from a large value λ 0 to 0. In practice, we find it much more efficient than independent training with multiple target values {λ ∞,h } H h=1 . As a result, we apply a single training with λ decaying from λ 0 to 0 to integrally solve the QSD problem, which gives the overall training loss for the j-th policy as : L (j) t (π θj ) = -J(π θj ) -λ (j) t Div(Π), with λ (j) t = λ 0 • exp(-t t0 • maxj R (j) t R (j) t ). We employ PPO (Schulman et al., 2017) as the backbone for discrete actions. As for scenarios with continuous action space, the TD3 (Fujimoto et al., 2018) algorithm is applied as the backbone. We term our diversity optimization along with population-based RL backbones as QSD-PBT. The pseudocode is given in Appendix C.5, and the code is open-sourced.

4. EXPERIMENTS

The effectiveness of QSD-PBT is validated on both MuJoCo (Brockman et al., 2016) continuous control tasks and Atari games (Bellemare et al., 2013) with discrete action spaces. For the diversity measure, we use the mean of pair-wise Euclidean distance between the BD vectors of two policies. For the specification of BDs, we follow common practices in MuJoCo tasks by incorporating the built-in joint torques as BDs, similar to Parker-Holder et al. (2020) . For Atari games, we estimate the advantage by the widely-adopted GAE method and follow the PPO Schulman et al. (2017) implementation that normalizes the advantage to make the training more robust. We design generalproposed BDs from the perspective of human players for the five Atari games. It is worth noting that the design of all the BDs does not unfairly favor any particular algorithm. Hence, the comparison among different methods is fair and unbiased (also illustrated in Appendix B.3). We compare QSD-PBT with independent PPO (Schulman et al., 2017) 1 , where we plot the quality-diversity curves throughout training. One important reason for the superiority of QSD-PBT is the unbiased diversity gradient with respect to the user-specified diversity and the estimation using samples in the state-action space, while other methods are either biased (QD-PG using 'state' BDs and DvD-TD3 using task-agnostic BDs) or estimated using samples in the policy parameter space (EDO-CS). Another reason is that the diversity is better exploited within each quality interval due to the adaptive diversity loss, which is further studied in Appendix B.1. Additional illustration of the diversity at different quality levels is in Figure 8 . 

4.2. ATARI GAMES

We select five Atari games. MsPacMan and RiverRaid are PvE (i.e., single-agent) games where players obtain scores by achieving pre-designed goals. FishingDerby, IceHockey, and Boxing are PvP (i.e., multiagent) games where the final score is the score difference between the player and a built-in AI competitor. Five general-proposed BDs are constructed from the perspective of human players: (1) Game time is the number of time steps in an episode. (2) Fire rate is the frequency of fire action in an episode. (3) Action continuity measures the continuity of an agent's actions by counting the number of action changes, which is actually an AI version of APM (actions per minute, widely applied to human players). For example, this BD for the action sequence {noop-noop-left-left-noop} is 2. (4) Two orientation BDs (left_right and up_down) measure the preference of moving directions. In Atari games we find that the quality of QD-PG and EDO-CS algorithms reach much lower levels than other PPO-based methods, therefore we have not included them for comparison and introduced a PPO baseline. The results are summarized in Table 2 and Figure 2 , most of which are consistent with that in the MuJoCo experiment. The exploitation strategy in PBT accelerates the training of quality at a considerable cost of diversity. DvD-PPO performs relatively well, even though it optimizes the task-agnostic diversity defined on action probabilities. Yet, the BDs for Atari games are more macroscopical than those for MuJoCo tasks, so the improvement of DvD-PPO over PBT is smaller in Atari games than that in MuJoCo tasks. Due to our diversity gradient and the adaptive diversity loss, QSD-PBT algorithm consistently achieves the highest QSD scores, though with more time overhead in MsPacMan and RiverRaid due to a relatively large diversity loss coefficient at the beginning. We further demonstrate the ability of QSD-PBT in generating user-specified diversity, using the two orientation diversity BDs in MsPacMan, as shown in Figure 3 and in the video. In this game, there are shortcuts connecting the left and right ends of the maze, which allows an agent to return to the middle of the maze via consistent (e.g., always left) moving directions. The orientation preference is a macroscopic BD that is significantly different from any task-agnostic diversity investigated in 

5. RELATED WORK

Diversity as a means to improve quality. A diversity-driven approach for exploration (Hong et al., 2018) was proposed by adding a distance regularization between the current policy and a previous policy. Unsupervised learning of diverse policies was studied in Eysenbach et al. (2018) to serve as an effective pretraining mechanism for downstream RL tasks. Novelty search was hybridized with the OpenAI ES to improve exploration in sparse or deceptive deep RL tasks (Conti et al., 2018) . A population determinant diversity measure (Parker-Holder et al., 2020) was proposed to improve exploration. Diverse opponent policies have a large influence on the online performance of opponent modelling Fu et al. (2022) . Diverse behaviors were learned in order to effectively generalize to varying environments that are different from training (Kumar et al., 2020) . A diversity-regularized collaborative exploration strategy was proposed in Peng et al. (2020) . Reward randomization (Tang et al., 2020) was employed to discover diverse policies in multi-agent games with the aim of improving the final policy quality. Trajectory diversity was maximized for better zero-shot coordination in a collaborative multi-agent environment (Lupu et al., 2021) . Diversity was studied in multi-agent open-ended learning (Liu et al., 2021; Perez-Nieves et al., 2021) to improve the final exploitability (another definition of quality). Maximizing diversity subject to high quality constraints or vice-versa. A maximum mean discrepancy regularizer was proposed to produce a set of near-optimal policies having different distributions of trajectories (Masood & Doshi-Velez, 2019) . A two-objective update technique (Zhang et al., 2019) was developed for sequentially obtaining a set of novel policies, each of which solves a given task in the meantime executing distinct action sequences. A method based on the Frank-Wolfe algorithm (Frank & Wolfe, 1956) was introduced to compute a set of diverse and near-optimal policies (Ghasemi et al., 2021) . A set of diverse policies in the space of successor features (Barreto et al., 2017) were sequentially obtained by solving a line of constrained MDPs (Zahavy et al., 2021) , where an intrinsic diversity reward was maximized subject to a quality constraint. A reward-switching technique was recently proposed (Zhou et al., 2022) to discover a diverse set of high-quality policies by sequentially solving a novelty-constrained optimization problem for the current policy. Maximizing both quality and diversity via QD-style methods. QD algorithms (Pugh et al., 2016; Cully & Demiris, 2017) are a type of evolutionary algorithms, where the goal is to evolve a set of diverse and high-quality solutions in a single run. A representative QD algorithm is MAP-Elites (Cully et al., 2015; Mouret & Clune, 2015) . In order to scale to large policy neural networks, several recent attempts have been made to combine QD with policy gradient (Cideron et al., 2020; Nilsson & Cully, 2021; Pierrot et al., 2022) or ES (Colas et al., 2020; Wang et al., 2021) . Evolutionary multiobjective deep reinforcement learning was employed (Shen et al., 2020) to generate behavior-diverse game AIs. A kernel-based method with Stein variational gradient descent was proposed (Gangwani et al., 2020) for training a set of QD policies. Assuming that both the quality and the BD are fully differentiable (which is generally not true in the RL settings), a new MAP-Elites algorithm was developed in Fontaine & Nikolaidis (2021) . A subsequent work (Tjanaka et al., 2022) extended differentiable QD (Fontaine & Nikolaidis, 2021) to RL settings, where the quality and BD gradients were estimated using TD3 and ES respectively.

6. DISCUSSION AND CONCLUSION

In this work, motivated by the practical need in generating task-specific diverse policies of similar qualities, we formulate the QSD problem and a new performance metric called the QSD score. For the first time, we derive the gradient of the user-defined diversity measure with respect to a policy and approximate the gradient using samples in the state action space (as opposed to the policy parameter space). Based on our diversity gradient, an efficient population-based RL algorithm (i.e., QSD-PBT) is then developed and has demonstrated strong performance, in terms of maximizing user-specified diversity across different quality levels, on both MuJoCo and Atari. BD definition. We are aware in some real-world situations it might be non-trivial to express the userintended diversity via a set of BDs but relatively easy via a similarity function Sim(τ i , τ j ) between two trajectories. Sim(τ i , τ j ) is more general than BDs, since we can derive Sim(τ i , τ j ) using BDs but the opposite is not true. We look forward to extending QSD-PBT to such and more general situations. BD estimator. In order to estimate our diversity gradient, we need to train the state BD (Equation 4) or state-action BD estimators (Equation 5). Note that both BD estimators depend on previous state-action sequence τ 0:t-1 , which might cause problems in very long trajectories. It is currently handled by applying LSTM models (in MuJoCo) or sufficient statistics (in Atari). More advanced techniques, such as attention (Vaswani et al., 2017) , may be needed for more accurate estimation. Population-based training. We use a separate neural network for each agent in the population, which increases the overall training time and memory overhead linearly with the population size N . A more efficient approach may be sharing most of the feature extraction part, e.g., the 3-layer convolution model in DQN, for all agents and building separate policy and value heads for each agent in the population. This implementation is connected with the ensemble and multi-task learning (An et al., 2021; Flet-Berliac & Preux, 2019) . From experience in previous research, parameter sharing accelerates training convergence yet in our case degrades the diversity of PPO and PBT, since they are optimized without explicit diversity objectives. For fair comparisons, parameter sharing is not implemented in this paper but is recommended in practice. Quality and diversity optimization. In this paper, we break the QSD problem into a set of subproblems (each corresponding to maximizing diversity at a certain quality level) and solve them integrally in a single run, and the resulting algorithm is termed QSD-PBT. In spite of achieving better empirical performance than other baselines, whether QSD-PBT is the most efficient method in optimizing the QSD score defined in Equation 1 is unclear. One possible solution is combining the strength of our diversity gradient (Equation 6 and 8) with existing QD-style methods. Another is regarding the QSD problem as a Pareto front optimization problem and applying continuous exploration (Ma et al., 2020) .

A THEORY ANALYSIS A.1 PROOF OF PROPOSITION 1

Proof. We prove the correctness of Equation 7and that Estimation 8 is an unbiased estimation of the gradient in Equation 7with a constant scale. We denote a segmentation of a trajectory starting from time step i to j as τ i:j := [s i , a i , ..., s j , a j ], with specially τ -1 := ∅ and τ 0 := [s 0 , a 0 ]. Recall the definitions of state BD estimator and state-action BD estimator: V π θ Bi (τ 0:t-1 , s t ) : = E at,τ t+1:T ∼π θ [b i (τ )], Q π θ Bi (τ 0:t-1 , s t , a t ) : = E τ t+1:T ∼π θ [b i (τ )]. We consider a determinate policy for continuous actions, and the output of the policy π θ (s t ) is a scalar, i.e., a t = π θ (s t ). As a result, the following two equations hold between the state BD estimator and the state-action BD estimator: Q π θ Bi (τ 0:t-1 , s t , a t ) = st+1 p(s t+1 |s t , π θ (s t ))V π θ Bi (τ 0:t , s t+1 )ds t+1 , V π θ Bi (τ 0:t-1 , s t ) = Q π θ Bi (τ 0:t-1 , s t , π θ (s t )), where p(s t+1 |s t , π θ (s t )) is the transition probability from s t to s t+1 after taking the action π θ (s t ). We use p π (s i → s j ) to represent the probability p(s i ) j-1 k=i p(s k+1 |s k , π θ (s k )). We omit the index π for clarity in the following. Note that the BD of a policy can be written as: B i (π θ ) = s0 p(s 0 )Q π θ Bi (τ -1 , s 0 , π θ (s 0 ))ds 0 , and the gradient is: ∂B i (π θ ) ∂θ = s0 p(s 0 )∇ θ Q π θ Bi (τ -1 , s 0 , π θ (s 0 ))ds 0 . ( ) Further, we have: ∇ θ Q π θ Bi (τ -1 , s 0 , π θ (s 0 )) = ∇ θ s1 p(s 1 |s 0 , π θ (s 0 ))V π θ Bi (τ 0 , s 1 )ds 1 = s1 ∇ θ π θ (s 0 )∇ a0 p(s 1 |s 0 , a 0 )| a0=π θ (s0) V π θ Bi (τ 0 , s 1 )ds 1 + s1 p(s 1 |s 0 , π θ (s 0 ))∇ θ V π θ Bi (τ 0 , s 1 )ds 1 =∇ θ π θ (s 0 ) s1 ∇ a0 p(s 1 |s 0 , a 0 )| a0=π θ (s0) V π θ Bi (τ 0 , s 1 )ds 1 + s1 p(s 1 |s 0 , π θ (s 0 ))∇ θ V π θ Bi (τ 0 , s 1 )ds 1 =∇ θ π θ (s 0 )∇ a0 Q π θ Bi (τ -1 , s 0 , a 0 )| a0=π θ (s0) + s1 p(s 1 |s 0 , π θ (s 0 ))∇ θ Q π θ Bi (τ 0 , s 1 , π θ (s 1 ))ds 1 . ( ) Equation 16 shows the relation between the gradient of the current state-action BD estimator and that of the next state-action BD estimator. Generalizing this result, we have ∇ θ Q π θ Bi (τ 0:t-1 , s t , π θ (s t )) =∇ θ π θ (s t )∇ at Q π θ Bi (τ 0:t-1 , s t , a t )| at=π θ (st) + st+1 p(s t+1 |s t , π θ (s t ))∇ θ Q π θ Bi (τ 0:t , s t+1 , π θ (s t+1 ))ds t+1 . ( ) We can apply Equation 17to Equation 15 recursively: ∂B i (π θ ) ∂θ = s0 p(s 0 )∇ θ π θ (s 0 )∇ a0 Q π θ Bi (τ -1 , s 0 , a 0 )| a0=π θ (s0) ds 0 + s1 s0 p(s 0 )p(s 1 |s 0 , π θ (s 0 ))∇ θ Q π θ Bi (τ 0 , s 1 , π θ (s 1 ))ds 0 ds 1 = s0 p(s 0 )∇ θ π θ (s 0 )∇ a0 Q π θ Bi (τ -1 , s 0 , a 0 )| a0=π θ (s0) ds 0 + s0:1 p(s 0 → s 1 )∇ θ Q π θ Bi (τ 0 , s 1 , π θ (s 1 ))ds 0:1 = s0 p(s 0 )∇ θ π θ (s 0 )∇ a0 Q π θ Bi (τ -1 , s 0 , a 0 )| a0=π θ (s0) ds 0 + s0:1 p(s 0 → s 1 )∇ θ π θ (s 1 )∇ a1 Q π θ Bi (τ 0 , s 1 , a 1 )| a1=π θ (s1) ds 0:1 + s0:2 p(s 0 → s 2 )∇ θ Q π θ Bi (τ 0:1 , s 2 , π θ (s 2 ))ds 0:2 = ... = T t=0 s0:t p(s 0 → s t )∇ θ π θ (s t )∇ at Q π θ Bi (τ 0:t-1 , s t , a t )| at=π θ (st) ds 0:t , where s0:t is short for s0 s1 ... st and ds 0:t is short for ds 0 ds 1 ...ds t . The first equality is according to Equation 16, and the successive equality is according to Equation 17. Proof. We further prove that with a constant scale, Equation 8 is an unbiased estimation of Equation 7. Note that in practice we obtain sampled trajectories by executing policy π(θ), and the samples in the trajectories are added into a replay buffer. As a result, a sample (s t , a t , τ 0:t-1 ) is added into the replay buffer with a probability p(s 0 ) t-1 k=0 p(s k+1 |s k , π θ (s k )), i.e., p(s 0 → s t ). In other words, the probability that the sample (s t , a t , τ 0:t-1 ) is in the current mini-batch is proportional to p(s 0 → s t ), which we denote by p(s0→st) C (C is a constant). The expectation of the sampled gradient is: E [∇ θ π θ (s t )∇ at Q π θ Bi (τ 0:t-1 , s t , a t )| at=π θ (st) ] = T t=0 s0:t ∇ θ π θ (s t )∇ at Q π θ Bi (τ 0:t-1 , s t , a t )| at=π θ (st) p(s 0 → s t ) C ds 0:t = 1 C ∂B i (π θ ) ∂θ , where the last equality is according to Equation 18. A.2 PROOF OF PROPOSITION 2 Proof. Denote h(x) = f (x) + λg(x). By assumption, we have h(x) is Lipschitz smooth, i.e., there exists L > 0, s.t. ||∇h(x) -∇h(y)|| ≤ L||x -y||, ∀x, y ∈ R n (20) So, |h(y) -h(x) -∇h(x) T (y -x)| = | 1 0 ∇h(x + t(y -x)) T (y -x)dt -∇h(x) T (y -x)| ≤ 1 0 ||∇h(x + t(y -x)) T -∇h(x) T || • ||y -x||dt ≤ 1 0 tL||y -x|| 2 dt = L 2 ||y -x|| 2 i.e., h(y) -h(x) ≤ ∇h(x) T (y -x) + L 2 ||x -y|| 2 . ( ) Since g(x) is bounded, we can assume g(x) > 0. Otherwise, let C = min x g(x), we rewrite h t (x) = f (x) + λ t (g(x) -C) + λ t C and perform the following analysis on h t (x) new = f (x) + λ t (g(x) -C) (note that both h t (x) new and h t (x) have the same derivatives and minimums). Hence, for each fixed x, the series {h t (x)} ∞ t=0 is decreasing to h(x). Consider an optimizing algorithm which starts at x 0 , and, at each time step t, updates x t using the gradient descent x t+1 = x t -η∇h t (x t ). By Equation 22, we have, h t (x t+1 ) -h t (x t ) ≤ ∇h t (x t ) T (x t+1 -x t ) + L 2 ||x t+1 -x t || 2 ≤ -η||∇h t (x t )|| 2 + Lη 2 2 ||∇h t (x t )|| 2 Choosing η = 1 L , we have, h t (x t+1 ) ≤ h t (x t ), and since h t+1 (x t+1 ) ≤ h t (x t+1 ), we obtain h t+1 (x t+1 ) ≤ h t (x t ). Hence lim t→∞ h t (x t ) exists (h(x) is bounded). It is easy to prove that {h t (x)} ∞ t=0 converge to h(x) uniformly, hence, lim t→∞ h t (x t ) = lim t→∞ h(x t ). We denote M = lim t→∞ h(x t ). We claim that lim t→∞ ∇h(x t ) = 0, hence M = min x h(x). We prove the claim by contradiction. Otherwise, there exists ϵ 0 satisfies that for any T , ∃t 0 > T, s.t.||∇h(x t0 )|| > ϵ 0 . 23,

Similar to Equation

h(x t+1 ) -h(x t ) ≤ ∇h(x t ) T (x t+1 -x t ) + L 2 ||x t+1 -x t || 2 ≤ -η∇h(x t ) T ∇h t (x t ) + Lη 2 2 ||∇h t (x t )|| 2 = η(∇h t (x t ) T ∇h t (x t ) -∇h(x t ) T ∇h t (x t )) + ( Lη 2 2 -η)||∇h t (x t )|| 2 = η(∇h t (x t ) -∇h(x t )) T ∇h t (x t ) + - η 2 ||∇h t (x t )|| 2 . ( ) In the last equation, we choose η = 1 L . Since lim t→∞ ∇h t (x) = ∇h(x), we have lim t→∞ |(∇h t (x t ) -∇h(x t )) T ∇h t (x t )| ≤ lim t→∞ ||∇h t (x t ) -∇h(x t )|| • ||∇h t (x t )|| = 0. Since whatever how large the T is, we can find t 0 > T s.t. ||∇h(x t0 )|| > ϵ 0 , so we can find |h(x t0+1 ) -h(x t0 )| > η 4 ϵ 0 , which is contradict to the convergence of {h(x t )} ∞ t=0 (Cauchy principle of convergence). Hence we prove the claim. Finally, since h(x) is convex, the algorithm will converge to a global minimum of h(x).

B.1.1 THE ADAPTIVE DIVERSITY LOSS

We demonstrate the effects of adaptive diversity loss constrained by the population quality described in Section 3.2. The Atari game FishingDerby is used for the ablation study, and three training settings are compared: (1) The baseline setting is the diversity loss function with a fixed coefficient λ ∞ . (2) In the decay setting, we apply exponential decay of the coefficient λ ∞ with decay step t 0 . (3) In the adaptive setting, we apply an adaptive coefficient λ j (t, R t,j ) that considers both the relative quality of the agent and the current training steps according to Section 3.2. The hyperparameters for this ablation study are detailed in Table 10 . as shown in Figure 4 (left), which hinders the optimization of diversity among policies of similar qualities. Moreover, as shown in Figure 4 (middle), the qualities of agents improve slowly for both the fixed and the decay training settings, which may have a negative effect on maximizing diversity among policies of near-optimal qualities, e.g., quality levels above 40 in this game. By introducing the adaptive quality constraint, we observe a much faster improvement of the population quality in Figure 4 (middle). The overall QSD performances of the three settings are shown in Figure 4 (right). We can observe that the quality constraint optimization has slightly better QSD scores at higher quality levels (>60%) although it improves the quality at the cost of diversity degradation at low quality levels. We recommend applying the adaptive diversity loss balanced by the population quality in practice. We further investigate the effect of the hyperparameter decay rate t 0 and initial λ 0 in the QSD-PBT algorithm. In the Atari game FishingDerby, we first fix the initial λ 0 to 0.1 and vary the decay rate t 0 from 20k to 500k training steps, then fix the decay rate t 0 to 500k and vary the initial λ 0 from 0.05 to 0.4. Ablation results are shown in Figure 5 , where computational/time overheads and QSD scores are measured in the same way as in Table 1 and Table 2 . From the experimental results, we can conclude that the initial λ 0 significantly affects the training convergence and demonstrates a trade-off between the final QSD score and the training overhead in a large range (from 0.05 to 0.3). When λ 0 is too large (λ 0 = 0.4), QSD-PBT can not reach the optimal quality within an acceptable time overhead (500k steps), then the partial sum of the quality intervals is reported and thereby the QSD score degrades. Compared with λ 0 , QSD-PBT algorithm is less sensitive to the decay rate t 0 , higher decay rates (20 and 50) lead to similar results as lower initial λ 0 , while a suitable decay rate (200) would both maintain the QSD score and reduce the time overheads. 7LPHRYHUKHDGPHDVXUHGE\VWHSVN only get states as raw pixel images from the Atari environment, we focus on actions and the implicit BD is designed as the randomly weighted action rate: b i (τ ) = w T i a, w i ∼ {(w 1 i , ...w K i )| K j=1 w j i = 1; w j i > 0, ∀j} where weight w i is sampled from the normal distribution and normalized by the softmax function, K = 6, action rates a is a 6-dimension vector [fire, noop, up, down, left, right] , the calculation of action rates is in Table 8 . We generate five random BDs, i.e., the weights form a 5 × 6 matrix, and provide additional training in the Atari game FishingDerby, the results are shown in Figure 7 . Besides, QSD-PBT checkpoints trained on generally designed BDs in Section 4.2 are cross-evaluated by random BDs, marked as "CE". From the experiment, the results are consistent with Figure 1 

B.3 DIVERSITY RESULTS ON EACH INDIVIDUAL BD

We claim that QSD-PBT can generate user-intended diversity as long as a diversity measure function and a set of user-defined BDs are provided. We illustrate this ability of QSD-PBT in the main text using orientation BDs in Figure 3 . We here provide additional results using the four Atari games for each of the five BDs defined for the Atari experiments. Results in Table 2 and Figure 2 are further decomposed into each individual BD in Figure 9 , where we plot the QSD for a single BD one at a time, i.e., B(π θ ) = [B 1 (π θ )]. After being normalized by scale factors in Table 8 , the five BDs have similar ranges. Overall from Figure 9 , we can conclude that QSD-PBT outperforms previous methods in terms of the QSD performance for each of the five BDs defined for the Atari experiments in most cases. We observe that some user-defined BDs are highly correlated with the quality, e.g., the game time in PvE-style games (MsPacMan and RiverRaid), where QSD-PBT achieves only slight improvements compared with other methods. How to adaptively handle the correlation between a BD and the quality remains an interesting question. Besides, for the left_right BD in the game RiverRaid, the river course is narrow (shown in supplementary videos), and there is no shortcut (as in the game MsPacMan) that connects the left and right areas. Therefore, left and right actions are restricted to have similar numbers, and the resulting diversity scores are small for all the methods in this game.

B.4 ADDITIONAL VISUALIZATIONS

We illustrate the diversity at different quality levels in Figure 8 using the Humanoid-v2 task. In this task, an agent can be rewarded by standing or walking, and walking at faster speeds often leads to higher rewards. In Figure 8 For the MuJoCo videos, we focus on demonstrating the diversity at different quality levels. For the Humanoid task, agents at a high quality level (5000-6000) prefer various walking gaits, e.g., ambling, striding, and mincing. Agents at a low quality level (4000-5000) only learn to stand still and balance in various poses. For the Walker2d task, agents at a high quality level (5000-6000) learn to run or stride, while agents at a low quality level (4000-5000) can only walk with small steps or just hop. For the Hopper task, agents at a high quality level (3000-4000) prefer fast hopping with different poses, while agents at a low quality level only learn to stand or jump with small steps. Generally speaking, the quality and the diversity are a trade-off, higher quality tends to result in lower diversity. But as shown in Figure 1 , Figure 2 and Figure 5 , since environments and BDs are so varied, it is hard to clarify the accurate correlation at each quality level and we need to analyze it case-by-case. Usually it depends on (1) the correlation of the BDs and the quality, and (2) the exploration space and complexity of the environment. Here we observe the difference between Figure 1 and Figure 2 . In MuJoCo tasks, the QSD curve shows the quality and diversity trade-off, while in Atari games, the trade-off disappears. We conclude with two major reasons: (1) The definition of BDs. For MuJoCo tasks, we define most of the BDs on joint torques, which are highly correlated to the pose control of the robots and the resulting score. While for Atari games, we define relatively coarse-grained BDs that have lower correlations with the quality, e.g., the left or right preference in Figure 3 does not influence the score since most games are mirror symmetric. (2) The exploration space and complexity of the environment. Compared with MuJoCo tasks, Atari games are more complex and have more exploration space. Take the game RiverRaid for example, at a lower quality level (<3000), the initial river course is narrow and the agent can only go straight. When it reaches certain scores, the agent will enter into new scenes that have a much broader course. The scene switching provides more exploration and diversity space, resulting in the improvement of diversity along with quality.

C IMPLEMENTATION DETAILS

C.1 DESCRIPTIONS OF THE BDS For MuJoCo tasks, the BDs are defined on the scoring speed and the built-in joint torques respectively. The joint torques are the actions in MuJoCo tasks. The action spaces for each MuJoCo task are shown in Table 4 -6, and more details can be found here. For a trajectory, we divide its total score by the number of frames used to obtain the BD of Scoring speed. For each joint in a trajectory, we sum the actions, i.e., the torques, applied to it and divide the sum by the number of frames to obtain a joint BD. As a result, the number of BDs for each MuJoCo task is the number of joints in that task plus one, i.e., the Scoring speed. We normalize each BD to the range of [0, 1.0]. An overview of all the BDs for each task in the MuJoCo experiment is presented in Table 7 . Num Description (Torque applied to different joints) Range 1 the hinge in the y-coordinate of the abdomen [-0.4, 0.4] 2 the hinge in the z-coordinate of the abdomen 3 the hinge in the x-coordinate of the abdomen 4 the rotor between torso/abdomen and the right hip (x-coordinate) [-0.4, 0.4] 5 the rotor between torso/abdomen and the right hip (z-coordinate) 6 the rotor between torso/abdomen and the right hip (y-coordinate) 7 the rotor between the right hip/thigh and the right shin 8 the rotor between torso/abdomen and the left hip (x-coordinate) 9 the rotor between torso/abdomen and the left hip (z-coordinate) 10 the rotor between torso/abdomen and the left hip (y-coordinate) 11 the rotor between the left hip/thigh and the left shin 12 the rotor between the torso and right upper arm (coordinate -1) 13 the rotor between the torso and right upper arm (coordinate -2) 14 the rotor between the right upper arm and right lower arm 15 the rotor between the torso and left upper arm (coordinate -1) 16 the rotor between the torso and left upper arm (coordinate -2) 17 the rotor between the left upper arm and left lower arm For Atari games, we define a common set of BDs for each game. There are 5 different BDs in total. The design of these BDs are motivated by covering different ways in which diverse and meaningful policies could possibly differ. We normalize each BD to a similar range. An overview of all the BDs in the Atari experiment is presented Table 8 . 

C.2 THE DIVERSITY MEASURES

We use the mean pair-wise distance as the diversity measure for the main experimental results. For a population of policies Π with size N, Π = {π θj |1 ≤ j ≤ N }, the mean pair-wise distance is defined as: 2 N (N -1) N -1 i=1 N j=i+1 ||B(π θi ) -B(π θj )|| 2 , ( ) where B(π θ ) is the vector of all the BD values of a policy defined in Section 2. As we stated in the main text, our method QSD-PBT can be applied to any explicit diversity measure as long as the measure is differentiable with respect to B(π θ ). Later in Appendix B.1.2, we present results using another diversity measure, i.e., the determinant of a DPP. The DPP determinant of Π = {π θj |1 ≤ j ≤ N } is defined as: det[K(B(π θi ), B(π θj )) N i,j=1 ], ( ) where K is a given kernel function. We set K = exp[- ||B(π θ i )-B(π θ j )||1 2 ] in our case.



Figure 1: The quality similar diversity across 10 quality intervals on MuJoCo tasks.

Figure 2: The quality similar diversity across 10 quality intervals on Atari games.

Figure 4: Comparison of three diversity training settings: the fixed diversity loss (fixed), the decayed diversity loss (decayed), and the adaptive diversity loss with quality constraint (adaptive). Left: typical training curves of the qualities of a population of 10 agents with the fixed setting. Middle: training curves of the qualities of population of 10 agents for the three settings. Right: the QSD performance for the three settings.

Figure 5: The final QSD scores and the time overheads under different settings of decay rate t 0 and initial λ 0 .

Figure 7: The quality similar diversity across 10 quality intervals on random BDs, "CE" means cross-evaluation on these BDs.

(a), agents at a lower quality level (4000-5000) learn to stand and balance in various poses, e.g., with open or closed legs and different hand gestures. In contrast, agents at a higher quality level (5000-6000) prefer various walking gaits, including ambling, striding, and mincing, which is demonstrated in Figure 8(b).

Figure 8: Visualization of diversity at different quality levels on Humanoid-v2.

Figure 9: Detailed diversity results for individual BDs in Atari games

(driven only by the quality), two QD-style methods: QD-PG(Pierrot et al., 2022) (diversity gradient by 'state' BDs) and EDO-CS(Wang et al., 2021) (diversity gradient by evolution strategies), and two population-based RL algorithms: PBT(Jaderberg et al., 2017) (driven only by the quality) and DvD(Parker-Holder et al., 2020) (driven by both the quality and task-agnostic diversity).For each environment, the range of quality is firstly estimated by training a state-of-the-art qualitydriven method (TD3 for MuJoCo tasks and PPO for Atari games) and then partitioned into M = 10 disjoint intervals of equal scope. The highest quality R max achieved by TD3 or PPO is considered as the 'optimal' quality and shared by all the comparing methods. To calculate the QSD score for each method, we save a number of candidate policies within each quality interval during training. The time overhead for each method is estimated by the number of training steps till the average quality of the population reaches a near-optimal quality (0.9R max ). Error bars plotted in the figures or standard deviations presented in the tables are obtained using 5 independent runs. The population size N is 8 for MuJoCo experiments and 10 for Atari experiments. Other implementation details are presented in Appendix C. Additional results that demonstrate the versatility of QSD-PBT for different user-specified BDs and diversity measure functions are included in Appendix B.3 and B.1.2. The QSD scores on MuJoCo tasks. #step denotes the time overhead of each method.Table1shows each method's QSD score, where we also report the QSD score calculated using only intervals with high quality (above 60% of R max ). PBT gets the lowest 3 out of 6 QSD scores, indicating that methods considering only quality result in considerable degradation of diversity. QD-PG gets trapped in terms of quality, which results in a low QSD score. EDO-CS performs slightly better than QD-PG, which is consistent with the experimental results inWang et al. (2021). QD-PG and EDO-CS are much more time-consuming than other RL-style methods. The performance ofDvD-TD3 (Parker-Holder et al., 2020)  comes as the second best, and QSD-PBT consistently achieves the highest QSD scores. Similar conclusions can be arrived at in Figure

The QSD scores on Atari games. #step denotes the time overhead of each method.

and Figure2. QSD-PBT CE is trained with the BD fire rate and the diversity of DvD-PPO is defined by action pairs, therefore both of them favor certain random BDs and perform slightly better than PPO and PBT. Comparing QSD-PBT and QSD-PBT CE, we conclude that our algorithm is capable of "what you will get is what you have defined" since it allows for the direct computation of the diversity gradient.

Action space for Hopper.

Action space for Humanoid.

Action space for Walker2d.

An overview of all BDs for each MuJoCo task. #frame is the total number of frames in a trajectory.

An overview of all BDs for Atari games. # indicates the total number of frames, action changes, or specific actions in a trajectory.

ACKNOWLEDGEMENT

We thank Ke Xue (from Nanjing University) and Peng Yang (from Southern University of Science and Technology) for their helpful discussions. Peng Yang and Chao Qian are sponsored by the CCF-Tencent Open Research Fund (CCF-Tencent RAGR20220110). Yaodong Yang is sponsored by the CCF-Tencent Open Research Fund (CCF-Tencent RAGR20220109). We are grateful to the anonymous reviewers for their insightful feedback.

B.1.2 OTHER DIVERSITY MEASURE

In this section, we provide more analysis of the measure function. Note that the measure function is in need in our algorithm to measure the diversity among the policies, so that our algorithm can train to improve the diversity. In the main experiment, we choose the MSE function to be the measure function. We note that our method theoretically can be applied to any measure function as long as the measure function is differentiable. We provide one experiment for example below. However, the choice of the measure function absolutely will affect the performance of the algorithm, and we discuss this later.We claim that QSD-PBT can be applied to any diversity measure as long as the diversity measure is differentiable with respect to the BDs. We here provide additional experimental results using a different (from the main experiments) diversity measure, i.e., the determinant of DPP, using the MuJoCo task Hopper. From the results presented in Table 3 , we can conclude that QSD-PBT still outperforms other methods when the diversity measure is the determinant of DPP. The results are consistent with that in Table 1 when the pair-wise distance is employed as the diversity measure. One reason is that QSD-PBT optimizes the user-defined diversity objective directly, while the other methods optimize a taskagnostic diversity measure (PBT, DvD) or quality within different cells of the BD space (EDO-CS).Besides, QSD scores within intervals above 60% of the maximum qualities in Table 3 drop faster compared to the results on the pair-wise distance diversity measure in Table 1 . The reason is that the determinant of DPP is much more sensitive to the changes in BDs, in comparison with the mean pair-wise distance measure. Therefore, we recommend the mean pair-wise Euclidean distance as the first choice for users and experiments in this paper.The choice of measure function will have an impact on the training overheads from two aspects:(1) The measure function itself may have high computational complexity, e.g., O(n 3 ) in DPP, where n is the population size. However, n is small in practice (8 in MuJoCo and 10 in Atari) and the complexity is negligible compared to neural network models. For example, when we increase population size n from 10 to 100, we do not observe any decrease of GPU sample speed during training.(2) Different measure functions will affect the convergence of QSD-PBT and hence affect the computational sources. In Figure 6 , we provide training curves of MSE and DPP measure functions in MuJoCo tasks. From the figure, we can see the convergence rate of DPP is slightly slower than MSE.The reason is that MSE calculates the pairwise distance of BDs among all policies in the population and averages them by equal weights, while the DPP is more sensitive to the most similar policies and is unstable. For example, the DPP of a population that contains two same policies in it will be zero, regardless of how diverse the other policies are in the population.

B.2 RESULTS ON RANDOMLY DESIGNED BDS

Since the trajectory BD can be technically anything, the design principles in this paper are focused on generality and practicality. The BDs in previous experiments include various forms, e.g., action BD (MuJoCo joint torques), state BD (Atari game time), and trajectory BD (Atari left/right preference). However, we find that particular choices of BD still favor certain algorithms. For example, BD defined on actions favors the DvD algorithm since it directly optimizes the KL distance of agents' actions.Hence the improvement of DvD over PBT is more significant in the MuJoCo results (Figure 1 ) than in the Atari results (Figure 2 ). Instead of defining meaningful and explicit BDs, we further investigate the performance of QSD-PBT on implicit BDs that would less favor certain algorithms. Since we can

C.3 THE CALCULATION OF THE QSD SCORE

In practice, the number of policies obtained throughout training with qualities lying in the same interval can be much larger than N , which is the population size of Π used for evaluating the diversity measure Div(Π). To make an efficient and fair comparison across different methods in terms of the QSD score, for each quality interval we sample a population Π of size N for 100 times. The way we sample Π is by alternating the training index of the policy till N policies are obtained, because policies with the same training index tend to have similar BD values. The training index of a policy denotes which agent in the population the policy comes from during population-based training. We calculate the diversity measure for each sampled Π and use the sample mean of the 100 evaluations as the diversity measure for the corresponding quality interval, which is then aggregated according to Equation 1 to produce the QSD score.

C.4 HYPERPARAMETERS AND DETAILS OF THE COMPARED BASELINE METHODS

For MuJoCo tasks, we compare QSD-PBT to three population-based training algorithms: EDO-CS, PBT, and DvD. EDO-CS uses the ES for optimizing the quality and a selection mechanism to induce diversity among the defined BDs. PBT uses TD3 as the backbone and mostly focuses on the quality of each policy in the population. The Perturb exploration and Truncation selection exploitation strategies are adopted every 10k training steps for our setting of PBT. The DvD-TD3 in the original paper is implemented. DvD-TD3 optimizes a combined loss of quality and a task-agnostic diversity measure defined on state action probabilities. All the hyperparameters for each method are listed in Table 9 except for EDO-CS and QD-PG, for which we implement with the architecture and hyperparameters suggested in their papers. For QSD-PBT, we use the LSTM as a feature exactor in BD estimators. We pass the trajectory state and trajectory action to an LSTM, respectively, then concatenate the feature and pass it to the MLP. The 'state' BD we used for reproducing QD-PG is the action (a real-valued vector, which has a dimension of 17, 3, and 6 respectively in the three MuJoCo tasks) in a state. We believe this is a fair setting for QD-PG, because the trajectory BD we defined here is the average torques (i.e., actions) applied to the hinge joints over a trajectory.For Atari games, we replace the baseline method EDO-CS with PPO for training efficiency consideration, where a population of N independent PPO agents are trained. As Atari games have discrete actions and high-dimensional image inputs, all the methods employ the DQN (Mnih et al., 2015) architecture as the backbone model. We find it effective to provide statistics from the beginning of a game to the current state (the number of frames, the number of fire, left, right, up, down actions and the number of action changes) as a sufficient encoding of trajectory τ 0:j . Instead of the LSTM applied in MuJoCo experiments, these additional features are applied in the DQN model to better estimate BD defined in Table 8 . Accordingly, we use PPO in the implementation of both DvD and PBT. All the hyperparameters for each method are listed in Table 10 .

C.5 PSEUDOCODE OF QSD-PBT

The developed QSD-PBT is in general a population-based RL algorithm. In QSD-PBT, we employ parallel actors and learners to simultaneously train a population of N agents. The policy of each agent is trained using the gradient of the loss function defined in Equation 9. Meanwhile, the policy value functions or Q functions are trained using the mean squared error, and so are the state BD estimators or the state-action BD estimators of each agent. Moreover, QSD-PBT maintains a running average estimation of each agent's quality, each agent's BDs, and the population's mean quality. The pseudocode of QSD-PBT is given in Algorithm 1. Note that, without loss of clarity, we present QSD-PBT with TD3 and QSD-PBT with PPO in one algorithm. by B(π θn ). Calculate diversity loss according to Equations 6, 8. Update critic parameters θ n using gradients on value loss and BD estimator loss. Update actor parameters θ n using gradients on L total (π θn ) as in Equation 9.Published as a conference paper at ICLR 2023 

