QUALITY-SIMILAR DIVERSITY VIA POPULATION BASED REINFORCEMENT LEARNING

Abstract

Diversity is a growing research topic in Reinforcement Learning (RL). Previous research on diversity has mainly focused on promoting diversity to encourage exploration and thereby improve quality (the cumulative reward), maximizing diversity subject to quality constraints, or jointly maximizing quality and diversity, known as the quality-diversity problem. In this work, we present the quality-similar diversity problem that features diversity among policies of similar qualities. In contrast to task-agnostic diversity, we focus on task-specific diversity defined by a set of user-specified Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (e.g., the fire action rate for an Atari game), which delivers the type of diversity the user prefers. To derive the gradient of the user-specified diversity with respect to a policy, which is not trivially available, we introduce a set of BD estimators and connect it with the classical policy gradient theorem. Based on the diversity gradient, we develop a population-based RL algorithm to adaptively and efficiently optimize the population diversity at multiple quality levels throughout training. Extensive results on MuJoCo and Atari demonstrate that our algorithm significantly outperforms previous methods in terms of generating user-specified diverse policies across different quality levels (see Atari and MuJoCo videos).

1. INTRODUCTION

Existing research on policy diversity in deep Reinforcement Learning (RL) can be generally divided into three categories, according to the role diversity plays. The first category (Hong et al., 2018; Eysenbach et al., 2018; Conti et al., 2018; Parker-Holder et al., 2020; Kumar et al., 2020; Peng et al., 2020; Tang et al., 2020; Han & Sung, 2021; Chenghao et al., 2021; McKee et al., 2022) focuses on maximizing the final quality (the cumulative reward) of a policy, and policy diversity only serves as a means to better fulfill this goal via improving the efficiency of exploration. Therefore, the diversity measure is preferred to be task-agnostic as the knowledge of what type of task-specific diversity benefits the quality may not be accessible in most cases. The second category (Masood & Doshi-Velez, 2019; Zhang et al., 2019; Sun et al., 2020; Ghasemi et al., 2021; Zahavy et al., 2021; Zhou et al., 2022) is concerned with constrained optimization problems, where either diversity is optimized subject to quality constraints or vice-versa. Again, existing methods in this category have mainly focused on task-agnostic diversity, thereby the obtained diversity is often explained in hindsight, i.e., it is unknown what type of policy diversity to expect until the optimization is finished. The third category optimizes quality and diversity simultaneously, which is usually known as the Quality-Diversity (QD) method (Cully et al., 2015; Mouret & Clune, 2015; Pugh et al., 2016; Colas et al., 2020; Fontaine & Nikolaidis, 2021; Nilsson & Cully, 2021; Pierrot et al., 2022; Wang et al., 2021; Tjanaka et al., 2022) . In contrast to task-agnostic diversity, most QD methods focus on task-specific diversity, where users are allowed to specify a set of interested Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (i.e., the whole game episode) and thus does not have an analytical function form with respect to a single policy or state. Therefore, the gradient of a BD with respect to a policy is not trivially available, and this extends to the diversity measure

