ITERATIVELY LEARNING NOVEL STRATEGIES WITH DIVERSITY MEASURED IN STATE DISTANCES

Abstract

In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. Yet, to not only optimize rewards but also discover as many diverse strategies as possible remains a challenging problem. A natural approach to this task is constrained population-based training (PBT), which simultaneously learns a collection of policies subject to diversity constraints. However, due to the unaffordable computation cost of PBT, we adopt an alternative approach, iterative learning (IL), which repeatedly learns a single novel policy that is sufficiently different from previous ones. We first analyze these two frameworks and prove that, for any policy pool derived by PBT, we can always use IL to obtain another policy pool of the same rewards and competitive diversity scores. In addition, we also present a novel state-based diversity measure with two tractable realizations. Such a metric can impose a stronger and much smoother diversity constraint than existing action-based metrics. Combining IL and the state-based diversity measure, we develop a powerful diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. We empirically examine our algorithm in complex multi-agent environments including StarCraft Multi-Agent Challenge and Google Research Football. In these environments, SIPO is able to consistently derive strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.

1. INTRODUCTION

A consensus in deep learning (DL) is that most local optima have similar losses to the global optimum (Venturi et al., 2018; Roughgarden, 2020; Ma, 2021) . Hence, via stochastic gradient descent (SGD), most DL works only focus on the final performance of the learned model without considering which local optimum SGD discovers. However, such a performance-oriented paradigm can be problematic for reinforcement learning (RL) because it is typical in complex RL problems that policies with the same reward may have substantially different behaviors. For example, a high-reward agent in a boat-driving game can either carefully drive the boat or keep turning around to exploit an environment bug (Clark & Amodei, 2016); a humanoid football AI can adopt any dribbling or shooting behaviors to score a goal (Liu et al., 2022) ; a strong StarCraft AI can take very distinct construction and attacking strategies (Vinyals et al., 2019) . Thus, it is a fundamental problem for an RL algorithm to not only optimize rewards but also discover as many diverse strategies as possible. In order to obtain diverse RL strategies, we can naturally extend single-policy learning to populationbased training (PBT). The problem can be formulated as a constrained optimization problem by simultaneously learning a collection of policies subject to policy diversity constraints (Parker-Holder et al., 2020b; Lupu et al., 2021) . However, since multiple policies are jointly optimized, PBT can be computationally challenging (Omidshafiei et al., 2020) .Therefore, a greedy alternative is iterative learning, which iteratively learns a single novel policy that is sufficiently different from previous ones (Masood & Doshi-Velez, 2019; Zhou et al., 2022) . Since only one policy is learned per iteration, IL can largely simplify optimization. However, there have not been any theoretical guarantees on the performance or the convergence properties of IL methods. In addition to the computation frameworks, how to quantitatively measure the difference (i.e., diversity) between two policies remains an open question as well. Mutual information (MI) is perhaps the most popular diversity measure (Eysenbach et al., 2019) . Although MI reveals great potential to discover diverse locomotion skills, it is proved in Eysenbach et al. ( 2022) that maximizing MI will not recover the set of optimal policies w.r.t. the environment reward. Therefore, MI-based methods often serve as a pre-training phase for downstream tasks (Sharma et al., 2020; Campos et al., 2020) . Another category of diversity measure is based on the action distributions, such as Wasserstein distance (Sun et al., 2020 ), cross-entropy (Zhou et al., 2022) , and Jensen-Shannon divergence (Lupu et al., 2021) . Action-based measures are straightforward to evaluate and optimize. However, we will show in Sec. 4.2 that such a metric can completely fail in simple scenarios. In this paper, we present comprehensive studies to address the two issues above. First, we provide an in-depth analysis of the two computation frameworks, namely PBT and IL, for learning diverse strategies. We theoretically prove that, in addition to simplified optimization thanks to fewer constraints, IL can discover solutions with the same reward as PBT with at least half of the diversity score. Regarding the diversity measure, we consider two concrete scenarios, i.e., grid-world navigation and Google Research Football (GRF). In the grid-world example, we construct visually different strategies that cannot be distinguished by popular action-based diversity measures. In the GRF example, we show that duplicated actions taken by an idle player can drastically influence the action-based diversity score. Consequently, we argue that an effective diversity measure should focus on state distances instead of action distributions. Combining IL and a state-based diversity measure, we design a generic and effective algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), for discovering diverse RL strategies in an iterative fashion. In each iteration, SIPO learns a single novel policy with state-based diversity constraints w.r.t. policies learned in previous iterations. We further solve this constrained optimization problem via Lagrangian method and two-timescale gradient descent ascent (GDA) (Lin et al., 2020) . Theoretical results show that our algorithm is guaranteed to converge to a neighbour of ϵ-approximate KKT point (Dutta et al., 2013) . Regarding the state-based measure, we provide two practical realizations, including a straightforward version based on the RBF kernel and a more general learning-based variant using Wasserstein distance. We validate the effectiveness of our algorithm in two challenging multi-agent environments: Star-Craft Multi-Agent Challenge (Samvelyan et al., 2019) and Google Research Football (Kurach et al., 2020) . Specifically, our algorithm can successfully discover 6 distinct human-interpretable strategies in the GRF 3-vs-1 scenario and 4 strategies in two 11-player GRF scenarios, namely counter-attack and corner, without any domain knowledge, which are substantially more than existing baselines.

2. RELATED WORK

Discovering diverse solutions has been a long-established problem (Miller & Shaw, 1996; Deb & Saha, 2010; Lee et al., 2022) with a wide range of applications in robotic control (Cully et al., 2015; Kumar et al., 2020 ), dialogues (Li et al., 2016) , game AI (Vinyals et al., 2019; Lupu et al., 2021 ), design (Gupta et al., 2021) and emergent behaviors (Liu et al., 2019; Baker et al., 2020; Tang et al., 2021) . Early works are primarily based on the setting of multi-objective optimization (Mouret & Clune, 2015; Pugh et al., 2016; Ma et al., 2020; Nilsson & Cully, 2021; Pierrot et al., 2022) , which assumes a set of reward functions is given in advance. In RL, this is also related to reward shaping (Ng et al., 1999; Babes et al., 2008; Devlin & Kudenko, 2011; Tang et al., 2021) . We consider learning diverse policies without any domain knowledge. Population-based training (PBT) is the most popular framework for producing diverse solutions by jointly learning separate policies. Representative algorithms include evolutionary computation (Wang et al., 2019; Long et al., 2020; Parker-Holder et al., 2020b ), league training (Vinyals et al., 2019; Jaderberg et al., 2019 ), computing Hessian matrix (Parker-Holder et al., 2020a) or constrained optimization with a diversity measure over the policy population (Lupu et al., 2021; Zhao et al., 2021; Li et al., 2021; Liu et al., 2021b ). An improvement over PBT is to learn a latent variable policy instead of separate ones to improve sample efficiency. Prior works have incorporate different domain knowledge to design the latent code, such as action clustering (Wang et al., 2021) , agent identities (Li et al., 2021) or prosocial level (Peysakhovich & Lerer, 2018; Baker et al., 2020) . The latent variable can be also learned in an unsupervised fashion. DIYAN (Eysenbach et al., 2019) and its variants (Kumar et al., 2020; Osa et al., 2022) learns latent-conditioned policies by maxi-

