ITERATIVELY LEARNING NOVEL STRATEGIES WITH DIVERSITY MEASURED IN STATE DISTANCES

Abstract

In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. Yet, to not only optimize rewards but also discover as many diverse strategies as possible remains a challenging problem. A natural approach to this task is constrained population-based training (PBT), which simultaneously learns a collection of policies subject to diversity constraints. However, due to the unaffordable computation cost of PBT, we adopt an alternative approach, iterative learning (IL), which repeatedly learns a single novel policy that is sufficiently different from previous ones. We first analyze these two frameworks and prove that, for any policy pool derived by PBT, we can always use IL to obtain another policy pool of the same rewards and competitive diversity scores. In addition, we also present a novel state-based diversity measure with two tractable realizations. Such a metric can impose a stronger and much smoother diversity constraint than existing action-based metrics. Combining IL and the state-based diversity measure, we develop a powerful diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. We empirically examine our algorithm in complex multi-agent environments including StarCraft Multi-Agent Challenge and Google Research Football. In these environments, SIPO is able to consistently derive strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.

1. INTRODUCTION

A consensus in deep learning (DL) is that most local optima have similar losses to the global optimum (Venturi et al., 2018; Roughgarden, 2020; Ma, 2021) . Hence, via stochastic gradient descent (SGD), most DL works only focus on the final performance of the learned model without considering which local optimum SGD discovers. However, such a performance-oriented paradigm can be problematic for reinforcement learning (RL) because it is typical in complex RL problems that policies with the same reward may have substantially different behaviors. For example, a high-reward agent in a boat-driving game can either carefully drive the boat or keep turning around to exploit an environment bug (Clark & Amodei, 2016) ; a humanoid football AI can adopt any dribbling or shooting behaviors to score a goal (Liu et al., 2022) ; a strong StarCraft AI can take very distinct construction and attacking strategies (Vinyals et al., 2019) . Thus, it is a fundamental problem for an RL algorithm to not only optimize rewards but also discover as many diverse strategies as possible. In order to obtain diverse RL strategies, we can naturally extend single-policy learning to populationbased training (PBT). The problem can be formulated as a constrained optimization problem by simultaneously learning a collection of policies subject to policy diversity constraints (Parker-Holder et al., 2020b; Lupu et al., 2021) . However, since multiple policies are jointly optimized, PBT can be computationally challenging (Omidshafiei et al., 2020) .Therefore, a greedy alternative is iterative learning, which iteratively learns a single novel policy that is sufficiently different from previous ones (Masood & Doshi-Velez, 2019; Zhou et al., 2022) . Since only one policy is learned per iteration, IL can largely simplify optimization. However, there have not been any theoretical guarantees on the performance or the convergence properties of IL methods. In addition to the computation frameworks, how to quantitatively measure the difference (i.e., diversity) between two policies remains an open question as well. Mutual information (MI) is perhaps the most popular diversity measure (Eysenbach et al., 2019) . Although MI reveals great potential to discover diverse locomotion skills, it is proved in Eysenbach et al. (2022) that maximizing MI will not recover the set of optimal policies w.r.t. the environment reward. Therefore, MI-based methods often serve as a pre-training phase for downstream tasks (Sharma et al., 2020; Campos et al., 2020) . Another category of diversity measure is based on the action distributions, such as Wasserstein distance (Sun et al., 2020) , cross-entropy (Zhou et al., 2022) , and Jensen-Shannon divergence (Lupu et al., 2021) . Action-based measures are straightforward to evaluate and optimize. However, we will show in Sec. 4.2 that such a metric can completely fail in simple scenarios. In this paper, we present comprehensive studies to address the two issues above. First, we provide an in-depth analysis of the two computation frameworks, namely PBT and IL, for learning diverse strategies. We theoretically prove that, in addition to simplified optimization thanks to fewer constraints, IL can discover solutions with the same reward as PBT with at least half of the diversity score. Regarding the diversity measure, we consider two concrete scenarios, i.e., grid-world navigation and Google Research Football (GRF). In the grid-world example, we construct visually different strategies that cannot be distinguished by popular action-based diversity measures. In the GRF example, we show that duplicated actions taken by an idle player can drastically influence the action-based diversity score. Consequently, we argue that an effective diversity measure should focus on state distances instead of action distributions. Combining IL and a state-based diversity measure, we design a generic and effective algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), for discovering diverse RL strategies in an iterative fashion. In each iteration, SIPO learns a single novel policy with state-based diversity constraints w.r.t. policies learned in previous iterations. We further solve this constrained optimization problem via Lagrangian method and two-timescale gradient descent ascent (GDA) (Lin et al., 2020) . Theoretical results show that our algorithm is guaranteed to converge to a neighbour of ϵ-approximate KKT point (Dutta et al., 2013) . Regarding the state-based measure, we provide two practical realizations, including a straightforward version based on the RBF kernel and a more general learning-based variant using Wasserstein distance. We validate the effectiveness of our algorithm in two challenging multi-agent environments: Star-Craft Multi-Agent Challenge (Samvelyan et al., 2019) and Google Research Football (Kurach et al., 2020) . Specifically, our algorithm can successfully discover 6 distinct human-interpretable strategies in the GRF 3-vs-1 scenario and 4 strategies in two 11-player GRF scenarios, namely counter-attack and corner, without any domain knowledge, which are substantially more than existing baselines.

2. RELATED WORK

Discovering diverse solutions has been a long-established problem (Miller & Shaw, 1996; Deb & Saha, 2010; Lee et al., 2022) with a wide range of applications in robotic control (Cully et al., 2015; Kumar et al., 2020) , dialogues (Li et al., 2016) , game AI (Vinyals et al., 2019; Lupu et al., 2021) , design (Gupta et al., 2021) and emergent behaviors (Liu et al., 2019; Baker et al., 2020; Tang et al., 2021) . Early works are primarily based on the setting of multi-objective optimization (Mouret & Clune, 2015; Pugh et al., 2016; Ma et al., 2020; Nilsson & Cully, 2021; Pierrot et al., 2022) , which assumes a set of reward functions is given in advance. In RL, this is also related to reward shaping (Ng et al., 1999; Babes et al., 2008; Devlin & Kudenko, 2011; Tang et al., 2021) . We consider learning diverse policies without any domain knowledge. Population-based training (PBT) is the most popular framework for producing diverse solutions by jointly learning separate policies. Representative algorithms include evolutionary computation (Wang et al., 2019; Long et al., 2020; Parker-Holder et al., 2020b) , league training (Vinyals et al., 2019; Jaderberg et al., 2019 ), computing Hessian matrix (Parker-Holder et al., 2020a) or constrained optimization with a diversity measure over the policy population (Lupu et al., 2021; Zhao et al., 2021; Li et al., 2021; Liu et al., 2021b ). An improvement over PBT is to learn a latent variable policy instead of separate ones to improve sample efficiency. Prior works have incorporate different domain knowledge to design the latent code, such as action clustering (Wang et al., 2021) , agent identities (Li et al., 2021) or prosocial level (Peysakhovich & Lerer, 2018; Baker et al., 2020) . The latent variable can be also learned in an unsupervised fashion. DIYAN (Eysenbach et al., 2019) and its variants (Kumar et al., 2020; Osa et al., 2022) learns latent-conditioned policies by maxi-mizing the mutual information between states and the latent variable. The discovered behaviors are primarily low-level motion skills rather than high-reward strategies (Eysenbach et al., 2022) . Iterative learning (IL) simplifies PBT by only optimizing a single policy subject to different diversity measures, such as maximum mean discrepancy (Masood & Doshi-Velez, 2019) , Wasserstein distance on actions (Sun et al., 2020) , and cross entropy (Zhou et al., 2022) , which are often actionbased. We adopt a purely state-based measure. Some other works require an expensive clustering process before each optimization iteration (Zhang et al., 2019) or domain-specific features (Zahavy et al., 2021) while we consider measures that can be efficiently optimized in an end-to-end fashion. Besides, Pacchiano et al. (2020) learns a kernel-based score function to guide policy optimization. The score function is conceptually similar to our Wasserstein-distance-based diversity measure but is applied to a parallel setting with more restricted expressiveness power.

3. PRELIMINARY

Notation: We consider Partially Observable Markov Decision Process (POMDP) (Spaan, 2012), defined by a tuple M " xS, A, O, r, P, O, ν, Hy. S is the state space. A and O are the action and observation space. r : S ˆAn Ñ R is the reward function. O : S Ñ O is the observation function. H is the horizon. P is the transition function. For state s, s 1 P S and an action a P A, the transition probability from s to s 1 by executing action a is P ps 1 | s, aq. At timestep h, the agent receives an observation o h " Ops h q from the current state s h and outputs an action a h P A w.r.t. its policy π : O Ñ △ pAq. The RL objective Jpπq, i.e., expected return, is defined by Jpπq " E ps h ,a h q"pP,πq " ř H h"1 rps h , a h q ı . The discounted factor is omitted here to simplify notations. The above formulation can be naturally extended to cooperative multi-agent settings, where π and R correspond to the joint policy and the shared reward. We follow the standard POMDP notations for conciseness and evaluate our algorithm in complex cooperative multi-agent scenarios since multiagent games are substantially more challenging than single-agent ones. Finally, in order to discover diverse strategies, we aim to learn a set of M policies tπ i u M i"1 such that all of these policies are locally optimal under Jp¨q but mutually distinct subject to some diversity measure Dp¨, ¨q : △ ˆ△ Ñ R, which captures the difference between two policies. We present two popular computation procedures for this purpose.

Population-Based Training (PBT):

PBT is a straightforward formulation of the diversity discovery problem by jointly learning M policies tπ i u M i"1 subject to pairwise diversity constraints, i.e., max π1,...,π M M ÿ i"1 Jpπ i q s.t. Dpπ j , π k q ě δ @j, k P rM s, j ‰ k, where δ is a threshold. Despite a precise formulation, PBT poses severe optimization challenges. Iterative Learning (IL): IL is a greedy approximation of PBT by iteratively learning novel policies. In the i-th (1 ď i ď M ) iteration, IL solves the following constrained optimization problem π ‹ i " arg max πi Jpπ i q s.t. Dpπ i , π ‹ j q ě δ @1 ď j ă i. IL runs unconstrained RL at first and then solves incrementally more constrained problems. Action-Based Diversity Measure: We briefly introduce the diversity measure in this paragraph. Many prior works define Dp¨, ¨q over actions, which can be formally summarized by D A pπ i , π j q " E s"qpsq " DA pπ i p¨| sq}π j p¨| sqq ı , where q : △pSq denotes some specific state distribution, and DA p¨}¨q : △ ˆ△ Ñ R measures the difference between action distributions. DA can be any probability distance such as Wasserstein distance (Sun et al., 2020) , Jensen-Shannon Divergence (Lupu et al., 2021) , cross-entropy (Zhou et al., 2022) , or simply the L 2 distance given a continuous action space (Parker-Holder et al., 2020b) .

4. ANALYSIS OF EXISTING DIVERSITY-DISCOVERY APPROACHES

In this section, we conduct both theoretical and quantitative analysis of existing approaches to motivate our method. We first compare computation frameworks, namely PBT and IL, in Sec. 4 Theoretical Comparison: We consider the simplest motivation example in the setting of linear programming to intuitively illustrate the computation challenges. We simply assume that π i is a scalar, and Jpπ i q is linear in π i , and Dpπ i , π j q " |π i ´πj |. In our definition, PBT involves ΩpM 2 q variables in a single constrained optimization problem while IL involves ΩpM q variables in all. It is well-known that the complexity of linear programming is a high degree polynomial (degree 3 or higher depending on the algorithm) w.r.t. the number of variables (Bertsimas & Tsitsiklis, 1997) . Therefore, even in the linear case, we can notice that more constraints can pose substantial challenges to the optimization problem. This issue can be more severe in RL due to complex solution space and large training variance. Although IL can be optimized efficiently, it remains unclear whether IL, as a greedy approximation of PBT, can obtain solutions of comparable rewards. Fig. 1 shows the worst case in 1-D setting when the policies found by IL (green) can indeed have much lower rewards than the PBT solution (red) when subject to the same diversity constraint. However, we will show in the next theorem that IL is guaranteed to have no worse rewards than PBT by trading off half of the diversity. Theorem 4.1. Assume D is a distance metric. Denote the optimal value of Eq.( 1) as T 1 . Let T 2 " ř M i"1 Jpπ i q where πi " arg max πi Jpπ i q s.t. Dpπ i , πj q ě δ{2 @1 ď j ă i (4) for i " 1, . . . , M , then T 2 ě T 1 . Proof. See Appendix E.1. δ/2 δ/2 δ/2 δ/2 optimal solution solution found by IL with threshold δ/2 worst-case solution found by IL with threshold δ J(π) π 1 2 1 2 Figure 1: 1-D worst case of IL. With threshold δ, IL finds solutions with inferior rewards. However, IL can find optimal solutions if the threshold is halved. The above theorem provides a quality guarantee for the IL solutions. The proof can be intuitively explained by the 1-D example in Fig. 1 . Assuming the worst case where the first IL solution lies in the middle of a plateau with size δ (green 1), then the next solution with threshold δ must locate outside the plateau with a low reward. However, if the threshold is halved, the IL solutions are guaranteed to locate in the high-reward area (blue 1 and 2). Thm. 4.1 shows that, for any policy pool derived by PBT, we can always use IL to obtain another policy pool, which has the same rewards and comparable diversity scores. We remark that the worst case in Fig. 1 may not be common for RL environments in practice. Table 1 : The number of discovered landmarks by PBT and IL across 6 seeds with standard deviation in the bracket.

setting PBT IL

NL " 4 2.0 (1.0) 3.5(0.5) NL " 5 2.2 (0.9) 4.5(0.5) Empirical Results: We empirically compare PBT and IL in a 2-D navigation environment with one agent and N L landmarks (blue circles), as shown in Fig. 2 . The reward is +1 if the agent successfully navigates to landmarks and 0 otherwise. Before training, landmark positions are randomly initialized subject to a pre-specified distance threshold per episode. We train N L policies using both PBT and IL to discover strategies towards each of these landmarks. Specifically, we simply take Dpπ i , π j q as the L 2 distance of the final state reached by π i and π j , i.e., Dpπ i , π j q " }s πi H ´sπj H } 2 . We solve this problem via Lagrangian multiplier with details in Appendix D. Figure 3: (left) A grid-world environment with size N G " 5 and 3 different optimal policies. Intuitively, Dpπ 1 , π 2 q ă Dpπ 1 , π 3 q because π 1 (purple) and π 2 (blue) both move along the diagonal. However, action-based diversity measures can give D A pπ 1 , π 2 q ě D A pπ 1 , π 3 q (right), which motivates our proposal of state-distance based diversity measure. Table 1 shows the number of discovered landmarks by PBT and IL. IL performs consistently better than PBT even in this simple example. We illustrate the learning process of PBT and IL in Fig. 2 . IL, due to its computation efficiency, can afford to run longer iterations and tolerate larger exploration noises. Hence, it can converge easily to diverse solutions by imposing a large diversity constraint. The PBT, however, only converges when the exploration is faint, otherwise it diverges or converges too slowly.

4.2. CHOICE OF DIVERSITY MEASURE: ACTION-BASED OR STATE-BASED?

We then analyze the impact of different diversity measures. We first show that action-based measures can often fail even for very simple tasks. Action-Based Measure: Although action-based measures are easy to compute and widely used, we present concrete failure cases here. The first example is a single-agent grid-world with size N G , where an agent spawns at the top left and needs to navigate to the bottom right. We consider three different policies shown in Fig. 3 : π 1 (purple) and π 2 (blue) move along the diagonal while π 3 (red) moves along the boundary. Humans can naturally conclude that π 3 is visually different from π 1 and π 2 , i.e., Dpπ 1 , π 2 q ă Dpπ 1 , π 3 q, especially when N G is large. However, the actions of π 1 and π 2 along the trajectory are totally disjoint. Consequently, action-based measures will have a large value on Dpπ 1 , π 2 q. We compute Dpπ 1 , π 2 q and Dpπ 1 , π 3 q based on popular action-based diversities measures in Table 2 , where the obtained values largely violates human intuition. Next, we consider a more realistic and complicated multi-agent football scenario in Fig. 4 , where an idle player in the backyard takes an arbitrary action, such as "pass", "shoot" or "slide", without involving in the attack at all. Although the idle player stays still with no effect on the team strategy at all, action-based measures can produce high diversity scores when the idle player takes different duplicated actions, leading to visually indistinguishable solutions.

Dribbling Player

Idle Player State-Based Measure: Based on the previous examples, we propose to focus on states rather than action when designing a diversity measure. Formally, denote the state distribution induced by π as µ π . We define the state-distance-based diversity measure as D S pπ i , π j q " E ps,s 1 q"γ " g `d `s, s 1 ˘˘‰ . ( ) d is a distance metric over S ˆS. g : R `Ñ R is a monotonic function. γ P Γpµ πi , µ πj q is a distribution over state pairs. Γpµ πi , µ πj q denotes the collection of all distributions on S Ŝ with marginals µ πi and µ πj on the first and second factors respectively. Our proposed measure is solely defined over states and such a metric can impose a stronger and much smoother diversity constraint than existing action-based metrics. The state distance in the measure encourages the policies to reach visually different states leading to desired diversity. We compute two simple state-based measures, i.e., the L 2 norm and the Earth Moving Distance (EMD), for the grid-world example in Table 2, which is consistent with human intuition.

4.3. PRACTICAL REMARK

Based on the analysis in the above subsections, we conclude that PBT can pose severe optimization challenges, and that action-based diversity measures can often fail because they may not correctly reflect behavioral differences. By contrast, IL and state-based diversity measures are free from the above issues and should be preferred in challenging RL applications. Therefore, we consider how to develop a powerful algorithm for discovering diverse policies that can leverage both algorithmic design choices. In the next section, we combine these ideas with a theoretically sound optimization algorithm, Gradient Descent Ascent (GDA), towards an efficient and practical algorithm for learning diverse policies.

5.1. ALGORITHM OVERVIEW

In this section, we develop a powerful diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), by combining IL and state-distance-based measures. SIPO runs M iterations to discover M distinct policies. At the i-th iteration, we solve Problem (2) by converting it into unconstrained optimization using the Lagrange method. The unconstrained optimization can be written as min πi max λj ě0, 1ďjăi ´Jpπ i q ´i´1 ÿ j"1 λ j `DS pπ i , π ‹ j q ´δ˘( 6) where λ j (1 ď j ă i) are Lagrange multipliers and tπ ‹ j u i´1 j"1 are previously obtained policies. We adopt two-timescale Gradient Descent Ascent (GDA) (Lin et al., 2020) to solve the above minimax optimization, i.e., performing gradient descent over π i and gradient ascent over λ j with different learning rates. We also clip the dual variables λ, which plays an important role both in our theorem and in empirical convergence. However, D S pπ i , π ‹ j q cannot be directly optimized through gradientbased methods because it is related to the states visited by π i . As a popular solution (Zhou et al., 2022) , we cast D S pπ i , π ‹ j q as the summation of intrinsic rewards and optimize it via policy gradient. The pseudocode of SIPO can be found in Appendix G. An important property of SIPO is the convergence guarantee. We present an informal illustration in Thm. 5.1 and present the formal theorem with proof in Appendix E.2. Theorem 5.1. (Informal) Under moderate assumptions, SIPO converges to a neighborhood of ϵapproximate KKT point. Remark: Please see the appendix for a detailed description of the assumptions and the proof. We assumed that the reward J and the distance D S are smooth in policies. In practice, this is true if the policy remains in a bounded region and the reward is continuous in state. The key step in the proof is to analyze the role of clipping the dual variables λ, which stabilizes the algorithm without hurting the optimality condition.

5.2. REALIZATION OF THE STATE-BASED MEASURE

Instead of directly defining D S , we define intrinsic rewards as illustrated in Sec. 5.1, such that D S pπ i , π ‹ j q " E s h "µπ i " ř H h"1 r int ps h ; π i , π ‹ j q ı . With this formulation, we can implement the following two types of diversity measures.

RBF Kernel:

The most popular realization of Eq. ( 5) in machine learning is kernel functions. In this paper, we realize Eq. ( 5) as an RBF kernel on states. Formally, the intrinsic reward is defined by r RBF int ps h ; π i , π ‹ j q " 1 H E s 1 "µ π ‹ j " ´exp ˆ´}s h ´s1 } 2 2σ 2 ˙ȷ ( ) where σ is a hyperparameter controlling the variance. Wasserstein Distance: For stronger discrimination power, we realize Eq. ( 5) as L 2 -Wasserstein distance. According to the dual form (Villani, 2009) , the intrinsic reward is defined by r WD int ps h ; π i , π ‹ j q " 1 H sup }f } L ď1 f ps h q ´Es 1 "µ π ‹ j " f ps 1 q ‰ ( ) where f : S Ñ R is a 1-Lipschitz function. Following Arjovsky et al. (2017) , we implement f as a neural network and clip parameters to r´0.01, 0.01s to ensure the Lipschitz constraint. r WD int utilizes a learnable scoring function f and is more flexible in practice. We name SIPO with r RBF int and r WD int SIPO-RBF and SIPO-WD respectively. Implementation In the i-th iteration (1 ď i ď M ), we learn an actor and a critic with i separate value heads to accurately predict different return terms, including i ´1 intrinsic returns for the diversity constraints and the environment reward. The input of r int is the global state, which contains the state information of all the agents. To incorporate temporal information, we stack the recent 4 global states to compute intrinsic rewards and normalize the intrinsic rewards to stabilize training. In multi-agent environments, we learn an agent-ID-conditioned policy (Fu et al., 2022) and share the parameter across all agents. Our implementation is based on MAPPO (Yu et al., 2021) with more details in Appendix D.

6. EXPERIMENTS

We validate the effectiveness of SIPO in two complex multi-agent games: StarCarft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) and Google Research Football (GRF) (Kurach et al., 2020) . First, we show that SIPO can efficiently learn diverse strategies in all scenarios and outperform several baseline methods, including DIPG (Masood & Doshi-Velez, 2019) , SMERL (Kumar et al., 2020) , DvD (Parker-Holder et al., 2020b) , and RSPO (Zhou et al., 2022) . Then, we qualitatively demonstrate the emergent behaviors learned by SIPO, which are both visually distinguishable and human-interpretable. Finally, we perform an ablation study over the building components of SIPO and show that both the diversity measure and GDA are critical to the performance. All the algorithms including ablation variants run for the same number of environment frames on a desktop machine with a single NVIDIA RTX3090 GPU. All the quantitative results are repeated over 3 random seeds with standard deviation shown in brackets. Additional results in continuous control can be found in Appendix B.

6.1. COMPARISON WITH BASELINE METHODS

In SMAC, we only compare SIPO and RSPO, since RSPO outperforms other baselines (Zhou et al., 2022) . We run both algorithms on an easy map, 2m_vs_1z, and a hard map, 2c_vs_64zg, across 4 iterations. Both algorithms can discover 4 distinct winning strategies. To perform further comparison, we compute the population diversity score based on r RBF int (see definition in Appendix B). The results in Table 3 show that SIPO can discover an even more diverse population than RSPO, even though RSPO explicitly forces all policies to output different actions. In GRF, we run all algorithms and train a population of 4 in three academy scenarios, specifically "academy_3_vs_1_with_keeper" (3v1), "academy_counterattack_easy" (CA), and "academy_corner" (corner). The GRF environment is more challenging than SMAC due to the large action space and the existence of duplicate actions. Table 4 compares the number of visually distinct policies discovered in the population. We present the population diversity scores in Appendix B. Our algorithm is the most efficient and robust -even in the challenging 11-vs-11 corner and CA scenario, SIPO can effectively discover different winning strategies in just a few iterations across different seeds. By contrast, baselines adopting action-based measures, e.g., DvD and RSPO, suffer from the issue of duplicate actions and tend to discover policies with slight distinctions. In addition, the mutual information objective in SMERL is sub-optimal (Eysenbach et al., 2022) and the MMD-based measure of DIPG may not impose a strong adversarial power on policies.

6.2. QUALITATIVE ANALYSIS

For SMAC, we present heatmaps of agent positions in Fig. 5 . The heatmaps clearly show that SIPO can consistently learn novel winning strategies to conquer the enemy. Fig. 6 presents the learned behavior by SIPO in the GRF 3v1 scenario, where 3 attackers should collaborate to shoot under the defense of an opponent player and a goalkeeper. We can observe that agents have learned a wide spectrum of collaboration strategies across merely 7 iterations. Visualization results in CA and corner scenarios can be found in Appendix B. Surprisingly, the strategies discovered by SIPO are both diverse and human-interpretable. We take the 3v1 scenario as an example. In the first iteration, all agents are involved in the attack such that they can distract the defender and obtain a high win rate. Similar strategies are discovered in the 4th and 5th iterations. The 2nd and the 6th iteration demonstrate an efficient pass-and-shoot strategy, where agents quickly elude the defender and score a goal. In the 3rd and the 7th iterations, agents learn smart "one-two" strategies to bypass the defender, which is a common tactic adopted by professional human football players. To the best of our knowledge, SIPO may be the first algorithm that can discover such diverse human-like tactics in complex multi-agent RL environments. We perform ablation studies by

6.3. ABLATION STUDY

• fixing the Lagrange multiplier (fix-L); • replacing our proposed diversity measure with cross-entropy (CE); • replacing GDA with the filteringbased method (filter); • replacing IL with PBT (PBT). We apply these changes to SIPO-WD and report the number of visually distinct policies discovered by these methods in Table 5 . Comparison between SIPO and CE demonstrates that the actionbased cross-entropy measure may suffer from duplicate actions in GRF and produce nearly identical behavior by overly exploiting duplicate actions, especially in the CA and corner scenarios with 11 agents. Besides, the fixed Lagrange coefficient, the filtering-based method, and PBT are all detrimental to our algorithm. These methods also suffer from significant training instability. Overall, both the state-distance-based diversity measure and GDA are critical to the performance of SIPO.

7. CONCLUSION

In this paper, we tackle the problem of discovering diverse high-reward policy in complex RL scenarios. We present a thorough comparison between two popular computation frameworks for this problem, i.e., population-based training (PBT) and iterative learning, and show that, comparing with PBT, IL is much easy to optimize and can derive solutions with comparable quality to PBT. Moreover, we also demonstrate concrete failure cases for popular action-based diversity measure. Motivated by these insights, we combine IL with a diversity measure defined on state distance to develop State-based Intrinsic-reward Policy Optimization (SIPO), which has provable convergence and can efficiently discover a wide spectrum of human-interpretable strategies in challenging multi-agent environments. We emphasize that the contribution of our work is much beyond the final algorithm SIPO. We believe our analysis on frameworks and diversity measure with concrete examples and theoretical justifications can bring useful insights to benefit the community for developing more powerful diversity-driven RL algorithms.

A PROJECT WEBSITE

Check https://sites.google.com/view/diversity-sipo for GIF demonstrations.

B ADDITIONAL RESULTS

B.1 MORE QUALITATIVE RESULTS We show visualization results in GRF CA and corner in Fig. 7 (left) and Fig. 8 (left). We also evaluate SIPO-WD in the most challenging continuous control environment, Humanoid-v3, across 3 iterations and visualize the learned behavior in Fig. 8 (right). SIPO-WD is able to produce diverse behaviors with different gaits. We additionally remark that the population diversity score is very close to 1 (such as 0.999) even when we repeatedly run PPO (Zhou et al., 2022) . Hence, we do not report the population diversity score here.

B.2 STATE-BASED POPULATION DIVERSITY

We define the pairwise difference between policies as 1 When conditioned on some specific latent variable, SMERL policy cannot even collect a single winning trajectory in CA and Corner. Therefore, we omit the result here. 2 Training DvD in CA and corner requires >24GB GPU memory, which exceeds our memory limit. 3 Not converged in corner. where σ is a scaling factor. Then, similar to Parker-Holder et al. (2020b), we compute the determinant of the matrix K as the population diversity. σ " 1 for Table 3 , " 0.4 for Table 6 , and σ " 0.15 for Table 7 . K ij " Kpπ i , π j q " E ps 1 h ,s h 2 q"pP,πi,πj q " exp ˆ´}s 1 h ´s2 h } 2 2σ 2 ˙ȷ . Similar to Table 3 , we present the state-based population diversity score of GRF scenarios in Table 6 . GRF scenarios are more challenging than SMAC and the trained policies may not always score a goal in each episode. (See evaluation winning rates in Table 8 .) To reduce the variance, we collect 32 winning trajectories and compute population diversity scores on them.

B.3 RESULTS WITH A LARGER POPULATION SIZE

To demonstrate the effectiveness of SIPO, we additionally conduct an experiment in the GRF 3v1 scenario with a population size M " 10. Baselines include DIPG and RSPO. We present the results in Table 7 . Results show that SIPO clearly outperforms these baselines consistently discovering one or more additional strategies. Empirically, we find that there are 4 "primitive" strategies in the 3v1 scenario, which are pass-andshoot (iteration 2 in Fig. 6 ), double-pass-and-shoot (iteration 1 in Fig. 6 ), and the corresponding mirror strategies. Across 10 iterations, baseline methods do not discover any strategies beyond these primitives, while SIPO is able to learn addition smart behaviors like "one-two" strategies (iteration 7 in Fig. 6 ).

B.4 EVALUATION WIN RATE

The evaluation win rates of the demonstrated visualization results (Fig. 5 , Fig. 6 ,Fig. 7 , and Fig. 8 ) are shown in Table 8 .

B.5 COMPUTATION OF ACTION-BASED MEASURES IN THE GRID-WORLD EXAMPLE

We consider the policies illustrated in Fig. 9 . These policies are all optimal since these actions only include "right" and "down" and actions on non-visited states can be arbitrary. We only mark actions on states visited by any of these 3 policies and actions on other states can be considered the same.  D A pπ i , π j q " E s"qpsq " D pπ i p¨| sq}π j p¨| sqq ı , where Dp¨, ¨q : △ ˆ△ Ñ R is a measure over action distributions and q : △pSq is a state proposal distribution. Here, we consider q to be the joint state distribution visited by π i and π j . KL Divergence KL divergence is defined by D KL pπ i p¨| sq, π j p¨| sqq " ż A π i pa | sq log π i pa | sq π j pa | sq da. When π j pa | sq " 0 at any state s, KL divergence is `8. Since the trajectories of these policies have disjoint states, D KL A pπ 1 , π 2 q " D KL A pπ 1 , π 3 q " `8. Similar results can be obtained for crossentropy. JSD γ JSD γ was defined in Lupu et al. (2021) and we consider two special cases when γ " 0 and γ " 1. As illustrated by Lupu et al. (2021) , JSD 0 measures the expected number of times two policies will "disagree" by selecting different actions. On trajectories induced by π 1 and π 2 , there are 4 `4 states that π 1 disagrees with π 2 (π 1 and π 2 are symmetric) and D JSD0 A pπ 1 , π 2 q " 8{16 " 1{2. Similarly, π 1 and π 3 only disagree at the initial state, therefore we have D JSD0 A pπ 1 , π 3 q " 2{16 " 1{8.  JSD 1 pπ i , π j q " ´1 ÿ τi P pτ i | π i q T ÿ t"1 1 T log π i pτ i q `πj pτ i q 2π i pτ i q ´1 2 ÿ τj P pτ j | π j q T ÿ t"1 1 T log π i pτ j q `πj pτ j q 2π j pτ j q . Since each of the policies considered only induces a single trajectory and π i pτ j q " 0 pi ‰ jq, we can easily compute D JSD1 A pπ 1 , π 2 q " D JSD1 A pπ 1 , π 3 q " log 2 Wasserstein Distance Wasserstein distance or Earth Moving Distance (EMD) is 1 if two policies disagree on a state and 0 otherwise. Therefore, it equals to D JSD0 A .

B.5.2 ACTION NORM

We embed the action "right" as vector r1, 0s since it increases the x-coordinate by 1 and the action "down" as vector r0, ´1s since it decreases the y-coordinate by 1. This embedding can be naturally extended to a continuous action space with velocity action. Following Parker-Holder et al. (2020b) , we compute the action norm over a uniform distribution on states. We can see that there are 7 states where π 1 and π 2 perform differently and 1 state (the initial state) where π 1 and π 3 perform differently. Therefore, we can get Dpπ 1 , π 2 q " ? 7 and Dpπ 1 , π 3 q " 1.

B.5.3 STATE-DISTANCE-BASED MEASURES

State L 2 Norm Similar to action L 2 norm, we concatenate the coordinates instead of actions as the embedding and compute the L 2 norm between embedding. Wasserstein Distance Wasserstein distance is tractable in the grid-world example. We consider 7 states (except the initial and final states) in each trajectory and compute the pair-wise distance as matrix C 14 . Then we solve the following linear programming min γ ÿ i,j γ d C s.t. γ1 14 " a, γ T 1 14 " b γ i,j ě 0, 1 ď i, j ď 14 where d means element-wise multiplication, 1 k is a k-dim all-one vector, a 14ˆ1 " r1 T k , 0 T k s T and b 14ˆ1 " r0 T k , 1 T k s T is the marginal state distribution of each policy.

B.6 HOW TO ADJUST CONSTRAINT-RELATED HYPERPARAMETERS

Three hyperparameters are essential in the implementation of the intrinsic reward r int : the threshold δ, the intrinsic reward scale factor α, and the variance factor σ in r RBF int . These parameters differ under different domains and must be adjusted individually. We find proper parameters by running two iterations without constraints and get two similar policies π 0 and π 1 . We record r int during training π 1 and the trend is shown in Fig. 10 . Not surprisingly, r int gradually decreases as training proceeds. Threshold We set δ " c 1 D S pπ 0 , π 1 q. We try several different c 1 P t1, 1.2, 1.4, 1.6, 1.8, 2.0u and find that c 1 " 1.2 or 1.4 are universal proper solutions for all the experimental environments. Intrinsic Scale Factor We need to balance the intrinsic reward r int and the original reward J so that neither of the two rewards can dominate the training process. Empirically, the maximums of the two rewards should be in the same order of magnitude. i.e., max π Jpπq " α ˆc2 λ max δ, where c 2 " Op1q. When c 2 is too large, the new-trained policy π j will oscillate near the boundary of Dpπ i , π j q " δ for some pre-trained policy p i . Conversely, when c 2 is too small, the intrinsic reward r int cannot yield diverse strategies. In experiments, we set c 2 " 0.8 " 1.0. Variance Factor We sweep the variance factor across t1e ´3, 5e ´3, 1e ´2, 2e ´2, 1e ´3u by training π 1 and observe the trend of intrinsic rewards. We find the steepest trend and select the corresponding σ. Empirically, we find that our algorithm performs robustly well when σ 2 " 0.02. The δ and α of GRF and SMAC are listed in Table 9 .

B.7 EVALUATION PROTOCOL OF TABLE 4 AND TABLE 5

In 3v1 and CA, players perform passes and shoot in the front yard. We consider two strategies to be different if the resulting trajectories of ball movement are different, e.g. the ball is passed to different players or different players perform a shoot. In Corner, besides ball movement, we further categorize pass-and-shoot strategies according to the position of shooting in the penalty box (e.g., lower/middle/upper spot). All the authors perform independent evaluation based on this criterion and strong agreements are achieved. Please check our project website for GIF demonstrations.

C ENVIRONMENT DETAILS C.1 DETAILS OF THE 2D NAVIGATION ENVIRONMENT

The navigation environment has an agent circle with size a and 4 landmark circles with size b. We pre-specify a threshold c and constrain that the distance of final states reaching different landmarks must be larger than c. Correspondingly, landmark circles are randomly initialized by constraining the pairwise distance between centers to be larger than a threshold c`2pa`bq such that the final-state constraint is valid. An episode ends if the agent touches any landmarks, i.e., the distance between the center of the agent and the center of the landmark d ă a`b, or 1000 timesteps have elapsed. The observation space includes the positions of the agent and all landmarks, which is a 10-dimensional vector. The action space is a 2-dimensional vector, which is the agent velocity. The time interval is set to be ∆t " 0.1, i.e., the next position is computed by x t`1 " x t `∆t ¨v. The reward is 0 if the agent touches the landmark and 0 otherwise.

C.2 DETAILS OF SMAC, GRF, AND MUJOCO

SMAC We adopt the SMAc environment in the MAPPO codebasefoot_0 with the same configuration as Yu et al. (2021) . The input of intrinsic rewards or diversity measure is the state of all allies, including positions, health, etc. GRF We adopt the "simple115v2" representation as observation with both "scoring" and "checkpoint" reward. The reward is shared across all agents. The input of intrinsic rewards or diversity measure is the position and velocity of all attackers and the ball. MuJoCo We use the Humanoid-v4 environment in OpenAI gym version 0.21.0 with the default configuration. To remove irrelevant or unchangeable features, we use the first 45-dimension of 

F.1 THE FAILURE CASE OF STATE-BASED DIVERSITY MEASURES

A failure case of state-based diversity measures may be when the state space includes many irrelevant features. These features cannot reflect behavioral differences. If we run SIPO in such an environment, the learned strategies may be only diverse w.r.t these features and have little visual distinction. Like the famous noisy TV problem (Burda & Edwards, 2018) , the issue of irrelevant features is intrinsically challenging for general RL applications, which cannot be resolved by using action-based diversity measures either. Thanks to the advantages we discussed in the paper, we generally find that state-based metrics can be preferred in challenging RL tasks. Meanwhile, since the state dimension can be much higher than actions, it is possible that RL optimization over states may be accordingly more difficult than actions. In practice, we can design a feature selector for those most relevant features for visual diversity and run diversity learning over the filtered features. In SMAC and GRF, we utilize the agent features (excluding enemies) as the input of diversity constraint without further modifications, as discussed in Appendix D. We remark that even after filtering, the agent features remain highdimensional while our algorithm still works well. Note that using a feature selector is a common practice in many existing domains, such as novelty search (Cully et al., 2015) , exploration (Liu et al., 2021a) , and curriculum learning (Campero et al., 2021) . There are also works studying how to extract useful low-dimensional features from observations (Wu et al., 2019; Ghosh et al., 2019) , which are orthogonal to our focus.

F.2 THE DISTANCE METRIC IN STATE-BASED DIVERSITY MEASURES

In Sec. 5, we adopt the two most popular implementations in the machine learning literature, i.e., RBF kernel and Wasserstein distance, while it is totally fine to adopt alternative implementations. For example, we can learn state representations (e.g. auto-encoder, Laplacian, or successor feature) and utilize pair-wise distance or norms as a diversity measure. Similar topics have been extensively discussed in the exploration literature (Wu et al., 2019; Machado et al., 2020) . We leave them as our future directions.

G PSEUDOCODE OF SIPO

The pseudocode of SIPO is shown in Algorithm 1. Algorithm 1 SIPO (red for SIPO-RBF and blue for SIPO-WD) for archive index j " 1, . . . , i ´1 do 24: Input λ j Ð clip ´λj `ηλ ´´R j int `δ¯, 0, λ max ¯Ź gradient ascent on λ j 25: ϕ j Ð ϕ j `ηW X Ð X Y tχ i u Ź for the use of following iterations 32: end for



https://github.com/marlbenchmark/on-policy https://github.com/dtak/DIPG-public https://github.com/footoredo/rspo-iclr-2022



Figure2: Illustration of the learning process of PBT and IL in a 2-D navigation environment with 4 modes. PBT will not uniformly converge to different landmarks as computation can be either too costly or unstable. By contrast, IL repeatedly excludes a particular landmark, such that policy in the next iteration can continuously explore until a novel landmark is discovered.

Figure 4: Duplicate actions in multiagent football. For players who are not involved in the attack, actions like "pass", "shoot", and "slide" result in the same consequence. Diversity measures should not focus on these actions.

Figure 5: Heatmaps of agent positions in SMAC across 4 iterations with SIPO-RBF.Attacker

Figure 7: Visualization of learned behaviors in GRF CA across a single training trial.

Figure 9: Policies in the grid-world example when N G " 5.

Figure 10: Average intrinsic reward during training π 1 .



Number of visually distinct strategies in GRF discovered by different methods. Population size M " 4 in all cases. Details of the evaluation protocol can be found in Appendix B.

Population diversity in SMAC.

Number of visually distinct strategies in GRF discovered

Population diversity in GRF. Mean values averaged over 3 random seeds are shown with standard deviation in the brackets. Population size M " 4.

Population diversity and the number of distinct in 3v1 scenario with population size M " 10. Mean values averaged over 3 random seeds are shown with standard deviation in the brackets.

Evaluation win rate (%) of the demonstrated visualization results. Averaged across 3 seeds with standard deviation shown in brackets.

The values of δ and α in different environments.

Hyperparameters in the 2D navigation environment.

Common hyperparameters for SIPO, baselines, and ablations.

Number of Iterations M , Number of Training Steps within Each Iteration T . Hyperparameter: Learning Rate η π , Diversity Threshold δ, Intrinsic Scale Factor α, Lagrange Multiplier Upperbound λ max , Lagrange Learning rate η λ , Wasserstein Critic Learning Rate η W , RBF Kernel Variance σ. 1: Archived trajectories X Ð H Ź to store states visited by previous policies 2: for iteration i " 1, . . . , M do Pχj f ϕj ps 1 q

∇ ϕj ´fϕj ps h q ´1 |χj | ř s 1 Pχj f ϕj ps 1 q 26: ϕ j Ð clippϕ j , ´0.01, 0.01qUpdate π θi with tps h , a h , r h qu by PPO algorithm Ź policy gradient on θ i

D IMPLEMENTATION DETAILS D.1 2D NAVIGATION

We apply PPO to optimize the policy and hyperparameters are summarized in Table 10 . The applied algorithm is the same as SIPO (see Appendix G) except that the intrinsic reward is only computed at the last timestep.

D.2 SIPO

We include all practical tricks mentioned in Yu et al. (2021) because we find them all critical to algorithm performance. We use separate actor and critic networks, both with hidden size 64 and a GRU layer with hidden size 64. The common hyperparameters for SIPO, baselines, and ablations are listed in Table 11 . Other environment-specific parameters, such as PPO epochs and mini-batch size, are all the same as Yu et al. (2021) . Besides, Table 9 and Table 12 lists some extra hyperparameters for SIPO. Specific hyperparameters for baselines can be found in Appendix D.3.

D.3 BASELINES

We re-implement all baselines with PPO based on the MAPPO (Yu et al., 2021) project. All algorithms run for the same number of environment frames.SMERL We implement SMERL (Kumar et al., 2020) with PPO, where the actor and the critic take as the input the concatenation of observation and a one-hot latent variable. The discriminator is a 2-layer feed-forward network with 64 hidden units. The learning rate of the discriminator is the same as the learning rate of the critic network. The input of the discriminator is the same as the input we use for SIPO-WD. The critic has 2 value heads for an accurate estimation of intrinsic return. Since SMERL trains a single latent-conditioned policy, we train SMERL for M ˆmore environment steps, such that total environment frames are the same. The scaling factor of intrinsic rewards is 0.1 and the threshold for diversification is r0.81, 0.45, 0.72s (0.9 ˆr0.9, 0.5, 0.8s) for "3v1", "counterattack", and "corner" respectively.TrajDi We also try TrajDi (Lupu et al., 2021) in the GRF domain. We sweep the action discount factor among t0.1, 0.5, 0.9u and the coefficient of TrajDi loss among t0.1, 0.01, 0.001u. However, TrajDi fails to converge in the "3v1" scenario and exceeds the GPU memory in the "counterattack" and "corner" scenarios. Therefore, we exclude the performance of TrajDi in the main body.DvD We concatenate the one-hot actions along a trajectory as the behavioral embedding. The square of the variance factor, i.e., σ 2 in the RBF kernel, is set to be the length of behavioral embedding. We also use the same Bayesian bandits as proposed in (Parker-Holder et al., 2020b) . Training DvD in "counterattack" and "corner" exceeds the GPU memory and we exclude the results in the main body.DIPG For DIPG (Masood & Doshi-Velez, 2019) , we follow the opensource implementation 2 . We set the same variance factor in the RBF kernel as SIPO-RBF and apply the same state as the input of the RBF kernel. We sweep the coefficient of MMD loss among t0.1, 0.5, 0.9u and find 0.1 the most appropriate (larger value will cause training instability). We use the same method to save archived trajectories as SIPO and the input of the RBF kernel is the same as the input we use for SIPO-RBF.To improve training efficiency, we only back-propagate the MMD loss at the first PPO epoch, but the training is still much slower ("17h/iteration for 3v1) than SIPO-RBF ("12 h/iteration for 3v1).RSPO For RSPO (Zhou et al., 2022) , we follow the opensource implementation 3 and use the same hyperparameters on the SMAC 2c_vs_64zg map in the original paper for GRF experiments.

D.4 ABLATION STUDY DETAILS

For the three ablation studies: fix-L, CE, and filter, we list the specific hyperparameters here:• fix-L: we set the Lagrange multiplier to be 0.2;• CE: the threshold is 3.800 and the intrinsic reward scale factor is 1{1000 of that in the WD setting;• filter: all the hyperparameters in the setting is the same as those in the WD setting.

E PROOFS

E.1 PROOF OF THEOREM 4.1Theorem 4.1. Assume D is a distance metric. Denote the optimal value of Problem 1 as T 1 . Let T 2 " ř M i"1 Jpπ i q where πi " arg max πi Jpπ i q s.t. Dpπ i , πj q ě δ{2, @1 ď j ă i(3)Proof. Suppose the optimal solution of Problem 1 is π 1 , π 2 , ..., π M satisfying Jpπ 1 q ě Jpπ 2 q ě ... ě Jpπ M q and the optimal solution of Problem 4 is π1 , π2 , ..., πM satisfying Jpπ 1 q ě Jpπ 2 q ě ... ě Jpπ M q.Assume the contrary that Thm. 4.1 is not true, which meansThen we choose the smallest number N ď M that satisfiesBy T 1 ą T 2 we know that N exists. In addition, because Problem 4 solves unconstrained RL in the first iteration, we know that π1 " arg max π Jpπq and then Jpπ 1 q ď Jpπ 1 q. Therefore, N ě 2.Suppose Jpπ N q ď Jpπ N q. Then we haveContradicting the fact that N is the smallest number satisfies that equation.Hence, we know that Jpπ N q ą Jpπ N q. Then Jpπ 1 q ě Jpπ 2 q ě ... ě Jpπ N q ą Jpπ N q.Consider the optimization problem of πN :πN " arg max π Jpπq s.t. Dpπ, πj q ě δ{2, @1 ď j ă N. This optimization does not find tπ 1 , . . . , π N u but find πN , which means that for each π i , 1 ď i ď N , there exists 1 ď j i ă N such that Dpπ i , πji q ă δ{2. Otherwise, we will get the solution of the above problem as π i instead of πN .By the Pigeonhole Principle, we know that there exist two indexes i 1 P rN s and i 2 P rN s pi 1 ‰ i 2 q such that j i1 " j i2 " ĵ. Then we have Dpπ i1 , π i2 q ď Dpπ i1 , πĵ q `Dpπ i2 , πĵ q ă δ{2 `δ{2 " δ, where the inequality follows by the triangle inequality of the distance function.It contradict with the fact that Dpπ i1 , π i2 q ě δ in Problem 1. Therefore, we prove the theoremE.2 PROOF OF THEOREM 5.1In this section, we consider the i-th iteration of SIPO illustrated in Eq. ( 2). For the sake of simplicity, we use a ď λ ď b for vector λ to denote each component of λ satisfies a ď λ i ď b, where a, b P R.We use π to denote the policy we are optimizing, and π j p1 ď j ă iq to denote a previously obtained policy. We denote the Lagrange function as Lpπ, λq " ´Jpπq ´ři´1 j"1 λ j pDpπ, π j q ´δq. To prove Theorem 5.1, we consider the following two optimization problems:and pπ i , λ‹ q " arg minwhere Λ " 1 ϵ0 and ϵ 0 ą 0 is sufficiently small. Assumption E.1. 0 ď Jp¨q ď 1. Assumption E.2. @λ ě 0, Lp¨, λq is l-smooth and ζ-Lipschitz. Lemma E.3. Jpπ i q ď Jpπ i q.Proof. As the domain of λ in Eq. 12 is smaller than Eq. (11), we have Lpπ i , λq ě Lpπ i , λq.By the fundamental property of Lagrange duality, we know that L achieves its optimal value when λ " 0 and the optimal value is ´Jpπ i q.By the optimality of pπ i , λ‹ q, we know that ´i´1 ÿ j"1 λ‹ j pDpπ i , π j q ´δq ě 0.Then we have ´Jpπ i q " Lpπ i , λ ‹ q ě Lpπ i , λ‹ q " ´Jpπ i q ´i´1 ÿ j"1 λ‹ j pDpπ i , π j q ´δq ě ´Jpπ i q.Lemma E.4. Under Assumption E.1, Dpπ i , π j q ě δ ´ϵ0 , @1 ď j ă i.Proof. We prove by contradiction.Suppose there exists 1 ď j 0 ă i, Dpπ i , π j0 q ă δ ´ϵ0 . Then we choose λ such that λj " " Λ j " j 0 , 0 1 ď j ă i, j ‰ j 0 .By the Assumption E.1, Eq. ( 13), and Λ " 1 ϵ0 , we have 0 ě ´Jpπ i q " Lpπ i , λ ‹ q ě Lpπ i , λ‹ q ě Lpπ i , λq ě ´1 ´ΛpDpπ i , π j0 q ´δq ą 0.That is a contradiction. So we have proved that Dpπ i , π j q ě δ ´ϵ0 , @1 ď j ă i.Lemma E.5. (Lin et al. (2020) , Theorem 4.8) Under Assumption E.2, solving Eq. ( 12) via twotimescale GDA with learning rate η π " Θpϵ 4 {l 3 ζ 2 Λ 2 q and η λ " Θp1{lq requiresiterations to converge to an ϵ-stationary point π ‹ i , where C 1 and C 2 are the constants that depend on the distance between the initial point and the optimal point. Theorem 5.1. Under moderate assumptions, SIPO converges to a neighbourhood of ϵ-approximate KKT point.Proof. At the i-th (1 ď i ď M ) iteration, SIPO solves the following constrained optimization min πi ´Jpπ i q s.t. Dpπ i , π j q ě δ, @1 ď j ă i .Consider the Lagrange function as Lpπ, λq " ´Jpπq´ř i´1 j"1 λ j pDpπ, π j q ´δq. Denote the optimal solution of Eq. 11 and Eq. 12 as pπ i , λq and pπ i , λq respectively.By Lemma E.3 and Lemma E.4 we have Jpπ i q ď Jpπ i q Dpπ i , π j q ě δ ´ϵ0 , @1 ď j ă i and therefore we only need to consider the following nonconvex-concave optimization min Following Lemma E.5, we know that the Two-Timescale GDA algorithm converges to an ϵstationary point π 0 i . Denote Φpπq " max 0ďλďΛ Lpπ, λq and } ¨} as the Euclidean distance. Using the property of ϵ-stationary point π 0 i in (Lin et al., 2020 ) (Lemma 3.8), we know that there exists πi such that min ξPBΦpπiq }ξ} ď ϵ and }π i ´π0 i } ď ϵ{2l. From the definition of Lpπ i , λq, we know that πi is an ϵ-approximate KKT point of Jpπq( (Dutta et al., 2013) ).From the above deduction, the Two-Timescale GDA algorithm convergences to an ϵ neighbourhood of ϵ-approximate KKT point of the above problem. The theorem then follows by applying the smoothness assumption.

