GENERATING DIVERSE COOPERATIVE AGENTS BY LEARNING INCOMPATIBLE POLICIES

Abstract

Training a robust cooperative agent requires diverse partner agents. However, obtaining those agents is difficult. Previous works aim to learn diverse behaviors by changing the state-action distribution of agents. But, without information about the task's goal, the diversified agents are not guided to find other important, albeit sub-optimal, solutions: the agents might learn only variations of the same solution. In this work, we propose to learn diverse behaviors via policy compatibility. Conceptually, policy compatibility measures whether policies of interest can coordinate effectively. We theoretically show that incompatible policies are not similar. Thus, policy compatibility-which has been used exclusively as a measure of robustness-can be used as a proxy for learning diverse behaviors. Then, we incorporate the proposed objective into a population-based training scheme to allow concurrent training of multiple agents. Additionally, we use state-action information to induce local variations of each policy. Empirically, the proposed method consistently discovers more solutions than baseline methods across various multi-goal cooperative environments. Finally, in multi-recipe Overcooked, we show that our method produces populations of behaviorally diverse agents, which enables generalist agents trained with such a population to be more robust. 1 Note that LIPO can be applied to environments with more than two players with a slight modification. Specifically, a policy π j would represent the joint policy of all players except player i, π j (a j t |τ j t ) = Π k̸ =i π k (a k t |τ k t ).

1. INTRODUCTION

Cooperating with unseen agents (e.g., humans) in multi-agent systems is a challenging problem. Current state-of-the-art cooperative multi-agent reinforcement learning (MARL) techniques can produce highly competent agents in cooperative environments (Kuba et al., 2021; Yu et al., 2021) . However, those agents are often overfitted to their training partners and cannot coordinate with unseen agents effectively (Carroll et al., 2019; Bard et al., 2020; Hu et al., 2020; Mahajan et al., 2022) . The problem of working with unseen partners, i.e., ad-hoc teamwork problem (Stone et al., 2010) , has been tackled in many different ways (Albrecht & Stone, 2018; Carroll et al., 2019; Shih et al., 2020; Gu et al., 2021; Rahman et al., 2021; Zintgraf et al., 2021; He et al., 2022; Mirsky et al., 2022; Parekh et al., 2022) . These methods allow an agent to learn how to coordinate with unseen agents and, sometimes, humans. However, the success of these methods depends on the quality of training partners; it has been shown that the diversity of training partners is crucial to the generalization of the agent (Charakorn et al., 2021; Knott et al., 2021; Strouse et al., 2021; McKee et al., 2022; Muglich et al., 2022) . In spite of its importance, obtaining a diverse set of partners is still an open problem. The simplest way to generate training partners is to use hand-crafted policies (Ghosh et al., 2020; Xie et al., 2021; Wang et al., 2022) , domain-specific reward shaping (Leibo et al., 2021; Tang et al., 2021; Yu et al., 2023) , or multiple runs of the self-play training process (Grover et al., 2018; Strouse et al., 2021) . These methods, however, are not scalable nor guaranteed to produce diverse behaviors. Prior works propose techniques aiming to generate diverse agents by changing the state visitation and action distributions (Lucas & Allen, 2022) , or joint trajectory distribution of the agents (Mahajan et al., 2019; Lupu et al., 2021) . However, as discussed by Lupu et al. (2021) , there is a potential drawback of using such information from trajectories to diversify the behaviors. Specifically, agents that make locally different decisions do not necessarily exhibit different high-level behaviors. To avoid this potential pitfall, we propose an alternative approach for learning diverse behaviors using information about the task's objective via the expected return. In contrast to previous works that use joint trajectory distribution to represent behavior, we use policy compatibility instead. Because cooperative environments commonly require all agents to coordinate on the same solution, if the agents have learned different solutions, they cannot coordinate effectively and, thus, are incompatible. Consequently, if an agent discovers a solution that is incompatible with all other agents in a population, then the solution must be unique relative to the population. Based on this reasoning, we introduce a simple but effective training objective that regularizes agents in a population to find solutions that are compatible with their partner agents but incompatible with others in the population. We call this method "Learning Incompatible Policies" (LIPO). We theoretically show that optimizing the proposed objective will yield a distinct policy. Then, we extend the objective to a population-based training scheme that allows concurrent training of multiple policies. Additionally, we utilize a mutual information (MI) objective to diversify local behaviors of each policy. Empirically, without using any domain knowledge, LIPO can discover more solutions than previous methods under various multi-goal settings. To further study the effectiveness of LIPO in a complex environment, we present a multi-recipe variant of Overcooked and show that LIPO produces behaviorally diverse agents that prefer to complete different cooking recipes. Experimental results across three environments suggest that LIPO is robust to the state and action spaces, the reward structure, and the number of possible solutions. Finally, we find that training generalist agents with a diverse population produced by LIPO yields more robust agents than training with a less diverse baseline population. See our project page at https://bit.ly/marl-lipo 

2. PRELIMINARIES

Our main focus lies in fully cooperative environments modeled as decentralized partially observable Markov decision processes (Dec-POMDP, Bernstein et al. (2002) ). In this work, we start our investigation in the two-player variant. A two-player Dec-POMDP is defined by a tuple (S, A 1 , A 2 , Ω 1 , Ω 2 , T, O, r, γ, H), where S is the state space, A ≡ A 1 × A 2 and Ω ≡ Ω 1 × Ω 2 are the joint-action and joint-observation spaces of player 1 and player 2. The transition probability from state s to s ′ after taking a joint action (a 1 , a 2 ) is given by T (s ′ |s, a 1 , a 2 ). O(o 1 , o 2 |s) is the conditional probability of observing a joint observation (o 1 , o 2 ) under state s. All players share a common reward function r(s, a 1 , a 2 ), γ is the reward discount factor and H is the horizon length. Players, with potentially different observation and action spaces, are controlled by policy π 1 and π 2 . At each timestep t, the players observe o t = (o  = (o 0 , a 0 , r 0 , ..., r H-1 , o H ) ∈ T ≡ (Ω × A × R) H can be written as G(τ ) = H t=0 γ t r t . The expected return of a joint policy (π 1 , π 2 ) is J(π 1 , π 2 ) = E τ ∼ρ(π 1 ,π 2 ) G(τ ) , where ρ(π 1 , π 2 ) is the distribution over trajectories of the joint policy (π 1 , π 2 ) and P (τ |π 1 , π 2 ) is the probability of τ being sampled from a joint policy (π 1 , π 2 ). We use subscripts to denote different joint policies and superscripts to refer to different player roles. For example, π A = (π 1 A , π 2 A ) is a different joint policy from π B = (π 1 B , π 2 B ) , and π i A and π j A are policies of different roles. 1 Finally, we denote the expected joint return of self-play (SP) trajectories-where both policies are part of the same joint policy, π A -as J SP (π A ) := J(π 1 A , π 2 A ) and the expected joint return of cross-play (XP) trajectories-where policies are chosen from different joint policies, π A and π B -as J XP (π A , π B ) := J(π 1 A , π 2 B ) + J(π 1 B , π A ). Since we are interested in creating distinct policies for any Dec-POMDP, we need an environmentagnostic measure that captures the similarity of policies. First, we consider a measure that can compute the similarity between policies of the same role i, e.g., π i A and π i B . We can measure this with the probability of a joint trajectory τ produced by either π i A or π i B . However, in the two-player setting, we need to pair these policies with a reference policy π j ref . Specifically, π i A and π i B are considered similar if they are likely to produce the same trajectories when paired with an arbitrary reference policy π j ref . We define similar policies as follows: Definition 2.1 (Similar policies). Considering two policies of the same role i, π i A and π i B , and a reference policy π j ref of a different role j, π i A is similar to π i B up to ϵ if and only if max τ ∈T |1 - P (τ |π i A ,π j ref ) P (τ |π i B ,π j ref ) | ≤ ϵ, where 0 ≤ ϵ ≤ 1. Next, we consider an alternate view on assessing the similarity between policies using policy compatibility (Section 3). Policy compatibility measures the performance difference of a joint policy π B before and after one of its policies π i B is substituted by another policy π i A . We define compatibility between a policy π i A and a joint policy π B as follows: Definition 2.2 (Compatible policies). Given a policy π i A and a joint policy π B , π i A is compatible with π B if and only if J(π i A , π j B ) ≥ (1 -ϵ)J SP (π B ).

3. LEARNING INCOMPATIBLE POLICIES (LIPO)

Our goal is to create distinct policies and, therefore, a population of diverse agents. First, we theoretically show that policy compatibility can be used to identify whether two policies are different. Based on this observation, we propose a novel training objective that produces a distinct policy. Then, we extend this objective for training a population of diverse policies. Finally, we incorporate an MI objective that encourages each policy to learn local variations.

3.1. LEARNING A DISTINCT POLICY VIA POLICY COMPATIBILITY

Low return In this section, we motivate our objective by looking at two joint policies: ( 𝜋𝜋 𝐴𝐴 1 , 𝜋𝜋 𝐴𝐴 2 ) ( 𝜋𝜋 𝐵𝐵 1 , 𝜋𝜋 𝐵𝐵 2 ) Fixed High return (a) ( 𝜋𝜋 𝐴𝐴 1 , 𝜋𝜋 𝐴𝐴 2 ) ( 𝜋𝜋 𝐵𝐵 1 , 𝜋𝜋 𝐵𝐵 2 ) Similar High return (b) ( 𝜋𝜋 𝐴𝐴 1 , 𝜋𝜋 𝐴𝐴 2 ) ( 𝜋𝜋 𝐵𝐵 1 , 𝜋𝜋 𝐵𝐵 2 ) Dissimilar Low return (c) π A = (π 1 A , π 2 A ) and π B = (π 1 B , π 2 B ). The goal is for π A to learn a different behavior from π B via the compatibility criterion. Importantly, the compatibility criterion can be computed empirically without direct access to the trajectory distribution, which can be difficult to estimate. Under mild assumptions, we can simplify the setting such that a simple relationship between similarity measure and compatibility criterion emerges. By reasoning about the expected return under different pairs of policies, we derive our main result. Theorem 3.1. If π i A is similar to π i B , then π i A is compatible with π B . (The proof is in App. A.) Corollary 3.2. If π i A is not compatible with π B , then π i A is not similar to π i B . The result from Corollary 3.2 shows that we can find a policy π i A that is not similar to π i B by decreasing its compatibility with π B until they are incompatible, i.e., J(π i A , π j B ) < (1 -ϵ)J SP (π B ). Additionally, we can ensure that π A learns a meaningful solution by maximizing J SP (π A ). Assuming that π B has learned a solution and is fixed, the optimization objective of π A can be written as max π A J SP (π A ) subject to J(π i A , π j B ) < (1 -ϵ)J SP (π B ) ∀i, j ∈ {1, 2}, i ̸ = j A way to solve such a constrained problem is to convert the constraints into regularization terms. For simplicity, we use a common λ XP > 0 as a hyperparameter for the constraints. Then, we can write the soft objective of Eq. 1 as max π A J SP (π A ) -λ XP J XP (π A , π B )

3.2. LEARNING A POPULATION OF DIVERSE POLICIES

To create a population of N diverse policies, P = {π A |1 ≤ A ≤ N }, we need an objective that requires each member of the population to have a different behavior relative to the rest of the population. We can write such an objective by expanding the XP term in Eq. ( 2) to include all other policies in the population. Additionally, we relax the assumption that other policies are fixed to allow concurrent training of all policies. For a policy π A ∈ P, with an aggregation function f agg , its objective becomes max π A J LIPO (π A , P) = J SP (π A ) -λ XP JXP (π A , P), where JXP (π A , P) = f agg (B xp A ), (4) B xp A = {J XP (π A , π B ) | π B ∈ P -A }, P -A = P\{π A } (6) While using the average operation as the aggregation function is plausible, we find that using the max operation helps stabilize the training process and produces more diverse policies. We suspect that the average operation might produce many conflicting gradients and does not prioritize compatible XP pairs. We refer to J LIPO as the compatibility gap between a policy π A and a population P. We can see that the compatibility gap objective only uses the expected return (J SP and JXP ) and is insensitive to the state and action information. We argue that this distinction between LIPO and previous methods helps the agents discover more solutions in various situations (Sec. 4.1 and 4.5).

3.3. INDUCING VARIATIONS IN EACH POLICY

It is important to note that, regardless of the population size, there could be policies of role i that are compatible with π A ∈ P but not similar to π i A . We consider those policies to be variations of π i A and propose to capture such variations via an MI objective. Specifically, we condition π i A on a latent variable z i such that π A has the form of π A (a|τ ) = E (z 1 ,z 2 ) π 1 A (a 1 |τ 1 , z 1 )π 2 A (a 2 |τ 2 , z 2 ) where p(z 1 , z 2 ) is a pre-defined prior distribution. We can induce variations of π i A by maximizing I({o i , a i }; z i ), where I(•; •) is the MI between two random variables. Intuitively, this objective encourages each policy to observe different observations and perform different actions given different values of the latent variable. However, maximizing I({o i , a i }; z i ) directly is intractable, instead we optimize the variational lower bound of the MI (Jordan et al., 1999) (see App. B for the derivation) I({o i , a i }; z i ) ≥ H(z i ) + E z i ,(o i ,a i ) [log q ϕ A (z i |o i , a i )], where q ϕ A (z i |o i , a i ) is an approximation of the true posterior p(z i |o i , a i ) parameterized by ϕ A . So, maximizing I({o 1 , a 1 }; z 1 ) and I({o 2 , a 2 }; z 2 ) is an optimization problem that can be written as max π A ,ϕ A 1 2 2 i=1 H(z i ) + E z i ,(o i ,a i ) log q ϕ A (z i |o i , a i ) In the previous work (Mahajan et al., 2019) , shared z (i.e., z 1 = z 2 ) is used allowing both policies to collectively switch between different modes of behavior. However, LIPO uses independently sampled z as it utilizes z for a different purpose. Specifically, LIPO maximizes J LIPO to learn diverse solutions and optimizes the MI objective to learn variations of each solution. That is, the MI objective does not directly impact the diversity between different policies but increases variations of each individual policy. We note that the MI objective is optional; we show that without the MI objective, LIPO still produces diverse policies (Sec. 4.4).

3.4. IMPLEMENTATION

In practice, we modify the MI objective (Eq. 8) to be differentiable with respect to the policy π i A . Specifically, the variational posterior q ϕ A is modified such that, instead of a sampled action a i , it takes the action distribution π i A (•|o i , z i ) as an input, i.e., q ϕ A (z|o, π i A (•|o i , z i )). In contrast to previous MI-based approaches (Eysenbach et al., 2018; Sharma et al., 2019; Jiang & Lu, 2021; Lucas & Allen, 2022) , we can optimize I({o i , a i }; z i ) directly without computing an auxiliary reward (Mahajan et al., 2019; Osa et al., 2022) . The loss function of the modified MI objective is L MI (π A , ϕ A ) = - 1 2 2 i=1 E z i ,(o i ,a i ) log q ϕ A (z i |o i , π i A (•|o i , z i ))) The objective of a policy π A in a population P becomes max π A ,ϕ A J LIPO (π A , P) -λ MI L MI (π A , ϕ A ) We set z as a discrete variable and use the uniform distribution for p(z 1 ) and p(z 2 ). At the beginning of each episode, each policy is given an independently sampled z that will be used until the end of the episode. We use MAPPO (Yu et al., 2021) for maximizing J SP and minimizing JXP . More details, including the pseudocode and the extension to more than two players, can be found in App. D.

4. EXPERIMENTS

We study the effectiveness of LIPO under three multi-goal cooperative environments in which both players must collectively choose to accomplish one of the available goals. We evaluate the diversity of a population based on the number of distinct goals achieved. We compare LIPO to other cooperative MARL methods that do not require domain knowledge to generate diverse agents. Our baselines are as follows: (i) Multi SP (multiple runs of self-play), (ii) SP MI (A single run of SP with added MI objective), (iii) MAVEN (Mahajan et al., 2019) , and (iv) TrajeDi (Lupu et al., 2021) . We also use Multi SP MI and Multi MAVEN as baselines by training SP MI and MAVEN multiple times. We also discuss on methods that utilize domain knowledge in Sec. 5.

4.1. DISCOVERING DIVERSE SOLUTIONS

We use two simple environments to study the effectiveness of various methods in discovering solutions: (i) One-Step Cooperative Matrix Game (CMG), in which there are many possible solutions, and (ii) Point Mass Rendezvous (PMR), a temporally extended cooperative navigation environment.

One-Step Cooperative Matrix Game (CMG):

A game of CMG is defined by a tuple (M, {k m }, {r m }), where M is the number of solutions. For m ∈ {1, ..., M }, k m is the number of compatible actions and r m is the reward of a solution m. By choosing the same solution, both players get a reward r m associated with the chosen solution. We consider two setups of CMG: sub-optimal (CMG-S) and hard-to-find (CMG-H). For CMG-S, we set (M = 32, k m = 8, r m = 0.5 * (1+ m-1 M -1 )), which causes each solution to have a different reward, ranging from 0.5 to 1. For CMG-H, we use (M = 32, k m = m, r m = 1), which makes solutions with a smaller number of compatible actions harder to be found by random exploration. An example payoff matrix is shown in Fig. 2a . 

Point Mass Rendezvous (PMR):

The environment is based on the Multi-Agent Particle Environment (Lowe et al., 2017; Terry et al., 2020) . The goal of this environment is for the two agents to navigate to a landmark together. There are M = 4 landmarks, and we consider each landmark as a solution in this environment. This environment has two modes: PMR-C and PMR-L. In PMR-C, landmarks are distributed evenly on the circumference of a circle. Thus, all landmarks are equally easy to find and optimal. In PMR-L, landmarks are placed on a line. In this scenario, closer landmarks are easier to find. We Results: Fig. 3 shows the numbers of learned solutions, averaged over three runs. In all environments, LIPO consistently discovers more solutions than the baselines, given the same population size. The baselines find fewer solutions in CMG-H and PMR-L than they do in CMG-S and PMR-C, whereas LIPO performs similarly across settings. LIPO is also better than the baselines at finding sub-optimal solutions in CMG-S. We note that Multi SP and TrajeDi perform almost ideally in PMR-C, where all solutions are equivalent, but perform worse in other settings. Also, Multi SP MI finds all four solutions in PMR when the population size is bigger than 8. However, it performs poorly in CMG. LIPO's consistency across environments and settings demonstrates that LIPO is still effective when (i) many solutions exist, (ii) solutions are not equally optimal, and (iii) solutions are not equally likely to be found by random exploration. We also have experimented with stronger regularization coefficients for the baselines, which help the baselines discover more solutions. However, if the regularization coefficient is too large, they fail to produce capable policies. 

4.2. TRADE-OFF BETWEEN COMPETENCY AND DISSIMILARITY OF JOINT POLICIES

It is possible that optimizing a regularized objective might incur training instability and create incapable policies. Here, we investigate the effect of different combinations of λ XP and the population size (N ) on the competency of the policies. Fig. 4 shows the number of competent joint policies when using different values of N and λ XP in PMR. Particularly, in PMR, a joint policy is considered competent when both players stay close to a landmark at the end of an episode. We observe that when the population size is larger than the number of solutions (N > M ), some surplus policies do not learn to reach a goal. Importantly, the number of competent joint policies depends on the value of λ XP : lower values of λ XP yield more capable policies. However, using too low λ XP will generate policies that share a common solution when N ≤ M as shown in App. J.1.1. Additionally, when N ≤ M , all trained agents are competent except when λ XP is too high in PMR-L. These results suggest that there is a trade-off between the number of capable joint policies and policy dissimilarity. When using a larger population size, a small λ XP should be used to avoid producing incompetent agents, while a bigger λ XP should be used with smaller population sizes to ensure the dissimilarity between joint policies. Not only is using bigger values of N more likely to produce incompetent policies, but it is also computationally expensive. Formally, the computation complexity of approximating JXP (•, P) is O(N n xp ) where n xp is the number of XP pairs used to approximate JXP (π A , P). So, we investigate a way to reduce the cost of calculating JXP (π A , P) by reducing n xp . According to Eq. 5, the default value is n xp = N -1. When n xp < N -1, n xp policies are chosen randomly from P -A by sampling without replacement.

4.3. TRADE-OFF BETWEEN COMPUTATION COST AND DIVERSITY

We observe that, while being computationally cheaper, using n xp < N -1 tends to produce less diverse populations as shown in Fig. 5 . Thus, n xp can be considered a hyperparameter that controls the computation-diversity trade-off. However, as shown by the dashed lines, the effect of n xp on population diversity is less prominent in PMR-C, where solutions are equally likely to be found. We use n xp = N -1 in all other experiments. See App. J.1.2 for results in CMG. Overcooked, a collaborative cooking game, has been used to study the cooperative ability of learned agents in prior works (Carroll et al., 2019; Charakorn et al., 2020; Strouse et al., 2021; McKee et al., 2022) . To investigate the usefulness of LIPO in a high-dimensional environment, we implement a more complex version of the game based on the work of Wu et al. (2021) ; players have to complete and serve one of the six pre-defined recipes as fast as possible, as opposed to delivering a single menu item repeatedly. We emphasize that this environment is much more challenging than the ones in the previous experiments because of various aspects: First, it has a sparse reward signal. Second, there are multiple sub-tasks. Third, different recipes have different sub-tasks. Each of these characteristics of the environment complicates the process of finding diverse solutions. Furthermore, we note that recipes containing a carrot or a tomato are harder to complete than other recipes as they involve an additional coordination step. Particularly, carrot and tomato have to be sent over by the agent on the right, unlike lettuce and onion. Fig. 7 shows an overview of the game. The goal in this experiment is to learn a population of behaviorally diverse agents. We choose to quantify the diversity of a population based on the entropy of its recipe distribution. For a population P, we approximate the probability of recipe i being completed as P (recipe i |P) ≈ π A mi(π A ) i π A mi(π A ) , where m i (π A ) denotes the frequency of recipe i under a joint policy π A . The recipe frequencies, {m i (π A )|1 ≤ i ≤ 6}, for each joint policy π A ∈ P are measured by counting the completed recipes from 1,000 self-play episodes. For Multi SP, TrajeDi and LIPO, we set N = 8. For Multi SP MI and Multi MAVEN, we use n seed = 8 and |z| = 8. Figure 8 : Recipe distributions of generated populations compared to the uniform distribution (dashed line). We provide a reference for the uniform distribution as it has the highest entropy. Table 1 : The means and standard errors entropy of approximated population recipe distributions. Multi SP TrajeDi Multi SP MI Multi MAVEN LIPO (λ M I = 0.5) LIPO (λ M I = 0) 1.16 ± 0.03 0.98 ± 0.17 1.43 ± 0.05 1.53 ± 0.07 1.58 ± 0.07 1.26 ± 0.17 Results: Quantitatively, Tab. 1 shows that LIPO has the highest population recipe distribution entropy, averaged over five random seeds. This result indicates that LIPO populations use all recipes more uniformly than the baseline populations, even though some recipes take longer to complete or are harder to find by random exploration. The recipe distribution of populations produced by each method can be found in Fig. 8 . We find that LIPO populations with λ M I = 0 still, similar to λ M I = 0.5, consistently learn to use the hard-to-find Tomato & Carrot Salad recipe. However, the frequencies of Chopped Tomato and Chopped Carrot are lowered (Fig. 8f ). This means that there are multiple joint policies that learn to complete the same recipe while being incompatible with each other. We suspect that using λ M I > 0 alleviates this problem by regularizing each joint policy to represent a policy with broader state-action coverage (e.g., learn multiple ways of completing a recipe), indirectly pushing other joint policies to use different recipes in order to be incompatible. Qualitatively, we can see in App. J.2 that the baselines produce agents with similar recipe frequencies. The resulting populations, thus, contain agents with a similar recipe preference. In contrast, LIPO produces agents with distinct recipe frequencies, collectively making the population more diverse than the baselines. We also visualize the behaviors learned by LIPO in App. J.3.

4.6. TRAINING GENERALIST AGENTS WITH GENERATED POPULATIONS

In addition to evaluating the diversity of agents in Overcooked, we quantify the usefulness of the produced agents by using them as training partners of a generalist agent and test the agent with held-out populations. Intuitively, more diverse training partners would enable the agent to generalize and coordinate with unseen agents better. Additionally, we include a population of six specialized SP policies where each policy is trained to complete a specific recipe by adjusting the reward function. The specialist population is created for evaluating the agent when the partner has a strong preference. Fig. 9a shows that all generalist agents perform similarly when tested with held-out baseline populations. However, the agents trained with a baseline population perform poorly when matched with held-out LIPO and specialist populations. In contrast, those trained with a LIPO population perform better in both situations. Specifically, they have a significantly higher success rate when paired with the specialist with a strong preference for Tomato & Carrot Salad as shown in Fig. 9b . We attribute the success rate difference to the fact that this recipe has a lower completion probability in all except LIPO populations. As a result, generalist agents trained with a LIPO population perform better in terms of the overall success rate when tested with specialist agents. Overall, training with a LIPO population helps the generalist agents to better coordinate with more partner types as indicated by the harmonic means. 

Training population

Figure 9 : The mean success rates and standard errors (in parentheses) of trained generalist agents when matched with unseen test agents. Each generalist agent is trained with only one population and evaluated with all test partners, each with 300 episodes. The result in each row is averaged over five generalist agents trained with an independently generated population from the corresponding method. A held-out population from each method is used as a test population. The right-most column shows the harmonic mean of the success rate of all held-out populations (a) and specialist agents (b).

5. RELATED WORK

Learning a collection of diverse agents has been utilized in various contexts (Parker-Holder et al., 2020; Sun et al., 2020; Zahavy et al., 2021; Zhou et al., 2021) . In the cooperative domain, Canaan et al. (2019; 2020) use the Quality Diversity (QD) algorithm (Mouret & Clune, 2015; Pugh et al., 2016) to produce a population of behaviorally diverse agents. QD, however, requires domain knowledge to encode different types of behaviors. For example, in CMG, the algorithm requires the mapping between actions and corresponding solutions. In Overcooked, it needs to know all possible recipes beforehand. Without such a domain knowledge, it would be difficult to use QD to produce a diverse population. TrajeDi (Lupu et al., 2021) produces a diverse population of agents based on the trajectory distribution. Finally, MEP (Zhao et al., 2021) trains a population of agents with an auxiliary reward based on population entropy. Like TrajeDi and MEP, ours does not require domain-specific knowledge. However, to promote behavioral diversity, LIPO utilizes the expected returns of different policy pairs as opposed to state-action information. The idea of diversifying the empirical return has been explored in the context of finding diverse solutions in non-transitive competitive games (Liu et al., 2021; Balduzzi et al., 2019; Perez-Nieves et al., 2021) . In particular, Liu et al. (2021) share some similar ideas with our work. They propose to use the expected returns, when encountering different opponents, and state-action information to promote diversity of agents. A concurrent work by Rahman et al. (2022) applies a similar idea of diversifying the expected joint return to generate diverse partners in cooperative settings. LIPO can be thought of as a special case designed specifically for cooperative environments (see App. E). MI objectives have been used in RL to learn diverse behaviors (Eysenbach et al., 2018; Sharma et al., 2019; Kumar et al., 2020; Osa et al., 2022) . In cooperative MARL, MAVEN (Mahajan et al., 2019) optimizes both RL and MI objectives to encourage the agents to explore in a committed manner and discover diverse solutions. Also, Any-play (Lucas & Allen, 2022 ) uses a similar objective to produce training partners with many solutions for a generalist agent. In contrast, our approach uses the MI objective to regularize each policy to learn local variations of each solution.

6. CONCLUSION

We propose LIPO, a simple and generic method that can create a population of diverse agents in cooperative multi-agent environments. Unlike previous work that uses state-action information from joint trajectories, LIPO utilizes the concept of policy compatibility to create diverse policies. This alternative view of quantifying diversity makes LIPO more robust to state and action spaces. Also, LIPO uses the MI objective to learn local variations of each solution. Empirically, LIPO consistently produces more diverse populations than the baselines across a variety of three multi-goal environments. Finally, in multi-recipe Overcooked, LIPO produces populations of diverse partners that help the generalist agents to generalize to unseen agents better. We include further discussions and limitations of LIPO in App. F and G.

C ADDITIONAL ENVIRONMENT DETAILS C.1 ONE-STEP COOPERATIVE MATRIX GAME

A game of CMG is defined by a tuple (M, {k m }, {r m }), where M is the number of solutions. For m ∈ {1, ..., M }, k m is the number of compatible actions and r m is the reward of a solution m. The game is stateless and terminate immediately after both players simultaneously choose an action. By choosing the same solution, both players get a reward r m associated with the chosen solution. This means that the solutions are not equally optimal if the values in {r m } are not identical. Similarly, if the values in {k m } are not identical then the solutions are not equally likely to be chosen by a uniform joint policy. We consider two setups of CMG: sub-optimal (CMG-S) and hard-to-find (CMG-H). For CMG-S, we set (M = 32, k m = 8, r m = 0.5 * (1 + m-1 M -1 )), which causes each solution to have a different reward, ranging from 0.5 to 1. There are 32 solutions, each with 8 compatible actions. There are 32 × 8 = 256 possible actions for each player. For CMG-H, we use (M = 32, k m = m, r m = 1), which makes solutions with a smaller number of compatible actions harder to be found by random exploration. There are 32 solutions, each with different number of compatible actions, ranging from 1 to 32. The number of available actions for each player is m=32 m=1 m = 528.

C.2 POINT MASS RENDEZVOUS (PMR)

PMR is based on the the Multi-Agent Particle Environment (Lowe et al., 2017; Terry et al., 2020) . The observation of each agent includes absolute position, current velocity, and the relative distance to the landmarks and the other agent. These features are concatenated as a 1-D vector of length 14. The possible actions are: no op, move, {up, down, left, right}. In PMR-C, the start positions of the agents are {(0.3,0), (-0.3,0)} and the landmarks positions are {(1.59, 1.59), (1.59, -1.59), (-1.59, 1.59), (-1.59, -1.59)}. For PMR-L, the start and the landmark positions are {(1,0),(0,1)} and {(0,2.25),(0,0.75),(0,-0.75),(0,-2.25)}. An episode will be terminated after 50 timesteps. The agents are incentivized to go to the same landmark and stay close together with the reward function r t = 1 -d(p i , c) -min l∈L d(l, c), where d(•, •) is the euclidean distance between two points, p i is the 2-d coordinate of agent i, c is the average coordinate of all agents, and L is the set of all landmarks.

C.3 MULTI-RECIPE OVERCOOKED

We implement a multi-recipe of the game based on the work of Wu et al. (2021) . In this version of Overcooked, there are four ingredients: lettuce, onion, tomato, and carrot. The ingredients are randomly placed at pre-defined positions in the layout. Particularly, the lettuce and the onion are randomly placed on the left or the middle counter. The tomato and the carrot are randomly placed on the right or the middle counter. These ingredients can be composed into different recipes making each ingredient unique: four recipes (LettuceSalad, TomatoSalad, ChoppedCarrot, ChoppedOnion) require only a single ingredient, while the other two (TomatoLettuceSalad, TomatoCarrotSalad) require two ingredients. The ingredients have to be chopped at the chopping station before placing on the plate. After the required ingredients are put on the plate, they must be delivered to the delivery station. Both players have the same egocentric observation and action spaces. The observation is a set of hand-crafted features that represent a local view of the environment. Specifically, we use the following features: absolute position and facing direction, relative distance to the objects and the other agent, state of the ingredients, four booleans indicating if the agent is next to a counter in four cardinal positions, currently held items, the state of the held foods, and the type and state of the items in front of the agent. These features are concatenated as a 1-D vector of length 54. At every timestep, each player has to choose one of the six possible actions: no op, move {up, down, left, right}, and interact. An episode lasts at most 200 timesteps and terminates immediately after a successful delivery. An episode without delivery is considered unsuccessful. We incentivize the agents to interact with the objects and deliver as fast as possible with the following reward function: r t = r interact + r progress + r complete -p, here r interact is a shaped reward given when an agent interacts with an object for the first time in an episode, r progress is given when the players progress toward a recipe completion (i.e., chopping required ingredients or putting chopped ingredients on the plate), r complete is given upon successful delivery, and p is a penalty. We use r interact = 0.5, r progress = 1.0, r complete = 10, and p = 0.1. We note that recipes with more than one ingredient will give only slightly higher rewards (r interact + r progress ) but are significantly harder to be discovered by random exploration than those with one ingredient. Additional experimental details: For specialist agents, the rewards are given when interacting, progressing, or completing a specific recipe. For the held-out populations, we remove incompetent policies with the expected return of less than zero from the test populations created by TrajeDi and LIPO. We do not remove those in the training populations. We do this because testing with an incompetent policy does not give any meaningful information, as almost all episodes will be unsuccessful.

D IMPLEMENTATION DETAILS

Algorithm 1: Training process of LIPO (on-policy) This pseudocode is based on self-play. Blue text is related to the MI objective. LIPO specific code is highlighted in green.  Input: A Population P = {πA |1 ≤ A ≤ N }, } do z 1 , ..., z m ∼ p(z 1 , ..., z m ) z j ← z k m k̸ =i π j (•|•, z j ) = Π k̸ =i π k (•|•, z k ) τ ∼ ρ(π i A (•|•, z i ), π j B (•|•, z j )) B ← B ∪ {τ } return B Function GetCrossPlayRollouts(πA, P, nxp, EXP): B xp ← {} P ′ -A ← SampleWithoutReplacement(P-A, nxp) for πB ∈ P ′ -A do B ← GetEpisodeRollouts(πA, πB, E XP |P ′ -A | ) B xp ← B xp ∪ B return B xp The pseudocode for LIPO is shown in Algorithm 1 and 2. If there are more than two players (m > 2), π j would represent the joint policy of all players except player i, π j (a j t |τ j t ) = Π k̸ =i π k (a k t |τ k t ). We note that scaling LIPO to more than two players does not increase the training time. It is the same as the two-player setting as long as the numbers of SP and XP episodes are the same. In practice, however, more XP episodes might be needed to better estimate JXP . We use the parameter sharing technique for better sample efficiency and faster convergence (Tan, 1993; Foerster et al., 2018; Rashid et al., 2018) . Assuming that a policy π i A is a neural network parameterized by θ i A , this means that for a joint policy (π 1 A , π 2 A ), we have θ 1 A = θ 2 A . Still, π 1 A and π 2 A can behave differently as they observe different parts of the environment and have a different player indicator concatenated with their local observations. All methods are implemented on top of MAPPO except MAVEN. The critic, policy, and discriminator are feed-forward neural networks with two hidden layers, each having 64 units. For a fair comparison, we use the same or more environment steps in the policy update of the baselines compared to LIPO. 

D.2 MULTI SP

A simple but effective way to produce diverse agents by training multiple SP agents with different neural network initializations and random seeds. Specifically, each run produces a joint policy π A that maximizes J SP (π A ) using MAPPO.

D.3 SP MI

A single run of SP agent trained with added MI objective I(z|o i , a i ). SP MI uses a shared z for both policies and considers each z as a different joint policy. We train SP MI using the same training procedure as LIPO by setting N = 1, λ XP = 0 and z 1 = z 2 . The discriminator takes a local observation o i and action distribution π i (•|o i ) as inputs and outputs the discrete probability of the latent variable. The latent variable of all policies is shared during an episode. D.4 MAVEN MAVEN (Mahajan et al., 2019) is explicitly designed for learning diverse solutions in cooperative multi-agent environments. A joint policy is represented as π(•|τ, z), and each mode of behavior is represented by the latent variable z. Similar to SP MI , MAVEN uses a shared z for all policies. We use the same network architecture presented in Mahajan et al. (2019) with recurrent neural networks. However, we do not use the hierarchical policy but sample z from the uniform distribution. The latent variable of all policies is shared.

D.5 MULTI SP MI AND MULTI MAVEN

A population containing joint policies from multiple runs of SP MI and MAVEN. Like Multi SP, each run has different neural network initializations and random seeds. The population size is |P| = n seed |z|, where n seed is the number of runs. Notably, this baseline uses the training data differently from the base algorithms. Instead of training a long single run, this approach allows the policy to "restart" by using different initialization of neural networks. For example, training a single run with |z| = N, n seed = 1 might not discover as many solutions as training n seed runs with |z| = N nseed even though the population size is the same. Empirically, we find that multiple shorter runs can find more solutions compared to a single long run of the corresponding algorithm. Thus, we omit the results of the base algorithms in multi-recipe Overcooked.

D.6 TRAJEDI

TrajedDi produces a population of diverse agents that also maximize the expected return in cooperative environments. The diversity measure of this method is based on the Jensen-Shannon divergence (JSD) between the trajectory distribution of each policy. Different from the original implementation, we remove the best-response (BR) policy from the population. Since the BR policy might work well with only a subset of solutions, removing BR potentially increase the number of variations in the population. Our modified loss is: L = -[ N A=1 (J SP (π A )) + αJSD γ (π 1 , ..., π N )], where JSD is the proposed diversity objective of TrajeDi, and α and γ are the hyperparameters of TrajeDi.

D.7 LIPO

LIPO uses the same implementation as SP MI except LIPO uses independent latent variable for each policy and λ XP > 0. Additionally, LIPO uses extra critics for the XP trajectories. In total, LIPO has an SP critic V π A sp and N -1 XP critics {V π A ,π B xp | π B ∈ P -A }. In each training iteration, LIPO collects SP and XP trajectories of all policy combinations. MAPPO is used for both maximizing J SP and minimizing JXP . The critics are trained using SP trajectories and all of XP trajectories, while the policy is trained using SP trajectories and XP trajectories from the XP pair that has the highest joint return, max(B xp ).

D.8 GENERALIST AGENT

The policy and critic networks of a generalist agent use two 256-unit GRU layers (Cho et al., 2014) followed by a linear layer. The input also includes the reward and action of the previous timestep. We use MAPPO for training a generalist agent. We train both the policy and critic with the batch size of 320,000 timesteps using truncated backpropagation through time (BPTT). The samples are reused for 15 epochs. Each minibatch contains 1,600 sequences with a maximum length of 50 timesteps. We also use learning rate annealing, specifically the generalist agent starting from 0.005 to 0.003 with linear scheduling. Other hyperparameters are shared with other methods (Tab. 2). A training partner for a generalist agent is sampled uniformly from the training population at the beginning of an episode. E RELATIONSHIP WITH RAHMAN ET AL. (2022) Rahman et al. (2022) propose to optimize the self-play returns while maximizing the diversity term Div(C), where C is a N × N cross-play payoff matrix. Specifically, they propose to learn diverse policies via the following objective:  κ i,j (C) = exp(- ||C i,• -C j,• || 2 σ 2 ), where C i,• is row i th of C. In other words, C i,• is the vector containing the empirical return of π i when matched with other policies in the population. Intuitively, this objective diversifies the policies via the expected returns similar to LIPO. Using the same notation of the cross-play matrix, we can write the objective of training a LIPO population as: max C Tr(C) -λ XP i max 1≤j≤N i̸ =j C i,j This objective wants the diagonal (J SP ) to be maximized and the off-diagonal entries of the cross-play matrix ( JXP ) to be minimized. This is a special case of Eq. 12 where Div(C) is based on policy compatibility.

F DISCUSSIONS

Agents trained with LIPO are incentivized to act adversarially toward agents that behave differently from itself. This behavior might not be desirable for certain downstream tasks. For example, agents produced by LIPO might not be suitable for interacting with humans as they would refuse to conform with the user. However, as shown in Sec. 4.6, training a generalist agent with these agents would have the opposite effect: the generalist agent would try to comply with the current partner's preference. LIPO produces a population of near-optimal solutions, a generalist agent trained with a LIPO population might not coordinate well with significantly sub-optimal agents. In a prior work, Strouse et al. (2021) show that augmenting the training population with past checkpoints (FCP) helps the trained generalist agent to effectively coordinate with sub-optimal agents. Since LIPO and FCP are orthogonal, populations created by LIPO can be also augmented in the same way as FCP. Previous works find incompatible policies to be undesirable since they are generally results of coordinated symmetry breaking (Bard et al., 2020; Hu et al., 2020; 2021) ; these policies perform poorly when interacting with unseen partners. However, we show that learning incompatible policies can be useful for generating behaviorally diverse agents in various scenarios. The produced agents can then be used as training partners for a generalist agent. Using LIPO in environments where many solutions are equivalent may produce such undesirable symmetry-breaking conventions. We believe that LIPO can be combined with other techniques, e.g., other-play (Hu et al., 2020) and equivariant coordinator (Muglich et al., 2022) , to avoid learning arbitrary symmetry-breaking. We leave the study of the combination of LIPO and these techniques for future work. A concurrent work by Cui et al. (2023) proposes an extension of LIPO by combining insights from off-belief learning (Hu et al., 2021) to avoid "sabotaging" behavior of LIPO agents.

G LIMITATIONS

LIPO requires an additional hyperparameter λ XP . If λ XP is too big, it is possible that the main RL objective, J SP , would be interfered which will result in an incompetent joint policy (Sec. 4.2). An population have more distinct recipe frequencies. That is, agents are different from each other in terms of recipe preference.

J.3 VISUALIZATION OF BEHAVIORS

We visualize the behaviors joint policies produced by LIPO in PMR and Overcooked at https: //sites.google.com/view/iclr-lipo-2023. Here, we show snapshots of four joint policies that have a distinct recipe preference in Overcooked. 



Figure 1: (a) The objective of π A (Eq. 1) in relation to π B . (b, c) Conceptual illustration of Theorem 3.1 and Corollary 3.2. Solid lines represent given relationships, and dotted lines represent implied relationships.

Figure 2: (a) The payoff matrix of a CMG game with (M = 3, k m = m, r m = m). (b, c) The agents (orange) and landmark positions (blue) of PMR-C and PMR-L.

define the population size |P| of each method as follows: For SP MI and MAVEN, |P| is equal to the number of dimensions of the latent variable, |z|. For Multi SP MI and Multi MAVEN, |P| = |z| • n seed where n seed is the number of random seeds and we use |z| = 8. For Multi SP, TrajeDi, and LIPO, |P| is the number of joint policies in the population.

Figure3: Numbers of discovered solutions. Ideally, if the population size increases by one, one more solution should be discovered, as depicted by the dashed lines (assuming that a joint policy does not produce a multi-modal behavior).

Figure 4: Numbers of competent joint policies using various combinations of N (x-axis) and λ XP (colors) in PMR.

Figure 5: Number of learned solutions using various n xp .

Figure 6: The top and bottom rows show four joint policies produced by a single run of LIPO training with and without the MI objective, respectively. Different colors of the trajectories correspond to different values of z. The orange and green circles show the starting positions. The blue circles represent the landmarks.

Fig.6shows the behaviors of the policies produced by LIPO with and without the MI objective in PMR-C. We can see the effect of the MI objective in the variety of the trajectories. Overall, each agent exhibits larger variations given a small MI regularization λ MI = 0.5. This result aligns with our motivation of using the MI objective to learn variations of each solution. With or without the MI regularization, LIPO discovers all the landmarks with N = 4.

use Div(C) = Det(κ(C)) where κ(C) is an N × N matrix with κ i,j (C) being similarity between policy π i and π j . The radial basis function (RBF) kernel of the empirical returns is used to measure the similarity between two policies:

(a) Recipe preference: Tomato & Carrot Salad (b) Recipe preference: Chopped Lettuce (c) Recipe preference: Chopped Onion (d) Recipe preference: Tomato & Lettuce Salad

Figure 14: Four joint policies from a population of eight joint policies produced by a single run of LIPO. Each row shows snapshots of a joint policy illustrating a distinct recipe preference.

the number of XP pairs used to approximate Jxp(πA, P) (nxp), the number of players in an episode (m), and the number of SP and XP episodes per iteration (ESP and EXP). while not done do for A ∈ {1, ..., N } do B sp ← GetEpisodeRollouts(πA, πA, ESP, m) Compute JSP(πA) using B sp B xp ← GetCrossPlayRollouts(πA, P, nxp, EXP, m) Compute JXP(πA, P) using B xp (Eq. 4) Compute LMI using B sp and B xp (Eq. 9) θA ← θA -∇ θ A [-JSP+λXP JXP+λMILMI] ϕA ← ϕA -λMI∇ ϕ A LMI

Hyperparameters used by the MAPPO algorithm.

Common hyperparameters of methods based on MAPPO are shown in Table. 2. D.1 MAPPO MAPPO is the base MARL algorithm for all baselines except MAVEN. The policy parameters are shared among all policies. The critic takes a state of the environment and outputs an expected return of a given global state. The global state is provided by the environment and only used during training. For the complete training objectives of MAPPO, we refer the reader to Appendix A of Yu et al. (2021).

ACKNOWLEDGEMENT

This work is partially supported by King Mongkut's Institute of Technology Ladkrabang [2566-02-06-002] . We thank Natchaya Sricom for drawing Fig. 1 and 7 . We thank Supasorn Suwajanakorn, Sucha Supittayapornpong and Maytus Piriyajitakonkij for their suggestions on early draft versions. We also thank anonymous reviewers for their constructive feedbacks.

REPRODUCIBILITY STATEMENT

We have include additional information to reproduce the experimental results in the supplementary text:• Environment details (App. C)• Pseudocode and implementation details (App. D)• Hyperparameters used in all experiments (App. H and I)The source code is available at https://github.com/51616/marl-lipo.A PROOF FOR THEOREM 3.1We prove the relationship of similar policies and compatible policies in Theorem 3.1 under the following assumptions.Assumption A.1 (All joint trajectories are supported by π B ). P (τ |π i B , π j B ) > 0; ∀τ ∈ T Assumption A.2 (Shared ϵ). A common 0 ≤ ϵ ≤ 1 is used for Def. 2.1 and 2.2Because 1 -ϵ ≤ r(τ ) ≤ 1 + ϵ and G(τ ) > 0; ∀τ ∈ T , the expected return J(π i A , π j B ) has the following upper and lower bounds:This means thatRemark: Assumption A.3 can be satisfied by offsetting the joint return of all trajectories such that min τ G(τ ) > 0. However, since we use MAPPO as the base algorithm, the expected return is subtracted by a baseline to compute the advantage during the policy update, which removes the effect of the offset. In practice, even under environments that do not satisfy G(τ ) > 0, LIPO can still discover diverse solutions effectively, as shown in the experiments.

B DERIVATION OF THE LOWER BOUND OF THE MI OBJECTIVE

We provide derivation of Eq. 7 here. Let p(z i |{o i , a i }) be the true posterior of z i and q ϕ be the approximation of p parameterized by ϕ. The lower bound of I({o i , a i }|z i ) can be derived as follows:adaptive mechanism that selects a suitable value for λ XP at different stages of training could help increase training stability.Although LIPO can be fully parallelized, it requires more computation than the baselines to get an accurate approximation of JXP , which makes it harder to scale up to bigger population sizes. Instead of collecting all policy pairs, sampling a portion of policy pairs to approximate JXP could reduce computation cost and training time at a potential cost of diversity (Sec. 4.3). Instead of using a uniform sampling, a mechanism that selects the best pair to sample (e.g., bandit algorithm) might help mitigate the diversity loss from using lower n xp .In 

H HYPERPARAMETERS (CMG AND PMR)

We provide the searched values of each method in Tab. 3, 4, 5, 6, 7 and 8. The hyperparameters are searched individually for each population size. We use three random seeds for each set of hyperparameters. We do not use any validation method. Instead, we present the results using the best parameters in the main paper. For LIPO, we set λ MI as 0.0 and 0.5 in CMG and PMR, respectively. 

I HYPERPARAMETERS (OVERCOOKED)

For each method, we use the parameters that give the highest entropy to generate five populations for Sec. 4.5, and Sec. 4.6. The searched values of each method are:• TrajeDi: We perform a grid search with following hyperparameters: α ∈ {5, 10} and γ ∈ {0, 0.5}. We use α = 5 and γ = 0.5 for the results in the paper. • Multi SP MI : We perform a grid search of λ MI ∈ {5, 10}. We use λ MI = 5 for the results in the paper. • Multi MAVEN: We perform a grid search of λ MI ∈ {5, 10}. We use λ MI = 5 for the results in the paper. • LIPO: We perform a grid search with following hyperparameters: λ XP ∈ {0.2, 0.3} and λ MI ∈ {0.1, 0.5}. We use λ XP = 0.3 and λ MI = 0.5 for the results in the paper.J ADDITIONAL RESULTS

J.1 ADDITIONAL ABLATION RESULTS

We provide additional results of numbers of learned solutions with varying N and λ XP (Fig. 10 ), numbers of competent agents in CMG with varying N and λ XP (Fig. 11 ), and numbers of learned solutions with varying n xp in CMG (Fig. 12 ) here. These results are consistent with the analysis presented in Sec. 4.2 and 4.3. Fig. 13 shows recipe frequencies of the population with highest recipe entropy (out of five runs) from each method. The frequencies are calculated based on completed recipe from 1,000 self-play episodes of each agent as described in Sec. 4.5. Qualitatively, we can see that agents in a LIPO

