GENERATING DIVERSE COOPERATIVE AGENTS BY LEARNING INCOMPATIBLE POLICIES

Abstract

Training a robust cooperative agent requires diverse partner agents. However, obtaining those agents is difficult. Previous works aim to learn diverse behaviors by changing the state-action distribution of agents. But, without information about the task's goal, the diversified agents are not guided to find other important, albeit sub-optimal, solutions: the agents might learn only variations of the same solution. In this work, we propose to learn diverse behaviors via policy compatibility. Conceptually, policy compatibility measures whether policies of interest can coordinate effectively. We theoretically show that incompatible policies are not similar. Thus, policy compatibility-which has been used exclusively as a measure of robustness-can be used as a proxy for learning diverse behaviors. Then, we incorporate the proposed objective into a population-based training scheme to allow concurrent training of multiple agents. Additionally, we use state-action information to induce local variations of each policy. Empirically, the proposed method consistently discovers more solutions than baseline methods across various multi-goal cooperative environments. Finally, in multi-recipe Overcooked, we show that our method produces populations of behaviorally diverse agents, which enables generalist agents trained with such a population to be more robust.

1. INTRODUCTION

Cooperating with unseen agents (e.g., humans) in multi-agent systems is a challenging problem. Current state-of-the-art cooperative multi-agent reinforcement learning (MARL) techniques can produce highly competent agents in cooperative environments (Kuba et al., 2021; Yu et al., 2021) . However, those agents are often overfitted to their training partners and cannot coordinate with unseen agents effectively (Carroll et al., 2019; Bard et al., 2020; Hu et al., 2020; Mahajan et al., 2022) . The problem of working with unseen partners, i.e., ad-hoc teamwork problem (Stone et al., 2010) , has been tackled in many different ways (Albrecht & Stone, 2018; Carroll et al., 2019; Shih et al., 2020; Gu et al., 2021; Rahman et al., 2021; Zintgraf et al., 2021; He et al., 2022; Mirsky et al., 2022; Parekh et al., 2022) . These methods allow an agent to learn how to coordinate with unseen agents and, sometimes, humans. However, the success of these methods depends on the quality of training partners; it has been shown that the diversity of training partners is crucial to the generalization of the agent (Charakorn et al., 2021; Knott et al., 2021; Strouse et al., 2021; McKee et al., 2022; Muglich et al., 2022) . In spite of its importance, obtaining a diverse set of partners is still an open problem. The simplest way to generate training partners is to use hand-crafted policies (Ghosh et al., 2020; Xie et al., 2021; Wang et al., 2022 ), domain-specific reward shaping (Leibo et al., 2021; Tang et al., 2021; Yu et al., 2023) , or multiple runs of the self-play training process (Grover et al., 2018; Strouse et al., 2021) . These methods, however, are not scalable nor guaranteed to produce diverse behaviors. Prior works propose techniques aiming to generate diverse agents by changing the state visitation and action distributions (Lucas & Allen, 2022) , or joint trajectory distribution of the agents (Mahajan et al., 2019; Lupu et al., 2021) . However, as discussed by Lupu et al. ( 2021), there is a potential drawback of using such information from trajectories to diversify the behaviors. Specifically, agents that make locally different decisions do not necessarily exhibit different high-level behaviors. To avoid this potential pitfall, we propose an alternative approach for learning diverse behaviors using information about the task's objective via the expected return. In contrast to previous works that use joint trajectory distribution to represent behavior, we use policy compatibility instead. Because cooperative environments commonly require all agents to coordinate on the same solution, if the agents have learned different solutions, they cannot coordinate effectively and, thus, are incompatible. Consequently, if an agent discovers a solution that is incompatible with all other agents in a population, then the solution must be unique relative to the population. Based on this reasoning, we introduce a simple but effective training objective that regularizes agents in a population to find solutions that are compatible with their partner agents but incompatible with others in the population. We call this method "Learning Incompatible Policies" (LIPO). We theoretically show that optimizing the proposed objective will yield a distinct policy. Then, we extend the objective to a population-based training scheme that allows concurrent training of multiple policies. Additionally, we utilize a mutual information (MI) objective to diversify local behaviors of each policy. Empirically, without using any domain knowledge, LIPO can discover more solutions than previous methods under various multi-goal settings. To further study the effectiveness of LIPO in a complex environment, we present a multi-recipe variant of Overcooked and show that LIPO produces behaviorally diverse agents that prefer to complete different cooking recipes. Experimental results across three environments suggest that LIPO is robust to the state and action spaces, the reward structure, and the number of possible solutions. Finally, we find that training generalist agents with a diverse population produced by LIPO yields more robust agents than training with a less diverse baseline population. See our project page at https://bit.ly/marl-lipo 

2. PRELIMINARIES

Our main focus lies in fully cooperative environments modeled as decentralized partially observable Markov decision processes (Dec-POMDP, Bernstein et al. (2002) ). In this work, we start our investigation in the two-player variant. A two-player Dec-POMDP is defined by a tuple (S, Afoot_0 , A 2 , Ω 1 , Ω 2 , T, O, r, γ, H), where S is the state space, A ≡ A 1 × A 2 and Ω ≡ Ω 1 × Ω 2 are the joint-action and joint-observation spaces of player 1 and player 2. The transition probability from state s to s ′ after taking a joint action (a 1 , a 2 ) is given by T (s ′ |s, a 1 , a 2 ). O(o 1 , o 2 |s) is the conditional probability of observing a joint observation (o 1 , o 2 ) under state s. All players share a common reward function r(s, a 1 , a 2 ), γ is the reward discount factor and H is the horizon length. Players, with potentially different observation and action spaces, are controlled by policy π 1 and π 2 . At each timestep t, the players observe o t = (o 1 t , o 2 t ) ∼ O(o 1 t , o 2 t |s t ) under state s t ∈ S and produce a joint action a t = (a 1 t , a 2 t ) ∈ A sampled from the joint policy π(a t |τ t ) = π 1 (a 1 t |τ 1 t )π 2 (a 2 t |τ 2 t ) , where τ 1 t and τ 2 t contain a trajectory history until timestep t from the perspective of each agent. All players receive a shared reward r t = r(s t , a 1 t , a 2 t ). The return of a joint trajectory τ = (o 0 , a 0 , r 0 , ..., r H-1 , o H ) ∈ T ≡ (Ω × A × R) H can be written as G(τ ) = H t=0 γ t r t . The expected return of a joint policy (π 1 , π 2 ) is J(π 1 , π 2 ) = E τ ∼ρ(π 1 ,π 2 ) G(τ ), where ρ(π 1 , π 2 ) is the distribution over trajectories of the joint policy (π 1 , π 2 ) and P (τ |π 1 , π 2 ) is the probability of τ being sampled from a joint policy (π 1 , π 2 ). We use subscripts to denote different joint policies and superscripts to refer to different player roles. For example, π A = (π 1 A , π 2 A ) is a different joint policy from π B = (π 1 B , π 2 B ) , and π i A and π j A are policies of different roles. 1 Finally, we denote the expected joint return of self-play (SP) trajectories-where both policies are part of the same joint policy, π A -as J SP (π A ) := J(π 1 A , π 2 A ) and the expected joint return of cross-play (XP) trajectories-where policies are chosen from different joint policies, π A and π B -as J XP (π A , π B ) := J(π 1 A , π 2 B ) + J(π 1 B , π A ). Since we are interested in creating distinct policies for any Dec-POMDP, we need an environmentagnostic measure that captures the similarity of policies. First, we consider a measure that can compute the similarity between policies of the same role i, e.g., π i A and π i B . We can measure this with the probability of a joint trajectory τ produced by either π i A or π i B . However, in the two-player setting, we need to pair these policies with a reference policy π j ref . Specifically, π i A and π i B are considered similar if they are likely to produce the same trajectories when paired with an arbitrary reference policy π j ref . We define similar policies as follows:



Note that LIPO can be applied to environments with more than two players with a slight modification. Specifically, a policy π j would represent the joint policy of all players except player i, π j (a j t |τ j t ) = Π k̸ =i π k (a k t |τ k t ).

