COORDINATION SCHEME PROBING FOR GENERALIZ-ABLE MULTI-AGENT REINFORCEMENT LEARNING Anonymous

Abstract

Coordinating with previously unknown teammates without joint learning is a crucial need for real-world multi-agent applications, such as human-AI interaction. An active research topic on this problem is ad hoc teamwork, which improves agents' coordination ability in zero-shot settings. However, previous works can only solve the problem of a single agent's coordination with different teams, which is not in line with arbitrary group-to-group coordination in complex multiagent scenarios. Moreover, they commonly suffer from limited adaptation ability within an episode in a zero-shot setting. To address these problems, we introduce the Coordination Scheme Probing (CSP) approach that applies a disentangled scheme probing module to represent and classify the newly arrived teammates beforehand with limited pre-collected episodic data and makes multi-agent control accordingly. To achieve generalization, CSP learns a meta-policy with multiple subpolicies that follow distinguished coordination schemes in an end-to-end fashion and automatically reuses it to coordinate with unseen teammates. Empirically, we show that the proposed method achieves remarkable performance compared to existing ad hoc teamwork and policy generalization methods in various multi-agent cooperative scenarios.

1. INTRODUCTION

Multi-Agent Reinforcement Learning (MARL) (Gronauer & Diepold, 2021) holds promise in numerous cooperative domains, including resource management (Xi et al., 2018) , traffic signal control (Du et al., 2021) , and autonomous vehicles (Zhou et al., 2020) . A number of methods (Lowe et al., 2017; Wang et al., 2020b; Papoudakis et al., 2019; Christianos et al., 2021) have been proposed to deal with joint policy learning and scalability issues. However, previous works commonly assume a fixed team composition, where agents only need to coordinate with training partners and do not consider generalization. Such a process is not in line with real-world applications that require agents to cooperate with unknown teammates whose coordination schemes may not be explicitly available. When coordinating with distributed teammates, co-trained agents may fail (Gu et al., 2022) . Another research domain dedicated to this need is ad hoc teamwork (Stone et al., 2010) , which stands at a single agent's perspective to adapt to different teams in a zero-shot fashion. However, current methods exist three significant limitations: (1) The differences between teammates could be very subtle and lie in only a few critical decisions. In this case, zero-shot coordination is not always feasible. As shown in the example in Fig. 1 , teammate's behaviors are indistinguishable before the final critical action. (2) The information for identifying different teams is collected by the same policy that aims to maximize coordination performance. Thus, the exploration-exploitation dilemma will cause adverse effects on both sides. (3) Ad hoc teamwork stands at a single agent's perspective, which cannot deal with arbitrary group-to-group generalizations in complex multi-agent scenarios. To overcome these limits and achieve generalizable coordination, we propose a multi-agent learning framework called Coordination Scheme Probing (CSP). Instead of doing zero-shot coordination, CSP tries to capture the unknown teammates' coordination scheme beforehand with limited precollected episodic data. Concretely, we start with generating a set of teams with high performance and diversified behaviors to discover different solutions to the task. After that, the scheme probing module learns to interact with these teams to reveal their behaviors and represents their policies with dynamics reconstruction (Hospedales et al., 2021) at the team level. Finally, we discover coordination (Carroll et al., 2019a) , where the chef in the blue hat needs to cooperate with chefs in other colors. Different chefs put the dishes in different positions after they are done cooking. For example, during training, the chefs in the green and orange hats put their dishes in the red and blue pentagrams as their last action, respectively. When a chef in the pink hat comes, how should the chef in the blue hat work with him? Since the first four actions for different chefs are indistinguishable, how to detect his type and coordinate with him is challenging. schemes as clusters of the representations and train a multimodal meta-policy to adapt to them, with each sub-policy dedicated to a unique coordination scheme. The whole learning framework of CSP learns multiple coordination schemes from a given task in an end-to-end fashion and automatically reuses the most suitable scheme recognized by the scheme probing module to coordinate with unknown partners. To validate our method, we conducted experiments on four challenging multi-agent cooperative scenarios and found that CSP achieves a more effective and robust generalization to unknown teammates compared to various baselines. Our visual analysis further confirms that CSP acquires multiple coordination schemes that are clearly distinguishable and can be precisely recognized. Meanwhile, our ablations show the necessity of probing unknown teammates in advance as well as using a multimodal policy for adaptation.

2. RELATED WORK

Multi-Agent Reinforcement Learning (MARL). Fully cooperative MARL methods (Foerster et al., 2018) mainly focus on deploying a fixed team with centralized training and decentralized execution (CTDE) setting (Oliehoek et al., 2008) . Value factorization methods (Sunehag et al., 2018; Rashid et al., 2018; Wang et al., 2020a ) are adopted widely in the CTDE process to solve the instability of teammates in the training process. Our work also utilizes the CTDE setting to learn policies collaborating with others. However, such a learning process may converge to a fixed modal and may not coordinate well with unseen teammates trained elsewhere. To deal with the issue, Tang et al. (2021) propose a method that could find a hard-to-search optimal policy by reward randomization. In game-theoretic methods, PSRO (Lanctot et al., 2017) aims at tackling the overfitting problem by learning meta-strategies, and α-PSRO (Muller et al., 2020) further utilizes α-Rank as an alternative for Nash solver to avoid equilibrium selection issues and improve efficiency. In this work, we borrow the idea of breaking local optima to mine diverse coordination behaviors from the environment, instead of training a single optimal policy. Ad Hoc Teamwork. Ad hoc teamwork aims to learn a single autonomous agent that cooperates well with other agents without pre-coordination (Stone et al., 2010; Mirsky et al., 2022) . Their methods include population based training (Strouse et al., 2021; Jaderberg et al., 2017; Carroll et al., 2019a; Charakorn et al., 2021) , Bayesian belief update (Barrett et al., 2017; Albrecht et al., 2016) , experience recognition (Barrett et al., 2017; Grover et al., 2018b; Chen et al., 2020), and planning (Bowling & McCracken, 2005; Ravula et al., 2019) . A recent work ODITS (Gu et al., 2022) introduces an information based regularizer to automatically approximate the hidden of the global encoder with the local encoder. However, they are limited to learn a single ad hoc agent and only makes zero-shot adaptation. We take a step further to develop a framework for more complex team-to-team coordination under a few-shot setting. Policy Representation. It is a well studied topic in multi-agent scenarios to anticipating others' behaviors and weaken the non-stationary issue (Albrecht & Stone, 2018) . DRON (He et al., 2016) uses a modeling network to reconstruct the actions of opponents from full history. Grover et al. (2018b) adopt the concept of interaction graph (Grover et al., 2018a) to learn a contrastive style representation with prepared policy class. DPIQN (Hong et al., 2018) learns "policy features" of other agents and incorporates them into the Q-network for better value estimation. LIAM (Papoudakis et al., 2021a) aims at estimating teammates' current actions and observations based on learning agent's local history alone. Inspired by dynamics-reconstruction (Raileanu et al., 2020) in meta-learning, we use an architecture that directly reconstructs the policy-dynamics. of a team policy, instead of predicting their temporal behaviors, to get a compact and precise representation.

3. BACKGROUND AND PROBLEM FORMALIZATION

Cooperative Task. We formalize the problem as a Dec-POMDP (Oliehoek & Amato, 2016 ) with a controllable group, defined as M = ⟨N , S, {O i } n i=1 , {A i } n i=1 , Ω, P, R, γ, d 0 , G 1 ⟩, where N = {1, . . . , n} is the set of agents, S is the set of global states, O i is agent i's observation space, A i is agent i's action space, γ ∈ [0, 1) is the discount factor, and d 0 denotes the initial state distribution. At each timestep, agent i ∈ N acquires a local observation o i ∈ O i with observation function Ω(s, i) and chooses an action a i ∈ A i via individual policy π i (a i |τ i ), where τ i denotes input history. The joint action a = ⟨a 1 , . . . , a n ⟩ transitions the system to next global state s ′ according to the transition function P (s ′ | s, a), and all agents get a shared reward r = R(s, a). The group describes control range at test time, where G 1 ⊆ N is the subset of controllable agents, and its complementary G -1 contains uncontrollable teammates that G 1 should adapt to. We denote join observation, action and policy for G 1 as o 1 = ⟨o i ⟩, a 1 = ⟨a i ⟩, π 1 = ⟨π i ⟩, i ∈ G 1 , and the corresponding parts for G -1 are defined similarly. Thus, the (global) joint policy can be written as π = ⟨π 1 , π -1 ⟩. Coordination Scheme. We define this term to better describe generalization. Let Π f be the set of all joint policies with high coordination performance. Coordination scheme C = {c i } is defined as a partition of Π f . Each coordination scheme c i is a set of joint policies, where c i ∩ c j = ∅, if i ̸ = j and ci∈C c i = Π f . We assume that the coordination performance can be guaranteed if all the agents are in the same coordination scheme, even if they have minor differences. Otherwise, no such guarantee exists generally. Intuitively, C is determined by the coordination task itself, different elements in which reflect different unique high-level joint behaviors. We may use words like "follows" or "is under" in this paper to describe the same thing as π ∈ c for ease in expression. Problem Formalization. Our aim is to control G 1 to coordinate with G -1 under any coordination scheme c ∈ C. We assume that G -1 has no adaptation ability and will stick to a fixed scheme no matter what G 1 behaves. With a little abuse of notations, we use π -1 ∈ c to denote π ∈ c and π -1 is its slice for G -1 . Formally, the optimal policy π 1 * θ parameterized by θ for G 1 is to maximize the discounted cumulative reward: θ * = arg max θ E π -1 ∈c,c∈C H-1 t=0 γ t r t+1 (a 1 , a -1 ) ∼ (π 1 θ , π -1 ), P, d 0 , where r t+1 is the shared reward at timestep t and H is the episode length. Since we do not have direct access to the true scheme set C, we create a set of diverse policies Π train and directly sample π -1 from it as a surrogate of the two-factor sampling, and rewrite the objective as: θ * = arg max θ E π -1 ∈Πtrain H-1 t=0 γ t r t+1 (a 1 , a -1 ) ∼ (π 1 θ , π -1 ), P, d 0 . Another diverse set Π eval is created to evaluate the generalization performance.

4. METHOD

This section describes how the CSP framework addresses the generalizable coordination problem in an end-to-end manner (Fig. 2 ). When given a cooperative task, CSP learns with three stages: (1) It generates a diverse population of team policies to discover multiple feasible coordination schemes. (2) It trains a scheme probing module to efficiently represent different teams by self-supervised team-dynamics reconstruction. (3) It discovers the underlying coordination schemes by clustering the representations and trains a multimodal meta-policy to adapt to them. We first need to use a set of diverse team policies to simulate the coordination schemes required for the training and evaluation phases. Instead of letting each team acquire diversity with different initializations (Carroll et al., 2019a; Strouse et al., 2021 ) alone, we propose the Soft-Value Diversity (SVD) objective and an alternate optimization process to gain diversity explicitly. We maintain a population with M independent multi-agent teams, each learning with an MARL method (VDN (Sunehag et al., 2018) in this work). While each team performs independent "inner loop" learning, the population periodically performs centralized "outer loop" updates for all teams to maximize SVD: J SVD ({θ i } M i=1 ) = E τ ,a M i=1 ∥D(τ , a) -D i (τ , a)∥ 2 2 , where θ i denotes the Q-network's parameters for the i-th team's policy, τ and a denote individual observation and action, D i (τ , a) = exp (Qθ i (τ ,a)) a ′ ∈A exp (Qθ i (τ ,a ′ )) is the normalized value estimation obtained by performing the Boltzmann softmax operator (Asadi & Littman, 2017) , and D(τ , a) = 1 M M i=1 D i (τ , a) is the average observation-action value estimation for all policies in the population Π. We use the Monte Carlo method to estimate the expectation based on recent collected trajectories in the replay buffer. In this stage, all agents (i.e., N = G 1 ∪ G -1 ) for each team are trained jointly, but we only store π -1 corresponding to G -1 for further use. Maximizing SVD encourages teams within the population to have different value estimations of each observation-action pair by increasing the gap between their current values and the mean. This diversity in value estimation reflects various local optima caused by different joint decisions in multi-agent scenarios, where the same action will have different outcomes when teammates' actions change. The "inner" and "outer" loops in the framework are designed to improve individual team performance and diversity among different teams. We set a hyper-parameter to control the learning frequency of these two loops, which is a trade-off between the quality and diversity of policies in the population. We obtain Π train and Π eval independently via this process.

4.2. SCHEME PROBING MODULE

After having a diverse team population Π train , we train the scheme probing module upon it to efficiently reveal and represent different teams' coordinating policies. It has two main parts. One is the scheme probing policy π sp which interacts with a team for an entire episode and gathers the trajectory τ sp = {s t , a t } H t=0 that reveals the current teammates' coordination scheme. The other is the team-dynamics autoencoder ⟨E c , D c ⟩, which learns a representation z c of team policies based on τ sp in a self-supervised manner. Start from here, we only train policies for G 1 and let G -1 switch within Π train . The teammates can thus be viewed as a non-stationary part of the environment. Team-dynamics Autoencoder. The encoder E c is parameterized as an LSTM (Hochreiter & Schmidhuber, 1997) which takes as input τ sp and outputs an embedding vector z c . The decoder D c is a feed-forward network that takes as input both z c and current state s t , and predicts teammates' next joint action distribution a -1 t . Formally, z c = E c (τ sp ; θ c ) , â-1 t = D c (• | s t , z c ; ϕ c ) . (2) The parameters θ c and ϕ c are jointly optimized to minimize the cross entropy loss (i.e., reconstruction error in Fig. 2 ) of â-1 t and a -1 t averaged over the entire trajectory τ sp : L tot = 1 H H t=1 L pred (t) = - 1 H H t=1 log D c (a -1 t | s t , z c ; ϕ c ). We call this approach team-dynamics reconstruction for the following reasons. Firstly, D c receives no historical information, so it cannot infer a team's behaviors based on the temporal ordering of states and joint actions. In this case, E c is forced to embed information about this team's dynamics into z c to make a good reconstruction. Additionally, since s t is already input of D c , E c has no motive to embed any information about states. Thus, the embedding z c contains only information about the policy and not the environment, making it a compressed and precise representation of a team. Scheme Probing Policy. Instead of collecting information passively, we expect π sp to actively guide teammates' behaviors and reveal their coordination schemes. To this end, we introduce the reconstruction loss above at each timestep as an intrinsic reward for π sp , which is added to the original environmental reward r env t : r ′ t = r env t + αL pred (t), ) where α is an adjustable hyperparameter to achieve a trade-off between coordination performance and information gain through probing. This additional term encourages π sp to explore states with the large behavioral uncertainty across different kinds of teammates, which is considered helpful in determining their identities and representing their coordination policies.

4.3. MULTIMODAL COORDINATION POLICY

When the scheme probing module is well trained, it is fixed and used to guide downstream schemespecific control. Common context-based methods (Hausman et al., 2018; Yang et al., 2020) use embeddings as augmentation of the agent's input space. However, in scenarios of coordination generalization, the behaviors under different coordination schemes can vary greatly and conflict with each other. Such processes require a single network to acquire multimodal behaviors as the context changes, which increases learning complexity and instability. As a comparison, we use embeddings to automatically group similar team policies to discover different schemes and solve each distinct group with an independent sub-policy to avoid conflicts. Scheme Discovery. We first use the learned scheme probing module to probe and represent all teams in Π train N times, which generates an embedding set Z of size |Z| = N M . Repeating it N times is to get the distribution of a team's representation rather than a single sample point when the environment and policy are stochastic. Then we perform k-means clustering based on Euclidean distances on Z to get k clusters with centers µ = ⟨µ 1 , . . . , µ k ⟩. We use the Silhouette method (Rousseeuw, 1987) to automatically determine the most suitable k, and details can be found in App. A.2. These clusters reflect the natural structure of coordination schemes, as behaviors under the same scheme should not vary too much to ensure coordination. The number of clusters k will be approximately equal to the number of environmental coordination schemes |C| if Π train already covers all possible schemes. So enlarging the size of Π train will not infinitely increase the learning complexity. Meta-Policy Learning. We initialize a multimodal meta-policy with k sub-policies, where each sub-policy takes local observation history as input: π meta = {π i (a | τ ) | i = 1, . . . , k}. (5) During training or deploying, when confronted with a team, we first let the scheme probing policy π sp interact with it to get trajectory τ sp and get the representation z c = E c (τ sp ). Then, we classify the team into one of the k classes (discovered schemes) according to the distance of z c and all cluster centers µ. Once we have done the classification, we choose the corresponding sub-policy to do the rest of the control. Formally, π meta (a | τ, τ sp ) = π i * (a | τ ), where i * = arg min i ∥E c (τ sp ) -µ i ∥ 2 2 . ( ) This structural design takes advantage of multimodality to be highly expressive, allowing end-toend learning of several vastly distinct coordination schemes simultaneously without affecting each other. Moreover, each sub-policy π i (a|τ ) only needs to acquire a unique and stationary coordination scheme. Compared to common context-based methods that use a single policy π(a | τ, z) to acquire all schemes, it reduces learning complexity and improves stability.

5. EXPERIMENTS

In the section, we design experiments to answer the following questions: (1) How well does CSP perform when generalizing to unknown partners in multiple complex scenarios (Sec. 5.1)? (2) Can the proposed scheme probing process get meaningful and distinguishable representations (Sec. 5.2)? (3) Does the team population really have multiple coordination schemes (Sec. 5.3)? ( 4) What is the impact of each component of CSP (Sec. 5.4)? We select four multi-agent cooperative environments with six scenarios as benchmarks: Level-based Foraging (LBF) (Papoudakis et al., 2021b) needs agents to find food randomly distributed on the map and eat it together. Predator-Prey (PP) (Böhmer et al., 2020) (Heinrich et al., 2015) in games by constructing training sets with teammates' historical checkpoints. ODITS (Gu et al., 2022) applies a centralized "teamwork situation encoder" for end-to-end learning. Notice that PEARL, PBT, FCP, and ODITS are originally designed for single-agent settings. To make them compatible with multi-agent scenarios, we implement them upon an MARL framework VDN (Sunehag et al., 2018) that follows the CTDE paradigm to learn team policies for G 1 . Other multi-agent methods (i.e., CSP, LIAM, and FIAM) also choose VDN as their base MARL framework for a fair comparison. In addition, the original versions of LIAM and FIAM did not use populations and only did self-play during training, while PBT, FCP and ODITS used several teams with different random seeds and initializations to construct training populations, which may be unstable and difficult to compare. To standardize comparison, we reuse the same Π train generated in CSP's stage 1 for all baselines, thus reducing the chance of outcome due to randomness of population generation. More details of each scenario and baseline are shown in App. B-C.

5.1. EVALUATION OF PERFORMANCE

We first investigate how CSP and other baselines perform when coordinated with unknown diverse teammates. Fig. 3 shows the average performance of coordinating with all teams in Π eval during training. Since CSP always collects a probing trajectory τ sp before adapted coordination, we use the average score of the two trajectories as CSP's performance metric for fair comparisons. All experiments are done with five random seeds, and all curves are marked with 95% confidence intervals in the shaded area. In general, CSP outperforms PEARL, PBT, FCP, and LIAM consistently and is better or at least comparable to FIAM and ODITS according to the tasks. In LBF, where each agent has a strictly limited observation range, FIAM has a clear advantage because it utilizes global information at each timestep to help make decisions. The comparable result of CSP shows that it is possible to achieve near-optimal control with pre-identified coordination scheme information alone, eliminating the need for global states throughout the entire episode. In PP, Overcooked, and SMAC Fork, the cooperative tasks are challenging and contain several vastly different coordination schemes, which require the coordinator to change its behaviors aggressively and precisely. All baselines here have a clear gap with CSP. We believe the superiority comes from the fact that CSP uses multimodality to isolate the expression of different coordination schemes to avoid mutual interference. As a comparison, in SMAC 1c3s5z, CSP and all baselines converge to roughly equivalent performance. The reason is that without a specially designed symmetry like Fork, the optimal policy in this task is reflected in focusing fire on the enemy and retreating when its health is low. The optimal decision under each state is sure with no ambiguity of different coordination schemes, and all methods acquire this optimal policy after training long enough. Although PEARL uses additional context data, it performs worse than CSP. We believe it is because PEARL learns to encode context into a meaningful variable in an end-to-end manner to directly maximize cumulative reward, which is relatively inefficient. It is worth noting that CSP trains k sub-policies but still has comparable or even better sample efficiency to baselines, verifying that isolating different schemes stabilizes training and avoids conflict schemes from affecting each other.

5.2. SCHEME REPRESENTATION ON SMAC

A meaningful and distinguishable representation of coordination schemes is the basis of CSP's adaptation ability. To demonstrate this, we visualize the embeddings of our scheme probing module with experiments on SMAC Fork (Fig. 4 ). We let our scheme probing module interact with all teams in Π = Π train ∪ Π eval for 64 episodes. One of the baselines, LIAM, which does agent modeling at every timestep, also runs in parallel as a comparison. Let's first make a brief understanding of the Fork task. There are two symmetrical points at Up and Down sides guarded by several enemies. If all the teammates choose to attack the same point, they will have enough force to eliminate all the enemies there and get a high reward. Otherwise, neither point will have a firepower advantage, and all the teammates will be destroyed and fail. Therefore, we can roughly claim that there are two basic coordination schemes C = {U p, Down} in this task. Screenshots at timesteps 10 and 30 (Fig. 4a-b ) indicate the early and middle stages of one episode. Asynchronous decision makings occur in (a) and (b) for G 1 and G -1 respectively to select a direction (Up or Down) to attack, and coordination succeeds if their choices are the same. Since G -1 may be controlled by various policies in Π, which will make different decisions, G 1 has to make the right choice based on the partners it meets. The t-SNE projection of CSP's scheme embeddings is shown in Fig. 4e , where different colors represent different teams in Π (named by their scheme observed) and each point represents a single run. There are two main phenomenons: (1) Each color forms a relatively compact cluster. (2) Clusters with similar colors (lighter or darker) tend to be close together, while deeper and lighter colors are farther apart. The former indicates that CSP's representations are highly consistent with low variance, which is beneficial for stabilizing downstream MARL learning. The latter shows that the embedding space holds semantically meaningful information, where teams with similar coordination schemes can be packed together. As a comparison, LIAM's embeddings at timesteps 10 and 30 are shown in Fig. 4c-d . We can observe that different teams are mixed up at the early stage, and a few local clusters emerge as time goes by but still cannot distinguish well between teams. As illustrated in the screenshots (white arrow), G -1 will always move right in the first few dozen steps, so the trajectories of different teams during this period are similar and hold insufficient information. It is impossible for LIAM to distinguish teams based on it, let alone make the right choice in the beginning. Fig. 4f presents the mean reconstruction accuracy of G -1 's joint actions throughout one episode. We can observe that CSP has consistently higher accuracy than LIAM, especially at timesteps close to 30. As described above, the observation segment before is not informative enough, in which case LIAM will fail to predict teammate behaviors when sudden uncertainty occurs. By contrast, CSP has a comprehensive view of the teammates it coordinates with after the probing phase in advance, so it can fully guarantee its scheme prediction throughout the coordination phase. More results on other benchmarks are shown in App. D.1.

5.3. SCHEME DIVERSITY

To verify populations generated in stage 1 do hold multiple coordination schemes, we perform Cross-Play (Hu et al., 2020) experiments on Overcooked's Coordination Ring and SMAC Fork (Fig. 5a ). Teams in Π = Π train ∪ Π eval are paired to play the role of G 1 and G -1 for all combinations. Firstly, we can find that values on the diagonal from the top left to the bottom right corner are generally larger than others. This indicates that each team coordinates well with itself, as they are always in the same coordination scheme. As a comparison, the relatively lower performance of other points indicates the inability to coordinate across different schemes. Interestingly, the performance on SMAC Fork shows that it has two basic coordination schemes, where each team can coordinate with exactly half of the teams in Π that have the same coordination scheme. This phenomenon is aligned with our understanding of the task, as described above in Sec. 5.2. More results on other benchmarks are provided in App. D.2.

5.4. ABLATION STUDY

We perform ablations on SMAC and Overcooked to demonstrate the importance of CSP's different components: UNI removes multimodality and only has a single sub-policy π(τ ); POP(z) uses a single context-based policy π(τ, z) instead of sub-policies; CSP(z) extends each sub-policy to be π i (τ, z); SP does not use Π train and learns with self-play alone. We report the mean generalization performance with Π eval and error bars indicating the 95% confidence interval as shown in Fig. 5b . Firstly, we can observe that SP generally does the worst and has a large variance. This means that without being exposed to diverse partners during training, the policy can only find a single coordination scheme and is hard to generalize. UNI has a lower variance but still performs poorly, which indicates that using domain randomization helps make the policy robust to partner changes, but it cannot be specialized to each scheme without a specially designed adaptability module. The relatively lower performance of POP(z) compared to CSP confirms our claim earlier that it is difficult to make adaptations within a single network based on different scheme embedding. In complex multiagent scenarios, each coordination scheme can imply highly different and even opposite behaviors, where using multiple sub-policies helps to isolate these schemes and makes each consistent. Finally, CSP(z) does not outperform the original version. This phenomenon shows that our scheme grouping process already fully uses the information contained in the embedding. Further use of it as additional input no longer results in a boost but may increase learning difficulty.

6. CLOSING REMARKS

To achieve generalizable coordination in complex multi-agent scenarios and address limitations of ad hoc teamwork, this paper considers learning a coordination scheme probing module for teammates recognition and a meta-policy consisting of multiple sub-policies for few-shot coordination with unseen diverse teams. With the help of this probing module, we can reduce few-shot coordination to a multi-task RL problem by clustering the representation space. A multimodal policy is then end-to-end trained to solve it directly. Sufficient experiments compared against strong baselines on various benchmarks validate the effectiveness of our proposed method. We point out two limitations and interesting future work: (1) Co-evolution of the population would be a more general interaction setting, such as Quality Diversity (Parker-Holder et al., 2020) . ( 2) Effective knowledge transfer between submodules, such as Soft Modularization (Yang et al., 2020) , could be considered to improve sample efficiency. Finally, instead of training alongside artificial agents, we also hope to study the human-in-the-loop setting to adapt to people's dynamic needs and preferences.

A IMPLEMENTATION DETAILS OF CSP

A.1 ALGORITHMS We now give more a detailed pseudocode corresponding to different stages of CSP. Algorithm 2 CSP: Training Stage 1 1: Initialize teammate policies Π = {π 1 , . . . , π M } 2: while not done do 3: for inner loops do 4: for π j ∈ Π parallel do 5: Control G 1 ∪ G -1 with π j . Train π j with any MARL algorithm 6: end for 7: end for 8: for outer loops do 9: Update all π i ∈ Π to maximize J SVD (Eq. 1) 10: end for 11: end while Control G 1 with π sp and G -1 with π j to generate τ sp = {o t , a t } H for n in N do

5:

Control G 1 with π sp and G -1 with π j to generate τ sp = {o t , a t } H t=0 6: Get team embedding z c = E c (τ sp ), Z ← Z ∪ {z c } 7: end for 8: end for 9: Compute best k with the Silhouette method on Z (Eq. 8) 10: Compute centers µ = ⟨µ 1 , . . . , µ k ⟩ with k-means clustering on Z 11: while not done do 12: for π j ∈ Π train do 13: Control G 1 with π sp and G -1 with π j to generate τ sp = {o t , a t } H We use the Silhouette method (Rousseeuw, 1987) to automatically determine the most suitable k for the embedding set Z. The intuition is that a higher Silhouette value for a data point indicates that this point is placed in the correct cluster. Therefore, a cluster number k with the highest mean Silhouette value for all points in a dataset is desirable. Concretely, for the embedding set Z, the Sihouette value SV (i) for each data point i that belongs to I-th cluster Z I is defined as: SV (i) =    d out (i) -d in (i) max{d out (i), d in (i)} , if |Z I | > 1, 0 , if |Z I | = 1, where d in (i) denotes the mean distance between data point i and other points in the same cluster, and d out (i) denotes the smallest mean distance of data point i to all points in any other cluster. They are defined as d in (i) = 1 |Z I |-1 j∈Z I ,i̸ =j d(i, j) and d out (i) = min I̸ =J 1 |Z J | j∈Z J d(i, j), where d(i, j) is the Euclidean distance between data points i and j. We implemented an automatic k selection approach by linear searching for the maximum mean Silhouette value of all data points: k * = arg max k≤M 1 |Z| i∈Z SV (i).

A.3 HYPERPARAMETERS AND ARCHITECTURE

In Stage 1, "inner loops" and "outer loops" are set to 32 and 5, respectively. MARL refers to any multi-agent reinforcement learning method, and we choose VDN (Sunehag et al., 2018) here. Population size M = |Π train | = |Π eval | is 4 for all scenarios but 1c3s5z, which is set to 5 instead. In Stage 2, α is set to 1 × 10 -6 and |D| is set to 32. In Stage 3, N is set to 64. Details of neural network architectures used by CSP are provided in Fig. 6 . Each policy used in CSP (i.e., teammate policy π i ∈ Π, probing policy π sp , and each sub-policy in {π 1 , . . . , π k }) is a form of "GRU policy" with local observation history as input and outputs value estimation across its action space. The MARL framework VDN adds up all agents' local utility q i to form Q tot , which is updated to approximate the global discounted return. Level-based Foraging (LBF) (Papoudakis et al., 2021b) . We use an instance of LBF, where the environment is a 20 × 20 grid world with 2 agents and 4 food. Each agent has a self-centered 5 × 5 observation range and a discrete action space for moving in four directions and collecting food. The goal of each agent is to collect all the food on the map. When food is collected, the environment returns a shared reward proportional to the food level with a total reward normalized to 1. An episode terminates if all the food is collected or reaches 50 total timesteps. To enhance the requirement for cooperation, we set the extra constraint that each food can only be collected if both agents are adjacent and perform the "collect" action simultaneously. We let CSP or baselines control one player and teammates from Π control another. The variability of coordination schemes in this environment is reflected in the order of eating all 4 food. Predator-Prey (PP) (Böhmer et al., 2020) . This environment can be considered a more complex version of LBF, where 2 predators with a 5 × 5 observation range are expected to hunt 4 prey in a 10 × 10 grid world. An episode ends when all prey is captured, or 200 timesteps have passed. The extra difficulty comes from the fact that each prey moves randomly throughout the game, so predators must constantly jointly chase the prey. We simplify the task by removing the "capture" action, and predators can capture prey with only a siege. Therefore, there will be no miss-capturing punishment. The coordination schemes in this environment are reflected in the order of chasing prey and the respective division of labor in the roundup of prey.

Overcooked (Strouse et al., 2021).

There are 2 agents in the environment sharing a 6-dimensional discrete action space: moving in four directions, interacting with the object facing, and doing nothing. Cooking a dish requires a series of actions and a waiting period. Delivering a dish requires picking it up, moving to the correct delivery point, and putting it down. The goal of both agents is to complete as many delivery orders as possible within 400 timesteps. We set CSP or baselines to control the blue player and teammates from Π to control the green player. In the layout Coordination Ring (lower), the passage is narrow, and the two agents may clash in their pathfinding. In the layout Forced Coordination Hard (upper), only the green agent can touch the cookware, and only the blue agent can reach the delivery point. Coordination is forced in this layout since no agent can finish the task alone, and they have to adapt to teammates' preferences. 

D.2 CROSS-PLAY OF DIVERSE TEAMMATES

To measure the quality and diversity of our SVD population generation in training stage 1. We concatenate all the teams in the training set Π train and the evaluation set Π eval together and make cross play with each other. In LBF, Overcooked, PP, and SMAC, Π train and Π eval are both 4 while that of SMAC 1c3s5z is 5. We normalize the score to [0, 100) for each scenario and show them in heatmaps in Fig. 11 . In most benchmarks, agents in the population cannot coordinate with others, as the values on the diagonal (i.e. self-play) are much higher. Fork mainly contains two coordination schemes as a comparison. The heatmaps show that our SVD method can generate a diverse population to simulate the underlying coordination schemes. Overcooked Forced Coord. We also tested the performance of CSP and baselines on these populations trained without SVD as shown in Fig. 12b . In this case, CSP still performs better, but the relative advantage is less significant. The result verifies the performance of CSP under random coordination settings, and is also in line with what we claimed above. Since our goal is to train a policy that can coordinate under any scheme, using a more diverse and less redundant Π eval trained with SVD makes sense. 13 to show how much extra cost is actually required. As can be seen from the plots, the cross-entropy loss between reconstructed teammate actions and the ground truth drops rapidly in the early stage and slowly decreases as training moves on. This phenomenon indicates that the self-supervised learning process in Stage 2 is much more sample-efficient than reinforcement learning. Therefore, although we let Stage 2 to interact with the environment as many timesteps as Stage 3 in our experiments, it actually only requires a small portion of interactions to make a good representation. We will further investigate how to compress this additional cost in future work. 



Figure 1: An example from Overcooked(Carroll et al., 2019a), where the chef in the blue hat needs to cooperate with chefs in other colors. Different chefs put the dishes in different positions after they are done cooking. For example, during training, the chefs in the green and orange hats put their dishes in the red and blue pentagrams as their last action, respectively. When a chef in the pink hat comes, how should the chef in the blue hat work with him? Since the first four actions for different chefs are indistinguishable, how to detect his type and coordinate with him is challenging.

Scheme Probing Module

Figure 3: Mean generalization performance on Π eval of all methods.

Different choice of direction Group -1 (teammate) moves right at the beginning of game Time to make decision (c) t-SNE embeddings (LIAM at 10) (d) t-SNE embeddings (LIAM at 30) (e) t-SNE embeddings (CSP) (f) Reconstruction accuracy (a) Fork at timestep 10 (b) Fork at timestep 30

Figure 4: (a)(b) SMAC Fork at timesteps 10 and 30, when G 1 (left) and G -1 (right) make decisions respectively to attack enemies at point Up (red arrow) or Down (blue arrow). (c)(d) LIAM's embeddings of different teams at timesteps 10 and 30. (e) CSP's embeddings of different teams. (f) CSP and LIAM's mean reconstruction accuracy of G -1 's joint action a -1 t throughout one episode.

Cross-Play performance of full population Π, normalized to [0, 100).

Figure 5: (a) The scores of cross-play on the Overcooked Coordination Ring and SMAC Fork benchmarks. (b) Final generalization performance of CSP and ablations for each component.

12: return Π Algorithm 3 CSP: Training Stage 2 1: Input: Teammate policy population Π train 2: Initialize scheme probing policy π sp , autoencoder ⟨E c , D c ⟩, and replay buffer B 3: while not done do 4:for π j ∈ Π do 5:

{L pred (t)} H t=0 based on D and update ⟨E c , D c ⟩ (Eq. 3) 9: Train π sp by any MARL algorithm, with r ′ t = r env t + αL pred (t) (Eq. 4) 10: end for 11: end while 12: return π sp , E c Algorithm 4 CSP: Training Stage 3 1: Input: Π train , π sp , E c 2: Initialize team embedding set Z; multimodal meta-policy π meta = {π 1 , . . . , π k } 3: for π j ∈ Π train do 4:

team embedding z c = E c (τ sp ) 15: Pick sub-policy π i * based on i * = arg min i ∥z c -µ i ∥ 2 2 (Eq. 6) 16: Control G 1 with π i * and G -1 with π j . Train π i * with any MARL algorithm 17: end for 18: end while 19: return µ = ⟨µ 1 , . . . , µ k ⟩, π meta = {π 1 , . . . , π k } Under review as a conference paper at ICLR 2023 Algorithm 5 CSP: Deploying 1: Input: π sp , E c , µ = ⟨µ 1 , . . . , µ k ⟩, π meta = {π 1 , . . . , π k }, new policy π new to adapt to 2: Control G 1 with π sp and G -1 with π new to generate τ sp = {o t , a t } H t=0 3: Get team embedding z c = E c (τ sp ) 4: Pick sub-policy π i * based on i * = arg min i ∥z c -µ i ∥ 2 2 (Eq. 6) 5: Execute π i * to coordinate with π new A.2 SILHOUETTE METHOD

Figure6: Network architectures of CSP, where o i,t and a i,t represent the local observation and action for agent i at timestep t, and τ i,t is the trajectory of agent i's local observations until timestep t.

Figure 10: t-SNE of CSP's coordination scheme embeddings in different environments.

Figure 12: (a) The cross-play matrix of populations trained without SVD. (b) Generalization performance of CSP and baselines on the two no SVD populations.

Figure 13: Learning curves of ⟨E c , D c ⟩ in CSP's Stage 2.

annex

 SMAC (Samvelyan et al., 2019) . It is widely used as a multi-agent benchmark for its high complexity of control. Each agent can move in four cardinal directions, stop, do nothing, or select an entity to interact (heal or attack according to its type) at each timestep. Therefore, if there are n a allies and n e enemies in the map, the action space for each unit contains n a + n e + 6 discrete actions, and the joint action space size is (n a + n e + 6) na . The map 1c3s5z is a standard map that requires control of 9 agents. We set CSP or baselines to control the first 4 agents, and teammates in Π control the reset 5 agents.We specially designed a map called Fork (Fig. 8 ) for this work which requires strong coordination and has very different coordination modes. It has 2 ally spawn points on the left and middle left sides of the map and 2 enemy spawn points on the upper right and lower right corners. At the beginning of an episode, each ally spawn point generates 4 marines (long-range attack unit), and each enemy spawn point generates 6. We let CSP or baselines control allies at point 1 and teammates from Π to control allies at point 2. If both groups choose to attack the same group of enemies, it will be 8 versus 6 and is easy to win. By contrast, if two groups of teammates attack different groups of enemies, respectively, it will be 4 versus 6, and neither group will be able to defeat. Thus, intuitively there are two main kinds of cooperation modes, which we refer to as Up and Down in this work, indicating attack enemies in the corresponding direction first. The population Π is trained with CSP's stage 1. In order to make the population unbiased in attacking directions, we manually pick 8 policies from the original population with 4 Ups and 4 Downs to form the balanced population. 

C DETAILS OF BASELINES

We compare CSP against six baselines. All of them are implemented with a similar GRU agent and VDN framework as CSP (Fig. 6 ), except that the input of GRU agents is τ t = {(o, z c ) 0:t } which has additional embedding z c at each timestep, so we mainly focus on how they build z c with components shown in Fig. 9 .PEARL (Rakelly et al., 2019) . This baseline comes from single-agent and meta-learning settings. It aims to represent the environments by hidden representations. It utilizes the history-data as context to inference the feature of the environment, which is modeled by a product of Gaussians. Since we use the Dec-POMDP setting, the PEARL module is adopted and optimized on each individual policy.Population based trainin (PBT). This baseline uses simple domain randomization to train G 1 against all the teams in Π train . The training set Π train is required from CSP's stage 1. We think fixing this population increases the fairness of the comparison.Ficticious co-play (FCP) (Strouse et al., 2021) . This baseline uses similar domain randomization approach like PBT to train G 1 , except that the training population is an extended version of Π train . Checkpoints at one-third and two-thirds of the total training timesteps are added to Π train , indicating teammates with different levels of ability.Local information agent model (LIAM) (Papoudakis et al., 2021a) . This baseline equips each agent with an encoder-decoder structure to predict other agents' observations o -1 t and actions a -1 t at current timestep based on its own local observation history τ t = {o 0:t }. The encoder and decoder are optimized to minimize the mean square error of observations plus the cross-entropy error of actions. The original version of LIAM considers only a single controllable agent, and predictions are made upon this agent's local observation. To fit in MARL's centralized training, we let all controllable agents make predictions based on their observations and calculate their loss, and use the mean of their loss as the final loss.Full information agent model (FIAM) (Papoudakis et al., 2021a) . This baseline is a variant of LIAM by replacing the input trajectory of local observations τ t = {o 0:t } with the trajectory of global states τ t = {s 0:t }.Online adaptation via inferred teamwork situations (ODITS) (Gu et al., 2022) . Unlike the previous two methods that predict the actual behaviors of teammate agents, ODITS improves zeroshot coordination performance in an end-to-end fashion. It has two variational autoencoder pairs, one global and one local. The global encoder takes in state trajectory τ t = {s 0:t } and outputs the mean and variance of a Gaussian distribution. A vector z e indicating "global teamwork situation" is then obtained by sampling from it. The global decoder uses z e to build the parameters z h of a hyper-network that maps the ad hoc agent's local utility Q i into global utility Q tot to approach the global discounted return. The local encoder has a similar structure as the global encoder, except that its input is replaced with local trajectory τ t = {o 0:t }. It is updated by maximizing the mutual information of its output ẑe and the global z e . The local decoder further maps ẑe into a variable z c used as the ad hoc agent's input. The whole training is end-to-end by maximizing global return and mutual information of z e and ẑe . Similar to LIAM, ODITS only considers a single ad hoc agent setting. We modified the loss to the mean of all controllable agents' individual loss to make it fit in MARL.To ensure fairness in the use of hidden variables, we make CSP and all baselines have the same width for z c in each environment. Concretely, |z c | is set to 8 for Overcooked, LBF, and PP, and 64 for Fork. 

D.1 EMBEDDINGS ON EACH ENVIRONMENT

To measure the quality of our scheme embedding, Fig. 10 adopts t-SNE to visualize the embedding of distinguished teams. Each color in four t-SNE sub-figures represents the embedding of distinguished teams. It shows that our scheme embedding can cluster the same coordination scheme and classify the different schemes, which shows a high quality property for team recognition and sub-policy selection. To further support that Π eval trained with SVD is a better benchmark for testing generalization performance compared to randomly trained teammates, we train another 10 teams independently without SVD for PP and SMAC Fork each. Fig. 12a illustrates their cross-play performance. Compared to the original teams with SVD as shown in Fig. 11 , we can find two major phenomenons: (1) It seems that not only the locations at the diagonal have high values, indicating that each team is able to coordinate with some other teams apart from itself. As we have claimed in Sec. 3, coordination performance across different schemes is generally not guaranteed. It is clear that there are multiple teams following the same coordination scheme in the population, which is redundant and makes the population less diverse. (2) We can still see the clear 2-scheme structure for SMAC Fork, but the distribution is biased (7 for one and 3 for the other). An ideal benchmark should be unbiased for all the underlying coordination schemes to best match our goal. SVD encourages different teams to behave as differently as possible, which naturally weakens the potential bias for a particular scheme.

