COORDINATION SCHEME PROBING FOR GENERALIZ-ABLE MULTI-AGENT REINFORCEMENT LEARNING Anonymous

Abstract

Coordinating with previously unknown teammates without joint learning is a crucial need for real-world multi-agent applications, such as human-AI interaction. An active research topic on this problem is ad hoc teamwork, which improves agents' coordination ability in zero-shot settings. However, previous works can only solve the problem of a single agent's coordination with different teams, which is not in line with arbitrary group-to-group coordination in complex multiagent scenarios. Moreover, they commonly suffer from limited adaptation ability within an episode in a zero-shot setting. To address these problems, we introduce the Coordination Scheme Probing (CSP) approach that applies a disentangled scheme probing module to represent and classify the newly arrived teammates beforehand with limited pre-collected episodic data and makes multi-agent control accordingly. To achieve generalization, CSP learns a meta-policy with multiple subpolicies that follow distinguished coordination schemes in an end-to-end fashion and automatically reuses it to coordinate with unseen teammates. Empirically, we show that the proposed method achieves remarkable performance compared to existing ad hoc teamwork and policy generalization methods in various multi-agent cooperative scenarios.

1. INTRODUCTION

Multi-Agent Reinforcement Learning (MARL) (Gronauer & Diepold, 2021) holds promise in numerous cooperative domains, including resource management (Xi et al., 2018) , traffic signal control (Du et al., 2021) , and autonomous vehicles (Zhou et al., 2020) . A number of methods (Lowe et al., 2017; Wang et al., 2020b; Papoudakis et al., 2019; Christianos et al., 2021) have been proposed to deal with joint policy learning and scalability issues. However, previous works commonly assume a fixed team composition, where agents only need to coordinate with training partners and do not consider generalization. Such a process is not in line with real-world applications that require agents to cooperate with unknown teammates whose coordination schemes may not be explicitly available. When coordinating with distributed teammates, co-trained agents may fail (Gu et al., 2022) . Another research domain dedicated to this need is ad hoc teamwork (Stone et al., 2010) , which stands at a single agent's perspective to adapt to different teams in a zero-shot fashion. However, current methods exist three significant limitations: (1) The differences between teammates could be very subtle and lie in only a few critical decisions. In this case, zero-shot coordination is not always feasible. As shown in the example in Fig. 1 , teammate's behaviors are indistinguishable before the final critical action. (2) The information for identifying different teams is collected by the same policy that aims to maximize coordination performance. Thus, the exploration-exploitation dilemma will cause adverse effects on both sides. (3) Ad hoc teamwork stands at a single agent's perspective, which cannot deal with arbitrary group-to-group generalizations in complex multi-agent scenarios. To overcome these limits and achieve generalizable coordination, we propose a multi-agent learning framework called Coordination Scheme Probing (CSP). Instead of doing zero-shot coordination, CSP tries to capture the unknown teammates' coordination scheme beforehand with limited precollected episodic data. Concretely, we start with generating a set of teams with high performance and diversified behaviors to discover different solutions to the task. After that, the scheme probing module learns to interact with these teams to reveal their behaviors and represents their policies with dynamics reconstruction (Hospedales et al., 2021) at the team level. Finally, we discover coordination schemes as clusters of the representations and train a multimodal meta-policy to adapt to them, with each sub-policy dedicated to a unique coordination scheme. The whole learning framework of CSP learns multiple coordination schemes from a given task in an end-to-end fashion and automatically reuses the most suitable scheme recognized by the scheme probing module to coordinate with unknown partners. To validate our method, we conducted experiments on four challenging multi-agent cooperative scenarios and found that CSP achieves a more effective and robust generalization to unknown teammates compared to various baselines. Our visual analysis further confirms that CSP acquires multiple coordination schemes that are clearly distinguishable and can be precisely recognized. Meanwhile, our ablations show the necessity of probing unknown teammates in advance as well as using a multimodal policy for adaptation.

2. RELATED WORK

Multi-Agent Reinforcement Learning (MARL). Fully cooperative MARL methods (Foerster et al., 2018) mainly focus on deploying a fixed team with centralized training and decentralized execution (CTDE) setting (Oliehoek et al., 2008) . Value factorization methods (Sunehag et al., 2018; Rashid et al., 2018; Wang et al., 2020a ) are adopted widely in the CTDE process to solve the instability of teammates in the training process. Our work also utilizes the CTDE setting to learn policies collaborating with others. However, such a learning process may converge to a fixed modal and may not coordinate well with unseen teammates trained elsewhere. To deal with the issue, Tang et al. (2021) propose a method that could find a hard-to-search optimal policy by reward randomization. In game-theoretic methods, PSRO (Lanctot et al., 2017) aims at tackling the overfitting problem by learning meta-strategies, and α-PSRO (Muller et al., 2020) further utilizes α-Rank as an alternative for Nash solver to avoid equilibrium selection issues and improve efficiency. In this work, we borrow the idea of breaking local optima to mine diverse coordination behaviors from the environment, instead of training a single optimal policy. Ad Hoc Teamwork. Ad hoc teamwork aims to learn a single autonomous agent that cooperates well with other agents without pre-coordination (Stone et al., 2010; Mirsky et al., 2022) . Their methods include population based training (Strouse et al., 2021; Jaderberg et al., 2017; Carroll et al., 2019a; Charakorn et al., 2021) , Bayesian belief update (Barrett et al., 2017; Albrecht et al., 2016) , experience recognition (Barrett et al., 2017; Grover et al., 2018b; Chen et al., 2020), and planning (Bowling & McCracken, 2005; Ravula et al., 2019) . A recent work ODITS (Gu et al., 2022) introduces an information based regularizer to automatically approximate the hidden of the global encoder with the local encoder. However, they are limited to learn a single ad hoc agent and only makes zero-shot adaptation. We take a step further to develop a framework for more complex team-to-team coordination under a few-shot setting. Policy Representation. It is a well studied topic in multi-agent scenarios to anticipating others' behaviors and weaken the non-stationary issue (Albrecht & Stone, 2018) . DRON (He et al., 2016) 



Figure 1: An example from Overcooked (Carroll et al., 2019a), where the chef in the blue hat needs to cooperate with chefs in other colors. Different chefs put the dishes in different positions after they are done cooking. For example, during training, the chefs in the green and orange hats put their dishes in the red and blue pentagrams as their last action, respectively. When a chef in the pink hat comes, how should the chef in the blue hat work with him? Since the first four actions for different chefs are indistinguishable, how to detect his type and coordinate with him is challenging.

