DISCOVERING GENERALIZABLE MULTI-AGENT COOR-DINATION SKILLS FROM MULTI-TASK OFFLINE DATA

Abstract

Cooperative multi-agent reinforcement learning (MARL) faces the challenge of adapting to multiple tasks with varying agents and targets. Previous multi-task MARL approaches require costly interactions to simultaneously learn or fine-tune policies in different tasks. However, the situation that an agent should generalize to multiple tasks with only offline data from limited tasks is more in line with the needs of real-world applications. Since offline multi-task data contains a variety of behaviors, an effective data-driven approach is to extract informative latent variables that can represent universal skills for realizing coordination across tasks. In this paper, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS) from multi-task data. ODIS first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm. We further demonstrate that the discovered coordination skills can assign effective coordinative behaviors, thus significantly enhancing generalization to unseen tasks. Empirical results in cooperative MARL benchmarks, including the StarCraft multi-agent challenge, show that ODIS obtains superior performance in a wide range of tasks only with offline data from limited sources.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has drawn broad attention in addressing problems like video games, sensor networks, and autopilot (Peng et al., 2017; Cao et al., 2013; Gronauer & Diepold, 2022; Yun et al., 2022; Xue et al., 2022b) . Since recent MARL methods mainly focus on learning policies in one single task with simulating environments (Sunehag et al., 2018; Rashid et al., 2020; Lowe et al., 2017; Foerster et al., 2018) , there exist two obstacles when applying them to real-world problems. One is poor generalization when facing tasks with varying agents and targets, where the practical demand is to adapt to multiple tasks rather than learning every new task from scratch (Omidshafiei et al., 2017) . The other is potentially high costs and risks caused by real-world interactions through an under-learning policy (Levine et al., 2020) . Multi-agent systems are expected to perform flexibly among multiple general scenarios where the agents and targets may differ. Multi-task MARL is one promising way to realize such flexibility and generalizability. Previous related works mainly focus on training simultaneously in a pre-defined task set (Omidshafiei et al., 2017; Iqbal et al., 2021) or fine-tuning a pre-trained policy to target tasks (Hu et al., 2021; Zhou et al., 2021; Qin et al., 2022a) in an online manner. Although these approaches exhibit promising performance in some tasks, the expensive cost of online interactions hinders their applications to a broader range of tasks. Offline RL (Levine et al., 2020) , aiming at learning policies from a static dataset, is anticipated to remove the need for interactions during training. However, most current offline RL methods conservatively regularize the learned policies towards datasets (Wu et al., 2019; Kumar et al., 2019; Yang et al., 2021; Fujimoto et al., 2019) . Albeit conservatism Figure 1 : An illustration of coordination skill discovery from multi-task offline data. Offline data from marine battle source tasks like 5m and 8m of StarCraft multi-agent challenge contain generalizable coordination skills like focusing fire, moving back, etc. After discovering these coordination skills from source data, a coordination policy learns to appropriately choose coordination skills through an offline RL training process. When facing the unseen task 10m, the agents reuse the discovered coordination skills to achieve coordination and accomplish the task. effectively mitigates the distribution shift issue of offline learning, it will restrict the learned policy to be similar to the behavior policy, leading to severe degradation of generalization when facing unseen data (Chen et al., 2021b) . Therefore, leveraging multi-agent offline data to train a policy with adequate generalization ability across tasks is in demand. This paper finds that the underlying generic skills can greatly help to improve the policy's generalization. Indeed, humans are good at summarizing skills from several tasks and reusing these skills in other similar tasks. Taking Figure 1 as an example, we try to learn a policy from some StarCraft multi-agent challenge (Samvelyan et al., 2019) tasks, 5m and 8m, where we need to control five or eight marines separately to beat the same number of enemy marines. Moreover, we aim to directly deploy the learned policy without fine-tuning to an unseen target task, 10m, a task with ten marines on each side. One effective way to achieve this is to extract skills from the source tasks, like focusing fire on the same enemy or moving back low-health units, and then apply these skills to the target task. Although these tasks have different state spaces or action spaces, these skills can be applicable to a broad range of tasks. We refer to such task-invariant skills as coordination skills since they are beneficial to realize coordination in different tasks. In other words, extracting such coordination skills from known tasks facilitate generalization via reusing them in unseen tasks. Towards learning and reusing coordination skills in a data-driven way, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS), where agents only access a multi-task dataset to discover coordination skills and learn generalizable policies. ODIS first extracts taskinvariant coordination skills that delineate agent behaviors from a coordinative perspective. These shared coordination skills will help agents perform high-level decision-making without considering specific action spaces. ODIS then learns a coordination policy to select appropriate coordination skills to maximize the global return via the centralized training and decentralized execution (CTDE) paradigm (Oliehoek et al., 2008) . Finally, we deploy the coordination policy directly to unseen tasks. Our proposed coordination skill is noteworthy compared to previous works in online multi-agent skill discovery (Yang et al., 2020; He et al., 2020) , which utilize hierarchically learned skills to improve exploration and data efficiency. Empirical results show that ODIS can learn to choose proper coordination skills and generalize to a wide range of tasks only with data from limited sources. To the best of our knowledge, it is the first attempt towards unseen task generalization in offline MARL.

2. RELATED WORK

Multi-task MARL. Multi-task RL and transfer learning in MARL can improve sample efficiency with knowledge reuse (da Silva & Costa, 2021) . The knowledge reuse across multiple tasks may be impeded by varying populations and input dimensions, asking for policy networks with flexible structures like graph neural networks (Agarwal et al., 2020) or self-attention mechanisms (Hu et al., 2021; Zhou et al., 2021) . Recent works consider utilizing policy representations or agent representations to realize multi-task adaptations (Grover et al., 2018) . EPL (Long et al., 2020) introduces an evolutionary-based curriculum learning approach to scale up the number of agents. REFIL (Iqbal et al., 2021) adopts randomized entity-wise factorization for multi-task learning. UPDeT (Hu et al., 2021) utilizes transformer-based value networks to realize adaptations among alterable populations and inputs. However, these approaches only consider simultaneous learning or fine-tuning in different tasks. Training generalizable policies for deployment in unseen tasks remains a challenge. Offline MARL. Offline RL attracts tremendous attention for its data-driven training paradigm without interactions with the environment (Levine et al., 2020) . Offline MARL is a promising research direction (Zhang et al., 2021; Formanek et al., 2023) that trains policies from a static dataset. Previous work (Fujimoto et al., 2019) discusses the distribution shift issue in offline learning and considers learning behavior-constrained policies to relieve extrapolation error from unseen data estimations (Wu et al., 2019; Kumar et al., 2019) . Existing offline MARL methods often try to adopt conservative constraints upon current online MARL methods, which either extend policy gradient algorithms to multi-agent cases (Lowe et al., 2017; Foerster et al., 2018; Iqbal & Sha, 2019; Wang et al., 2021b; Xue et al., 2022a) or adopt Q-learning paradigms with value decomposition (Sunehag et al., 2018; Rashid et al., 2020; Wang et al., 2021a; Cao et al., 2021; Yuan et al., 2022b; a) . Most offline MARL methods consider training policies with sufficient conservatism (Yang et al., 2021; Jiang & Lu, 2021; Pan et al., 2022; Guan et al., 2023) . These methods can help learn an effective policy from offline data, but the conservative learning manner may significantly degrade the performance when facing unseen tasks, as learned policies will fail to generalize with out-of-distribution inputs. Some data sharing methods in single-agent RL (Li et al., 2020; Yu et al., 2021; 2022) consider properly using multi-task data for offline policy training. However, potential coordination skills from multi-task MARL data cannot be efficiently exploited by these single-agent data sharing methods. Skill discovery with hierarchical structures. Hierarchical RL (Barto & Mahadevan, 2003; Tang et al., 2019; Pateria et al., 2021) provides an approach to realizing temporal abstraction and hierarchical organization. Skill discovery methods adopt the hierarchical structure to discover unsupervised high-level skills with state empowerment from information theory (Eysenbach et al., 2019; Campos et al., 2020; Sharma et al., 2020) . This diverse skill discovery methods from online RL can also be extended to MARL. HSD (Yang et al., 2020) learns latent skill variables via centralized training and MASD (He et al., 2020) considers performing skill discovery in an adversarial way. However, these methods are usually tailored to address data efficiency in online RL, where the proposed skills are different from our multi-task reusable skills towards generalizable MARL. On the other hand, recent works in offline RL also exhibit that high-level latent actions help tackle extrapolation errors from out-of-distribution actions and learn primitive behaviors (Zhou et al., 2020; Ajay et al., 2021) . A data-driven approach to extracting and reusing high-level multi-agent behaviors (i.e., our proposed coordination skills) can be a practical and promising direction toward multi-task generalization.

3.1. COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING

A cooperative multi-agent task can be modeled as a decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek & Amato, 2016 ) T = ⟨I, S, A, P, Ω, O, R, γ⟩, where I = {1, . . . , n} is the set of agents, S is the set of global states, A is the set of actions, and Ω is the set of observations. At each time step with state s ∈ S, each agent i ∈ I only acquires an observation o i ∈ Ω drawn from the observation function O(s, i), and then chooses its action a i ∈ A. The joint action a = (a 1 , . . . , a n ) leads to a next state s ′ ∼ P (s ′ | s, a) and the corresponding global reward r = R(s, a). The target is to find a joint policy π(a | τ ) to maximize the discounted return Q π (τ , a) = E [ ∞ t=0 γ t R(s t , a t ) | s 0 = s, a 0 = a, π], where γ ∈ [0, 1) is a discount factor that trades off between current and future rewards. Here τ = (τ 1 , . . . , τ n ), where τ i denotes the trajectory (o 1 i , a 1 i , . . . , o t-1 i , a t-1 i , o t i ) of agent i. Most value-based cooperative MARL algorithms apply the CTDE paradigm (Sunehag et al., 2018; Rashid et al., 2020; Wang et al., 2021a) , where agents can learn a decomposable global value function Q tot (τ , a) represented by a mixing network in the training phase and use the decomposed value function Q i (τ i , a i ) for decentralized decision. The global value function parameterized by θ can be learned by minimizing the squared temporal difference (TD) error (Sutton & Barto, 2018) , using experience replay and a target network parameterized by θ -to stabilize training (Mnih et al., 2015) : min θ E τ ,a,r,τ ′ r + γ max a ′ Q tot τ ′ , a ′ ; θ --Q tot (τ , a; θ) 2 . (1) Coordination skill discovery Unlabeled Multi-task Data 𝑠 ! , 𝜏 " ! , … , 𝜏 # ! , 𝒂 ! = (𝑎 " ! , … , 𝑎 # ! ) State Encoder 𝑞(𝑧 $ | 𝑠, 𝒂, 𝑖) (2) Coordination policy learning (3) In multi-task decentralized execution, ODIS individually chooses a coordination skill according to the coordination policy and decodes specific actions using the action decoder.

3.2. MULTI-TASK MULTI-AGENT REINFORCEMENT LEARNING

Recent multi-task MARL works consider policy learning among two or several cooperative multiagent tasks (Hu et al., 2021; Iqbal et al., 2021) . In our settings, we focus on learning and deploying policies in a static task set {T i }. A major difficulty in multi-agent multi-task transfer is varying agent numbers and observation/action dimensions. To learn a transferable or universal policy across different tasks, previous methods (Hu et al., 2021) develop a population-invariant network structure by utilizing the transformer structure (Vaswani et al., 2017) . For the individual Q-network, the observation o i of agent i can be decomposed into self/environmental information o own i and other entities' information {o other i,j }. The network further generates embeddings of Q, K, V and calculate the attention output according to the attention mechanism: Q = MLP Q ([o own i , o other i,1 , . . . ]), K = MLP K ([o own i , o other i,1 , . . . ]), V = MLP V ([o own i , o other i,1 , . . . ]), [e own i , e other i,1 , . . . ] = softmax QK T √ d K V, d K = dim(K). The output representations e own i , e other i,1 , . . . can be further utilized to derive Q-values, where e own i can be used for actions without interactions with other entities and e other i,j can be used for interactive actions with entity j. To handle the partial observability, we can append an additional historical embedding h t-1 i from the last time step t -1 to the transformer input sequence and acquire corresponding h t i . It is notable that this decomposition technique with transformer models can also be applied to process the state information in mixing networks or other modules (Zhou et al., 2021; Qin et al., 2022a) to effectively handle multi-task inputs with varying data shapes.

4. METHOD

In this section, we will describe our proposed Offline MARL algorithm to Discover coordInation Skills (ODIS) from a static multi-task dataset D, which is a fully data-driven approach to learning generalizable policies for multiple tasks. As shown in Figure 2 , ODIS begins with unsupervised coordination skill discovery that leverages a state encoder q(z i | s, a, i) to extract the coordination skill z i , . . . , z n for each agent using the unlabeled portion of D without reward information. The discovered coordination skill z i represents an abstraction of coordinative behaviors, which can be decoded to a task-related action with local information τ i by an individual action decoder p(a i | τ i , z i ). Both the state encoder and the action decoder are trained in an unsupervised learning manner to discover effective and distinct coordination skills. After that, ODIS learns a coordination policy to select appropriate coordination skills with offline RL training in the entire dataset D. This offline MARL training process adopts the CTDE paradigm with a conservative learning manner. In the decentralized execution phase, the agent can use individual coordination policy to choose the best coordination skill and decode it to a specific action with the pre-trained action decoder. The high-level decision-making of ODIS promotes generalization across different tasks, especially in unseen tasks with varying targets and controllable agents. To tackle the alterable data shapes from multi-task settings, we design flexible and powerful network structures based on the transformer model to process input data, whose details are illustrated in Appendix C.

4.1. COORDINATION SKILL DISCOVERY

Coordination skill discovery is expected to extract effective coordination skills from offline data. We assume the coordination skill z i for agent i is a discrete variable from a finite set Z, where |Z| is a hyper-parameter varying among tasks. We leverage a state encoder q(z i | s, a, i) using the sampled state s and joint action a to obtain a coordination skill z i for agent i. By feeding states and joint actions as input, we can ensure the state encoder outputs appropriate coordination skills from the global perspective. In practice, we implement the state encoder with a transformer model, which simultaneously outputs coordination skills z = (z 1 , . . . , z n ) of all agents in a sequence. Moreover, we need to derive task-related actions from the coordination skill in decentralized executions, where acquiring global information is impractical for individual agents. Thus we further introduce an action decoder that can predict a task-related action âi ∼ p(• | τ i , z i ) with an agent i's local information τ i and the chosen coordination skill z i . The training objective is to maximize the likelihood of the real action a i from data, along with the Kullback-Leibler divergence between q(z i | s, a, i) and an uniform prior p(z i ) as a regularization, following β-VAE (Higgins et al., 2017) . The regularization with a uniform distribution of coordination skills can prevent the state encoder from choosing similar coordination skills for all inputs, thereby helping to discover distinguished coordination skills. The objective for coordination skill learning is presented as follows: L s (θ a , ϕ s ) = -E (s,τ ,a)∼D n i=1 E zi∼q(•|s,a,i) [log p(a i | τ i , z i )] -βD KL (q(• | s, a, i) ∥ p(•)) , (1) where ϕ s and θ a denote the parameters of the state encoder and the action decoder, respectively, and β is the regularization coefficient. Our implementations of the state encoder and the action decoder utilize the transformer structure to handle alterable dimensions of states, observations, and actions in multi-task settings. The key technique is to decompose input into sequence data as illustrated in Section 3.2 and project each portion to an embedding with a fixed dimension. In Appendix C we provide detailed descriptions of the network structures.

4.2. COORDINATION POLICY LEARNING

After coordination skill discovery, we train an action decoder that can generate a task-related action using an agent's local information and input coordination skills. Therefore, we can further consider developing high-level decision-making on discovered coordination skills. As our coordination skills are universal for a given task set, we can train a generalizable coordination policy to decide on appropriate coordination skills for different tasks. We adopt a value-based MARL method with the CTDE paradigm to learn a global value function Q tot (τ , z) that can be decomposed into individual value functions Q 1 (τ 1 , z 1 ), . . . , Q n (τ n , z n ). The global value function Q tot (τ , z ) is trained with the squared TD loss as follows: L TD (θ v ) = E (s,τ ,a,r,τ ′ )∼D,z r + γ max z ′ Q tot τ ′ , z ′ ; θ - v -Q tot (τ , z; θ v ) 2 , where we use θ v to denote all parameters in value networks including individual and mixing networks, θ - v to denote parameters of target networks. Note that in Equation 2, we cannot directly acquire the joint coordination skill z from the offline dataset D. Therefore, the state encoder q(z i | s, a, i) trained in the coordination discovery phase is reused to provide a joint coordination skill z = (z 1 , . . . , z n ) from the state and joint action. When estimating Q-targets, we choose the joint coordination skill z ′ by selecting each coordination skill z ′ i with maximal individual Q-value Q i (τ ′ i , z ′ i ) to avoid search in the large joint coordination skill space, following previous MARL methods (Sunehag et al., 2018; Rashid et al., 2020) . When the value decomposition method satisfies the individual-global-max (IGM) principle (Wang et al., 2021a) , the action selection with individual value functions is correct. We adopt a QMIX-style mixing network to ensure that it can satisfy this property. Our design of the individual value network and mixing network also utilizes the powerful transformer structure. Practically, the individual value network calculates local Q-values by taking representations from an observation encoder as input. The observation encoder has the same network structure as the aforementioned action decoder to process local trajectory information. The mixing network takes the global state information as input to generate non-negative weights for combining individual Q-values. We also provide detailed descriptions of these networks in Appendix C. As the coordination skills from the previous discovery phase are obtained with global information, it remains a challenge that the coordination policy may not explicitly learn to choose effective coordination skills only with local observations. We find that directly learning this from the TD objective will be extremely inefficient. Therefore, it will be necessary to introduce an auxiliary objective to learn better representations that can guide the coordination policy towards a coordinative perspective. When updating parameters of the observation encoder, we use the last layer of the previous state encoder to process output representations from the observation encoder and thus acquire a coordination skill distribution qi (• | τ i ). We expect that the output distribution can be similar to the pre-trained state encoder q(• | s, a, i). We calculate the KL-divergence between them to update the observation encoder as the consistent loss L c below: L c (ϕ o ) = n i=1 E (s,τ ,a)∼D [D KL (q i (• | τ i ) ∥ q(• | s, a, i))] , where ϕ o denotes parameters of the observation encoder in individual value network. To tackle the out-of-distribution issue in offline RL, we also adopt the popular conservative Q-learning (CQL) method (Kumar et al., 2020) . To be concise, the total loss term in the coordination exploitation phase is presented as L p (θ v , ϕ o ) = L TD (θ v ) + αL CQL (θ v ) + λL c (ϕ o ) , where α and λ are two coefficients and L CQL is the loss term from CQL. Decentralized Execution. When performing decentralized executions in a test task, we use local information to calculate Q-values for each coordination skill with individual value network Q i (τ i , z i ) and then choose the coordination skill with maximal Q-value. The action decoder further uses the coordination skill and local information to provide a task-related action for the particular task.

5. EXPERIMENTS

In this section, we design experiments to evaluate the following properties of ODIS We conduct ablation studies to find out how the components of ODIS affect performance.

5.1. PERFORMANCES ON MULTI-TASK GENERALIZATION

Baselines. We introduce comparable baselines and perform several adaptations since none of the existing multi-task MARL algorithms considers offline learning. UPDeT (Hu et al., 2021) is a state-of-the-art multi-agent transfer learning baseline that adopts a transformer-based individual network to tackle multi-agent transfer. However, the mixing network of UPDeT is not designed for simultaneous multi-task learning. As its alternatives, we implement two variants of UPDeT by adopting the transformer-based mixing network of ODIS (UPDeT-m), and the linear decomposable network of VDN (Sunehag et al., 2018) (UPDeT-l), respectively. We also keep the same transformer structures and adopt the same CQL loss to these UPDeT baselines. Therefore, UPDeT-m can be Table 1 : Average test win rates of the final policies in the task set marine-hard with different data qualities. The listed performance is averaged over five random seeds. We abbreviate asymmetric task names for simplicity. For example, the task name "5m6m" denotes the SMAC map "5m_vs_6m". Results of BC-best stands for the best test win rates between BC-t and BC-r. 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 1.0 ± 1.5 0.0 ± 0.0 0.0 ± 0.0 1.6 ± 1.6 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 1.6 ± 1.6 13m15m 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 2.3 ± 2.6 0.6 ± 1.3 0.0 ± 0.0 0.0 ± 0.0 1. 1.6 ± 1.6 1.6 ± 2.7 0.0 ± 0.0 0.0 ± 0.0 2.3 ± 1.4 13m15m 0.8 ± 1.4 0.0 ± 0.0 0.0 ± 0.0 2.3 ± 2.6 1.6 ± 1.6 0.0 ± 0.0 0.0 ± 0.0 2.4 ± 1.4 seen as an ODIS variant without coordination skill discovery but learning an RL policy to directly select actions. We also implement behavior cloning baselines since they are popular for offline tasks. We introduce a transformer-based behavior cloning baseline (BC-t) that has the same structure as ODIS. As decision transformer methods (Chen et al., 2021a) prevail in recent offline literature, we also append return-to-go information to BC-t and thus get a baseline BC-r. More details of our implementations and hyper-parameters are reported in Appendix C. We notice that a recent work, MADT (Meng et al., 2021) , proposed to train a multi-agent decision transformer with offline training and online tuning. However, we find that MADT is generally not comparable in our settings and put related discussion in Appendix G. StarCraft multi-agent micromanagement tasks. Following previous multi-task MARL methods (Hu et al., 2021; Qin et al., 2022a) , we extend the original SMAC maps and sort out three task sets. In each task set, agents will control some units like marines, stalkers, and zealots, but the numbers of controllable agents and target enemies differ from tasks. We refer to our three task sets as marine-easy, marine-hard, and stalker-zealot. Detailed descriptions of these task sets can be found in Appendix A. We adopt the popular QMIX algorithm (Rashid et al., 2020) to collect four classes of offline datasets with different qualities. Following guidelines in single-agent D4RL offline RL benchmarks (Fu et al., 2020; Qin et al., 2022b) , the four different dataset qualities are labeled as expert, medium, mediumexpert, and medium-replay. We report the detailed properties of these datasets in Appendix B. We conduct experiments in three task sets with four different data qualities. We train all methods with offline data only from three source tasks and evaluate them in a wide range of unseen tasks. The average test win rates in the task set marine-hard are shown in Table 1 , while results of the two other task sets are deferred to Appendix I. The tables report the best test win rates between BC-t and BC-r as BC-best. We find that ODIS generally outperforms other baselines in both source tasks and unseen tasks. ODIS can discover and exploit common coordination skills from multi-task data, resulting in superior and stable performance compared with UPDeT-l and UPDeT-m, which cannot generalize well among different levels of tasks. We notice that behavior cloning methods present comparable performance, especially in expert datasets, indicating that our proposed transformer structure also enhances generalization across tasks. However, these behavior cloning baselines cannot effectively exploit multi-task data to further improve its performance, although we add additional global return-to-go information in BC-r. To evaluate the validity of ODIS, we also compare ODIS with other single-task MARL approaches that can perform offline training, including ICQ (Yang et al., 2021) and QPLEX (Wang et al., 2021a) . These works can adopt offline learning manners but cannot directly learn from multi-task data. We find that ODIS can exhibit comparable performance with these single-task offline MARL methods even in the unseen task of ODIS. These experiments reveal that the multi-task offline training manner of ODIS can indeed promote generalization to different tasks. We defer the experimental results and detailed discussion to Appendix D. Cooperative navigation tasks. To further validate the effectiveness of ODIS, we also conduct experiments in several cooperative navigation tasks from the multi-agent particle environment (Lowe et al., 2017) , where varying numbers of agents learn to reach corresponding landmarks. We notice that ODIS still shows superior performance in both source and unseen tasks. The results and detailed descriptions of these tasks can be found in Appendix F.

5.2. SEMANTICS OF DISCOVERED COORDINATION SKILLS

To investigate how the discovered coordination skill helps decision-making, we deploy ODIS agents to different tasks and record chosen coordination skills in test episodes. As shown in Figure 3 , we exhibit the percentage of each coordination skill usage from policies learned in marine-easy and stalker-zealot task sets with expert datasets. For the marine-easy task set in Figure 3 (a), we adopt a coordination skill number of three and find that the curves of each coordination skill are pretty similar between a source task 5m and an unseen task 12m despite different episode lengths. Further visualization shows that the three coordination skills represent different coordinative behaviors. The agents learn to utilize skill 3 to disperse in different directions and then use skill 2 to advance. When enemies are in the attack range, most agents choose skill 1 to focus fire on specific enemies, and a few agents choose skill 2 and skill 3 to pursue enemies or run away from fire. For the stalker-zealot task set in Figure 3 (b), we adopt a coordination skill number of four as the task is a little more complicated. We exhibit the coordination skill usages in a source task 2s3z and an unseen task 3s4z, whose curves are also similar. The visualization indicates that the chosen coordination skill may present different semantics from marine-easy. In the beginning, agents use skill 4 to advance. When approaching the enemies, some agents choose to attack nearby enemies with skill 2, and others learn to absorb damage forwardly with skill 1. We find that the stalker, a melee unit, is more willing to draw enemy fire than the ranging unit stalker. After that, a few agents will use skill 3 to attack enemies far away, and more agents tend to choose skill 3 when nearby enemies fall.

5.3. ABLATION STUDIES

In ablation studies, we investigate the effectiveness of each component in our proposed ODIS structure. First, we try to find whether discovering coordination skills from multiple tasks can yield better performance in a particular task. We perform ODIS offline training separately with multi-task and single-task coordination skill discovery. ODIS with multi-task coordination skill discovery can utilize data from the two other tasks to discover potentially more effective coordination skills, while ODIS with single-task coordination skill discovery can only access data from the corresponding task. In coordination policy learning, both two approaches are learned from single-task data. Results in Figure 4 (a) show that ODIS with multi-task coordination skill discovery acquires better performance, indicating that shared coordination skills across tasks can benefit other tasks. We also conduct experiments to investigate our proposed consistent loss, which helps learn effective local representations for choosing coordination skills from a coordinative perspective as the state encoder does. We run experiments with two variants of ODIS with and without consistent loss in the marine-hard task set and present average test win rates in three source tasks and four unseen tasks. As shown in Figure 4 (b), the performance of ODIS with consistent loss is significantly better than ODIS without consistent loss, which shows the proposed loss can maintain the consistency of coordination skills chosen between coordination skill discovery and coordination policy learning. Furthermore, the number of coordination skills is a key hyper-parameter of ODIS, so we conduct experiments to compare performances with different coordination skill numbers in Table 7 . We find the performance of ODIS is not sensitive with coordination skills numbers ranging from 3 to 16. However, adopting less skill numbers or random selection will significantly fail, as discussed in Appendix E.

6. CONCLUSION

We propose ODIS, an offline MARL algorithm applying coordination skill discovery from multitask data, realizing multi-task generalization in a fully data-driven manner. ODIS extracts and utilizes coordination skills shared among different tasks and thus acquires superior performance in both source and unseen tasks. The effectiveness of ODIS indicates that underlying coordination skills from multi-task data can be crucial for generalization in cooperative MARL. As the discrete coordination skill might be limited when facing dissimilar tasks, developing general representations among dissimilar tasks from multi-task or many-task data is a promising direction in the future. In addition, research on offline MARL algorithms and benchmarks in various domains will also be helpful to real-world applications of MARL.

A DESCRIPTIONS OF DEFINED TASK SETS IN SMAC

The StarCraft multi-agent challenge (SMAC) (Samvelyan et al., 2019) is a widely used cooperative multi-agent environment containing different types of StarCraft micromanagement tasks. In this paper, we sort out three task sets and call them marine-easy, marine-hard, and stalker-zealot, respectively. The marine-easy and marine-hard task sets include several marine battle tasks where several ally marines need to beat the same number or a larger number of enemy marines. In the marine-easy task set, the number of enemy marines equals the number of ally marines, while tasks from the marine-hard task set contain both equal numbers and larger numbers of enemy marines. The stalker-zealot task set includes several tasks with symmetric stalkers and zealots on each side. We exhibit illustrations of these tasks in Figure 5 (a) and Figure 5 (b). For our goal of generalization to unseen tasks with limited sources, we select three tasks from the task set as training tasks, and the other tasks are only for evaluation. The detailed properties of these task sets can be seen in Tables 2, 3 , and 4, respectively. 

B PROPERTIES OF OFFLINE DATASETS

As stated in the experiments section, we construct offline datasets based on the PyMARL implementationfoot_1 of the MARL algorithm QMIX (Rashid et al., 2020) . Following the popular single-agent offline reinforcement learning benchmark D4RL (Fu et al., 2020) , we collect data with four types of qualities called expert, medium, medium-expert, and medium-replay, respectively. Definitions of these four qualities are listed below: • The expert dataset contains trajectory data collected by a QMIX policy trained with 2, 000, 000 steps of environment interactions. We also record the test win rate of the trained QMIX policy (as the expert policy) for constructing medium datasets. • The medium dataset contains trajectory data collected by a QMIX policy (as the medium policy) whose test win rate is half of the expert QMIX policy. • The medium-expert dataset mixes data from the expert dataset and the medium dataset to acquire a more diverse dataset. • The medium-replay dataset is the replay buffer of the medium policy, containing trajectory data with lower qualities. As we consider generalization to unseen tasks, we only require offline datasets in the source tasks of the three task sets mentioned above. For the expert and medium datasets, we collect 2, 000 trajectories to construct each dataset, and thus the medium-expert dataset contains 4, 000 trajectories as a mixture of the expert and medium data. The trajectory number of the medium-replay dataset depends on the number of sampling trajectories before the medium policy stops training. When performing multi-task offline training in our experiments, we select up to 2, 000 trajectories of data from each source task. When the medium-replay dataset contains less than 2, 000 trajectories of data, we select all trajectories. Data from different tasks are merged into a multi-task dataset to ensure that the policy is trained with multi-task data simultaneously.

C STRUCTURES, HYPER-PARAMETERS, AND TRAINING DETAILS OF ODIS

In this section, we will exhibit the network structure, hyper-parameter choices, and other training details of ODIS. As illustrated in Figure 6 , the network structure of ODIS mainly contains four components, the action decoder, state encoder, observation encoder, and mixing network, respectively. These four components apply the attention mechanism to process alterable state and observation spaces. The observation encoder and the action decoder take the observation as input, and the state encoder and the mixing network take the global state as input. As presented in the preliminaries, we decompose the observation information o i of an agent i into several portions including its own/environmental information o own i and other entities' information {o other i,j }. Each kind of portion is fed into a separate fully connected layer to acquire an embedding  [ ] … 𝑜 # Self-Attention FC 𝑜 #,' %$()* 𝑜 # %+, MLP 𝑧 # ' 𝑎 # (a) Action Decoder 𝑠 … … [ ] FC Self-Attention MLP … 𝑄 ! ⋮ 𝑄 " Mixing 𝑄 $%$ (𝝉, 𝒛) 𝑠 ),- 𝑠 # .//0 𝑠 ' %$()* (c) Mixing Network [ ] … 𝑜 #,' %$()* 𝑜 # %+, 𝑜 # Self-Attention 𝑒 # %+, , 𝑒 #,! %$()* , … , 𝑒 #,' %$()* , ℎ # 1 FC (d) Observation Encoder ℎ # 12! ℎ # 12! … … FC Self-Attention 𝑧 ! , … , 𝑧 " 𝑠 ),- 𝑠 # .//0 𝑠 ' %$()* 𝑠, 𝑎 ! , … , 𝑎 " 𝑎 # [ ] (b) State Encoder FC & Softmax | ⋅ | ℎ # 1 Freeze Parameters For the self-attention network, we generate Q, K, V with fully connected layers from the embedding sequence and then perform self-attention (Vaswani et al., 2017) with the input and output dimensions of 64. The output sequence can be formalized as the representation of the agent i's own information e own i and the representation to other entities {e other i,j } in agent i's view. To handle the partial observability, we append the input sequence with a historical embedding h t-1 i when applying self-attention and thus acquire the output of h t i , following the UPDeT structure (Hu et al., 2021) . For the observation encoder, we only select the own information representation e own i to calculate Q-values and feed it into the individual network or the coordination skill classifier. For the action decoder, we embed the chosen skill to a vector of 64 dimension and concatenate it with the output sequence. When predicting actions, we divide available actions into interactive actions and non-interactive actions, where the interactive action means an action that needs to interact with an entity and non-interactive action means an action that is only relevant to an agent itself. We output the values of non-interactive actions from e own i and the values of interactive actions with entity j from corresponding e other i,j and further concatenate them to select the action with the maximal value. Figure 7 : Average test win rates of ODIS, ICQ, and QPLEX in three maps of the marine-easy task set with medium datasets (top), medium datasets (middle), and medium-expert datasets (bottom). Here ODIS is trained with offline data from three source tasks, 3m, 5m, and 10m, as shown in Table 2 , while ICQ and QPLEX are trained with corresponding single-task data. Note that 8m is an unseen task for ODIS, where ODIS can still acquire comparable performance with the two other offline MARL algorithms using offline data of 8m for training. We also decompose the state information s into several portions including the environment information s env , ally information {s ally i }, and other entities' information {s other j }. Like the process of the observation, we feed them into a separate fully connected layer to acquire embeddings with the dimension of 64 and further calculate Q, K, V and perform self-attention. The dimensions of Q and K are set to 8 to reduce computation as the state may contain numerous entities, while V remains a dimension of 64. The output sequence consists of the representation of each portion. For the mixing network, we utilize all ally information to generate non-negative weights by calculating the absolute value of MLP outputs. For the state encoder, we additionally append the action a i to the corresponding ally information s ally i to perform self-attention and use ally representations from the output sequence as the input of the coordination skill classifier. The coordination skill classifier contains a fully connected layer along with the softmax function to compute the probability of each coordination skill. The individual value network also contains a fully connected layer to transform local representations into Q-values. Besides the above network structure, ODIS needs two training phases to perform coordination skill discovery and coordination policy learning separately. We implement ODIS with the aforementioned PyMARL framework to ensure that the mechanics irrelevant to the algorithm are the same as previous methods. Other specific hyper-parameters are listed in Table 6 . All the tabular results show the performance of ODIS with 50, 000 optimization steps, and the steps of the coordination policy Table 7 : Average test win rates of the final policies trained with different coordination skill numbers (abbreviated as "skill num.") in the task set marine-hard with medium data quality. The listed performance is averaged over five random seeds. We abbreviate asymmetric task names for simplicity. For example, the task name "5m6m" denotes the SMAC map "5m_vs_6m". The column name "rand. skill" stands for random coordination skill selection. learning phase are the subtraction of the coordination skill discovery steps from the total steps. The training process of ODIS with an NVIDIA GeForce RTX 2080Ti GPU and a 32-core CPU costs 12-14 hours typically. Our released implementation of ODIS, along with the provided offline datasets, follows Apache License 2.0, the same as the PyMARL framework.

D COMPARISONS WITH SINGLE-TASK OFFLINE MARL METHODS

Although ODIS aims at the multi-task offline MARL domain, we also compare ODIS with offline MARL methods trained in single-task in the marine-easy task set, where ODIS utilizes three source tasks for training. We select two baselines, ICQ and QPLEX, for comparison. ICQ (Yang et al., 2021) is a conservative-style offline MARL algorithm that can handle the severe distribution shift issue in offline MARL. QPLEX (Wang et al., 2021a) claims the IGM-complete value decomposition can benefit offline training. As the two baselines cannot leverage multi-task offline data, we feed the corresponding task dataset into these algorithms to train policies independently. We evaluate the performance in three tasks, including 5m, 10m, and 8m, where 8m is an unseen task for ODIS. To train the two baselines in task 8m, we also collect data with a QMIX policy like previous statements, and the properties of the data are exhibited in Table 5 . The data in 8m is only used for the two baselines but remains unseen to ODIS. We conduct experiments with medium, expert, and medium-expert data qualities. As shown in Figure 7 , ODIS outperforms other baselines in both source tasks 5m and 8m and the unseen task 8m, indicating that with learning through multi-task data, ODIS can not only perform better than other single-task offline MARL methods in most tasks but present tremendous zero-shot task generalization to unseen tasks. ICQ can generally acquire a good performance in expert data, but its conservative paradigm may limit the performance. QPLEX struggles in these datasets without particular tuning, and we speculate that it is because our datasets cannot provide 28.7 ± 6.7 24.0 ± 3.9 4.2 ± 2.9 8.3 ± 2.9 71.7 Unseen tasks CN-3 43.8 ± 5.2 43.8 ± 2.6 14.6 ± 3.9 26.0 ± 5.9 83.4 CN-5 8.1 ± 2.5 7.3 ± 3.9 0.0 ± 0.0 1.0 ± 1.5 0.0 diverse data for QPLEX to perform policy exploitation and lead to large extrapolation errors. The empirical results show that discovering and sharing coordination skills can be efficient and powerful across different tasks.

E EXPERIMENTS WITH DIFFERENT COORDINATION SKILL NUMBERS

The size of the coordination skill set Z is a key hyper-parameter in ODIS, which represents the number of actions that can be chosen in the coordination policy. We conduct experiments in the marine-hard task-set with medium quality to investigate whether decision-making in the coordination skill space works. We select the coordination skill number of 5 as the default setting for our main experiments. As shown in Table 7 , we can find that the choices of coordination skill numbers 3, 5, 8, and 16 exhibit comparable performances in most unseen tasks, indicating that ODIS does not need a sophisticated tuning at the coordination skill number. We finally choose the coordination skill number of 5 in this task set because it obtains a generally better performance. On the other hand, we can find that a coordination skill number of 1 cannot generalize to most unseen tasks, as its performance entirely depends on the reconstructive ability of the action decoder with only 1 coordination skill that can be selected. We also evaluate the performance of ODIS with randomly choosing a coordination skill when the total coordination skill number is 5, where the coordination policy is not trained and we can only rely on the action decoder for decision-making. The results of ODIS with random skills exhibit all average test win rates of 0, indicating that the discovered coordination skill can be sufficiently utilized by the action decoder to generate task-relevant actions and cannot be simply disentangled from the framework.

F RESULTS ON COOPERATIVE NAVIGATION TASKS

To further evaluate the effectiveness of ODIS, we design and conduct experiments in a task set from the cooperative navigation environment. Cooperative navigation is a series of tasks from the multi-agent particle environment (MPE) raised by MADDPG (Lowe et al., 2017) . In this environment, a number of agents try to reach corresponding landmarks and only acquire a reward of 1 when they all successfully reach the landmarks. A visualization of this task can be found in Figure 5 (c). We add discrete control support to the original cooperative navigation tasks, where agents can execute actions of moving towards four directions and a "none" operation. We design a task set in the cooperative We collect different qualities of data using the QMIX algorithm. The detailed dataset properties are shown in Table 8 . We compare ODIS with the aforementioned baselines, including BC-t, BC-r, UPDeT-m, and UPDeT-l, and evaluate in the expert and medium datasets, respectively. The results are exhibited in Table 9 . ODIS outperforms other baselines in both expert data and medium data. As a reference, we also put the results of online QMIX algorithms in the table. We find that in two unseen tasks, CN-3 and CN-5, ODIS trained with expert data can acquire better performances than a learn-from-scratch QMIX algorithm. A notable observation is that ODIS can generalize to CN-5, while an online QMIX will fail to learn a valid policy in this task.

G EVALUATIONS OF ODIS AND MADT

MADT (Meng et al., 2021) is a recent approach to training a multi-agent decision transformer with offline training and optional online tuning. However, MADT is not a baseline originally designed for simultaneous multi-task training. We compare ODIS with MADT in the marine-hard task set with expert and medium data qualities, and exhibit the results in Table 10 . We find MADT cannot generally learn a valid policy in our multi-task settings, and we provide the following three explanations: 1. MADT utilizes feature encoding and action masking techniques to deal with different input shapes. All inputs are encoded into the same shape with zero padding, and the action space has to be aligned to the maximal action space with unavailable actions masked. These techniques may induce poor generalization when the input shape changes dramatically (e.g., the observation size of task 3m is 42 while for 10m_vs_11m it is 132). 2. The multi-task generalization setting proposed by MADT adopts a task set with similar input shapes (i.e., the observation size range is 25-48), which makes the above issue less significant. 3. Our implemented baselines (including ODIS, BC-t, and raised UPDeT variants) decompose input with particularly designed observation and state encoders to handle the varying shape issue better. It means our baseline design is generally fair and comparable. The benefits of our encoding method can be seen from the comparison between MADT and BC-t. To validate the observed phenomena, we additionally conduct a simple experiment where the input shapes of the task set are restricted to a small range. The results are shown in Table 11 . MADT generalizes better in this small task set, while ODIS can still outperform MADT. When constructing the offline datasets, we choose the number of trajectories mainly based on recent offline MARL works (Yang et al., 2021) and the single-agent offline RL benchmark D4RL (Fu et al., 2020) . As datasets in D4RL are transition-based, we estimate the corresponding trajectory numbers based on the average episode length in our evaluated benchmarks. We finally chose a trajectory number of 2000 for each task to compose multi-task offline datasets. We conduct additional experiments in the expert dataset of the marine-hard task set to evaluate ODIS with datasets of 1000 trajectory numbers. The results are exhibited in Table 12 . We find that ODIS-1000 can also present a good performance, indicating that ODIS is not very sensitive to the data size. As stated in the experiments section, we conduct experiments in three designed task sets called marine-easy, marine-hard, and stalker-zealot, respectively. As we have presented the empirical results in the marine-hard task set in Table 1 , we here exhibit the results in the marine-easy and stalker-zealot task sets. The empirical results in the marine-easy task set are shown in Table 14 , and the empirical results in the stalker-zealot task set are shown in Table 15 . As a reference, we also exhibit the online QMIX performances in all tasks in Table 13 . We find that ODIS generally outperforms other baselines in most source and unseen tasks. A specific case is that the performance of ODIS is not so promising in medium-replay data because the medium-replay data of marine-easy and stalker-zealot has very low qualities and thus hinders ODIS from discovering effective coordination skills. Despite all that, ODIS can still reach a similar performance compared to other baselines, indicating the validity of the ODIS algorithm. In addition, ODIS can achieve comparable performances to online QMIX in some unseen tasks with zero-shot generalization, showing the effectiveness of our proposed method. 



Code available at https://github.com/LAMDA-RL/ODIS https://github.com/oxwhirl/pymarl



1 . (a) The ability of multi-task generalization, including zero-shot generalization in unseen tasks. We conduct experiments in specially designed task sets from the StarCraft multi-agent challenge (SMAC) (Samvelyan et al., 2019) using offline data with diverse qualities. (b) The semantics of discovered coordination skills. We analyze the coordination skill usage in several test episodes from different tasks to investigate how ODIS makes decisions with different coordination skills. (c) Effectiveness of the ODIS structure.

Figure 3: An illustration of the semantics of discovered coordination skills. We plot the percentages of used coordination skills in two maps from (a) the marine-easy task set and (b) the stalker-zealot task set. We anticipate the semantics of each coordination skill and visualize several test frames from corresponding time steps in (c) 5m and 12m maps and (d) 2s3z and 3s4z maps. Agents in colored circle choose the corresponding coordination skill with the same color in legends of (a) and (b).

Figure 4: (a) average test win rates of ODIS when performing multi-task/single-task coordination skill discovery. (b) average test win rates of ODIS with/without consistent loss.

Figure 5: Illustrations of different kinds of tasks in the experiments section. (a) The marine battle task 5m_vs_6m of SMAC, where agents control 5 controllable marines to beat 6 built-in-AI marines. (b)The heterogeneous task 2s3z of SMAC, where agents control 2 stalkers and 3 zealots to beat the same number of built-in-AI units. (c) The cooperative navigation task with 3 agents from the multi-agent particle environment, where 3 agents need to reach corresponding landmarks.

Figure 6: The network structure of ODIS. In coordination skill discovery, ODIS leverages (a) a local action decoder and (b) a global state encoder to discover coordination skills from multi-task offline data. In coordination policy learning, ODIS performs MARL with the CTDE paradigm to train individual coordination policies with the assistance of (c) a mixing network. The individual coordination policy contains (d) an observation encoder to extract representations from local information.

6 ± 1.6

Descriptions of tasks in marine-easy task set.

Descriptions of tasks in marine-hard task set.

Descriptions of tasks in stalker-zealot task set.

Properties of offline datasets with different qualities.

𝑠, 𝑎 ! , … , 𝑎 " From 𝑜 ! , … , 𝑜 " 𝑄 # (𝜏 # , 𝑧 # )

Properties of offline datasets with different qualities in cooperative navigation tasks.

Average test success rates in the cooperative navigation task set with different data qualities.

Average test win rates of ODIS and MADT in the marine-hard task set with expert and medium data qualities.

Average test win rates of ODIS and MADT in a small task set (including source tasks of 3m and 5m, and unseen tasks of 4m and 6m) with medium data qualities.

Average test win rates of ODIS in the marine-hard task set with expert data of 2000 trajectories (ODIS-2000) and 1000 trajectories (ODIS-1000).

Test win rates of online QMIX algorithms in all test tasks. We abbreviate asymmetric tasknames for simplicity. For example, the task name "5m6m" denotes the SMAC map "5m_vs_6m". ADDITIONAL RESULTS ON MULTI-TASK OFFLINE LEARNING IN SMAC

Average test win rates in the marine-easy task set with different data qualities. Results of BC-best stand for the best test win rates between BC-t and BC-r (the same to Table15).

Average test win rates in the stalker-zealot task set with different data qualities.

ACKNOWLEDGMENTS

This work is supported by National Key Research and Development Program of China (2020AAA0107200), the National Science Foundation of China (61921006, 62276126), and the Natural Science Foundation of Jiangsu (BK20221442). We thank Feng Chen, Xinyu Yang, and the anonymous reviewers for their support and helpful discussions on improving the paper.

