DISCOVERING GENERALIZABLE MULTI-AGENT COOR-DINATION SKILLS FROM MULTI-TASK OFFLINE DATA

Abstract

Cooperative multi-agent reinforcement learning (MARL) faces the challenge of adapting to multiple tasks with varying agents and targets. Previous multi-task MARL approaches require costly interactions to simultaneously learn or fine-tune policies in different tasks. However, the situation that an agent should generalize to multiple tasks with only offline data from limited tasks is more in line with the needs of real-world applications. Since offline multi-task data contains a variety of behaviors, an effective data-driven approach is to extract informative latent variables that can represent universal skills for realizing coordination across tasks. In this paper, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS) from multi-task data. ODIS first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm. We further demonstrate that the discovered coordination skills can assign effective coordinative behaviors, thus significantly enhancing generalization to unseen tasks. Empirical results in cooperative MARL benchmarks, including the StarCraft multi-agent challenge, show that ODIS obtains superior performance in a wide range of tasks only with offline data from limited sources.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has drawn broad attention in addressing problems like video games, sensor networks, and autopilot (Peng et al., 2017; Cao et al., 2013; Gronauer & Diepold, 2022; Yun et al., 2022; Xue et al., 2022b) . Since recent MARL methods mainly focus on learning policies in one single task with simulating environments (Sunehag et al., 2018; Rashid et al., 2020; Lowe et al., 2017; Foerster et al., 2018) , there exist two obstacles when applying them to real-world problems. One is poor generalization when facing tasks with varying agents and targets, where the practical demand is to adapt to multiple tasks rather than learning every new task from scratch (Omidshafiei et al., 2017) . The other is potentially high costs and risks caused by real-world interactions through an under-learning policy (Levine et al., 2020) . Multi-agent systems are expected to perform flexibly among multiple general scenarios where the agents and targets may differ. Multi-task MARL is one promising way to realize such flexibility and generalizability. Previous related works mainly focus on training simultaneously in a pre-defined task set (Omidshafiei et al., 2017; Iqbal et al., 2021) or fine-tuning a pre-trained policy to target tasks (Hu et al., 2021; Zhou et al., 2021; Qin et al., 2022a) in an online manner. Although these approaches exhibit promising performance in some tasks, the expensive cost of online interactions hinders their applications to a broader range of tasks. Offline RL (Levine et al., 2020) , aiming at learning policies from a static dataset, is anticipated to remove the need for interactions during training. However, most current offline RL methods conservatively regularize the learned policies towards datasets (Wu et al., 2019; Kumar et al., 2019; Yang et al., 2021; Fujimoto et al., 2019) . Albeit conservatism effectively mitigates the distribution shift issue of offline learning, it will restrict the learned policy to be similar to the behavior policy, leading to severe degradation of generalization when facing unseen data (Chen et al., 2021b) . Therefore, leveraging multi-agent offline data to train a policy with adequate generalization ability across tasks is in demand. This paper finds that the underlying generic skills can greatly help to improve the policy's generalization. Indeed, humans are good at summarizing skills from several tasks and reusing these skills in other similar tasks. Taking Figure 1 as an example, we try to learn a policy from some StarCraft multi-agent challenge (Samvelyan et al., 2019) tasks, 5m and 8m, where we need to control five or eight marines separately to beat the same number of enemy marines. Moreover, we aim to directly deploy the learned policy without fine-tuning to an unseen target task, 10m, a task with ten marines on each side. One effective way to achieve this is to extract skills from the source tasks, like focusing fire on the same enemy or moving back low-health units, and then apply these skills to the target task. Although these tasks have different state spaces or action spaces, these skills can be applicable to a broad range of tasks. We refer to such task-invariant skills as coordination skills since they are beneficial to realize coordination in different tasks. In other words, extracting such coordination skills from known tasks facilitate generalization via reusing them in unseen tasks. Towards learning and reusing coordination skills in a data-driven way, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS), where agents only access a multi-task dataset to discover coordination skills and learn generalizable policies. ODIS first extracts taskinvariant coordination skills that delineate agent behaviors from a coordinative perspective. These shared coordination skills will help agents perform high-level decision-making without considering specific action spaces. ODIS then learns a coordination policy to select appropriate coordination skills to maximize the global return via the centralized training and decentralized execution (CTDE) paradigm (Oliehoek et al., 2008) . Finally, we deploy the coordination policy directly to unseen tasks. Our proposed coordination skill is noteworthy compared to previous works in online multi-agent skill discovery (Yang et al., 2020; He et al., 2020) , which utilize hierarchically learned skills to improve exploration and data efficiency. Empirical results show that ODIS can learn to choose proper coordination skills and generalize to a wide range of tasks only with data from limited sources. To the best of our knowledge, it is the first attempt towards unseen task generalization in offline MARL.

2. RELATED WORK

Multi-task MARL. Multi-task RL and transfer learning in MARL can improve sample efficiency with knowledge reuse (da Silva & Costa, 2021) . The knowledge reuse across multiple tasks may be impeded by varying populations and input dimensions, asking for policy networks with flexible structures like graph neural networks (Agarwal et al., 2020) or self-attention mechanisms (Hu et al., 2021; Zhou et al., 2021) . Recent works consider utilizing policy representations or agent representations to realize multi-task adaptations (Grover et al., 2018) . EPL (Long et al., 2020) introduces an evolutionary-based curriculum learning approach to scale up the number of agents. REFIL (Iqbal et al., 2021) adopts randomized entity-wise factorization for multi-task learning. UPDeT



Figure1: An illustration of coordination skill discovery from multi-task offline data. Offline data from marine battle source tasks like 5m and 8m of StarCraft multi-agent challenge contain generalizable coordination skills like focusing fire, moving back, etc. After discovering these coordination skills from source data, a coordination policy learns to appropriately choose coordination skills through an offline RL training process. When facing the unseen task 10m, the agents reuse the discovered coordination skills to achieve coordination and accomplish the task.

