DISCOVERING GENERALIZABLE MULTI-AGENT COOR-DINATION SKILLS FROM MULTI-TASK OFFLINE DATA

Abstract

Cooperative multi-agent reinforcement learning (MARL) faces the challenge of adapting to multiple tasks with varying agents and targets. Previous multi-task MARL approaches require costly interactions to simultaneously learn or fine-tune policies in different tasks. However, the situation that an agent should generalize to multiple tasks with only offline data from limited tasks is more in line with the needs of real-world applications. Since offline multi-task data contains a variety of behaviors, an effective data-driven approach is to extract informative latent variables that can represent universal skills for realizing coordination across tasks. In this paper, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS) from multi-task data. ODIS first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm. We further demonstrate that the discovered coordination skills can assign effective coordinative behaviors, thus significantly enhancing generalization to unseen tasks. Empirical results in cooperative MARL benchmarks, including the StarCraft multi-agent challenge, show that ODIS obtains superior performance in a wide range of tasks only with offline data from limited sources.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has drawn broad attention in addressing problems like video games, sensor networks, and autopilot (Peng et al., 2017; Cao et al., 2013; Gronauer & Diepold, 2022; Yun et al., 2022; Xue et al., 2022b) . Since recent MARL methods mainly focus on learning policies in one single task with simulating environments (Sunehag et al., 2018; Rashid et al., 2020; Lowe et al., 2017; Foerster et al., 2018) , there exist two obstacles when applying them to real-world problems. One is poor generalization when facing tasks with varying agents and targets, where the practical demand is to adapt to multiple tasks rather than learning every new task from scratch (Omidshafiei et al., 2017) . The other is potentially high costs and risks caused by real-world interactions through an under-learning policy (Levine et al., 2020) . Multi-agent systems are expected to perform flexibly among multiple general scenarios where the agents and targets may differ. Multi-task MARL is one promising way to realize such flexibility and generalizability. Previous related works mainly focus on training simultaneously in a pre-defined task set (Omidshafiei et al., 2017; Iqbal et al., 2021) or fine-tuning a pre-trained policy to target tasks (Hu et al., 2021; Zhou et al., 2021; Qin et al., 2022a) in an online manner. Although these approaches exhibit promising performance in some tasks, the expensive cost of online interactions hinders their applications to a broader range of tasks. Offline RL (Levine et al., 2020) , aiming at learning policies from a static dataset, is anticipated to remove the need for interactions during training. However, most current offline RL methods conservatively regularize the learned policies towards datasets (Wu et al., 2019; Kumar et al., 2019; Yang et al., 2021; Fujimoto et al., 2019) . Albeit conservatism

