OFFLINE COMMUNICATION LEARNING WITH MULTI-SOURCE DATASETS

Abstract

Scalability and partial observability are two major challenges faced by multi-agent reinforcement learning. Recently researchers propose offline MARL algorithms to improve scalability by reducing online exploration cost, while the problem of partial observability is often ignored in the offline MARL setting. Communication is a promising approach to alleviate the miscoordination caused by partially observability, thus in this paper we focus on offline communication learning where agents learn from an fixed dataset. We find out that learning communications in an end-to-end manner from a given offline dateset without communication information is intractable, since the correct communication protocol space is too sparse compared with the exponentially growing joint state-action space when the number of agents increases. Besides, unlike offline policy learning which can be guided by reward signals, offline communication learning is struggling since communication messages implicitly impact the reward. Moreover, in real-world applications, offline MARL datasets are often collected from multi-source, leaving offline MARL communication learning more challenging. Therefore, we present a new benchmark which contains a diverse set of challenging offline MARL communication tasks with single/multi-source datasets, and propose a novel Multi-Head structure for Communication Imitation learning (MHCI) algorithm that automatically adapts to the distribution of the dataset. Empirical result shows the effectiveness of our method on various tasks of the new offline communication learning benchmark.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning is essential for many real-world tasks where multiple agents must coordinate to achieve a joint goal. However, the problems of scalability and partial observability limit the effectiveness of online MARL algorithms. The large joint state-action space makes exploration costly especially when the number of agents increases. On the other hand, the partial observability requires communication among agents to make better decisions. Plenty of previous works in MARL try to find solutions for these two challenges, with the hope to make cooperative MARL applicable to more complicated real-world tasks (Sukhbaatar et al. ( 2016 Recently, emerging researches apply offline RL to cooperative multi-agent RL in order to avoid costly exploration across the joint state-action space, thus the scalability is improved. Offline RL is defined as learning from a fixed dataset instead of online interactions. In the context of singleagent offline RL, the main challenge is the distributional shift issue, which means that the learned policy reaches unseen state-action pairs which are not correctly estimated. By constraining the policy on the behavioral policy, offline RL has gained success on diverse single-agent offline tasks like locomotion (Fu et al. (2020) ) and planning without expensive online exploration (Finn & Levine (2017) ). As the problem of scalability can be promisingly alleviated by utilizing offline datasets, another challenge of MARL, i.e., the partial observability, can be addressed by introducing communications during coordination. Communication allows agents to share information and work as a team. A wide range of multi-agent systems benefit from effective communication, including electronic games like 2022)) have been proposed to tackle with the scalability challenge recently, how to deal with the partial observability in the offline MARL setting, has not received much attention. Unluckily, simply adopting communication mechanism in offline MARL, i.e., learning communication in an end-to-end manner by offline datasets is still problematic. Finding effective communication protocols without any guidance can be the bottleneck especially when the task scale increases. It may converge to sub-optimal communication protocols that influence the downstream policy learning. To handle this problem, in this paper, we investigate the new area of offline communication learning, where the multiple agents learn communication protocols from a static offline dataset containing extra communication information. We denote this kind of dataset as "communication-based dataset" to distinguish it from single-agent offline dataset. In real-world applications, communication-based dataset may be collected by a variety of existing communication protocols, like handcraft protocols designed by experts, or hidden protocols leaned by other agents. Therefore, communication-based dataset can be established in offline MARL learning to boost the performance of the downstream tasks. Previous offline RL works focus on eliminating the problem of distributional shift, while offline MARL communication learning faces different challenges. Unlike policy learning which is directly guided by reward signals since actions influence the expected return, it is hard to evaluate communication learning since communication serves as an implicit impact between agents . What's worse, it is likely that the offline dataset is multi-source in real world, thus trajectories may be sampled by different communication protocols as well as policies. The multi-source property introduces extra challenges as we cannot simply imitate the dataset communication. Offline communication learning algorithms need to distinguish the source of each trajectories before learning from them. In this paper, We propose Multi-head Communication Imitation (MHCI) that accomplishes multi-source classification and message imitation at the same time. To our best knowledge, MHCI is the first to learn a composite communication from a multi-source communication-based dataset. We also provide theoretical explanation on its optimality under the dataset. To better evaluate the effectiveness of our algorithm as well as for further study, we propose an offline communication learning benchmark, including environments from previous works and additional environments that require sophisticated communication. The empirical results show that Multi-head Communication Imitation (MHCI) successfully combines and refines information in the communication-based dataset, thus obtains outperformance in diverse challenging tasks of the offline communication learning benchmark. Our main contributions are two-folds: 1) we analyze the new challenges in offline communication learning, and introduce a benchmark of offline communication learning which contains diverse tasks.; 2) we propose an effective algorithm, Multi-head Communication Imitation (MHCI), which aims to address the problem of learning from single-source or multi-source datasets, and our method shows superior outperformance in various environments of our benchmark. 



); Singh et al. (2019); Yang et al. (2021)).

StarCraft II(Rashid et al. (2018) and soccer (Huang et al. (2021)), as well as real-world applications, e.g., autonomous driving (Shalev-Shwartz et al. (2016)) and traffic control (Das et al. (2019)). Although several offline MARL algorithms (Yang et al. (2021); Pan et al. (

Multi-agent reinforcement learning has attracted great attention in recent years. (Tampuu et al. (2017); Matignon et al. (2012); Mordatch & Abbeel (2017); Wen et al. (2019)) In MARL, the framework of centralized training and decentralized execution has been widely adopted (Kraemer & Banerjee (2016); Lowe et al. (2017)). For cooperative scenarios under this framework, COMA (Foerster et al. (2018)) assigns credits to different agents based on a centralized critic and counter-factual advantage functions, while another series of works, including VDN (Sunehag et al. (2018)), QMIX (Rashid et al. (2018)) and QTRAN (Son et al. (2019)), achieve this by applying value-function factorization. These MARL algorithms show remarkable empirical results when tested on the popular StarCraft unit micromanagement benchmark (SMAC) (Samvelyan et al. (2019)). CommNet (Sukhbaatar et al. (2016)), RIAL and DIAL (Foerster et al. (2016)) are seminal works that allow agents to learn how to communicate with each other in MARL. CommNet and DIAL design the communication structure in a differentiable way to enable end-to-end training, and RIAL trains

