OFFLINE COMMUNICATION LEARNING WITH MULTI-SOURCE DATASETS

Abstract

Scalability and partial observability are two major challenges faced by multi-agent reinforcement learning. Recently researchers propose offline MARL algorithms to improve scalability by reducing online exploration cost, while the problem of partial observability is often ignored in the offline MARL setting. Communication is a promising approach to alleviate the miscoordination caused by partially observability, thus in this paper we focus on offline communication learning where agents learn from an fixed dataset. We find out that learning communications in an end-to-end manner from a given offline dateset without communication information is intractable, since the correct communication protocol space is too sparse compared with the exponentially growing joint state-action space when the number of agents increases. Besides, unlike offline policy learning which can be guided by reward signals, offline communication learning is struggling since communication messages implicitly impact the reward. Moreover, in real-world applications, offline MARL datasets are often collected from multi-source, leaving offline MARL communication learning more challenging. Therefore, we present a new benchmark which contains a diverse set of challenging offline MARL communication tasks with single/multi-source datasets, and propose a novel Multi-Head structure for Communication Imitation learning (MHCI) algorithm that automatically adapts to the distribution of the dataset. Empirical result shows the effectiveness of our method on various tasks of the new offline communication learning benchmark.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning is essential for many real-world tasks where multiple agents must coordinate to achieve a joint goal. However, the problems of scalability and partial observability limit the effectiveness of online MARL algorithms. The large joint state-action space makes exploration costly especially when the number of agents increases. On the other hand, the partial observability requires communication among agents to make better decisions. Plenty of previous works in MARL try to find solutions for these two challenges, with the hope to make cooperative MARL applicable to more complicated real-world tasks (Sukhbaatar et al. ( 2016 Recently, emerging researches apply offline RL to cooperative multi-agent RL in order to avoid costly exploration across the joint state-action space, thus the scalability is improved. Offline RL is defined as learning from a fixed dataset instead of online interactions. In the context of singleagent offline RL, the main challenge is the distributional shift issue, which means that the learned policy reaches unseen state-action pairs which are not correctly estimated. By constraining the policy on the behavioral policy, offline RL has gained success on diverse single-agent offline tasks like locomotion (Fu et al. (2020) ) and planning without expensive online exploration (Finn & Levine (2017) ). As the problem of scalability can be promisingly alleviated by utilizing offline datasets, another challenge of MARL, i.e., the partial observability, can be addressed by introducing communications during coordination. Communication allows agents to share information and work as a team. A wide range of multi-agent systems benefit from effective communication, including electronic games like 1



); Singh et al. (2019); Yang et al. (2021)).

