LEARNING TO COMMUNICATE USING CONTRASTIVE LEARNING

Abstract

Communication is a powerful tool for coordination in multi-agent RL. Inducing an effective, common language has been a difficult challenge, particularly in the decentralized setting. In this work, we introduce an alternative perspective where communicative messages sent between agents are considered as different incomplete views of the environment state. Based on this perspective, we propose to learn to communicate using contrastive learning by maximizing the mutual information between messages of a given trajectory. In communication-essential environments, our method outperforms previous work in both performance and learning speed. Using qualitative metrics and representation probing, we show that our method induces more symmetric communication and captures task-relevant information from the environment. Finally, we demonstrate promising results on zero-shot communication, a first for MARL. Overall, we show the power of contrastive learning, and self-supervised learning in general, as a method for learning to communicate.

1. INTRODUCTION

Figure 1 : Multi-view contrastive learning and CACL, contrastive learning for multi-agent communication. In multi-view learning, augmentations of the original image or "views" are positive samples to contrastively learn features. In CACL, different agents' views of the same environment states are considered positive samples and messages are contrastively learned as encodings of the state. Communication between agents is a key capability necessary for effective coordination among agents in partially observable environments. In multi-agent reinforcement (MARL) (Sutton & Barto, 2018) , agents can use their actions to transmit information (Grupen et al., 2020) but continuous or discrete messages on a communication channel (Foerster et al., 2016) , also known as linguistic communication (Lazaridou & Baroni, 2020) , is more flexible and powerful. To successfully communicate, a speaker and a listener must share a common language with a shared understanding of the symbols being used (Skyrms, 2010; Dafoe et al., 2020) . Learning a common protocol, or emergent communication (Wagner et al., 2003; Lazaridou & Baroni, 2020) , is a thriving research direction but many works focus on simple, single-turn, sender-receiver games (Lazaridou et al., 2018; Chaabouni et al., 2019) . In more visually and structurally complex MARL environments (Samvelyan et al., 2019) , existing approaches often rely on centralized learning mechanisms by sharing models (Lowe et al., 2017) or gradients (Sukhbaatar et al., 2016) . However, a centralized controller is impractical in many real-world environments (Mai et al., 2021; Jung et al., 2021) and centralized training with decentralized execution (CTDE) (Lowe et al., 2017) may not perform better than purely decentralized training (Lyu et al., 2021) . Furthermore, the decentralized setting is more flexible and requires fewer assumptions about other agents, making it more realistic in many real-world scenarios (Li et al., 2020) . The decentralized setting also scales better, as a centralized controller will suffer from the curse of dimensionality: as the number of agents it must control increases, there is an exponential increase in the amount of communication between agents to process (Jin et al., 2021) . Hence, this work explores learning to communicate in order to coordinate agents in the decentralized setting. In MARL, this means each agent will have its own model to decide how to act and communicate, and no agents share parameters or gradients. Normal RL approaches to decentralized communication are known to perform poorly even in simple tasks (Foerster et al., 2016) . The main challenge lies in the large space of communication to explore, the high variance of RL, and a lack of common grounding to base communication on (Lin et al., 2021) . Earlier work leveraged how communication influences other agents (Jaques et al., 2018; Eccles et al., 2019) to learn the protocol. Most recently, Lin et al. ( 2021) proposed agents that autoencode their observations and simply use the encodings as communication, using the shared environment as the common grounding. We propose to use the shared environment and the knowledge that all agents are communicating to ground a protocol. If, like Lin et al. ( 2021), we consider our agents' messages to be encodings of their observations then agents in similar states should produce similar messages. This perspective leads to a simple method based on contrastive learning to ground communication. Inspired by the literature in representation learning that uses different "views" of a data sample (Bachman et al., 2019) , for a given trajectory, we propose that an agent's observation is a "view" of some environment states. Therefore, different agents' messages are encodings of different "views" of the same underlying state. From this perspective, messages within a trajectory should be more similar to each other than to messages from another trajectory. We visually show our perspective in Figure 1 . We propose that each agent use contrastive learning between sent and received messages to learn to communicate, which we term Communication Alignment Contrastive Learning (CACL). We experimentally validate our method in three communication-essential environments and empirically show how our method leads to improved performance and speed, outperforming state-of-theart decentralized MARL communication algorithms. To understand CACL's success, we propose a suite of qualitative and quantitative metrics. We demonstrate that CACL leads to more symmetric communication, allowing agents to be more mutually intelligible. By treating our messages as representations, we show that CACL's messages capture task-relevant semantic information about the environment better than baselines. Finally, we look at zero-shot cooperation with partners unseen at training time, a first for MARL communication. Despite the difficulty of the task, we demonstrate the first promising results in this direction. Overall, we argue that self-supervised learning is a powerful direction for multi-agent communication.

2. RELATED WORK

Learning to coordinate multiple RL agents is a challenging and unsolved task where naively applying single-agent RL algorithms often fails (Foerster et al., 2016) . Recent approaches focus on agents parameterized by neural networks (Goodfellow et al., 2016) augmented with a message channel so that they can develop a common communication protocol (Lazaridou & Baroni, 2020) . To solve issues of non-stationarity, some work focuses on centralized learning approaches that globally share models (Foerster et al., 2016 ), training procedures (Lowe et al., 2017) , or gradients (Sukhbaatar et al., 2016) among agents. This simplifies optimization issues can still be sub-optimal (Foerster et al., 2016; Lin et al., 2021) . This also violates independence assumptions, effectively modelling the multi-agent scenario as a single agent (Eccles et al., 2019) . This work focuses on independent, decentralized agents and non-differentiable communication. In previous work, Jaques et al. (2018) propose a loss to influence other agents but require explicit and complex models of other agents and their experiments focus on mixed cooperative-competitive

