CHEAP TALK DISCOVERY AND UTILIZATION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

By enabling agents to communicate, recent cooperative multi-agent reinforcement learning (MARL) methods have demonstrated better task performance and more coordinated behavior. Most existing approaches facilitate inter-agent communication by allowing agents to send messages to each other through free communication channels, i.e., cheap talk channels. Current methods require these channels to be constantly accessible and known to the agents a priori. In this work, we lift these requirements such that the agents must discover the cheap talk channels and learn how to use them. Hence, the problem has two main parts: cheap talk discovery (CTD) and cheap talk utilization (CTU). We introduce a novel conceptual framework for both parts and develop a new algorithm based on mutual information maximization that outperforms existing algorithms in CTD/CTU settings. We also release a novel benchmark suite to stimulate future research in CTD/CTU.

1. INTRODUCTION

Effective communication is essential for many multi-agent systems in the partially observable setting, which is common in many real-world applications like elevator control (Crites & Barto, 1998) and sensor networks (Fox et al., 2000) . Communicating the right information at the right time becomes crucial to completing tasks effectively. In the multi-agent reinforcement learning (MARL) setting, communication often occurs on free channels known as cheap talk channels. The agents' goal is to learn an effective communication protocol via the channel. The transmitted messages can be either discrete or continuous (Foerster et al., 2016) . Existing work often assumes the agents have prior knowledge (e.g., channel capacities and noise level) about these channels. However, such assumptions do not always hold. Even if these channels' existence can be assumed, they might not be persistent, i.e., available at every state. Consider the real-world application of inter-satellite laser communication. In the case, communication channel is only functional when satellites are within line of sight. This means positioning becomes essential (Lakshmi et al., 2008) . Thus, Without these assumptions, agents need the capability to discover where to best communicate before learning a protocol in realistic MARL settings. In this work, we investigate the setting where these assumptions on cheap talk channels are lifted. Precisely, these channels are only effective in a subset of the state space. Hence, agents must discover where these channels are before they can learn how to use them. We divide this problem into two sequential steps: cheap talk discovery (CTD) and cheap talk utilization (CTU). The problem is a strict generalization of the common setting used in the emergent communication literature with less assumptions, which is more akin to real-world scenarios (see appendix A for more in-depth discussions on the setting's significance and use cases). This setting is particularly difficult as it suffers from the temporal credit assignment problem (Sutton, 1984) for communicative actions. Consider an example we call the phone booth maze (PBMaze), the environment has a sender and a receiver, placed into two separate rooms. The receiver's goal is to escape from the correct exit out of two possible exits. Only the sender knows which one is the correct exit. The sender's goal is to communicate this information using functional phone booths. This leads to two learning stages. Firstly, they need to learn to reach the booths. Then, the sender has to learn to form a protocol, distinguishing different exit information while the receiver has to learn to interpret the sender's protocol by trying different exits. This makes credit assignment particularly difficult as communicative actions do not lead to immediate rewards. Additionally, having communicative actions that are only effective in a small subset of the state space further makes it a challenging joint exploration problem, especially when communication is necessary for task completion. Figure 1 provides a visual depiction of the two learning stages in this environment. As a whole, our contributions are four-fold. Firstly, we provide a formulation of the CTD and CTU problem. Secondly, we introduce a configurable environment to benchmark MARL algorithms on the problem. Thirdly, we propose a method to solve the CTD and CTU problems based on information theory and advances in MARL, including off-belief learning (Hu et al., 2021, OBL) and differentiable inter-agent learning (Foerster et al., 2016, DIAL) . Finally, we show that our proposed approach empirically compares favourably to other MARL baselines, validate the importance of specific components via ablation studies and illustrate how our method can act as a measure of channel capacity to learn where best to communicate.

2. RELATED WORK

The use of mutual information (MI) has been explored in the MARL setting. Wang et al. (2019) propose a shaping reward based on MI between agents' transitions to improve exploration, encouraging visiting critical points where one can influence other agents. Our proposed method also has an MI term for reward shaping. Their measure might behave similarly to ours but is harder to compute and requires full environmental states during training. Sokota et al. (2022) propose a method to discover implicit communication protocols using environment actions via minimum entropy coupling, separating communicative and non-communicative decision-making. We propose a similar problem decomposition by separating state and action spaces into two subsets based on whether communication can occur or not. Unlike in Sokota et al. (2022) , we focus on explicit communication



Figure 1: The two learning stages for CTD/CTU based on PBMaze. Stage (a): Discover the functional phone booths; Stage (b): Form a protocol to use the phone booth and learn to interpret the messages (left), and solve the task (right). The blue and red agents are the sender and the receiver respectively

