LEARNING TO COOPERATE AND COMMUNICATE OVER IMPERFECT CHANNELS

Abstract

Information exchange in multi-agent systems improves the cooperation among agents, especially in partially observable settings. This can be seen as part of the problem in which the agents learn how to communicate and to solve a shared task simultaneously. In the real world, communication is often carried out over imperfect channels and this requires the agents to deal with uncertainty due to potential information loss. In this paper, we consider a cooperative multi-agent system where the agents act and exchange information in a decentralized manner using a limited and unreliable channel. To cope with such channel constraints, we propose a novel communication approach based on independent Q-learning. Our method allows agents to dynamically adapt how much information to share by sending messages of different size, depending on their local observations and the channel properties. In addition to this message size selection, agents learn to encode and decode messages to improve their policies. We show that our approach outperforms approaches without adaptive capabilities and discuss its limitations in different environments.

1. INTRODUCTION

In multi-agent systems, cooperation and communication are closely related. Whenever a task requires agents with partial views to cooperate, the exchange of information about one's view and intent can help to reduce uncertainty and allows for more well-founded decisions. Communication allows agents to solve tasks more efficiently, and can even be necessary to achieve acceptable results (Singh et al., 2019) . As an example, consider a safety-critical autonomous driving scenario (Li et al., 2021) . By letting the cars exchange sensor data or abstract details about detected objects in the scene, occluded objects can be considered in the planning processes of all cars and reduce the risk of collisions. Multi-agent reinforcement learning (MARL) comprises learning methods for problems where multiple agents interact with a shared environment (Buşoniu et al., 2010; Hernandez-Leal et al., 2019) . The goal is to find an optimal policy for the agents that maximizes the outcome of their actions with respect to the environment's reward signal. Key challenges in MARL include non-stationarity (Papoudakis et al., 2019) , the credit assignment problem (Zhou et al., 2020) and partial observability (Oroojlooyjadid & Hajinezhad, 2019) . We focus on cooperative environments with partial observability. As communication is essential in cooperative environments, many works include a predefined information exchange between agents (Melo et al., 2011; Schneider et al., 2021) . Additionally, there is ongoing research to include learnable communication into MARL approaches. Pioneering work gave first empirical evidence that communication between agents can be learned with deep MARL (Foerster et al., 2016; Lowe et al., 2017; Sukhbaatar et al., 2016) . This enhances the performance on existing environments and allows to address a new class of problems that require communication between agents. Building upon these ideas, many researchers proposed methods to improve the performance and stability of these approaches (Gupta et al., 2020; Jiang & Lu, 2018; Li et al., 2021) . While related work investigates effects of using different fixed message sizes (Li et al., 2022) and multiple communication rounds (Das et al., 2019) , selectively sending messages (Singh et al., 2019) , and sending messages only to other agents in their proximity (Jiang & Lu, 2018) , most of these approaches are designed for communication channels without capacity limitations or message losses. Recent approaches started to investigate such settings, e.g. by learning central controllers for coordinated access to a communication channel (Kim et al., 2019) . To the best of our knowledge, there are no studies on message size adaptation to improve multi-agent communication over imperfect channels. In our work, we address this gap by investigating a cooperative MARL setting in which agents communicate over an unreliable and limited channel. We focus on agents learning when, what and how much to communicate over imperfect channels in a decentralized manner. The key challenge here is to determine how to utilize the limited capacity efficiently and cope with the lack of reliability, in order to maximize the benefit for the cooperative multi-agent problem. With this paper, we provide a novel approach to address this challenge. Our contributions are as follows: (i) we propose a novel communication approach that allows for an adaptive message size selection while learning the message encoders and decoders, (ii) we introduce discrete communication trained with the pseudo-gradient method, (iii) we analyze the effect of different message types and message sizes, (iv) we introduce POMNIST, a fast MNIST-based benchmark environment for communication, (v) we show that agents adapt to the given communication channels in POMNIST and show limitations of our approach in the traffic junction environment.

2. RELATED WORK

Agents in MARL can exchange information a) with implicit communication, and b) with explicit messages that are forwarded between the agents. Implicit communication refers to exchange of information without separate communication actions, e.g. through the agents' regular actions and observations (Foerster et al., 2019) or with a joint policy (Berner et al., 2019) . Within the scope of this paper, we focus on explicit communication. This can further be divided into i) continuous communication with real-valued messages and ii) discrete communication with a finite set of messages. In the context of deep learning, exchanging continuous messages allows for backpropagation across different agents (Sukhbaatar et al., 2016) . This results in significant performance improvements in partially observable environments, where agents can benefit from coordination or the exchange of local information. Recent approaches include restricting communication to agent groups (Jiang & Lu, 2018) and specific topologies (Du et al., 2021) , deciding when to send messages (Liu et al., 2020; Singh et al., 2019) and estimating the importance of messages with attention (Das et al., 2019; Li et al., 2021; Rangwala & Williams, 2020) . Discrete communication with finite message sets allows for more fine-grained control of the used data rate in limited communication scenarios and is the focus of this paper. In order to facilitate backpropogation for discrete communication, Foerster et al. ( 2016) regularize continuous messages with noise during training and discretize them during evaluation. In their experiments, this yields better results than extending the action space with communication actions. Differentiability can also be retained by sampling messages from a gumbel-softmax distribution (Jang et al., 2017) instead of a categorical distribution (Gupta et al., 2020; Lowe et al., 2017) . Li et al. (2022) aim to compensate for the message discretization with skip connections. Their results also suggest that the message size has a neglectable effect on continuous and a significant effect on discrete communication. Our work combines learnable communication with deep Q-learning to adaptively select the message size based on the observations given at each step. We consider an uncoordinated channel of limited capacity and demonstrate how agents can benefit from message size selection in such settings. Related works consider limited communication via a centrally controlled channel of limited capacity (Kim et al., 2019) , by pruning messages (Mao et al., 2020) or with regularizations based on the length of messages (Freed et al., 2020 ). Hu et al. (2022) show empirically that controlling whether to send messages improves communication in a slotted p-CSMA channel. It is unclear how and if agents can choose from different message sizes to improve their communication efficiency in unreliable channels of limited capacity. We address this research gap with our work. Although not the focus of this work, efficient use of imperfect channels can also be improved by coordinating the agents' access to the channel. For example, Kim et al. (2019) and Wang et al. (2020) consider this by learning a centralized scheduler for multi-agent communication. The classical communication literature compromises a multitude of sophisticated mechanisms for medium access control (Huang et al., 2013; Kumar et al., 2018) . We note that such schemes can be used in conjunction with our adaptive message size selection and leave further exploration of this combination to future work.

