CENTRALIZED TRAINING WITH HYBRID EXECUTION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of informationsharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) aims to learn utility-maximizing behavior in scenarios involving multiple agents. In recent years, deep MARL methods have been successfully applied to multi-agent tasks such as game-playing (Papoudakis et al., 2020) , traffic light control (Wei et al., 2019) , or energy management (Fang et al., 2020) . Despite recent successes, the multi-agent setting happens to be substantially harder than its single-agent counterpart (Canese et al., 2021) because multiple concurrent learners can create non-stationarity conditions that hinder learning; the curse of dimensionality obstructs centralized approaches to MARL due to the exponential growth in state and action spaces with the number of agents; and agents seldom observe the true state of the environment. As a way to deal with the exponential growth in the state/action space and with environmental constraints, both in perception and actuation, existing methods aim to learn decentralized policies that allow the agents to act based on local perceptions and partial information about other agents' intentions. The paradigm of centralized training with decentralized execution is undoubtedly at the core of recent research in the field (Oliehoek et al., 2011; Rashid et al., 2018; Foerster et al., 2016) ; such paradigm takes advantage of the fact that additional information, available only at training time, can be used to learn decentralized policies in a way that the need for communication is alleviated. While in some settings partial observability and/or communication constraints require learning fully decentralized policies, the assumption that agents cannot communicate at execution time is often too restrictive for a great number of real-world application domains such as robotics, game-playing or autonomous driving (Ho et al., 2019; Yurtsever et al., 2020) . In such domains, learning fully decentralized policies should be deemed inappropriate since such policies do not take into account the possibility of communication between the agents. Other MARL strategies, which take advantage of additional information shared among the agents, can surely be developed (Zhu et al., 2022) . In this work, our objective is to develop agents that are able to exploit the benefits of centralized training while, simultaneously, taking advantage of information-sharing at execution time. We introduce the paradigm of hybrid execution, in which agents act in scenarios with any possible communication level, ranging from no communication (fully decentralized) to full communication

