CENTRALIZED TRAINING WITH HYBRID EXECUTION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of informationsharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) aims to learn utility-maximizing behavior in scenarios involving multiple agents. In recent years, deep MARL methods have been successfully applied to multi-agent tasks such as game-playing (Papoudakis et al., 2020) , traffic light control (Wei et al., 2019) , or energy management (Fang et al., 2020) . Despite recent successes, the multi-agent setting happens to be substantially harder than its single-agent counterpart (Canese et al., 2021) because multiple concurrent learners can create non-stationarity conditions that hinder learning; the curse of dimensionality obstructs centralized approaches to MARL due to the exponential growth in state and action spaces with the number of agents; and agents seldom observe the true state of the environment. As a way to deal with the exponential growth in the state/action space and with environmental constraints, both in perception and actuation, existing methods aim to learn decentralized policies that allow the agents to act based on local perceptions and partial information about other agents' intentions. The paradigm of centralized training with decentralized execution is undoubtedly at the core of recent research in the field (Oliehoek et al., 2011; Rashid et al., 2018; Foerster et al., 2016) ; such paradigm takes advantage of the fact that additional information, available only at training time, can be used to learn decentralized policies in a way that the need for communication is alleviated. While in some settings partial observability and/or communication constraints require learning fully decentralized policies, the assumption that agents cannot communicate at execution time is often too restrictive for a great number of real-world application domains such as robotics, game-playing or autonomous driving (Ho et al., 2019; Yurtsever et al., 2020) . In such domains, learning fully decentralized policies should be deemed inappropriate since such policies do not take into account the possibility of communication between the agents. Other MARL strategies, which take advantage of additional information shared among the agents, can surely be developed (Zhu et al., 2022) . In this work, our objective is to develop agents that are able to exploit the benefits of centralized training while, simultaneously, taking advantage of information-sharing at execution time. We introduce the paradigm of hybrid execution, in which agents act in scenarios with any possible communication level, ranging from no communication (fully decentralized) to full communication between the agents (fully centralized). In particular, we focus on multi-agent cooperative tasks in which the sharing of local information (observations and actions of the agents) is critical to their successful execution. To formalize our setting, we start by defining hybrid partially observable Markov decision process (H-POMDP), a new class of multi-agent POMDPs that explicitly considers a communication process between the agents. Our goal is to find a method that allows agents to solve H-POMDPs regardless of the communication process encountered at execution time. To allow for hybrid execution, we propose an autoregressive model that explicitly predicts non-shared information from past observations of the agents. In addition, we propose a training scheme for the agents' controllers that simulates communication faults during the centralized training phase. We denote our coupled approach by multi-agent observation sharing with communication dropout (MARO). MARO can be easily integrated with current deep MARL methods. We evaluate the performance of MARO across different communication levels, in different MARL benchmark environments and using multiple RL algorithms. Furthermore, we introduce three novel MARL environments that explicitly require communication during execution to successfully perform cooperative tasks, currently missing in literature. Finally, we perform an ablation study that highlights the importance of both the predictive model and the training scheme to the overall performance of MARO. The results show that our method consistently outperforms the baselines, allowing agents to exploit shared information during execution and perform tasks under various communication levels. In summary, our contribution is three-fold: (i) we propose and formalize the setting of hybrid execution in MARL, in which agents must perform partially-observable cooperative tasks across all possible communication levels; (ii) we propose MARO, an approach that combines an autoregressive predictive model of agents' observations and a novel training scheme; and (iii) we evaluate MARO in different benchmark and novel environments, using different RL algorithms, showing that our approach consistently allows agents to act with different communication levels.

2. HYBRID EXECUTION IN MULTI-AGENT REINFORCEMENT LEARNING

A fully cooperative multi-agent system with Markovian dynamics can be modelled as a decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek & Amato, 2016). A Dec-POMDP is a tuple ([n], X , A, P, r, γ, Z, O), where [n] = {1, . . . , n} is the set of indexes of n agents, X is the set of states of the environment, A = × i A i is the set of joint actions, where A i is the set of individual actions of agent i, P is the set of probability distributions over next states in X , one for each state and action in X × A, r : X × A → R maps states and actions to expected rewards, γ ∈ [0, 1[ is a discount factor, Z = × i Z i is the set of joint observations, where Z i is the set of local observations of agent i, and O is the set of probability distributions over joint observations in Z, one for each state and action in X × A. A decentralized policy for agent i is π i : Z i → A i and the joint decentralized policy is π : Z → A such that π(z 1 , . . . , z n ) = π 1 (z 1 ), . . . , π n (z n ) . Fully decentralized approaches to MARL directly apply standard single-agent RL algorithms for learning each agents' policy π i in a decentralized manner. In independent Q-learning (IQL) (Tan, 1993) , as well as in independent RL algorithms (Schulman et al., 2017) , each agent treats other agents as part of the environment, ignoring the influence of other agents' observations and actions. More recently, under the paradigm of centralized training with decentralized execution, QMIX (Rashid et al., 2018) and other MARL algorithms (Sunehag et al., 2017; Son et al., 2019; Mahajan et al., 2019; Wang et al., 2020b; Yu et al., 2021) aim at learning decentralized policies with centralization at training time while fostering cooperation among the agents. Finally, if agents can share their observations, we can use either approach to learn fully centralized policies. None of the classes of methods aforementioned assumes that agents may sometimes have access to other agents' observations and sometimes not. Therefore, decentralized agents are unable to take advantage of the additional information that they may receive from other agents at execution time, and centralized agents are unable to act when the sharing of information fails. In this work, we introduce hybrid execution in MARL, a setting in which agents act regardless of the communication process while taking advantage of additional information they may receive during execution. To formalize this setting, we define a new class of multi-agent POMDPs that we name hybrid-POMDPs (H-POMDPs), which explicitly considers a specific communication process among the agents.

