QPLEX: DUPLEX DUELING MULTI-AGENT Q-LEARNING

Abstract

We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms stateof-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration 1 .

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has broad prospects for addressing many complex real-world problems, such as sensor networks (Zhang & Lesser, 2011) , coordination of robot swarms (Hüttenrauch et al., 2017) , and autonomous cars (Cao et al., 2012) . However, cooperative MARL encounters two major challenges of scalability and partial observability in practical applications. The joint state-action space grows exponentially as the number of agents increases. The partial observability and communication constraints of the environment require each agent to make its individual decisions based on local action-observation histories. To address these challenges, a popular MARL paradigm, called centralized training with decentralized execution (CTDE) (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) , has recently attracted great attention, where agents' policies are trained with access to global information in a centralized way and executed only based on local histories in a decentralized way. Many CTDE learning approaches have been proposed recently, among which value-based MARL algorithms (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2019b) have shown state-of-the-art performance on challenging tasks, e.g., unit micromanagement in StarCraft II (Samvelyan et al., 2019) . To enable effective CTDE for multi-agent Q-learning, it is critical that the joint greedy action should be equivalent to the collection of individual greedy actions of agents, which is called the IGM (Individual-Global-Max) principle (Son et al., 2019) . This IGM principle provides two advantages: 1) ensuring the policy consistency during centralized training (learning the joint Q-function) and decentralized execution (using individual Q-functions) and 2) enabling

