QPLEX: DUPLEX DUELING MULTI-AGENT Q-LEARNING

Abstract

We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms stateof-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration 1 .

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has broad prospects for addressing many complex real-world problems, such as sensor networks (Zhang & Lesser, 2011) , coordination of robot swarms (Hüttenrauch et al., 2017) , and autonomous cars (Cao et al., 2012) . However, cooperative MARL encounters two major challenges of scalability and partial observability in practical applications. The joint state-action space grows exponentially as the number of agents increases. The partial observability and communication constraints of the environment require each agent to make its individual decisions based on local action-observation histories. To address these challenges, a popular MARL paradigm, called centralized training with decentralized execution (CTDE) (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) , has recently attracted great attention, where agents' policies are trained with access to global information in a centralized way and executed only based on local histories in a decentralized way. Many CTDE learning approaches have been proposed recently, among which value-based MARL algorithms (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2019b) have shown state-of-the-art performance on challenging tasks, e.g., unit micromanagement in StarCraft II (Samvelyan et al., 2019) . To enable effective CTDE for multi-agent Q-learning, it is critical that the joint greedy action should be equivalent to the collection of individual greedy actions of agents, which is called the IGM (Individual-Global-Max) principle (Son et al., 2019) . This IGM principle provides two advantages: 1) ensuring the policy consistency during centralized training (learning the joint Q-function) and decentralized execution (using individual Q-functions) and 2) enabling scalable centralized training of computing one-step TD target of the joint Q-function (deriving joint greedy action selection from individual Q-functions). To realize this principle, VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018) propose two sufficient conditions of IGM to factorize the joint action-value function. However, these two decomposition methods suffer from structural constraints and limit the joint action-value function class they can represent. As shown by Wang et al. (2020a) , the incompleteness of the joint value function class may lead to poor performance or potential risk of training instability in the offline setting (Levine et al., 2020) . Several methods have been proposed to address this structural limitation. QTRAN (Son et al., 2019) constructs two soft regularizations to align the greedy action selections between the joint and individual value functions. WQMIX (Rashid et al., 2020) considers a weighted projection that places more importance on better joint actions. However, due to computational considerations, both their implementations are approximate and based on heuristics, which cannot guarantee the IGM consistency exactly. Therefore, achieving the complete expressiveness of the IGM function class with effective scalability remains an open problem for cooperative MARL. To address this challenge, this paper presents a novel MARL approach, called duPLEX dueling multiagent Q-learning (QPLEX), that takes a duplex dueling network architecture to factorize the joint action-value function into individual action-value functions. QPLEX introduces the dueling structure et al., 2016) for representing both joint and individual (duplex) action-value functions and then reformalizes the IGM principle as an advantage-based IGM. This reformulation transforms the IGM consistency into the constraints on the value range of the advantage functions and thus facilitates the action-value function learning with linear decomposition structure. Different from QTRAN and WQMIX (Son et al., 2019; Rashid et al., 2020) losing the guarantee of exact IGM consistency due to approximation, QPLEX takes advantage of a duplex dueling architecture to encode it into the neural network structure and provide a guaranteed IGM consistency. To our best knowledge, QPLEX is the first multi-agent Q-learning algorithm that effectively achieves high scalability with a full realization of the IGM principle. Q = V + A (Wang We evaluate the performance of QPLEX in both didactic problems proposed by prior work (Son et al., 2019; Wang et al., 2020a) and a range of unit micromanagement benchmark tasks in StarCraft II (Samvelyan et al., 2019) . In these didactic problems, QPLEX demonstrates its full representation expressiveness, thereby learning the optimal policy and avoiding the potential risk of training instability. Empirical results on more challenging StarCraft II tasks show that QPLEX significantly outperforms other multi-agent Q-learning baselines in online and offline data collections. It is particularly interesting that QPLEX shows the ability to support offline training, which is not possessed by other baselines. This ability not only provides QPLEX with high stability and sample efficiency but also with opportunities to efficiently utilize multi-source offline data without additional online exploration (Fujimoto et al., 2019; Fu et al., 2020; Levine et al., 2020; Yu et al., 2020) .

2. PRELIMINARIES 2.1 DECENTRALIZED PARTIALLY OBSERVABLE MDP (DEC-POMDP)

We model a fully cooperative multi-agent task as a Dec-POMDP (Oliehoek et al., 2016) defined by a tuple M = N , S, A, P, Ω, O, r, γ , where N ≡ {1, 2, . . . , n} is a finite set of agents and s ∈ S is a finite set of global states. At each time step, every agent i ∈ N chooses an action a i ∈ A ≡ {A (1) , . . . , A (|A|) } on a global state s, which forms a joint action a ≡ [a i ] n i=1 ∈ A ≡ A n . It results in a joint reward r(s, a) and a transition to the next global state s ∼ P (•|s, a). γ ∈ [0, 1) is a discount factor. We consider a partially observable setting, where each agent i receives an individual partial observation o i ∈ Ω according to the observation probability function O(o i |s, a i ). Each agent i has an action-observation history τ i ∈ T ≡ (Ω × A) * and constructs its individual policy π i (a|τ i ) to jointly maximize team performance. We use τ ∈ T ≡ T n to denote joint action-observation history. The formal objective function is to find a joint policy π = π 1 , . . . , π n that maximizes a joint value function V π (s) = E [ 



t=0 γ t r t |s 0 = s, π]. Another quantity of interest in policy search is the joint action-value function Q π (s, a) = r(s, a) + γE s [V π (s )]. 2.2 DEEP MULTI-AGENT Q-LEARNING IN DEC-POMDP Q-learning algorithms is a popular algorithm to find the optimal joint action-value function Q * (s, a) = r(s, a)+γE s [max a Q * (s , a )]. Deep Q-learning represents the action-value function

