QTRAN++: IMPROVED VALUE TRANSFORMATION FOR COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING

Abstract

QTRAN is a multi-agent reinforcement learning (MARL) algorithm capable of learning the largest class of joint-action value functions up to date. However, despite its strong theoretical guarantee, it has shown poor empirical performance in complex environments, such as Starcraft Multi-Agent Challenge (SMAC). In this paper, we identify the performance bottleneck of QTRAN and propose a substantially improved version, coined QTRAN++. Our gains come from (i) stabilizing the training objective of QTRAN, (ii) removing the strict role separation between the action-value estimators of QTRAN, and (iii) introducing a multihead mixing network for value transformation. Through extensive evaluation, we confirm that our diagnosis is correct, and QTRAN++ successfully bridges the gap between empirical performance and theoretical guarantee. In particular, QTRAN++ newly achieves state-of-the-art performance in the SMAC environment. The code will be released.

1. INTRODUCTION

Over the decade, reinforcement learning (RL) has shown successful results for single-agent tasks (Mnih et al., 2015; Lillicrap et al., 2015) . However, the progress in multi-agent reinforcement learning (MARL) has been relatively slow despite its importance in many applications, e.g., controlling robot swarms (Yogeswaran et al., 2013) and autonomous driving (Shalev-Shwartz et al., 2016) . Indeed, naïvely applying single-agent algorithms demonstrated underwhelming results (Tan, 1993; Lauer and Riedmiller, 2000; Tampuu et al., 2017) . The main challenge is handling the non-stationarity of the policies: a small perturbation in one agent's policy leads to large deviations of another agent's policy. Centralized training with decentralized execution (CTDE) is a popular paradigm for tackling this issue. Under the paradigm, value-based methods (i) train a central action-value estimator with access to full information of the environment, (ii) decompose the estimator into agent-wise utility functions, and (iii) set the decentralized policy of agents to maximize the corresponding utility functions. Their key idea is to design the action-value estimator as a decentralizable function (Son et al., 2019) , i.e., individual policies maximizing the estimate of central action-value. For example, value-decomposition networks (VDN, Sunehag et al. 2018) decomposes the action-value estimator into a summation of utility functions, and QMIX (Rashid et al., 2018) use a monotonic function of utility functions for the decomposition instead. Here, the common challenge addressed by the prior works is how to design the action-value estimator as flexible as possible, while maintaining the execution constraint on decentralizability. Recently, Son et al. (2019) proposed QTRAN to eliminate the restriction of value-based CTDE methods for the action-value estimator being decentralizable. To be specific, the authors introduced a true action-value estimator and a transformed action-value estimator with inequality constraints imposed between them. They provide theoretical analysis on how the inequality constraints allow QTRAN to represent a larger class of estimators than the existing value-based CTDE methods. However, despite its promise, other recent studies have found that QTRAN performs empirically worse than QMIX in complex MARL environments (Mahajan et al., 2019; Samvelyan et al., 2019; Rashid et al., 2020a) . Namely, a gap between the theoretical analysis and the empirical observation is evident, where we are motivated to fill this gap. Contribution. In this paper, we propose QTRAN++, a novel value-based MARL algorithm that resolves the limitations of QTRAN, i.e., filling the gap between theoretical guarantee and empirical performance. Our algorithm maintains the theoretical benefit of QTRAN for representing the largest class of joint action-value estimators, while achieving state-of-the-art performance in the popular complex MARL environment, StarCraft Multi-Agent Challenge (SMAC, Samvelyan et al. 2019) . At a high-level, the proposed QTRAN++ improves over QTRAN based on the following critical modifications: (a) enriching the training signals through the change of loss functions, (b) allowing shared roles between the true and the transformed joint action-value functions, and (c) introducing multi-head mixing networks for joint action-value estimation. To be specific, we achieve (a) through enforcing additional inequality constraints between the transformed action-value estimator and using a non-fixed true action-value estimator. Furthermore, (b) helps to maintain high expressive power for the transformed action-value estimator with representation transferred from the true actionvalue estimator. Finally, (c) allows an unbiased credit assignment in tasks that are fundamentally non-decentralizable. We extensively evaluate our QTRAN++ in the SMAC environment, where we compare with 5 MARL baselines under 10 different scenarios. We additionally consider a new rewarding mechanism which promotes "selfish" behavior of agents by penalizing agents based on their self-interest. To highlight, QTRAN++ consistently achieves state-of-the-art performance in all the considered experiments. Even in the newly considered settings, our QTRAN++ successfully trains the agents to cooperate with each other to achieve high performance, while the existing MARL algorithms may fall into local optima of learning selfish behaviors, e.g., QMIX trains the agents to run away from enemies to preserve their health conditions. We also construct additional ablation studies, which reveal how the algorithmic components of QTRAN++ are complementary to each other and crucial for achieving state-of-the-art performance. We believe that QTRAN++ method can be a strong baseline when other researchers pursue the MARL tasks in the future.

2.1. PROBLEM STATEMENT

In this paper, we consider a decentralized partially observable Markov decision process (Oliehoek et al., 2016) represented by a tuple G = S, U, P, r, O, N, γ . To be specific, we let s ∈ S denote the true state of the environment. At each time step, an agent i ∈ N := {1, ..., N } selects an action u i ∈ U as an element of the joint action vector u := [u 1 , • • • , u N ]. It then goes through a stochastic transition dynamic described by the probability P (s |s, u). All agents share the same reward r(s, u) discounted by a factor of γ. Each agent i is associated with a partial observation O(s, i) and an action-observation history τ i . Concatenation of the agent-wise action-observation histories is denoted as the overall action-observation history τ . We consider value-based policies under the paradigm of centralized training with decentralized execution (CTDE). To this end, we train a joint action-value estimator Q jt (s, τ , u) with access to the overall action-observation history τ and the underlying state s. Each agent i follows the policy of maximizing an agent-wise utility function q i (τ i , u i ) which can be executed in parallel without access to the state s. Finally, we denote the joint action-value estimator Q jt to be decentralized into agent-wise utility functions q 1 , . . . , q N when the following condition is satisfied: arg max u Q jt (s, τ , u) = arg max u1 q 1 (τ 1 , u 1 ), . . . , arg max u N q N (τ N , u N ) . (1) Namely, the joint action-value estimator is decentralizable when there exist agent-wise policies maximizing the estimated action-value.

2.2. RELATED WORK

QMIX. Rashid et al. (2018) showed how decentralization in Equation 1 is achieved when the joint action-value estimator is restricted as a non-decreasing monotonic function of agent-wise utility functions. Based on this result, they proposed to parameterize the joint action-value estimator as a mixing network f mix for utility functions q i with parameter θ: Q jt (s, τ , u) = f mix q 1 (τ 1 , u 1 ), . . . , q N (τ N , u N ); θ(s) ,

