FSV: LEARNING TO FACTORIZE SOFT VALUE FUNC-TION FOR COOPERATIVE MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

We explore stochastic-based policy solutions for cooperative multi-agent reinforcement learning (MARL) using the idea of function factorization in centralized training with decentralized execution (CTDE). Existing CTDE based factorization methods are susceptible to the relative overgeneralization, where finding a suboptimal Nash Equilibrium, which is a well-known game-theoretic pathology. To resolve this issue, we propose a novel factorization method for cooperative MARL, named FSV, which learns to factorize the joint soft value function into individual ones for decentralized execution. Theoretical analysis shows that FSV solves a rich class of factorization tasks. Our experiments for the well-known tasks of the Non-Monotonic Matrix game and the Max of Two Quadratics game show that FSV converges to optima in the joint action space in the discrete and continuous tasks by local searching. We evaluate FSV on a challenging set of StarCraft II micromanagement tasks, and show that FSV significantly outperforms existing factorization multi-agent reinforcement learning methods.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) aims to instill in agents policies that maximize the team reward accumulated over time (Panait & Luke (2005) ; Busoniu et al. (2008) ; Tuyls & Weiss (2012) ), which has great potential to address complex real-world problems, such as coordinating autonomous cars (Cao et al. (2013) ). Considering the measurement and communication limitations in practical problems, cooperative MARL faces the partial observability challenge. That is, each agent chooses actions just based on its local observations. Centralized training with decentralized execution (CTDE) (Oliehoek et al. (2011) ) is a common paradigm to address the partial observability, where agents' policies are trained with access to global information in a centralized way and executed only based on local observations in a decentralized way, such as the MADDPG (Lowe (2017)) and COMA (Foerster et al. (2017) ). However, the size of the joint state-action space of the centralized value function grows exponentially as the number of agents increases, which is known as the scalibility challenge. Value function factorization methods have been an increasingly popular paradigm for solving the scalability in CTDE by satisfying the Individual-Global-Max (IGM) where the optimal joint action selection should be consistent with the optimal individual action selections. 2019)). All these methods are -greedy policies, where VDN and QMIX give sufficient but unnecessary conditions for IGM by additivity and monotonicity structures respectively, and the QTRAN formulates the IGM as an optimization problem with linear constraints. Although these methods have witnessed some success in some tasks, they all face relative overgeneralization, where agents may stick into a suboptimal Nash Equilibrium. In fact, relative overgeneralization is a grave pathology arising which occurs when a suboptimal Nash Equilibrium in the joint space of action priors to an optimal Nash Equilibrium since each agent's action in the suboptimal equilibrium is a better choice (Wei & Luke ( 2016)). The non-monotonic matrix game is a simple discrete example. Both VDN and QMIX fail to learn the optimal policy in the non-monotonic matrix due to their structure limitation. Although QTRAN expresses the complete value function representation ability in the non-monotonic matrix, its full expressive ability decreases in the complex tasks due to the computationally intractable constraints relaxing with tractable L2 penalties. Besides, QTRAN sacrifices the tractability in continuous action space. Therefore, in discrete and continuous tasks, achieving effective scalability while avoiding relative overgeneralization remains an open problem for cooperative MARL. To address this challenge, this paper presents a new definition of factorizable tasks called IGO (Individual-Global-Optimal) which introduces the consistency of joint optimal stochastic policies and individual optimal stochastic policies. Theoretical analysis shows that IGO degenerates into IGM if the policy is greedy, which represents the generality of IGO. Under the IGO, this paper proposes a novel factorization solution for MARL, named FSV, which learns to factorize soft value function into individual ones for decentralized execution enabling efficient learning and exploration through maximum entropy reinforcement learning. To our best knowledge, FSV is the first multiagent algorithm with stochastic policies using the idea of factorization, and theoretical analysis shows that FSV solves a rich class of tasks. We evaluate the performance of FSV in both discrete and continuous problems proposed by Son et al. ( 2019); Wei et al. ( 2018) and a range of unit micromanagement benchmark tasks in StarCraft II. The Non-Monotonic Matrix game shows that FSV has full expression ability in the discrete task, and the Max of Two Quadratics game shows that FSV is the first factorization algorithm that avoids the relative overgeneralization to converge to optima in the continuous task. On more challenging StarCraft II tasks, due to the high representation ability and exploration efficiency of FSV, it significantly outperforms other baselines, SMAC (Samvelyan et al. ( 2019)).

2. PRELIMINARIES 2.1 DEC-POMDP AND CTDE

A fully cooperative multi-agent task can be described as a Dec-POMDP defined by a tuple G = S, U, P, r, Z, O, N , γ , where s ∈ S is the global state of the environment. Each agent i ∈ N choose an action u i ∈ U at each time step, forming a joint action u ∈ U N . This causes a transition to the next state according to the state transition function P(s |s, u) : S × U N × S → [0, 1] and reward function r(s, u) : S × U N → R shared by all agents. γ ∈ [0, 1] is a discount factor. Each agent has individual, partial observation z ∈ Z according to observation function O(s, i) : S × N → Z. Each agent also has an action-observation history τ i ∈ T : (Z × U) * , on which it conditions a stochastic policy π i (u i |τ i ) : T × U → [0, 1]. The joint policy π has a joint action-value function Q π (s t , u t ) = E st+1:∞,ut+1:∞ [ ∞ k=0 γ k r t+k |s t , u t ]. Centralized Training with Decentralized Execution (CTDE) is a common paradigm of cooperative MARL tasks. Through centralized training,the action-observation histories of all agents and the full state can be made accessible to all agents. This allows agents to learn and construct individual action-value functions correctly while selecting actions based on its own local action-observation history at execution time .

2.2. VDN, QMIX AND QTRAN

An important concept for factorizable tasks is IGM which asserts that the joint action-value function Q tot : T N × U N → R and individual action-value functions [Q i : T × U → R] N i=1 satisfies arg max u Q tot (τ, u) = (arg max u1 Q 1 (τ 1 , u 1 ), ..., arg max u N Q N (τ N , u N )) To this end, VDN and QMIX give sufficient conditions for the IGM by additivity and monotonicity structures, respectively, as following: Q tot (τ, u) = N i=1 Q i (τ i , u i ) and ∂Q tot (τ, u) ∂Q i (τ i , u i ) > 0, ∀i ∈ N (2) However, there exist tasks whose joint action-value functions do not meet the said conditions, where VDN and QMIX fail to construct individual action-value function correctly. QTRAN uses a linear constraint between individual and joint action values to guarantee the optimal decentralisation.



Three representative examples of value function factorization methods include VDN (Sunehag et al. (2017)), QMIX (Rashid et al. (2018)), and QTRAN (Son et al. (

