FSV: LEARNING TO FACTORIZE SOFT VALUE FUNC-TION FOR COOPERATIVE MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

We explore stochastic-based policy solutions for cooperative multi-agent reinforcement learning (MARL) using the idea of function factorization in centralized training with decentralized execution (CTDE). Existing CTDE based factorization methods are susceptible to the relative overgeneralization, where finding a suboptimal Nash Equilibrium, which is a well-known game-theoretic pathology. To resolve this issue, we propose a novel factorization method for cooperative MARL, named FSV, which learns to factorize the joint soft value function into individual ones for decentralized execution. Theoretical analysis shows that FSV solves a rich class of factorization tasks. Our experiments for the well-known tasks of the Non-Monotonic Matrix game and the Max of Two Quadratics game show that FSV converges to optima in the joint action space in the discrete and continuous tasks by local searching. We evaluate FSV on a challenging set of StarCraft II micromanagement tasks, and show that FSV significantly outperforms existing factorization multi-agent reinforcement learning methods.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) aims to instill in agents policies that maximize the team reward accumulated over time (Panait & Luke (2005) ; Busoniu et al. (2008) ; Tuyls & Weiss (2012) ), which has great potential to address complex real-world problems, such as coordinating autonomous cars (Cao et al. (2013) ). Considering the measurement and communication limitations in practical problems, cooperative MARL faces the partial observability challenge. That is, each agent chooses actions just based on its local observations. Centralized training with decentralized execution (CTDE) (Oliehoek et al. (2011) ) is a common paradigm to address the partial observability, where agents' policies are trained with access to global information in a centralized way and executed only based on local observations in a decentralized way, such as the MADDPG (Lowe (2017) ) and COMA (Foerster et al. (2017) ). However, the size of the joint state-action space of the centralized value function grows exponentially as the number of agents increases, which is known as the scalibility challenge. Value function factorization methods have been an increasingly popular paradigm for solving the scalability in CTDE by satisfying the Individual-Global-Max (IGM) where the optimal joint action selection should be consistent with the optimal individual action selections. Three representative examples of value function factorization methods include VDN (Sunehag et al. (2017) ), QMIX (Rashid et al. (2018) ), and QTRAN (Son et al. (2019) ). All these methods are -greedy policies, where VDN and QMIX give sufficient but unnecessary conditions for IGM by additivity and monotonicity structures respectively, and the QTRAN formulates the IGM as an optimization problem with linear constraints. Although these methods have witnessed some success in some tasks, they all face relative overgeneralization, where agents may stick into a suboptimal Nash Equilibrium. In fact, relative overgeneralization is a grave pathology arising which occurs when a suboptimal Nash Equilibrium in the joint space of action priors to an optimal Nash Equilibrium since each agent's action in the suboptimal equilibrium is a better choice (Wei & Luke (2016) ). The non-monotonic matrix game is a simple discrete example. Both VDN and QMIX fail to learn the optimal policy in the non-monotonic matrix due to their structure limitation. Although QTRAN expresses the complete value function representation ability in the non-monotonic matrix, its full expressive ability decreases in the complex tasks due to the computationally intractable constraints relaxing with tractable L2 penalties. Besides, QTRAN sacrifices the tractability in continuous action space. Therefore, in discrete and continuous tasks, achieving effective scalability while avoiding relative overgeneralization remains an open problem for cooperative MARL. To address this challenge, this paper presents a new definition of factorizable tasks called IGO (Individual-Global-Optimal) which introduces the consistency of joint optimal stochastic policies and individual optimal stochastic policies. Theoretical analysis shows that IGO degenerates into IGM if the policy is greedy, which represents the generality of IGO. Under the IGO, this paper proposes a novel factorization solution for MARL, named FSV, which learns to factorize soft value function into individual ones for decentralized execution enabling efficient learning and exploration through maximum entropy reinforcement learning. To our best knowledge, FSV is the first multiagent algorithm with stochastic policies using the idea of factorization, and theoretical analysis shows that FSV solves a rich class of tasks. We evaluate the performance of FSV in both discrete and continuous problems proposed by Son et al. (2019) ; Wei et al. (2018) and a range of unit micromanagement benchmark tasks in StarCraft II. The Non-Monotonic Matrix game shows that FSV has full expression ability in the discrete task, and the Max of Two Quadratics game shows that FSV is the first factorization algorithm that avoids the relative overgeneralization to converge to optima in the continuous task. On more challenging StarCraft II tasks, due to the high representation ability and exploration efficiency of FSV, it significantly outperforms other baselines, SMAC (Samvelyan et al. ( 2019)).

2. PRELIMINARIES 2.1 DEC-POMDP AND CTDE

A fully cooperative multi-agent task can be described as a Dec-POMDP defined by a tuple G = S, U, P, r, Z, O, N , γ , where s ∈ S is the global state of the environment. Each agent i ∈ N choose an action u i ∈ U at each time step, forming a joint action u ∈ U N . This causes a transition to the next state according to the state transition function P(s |s, u) : S × U N × S → [0, 1] and reward function r(s, u) : S × U N → R shared by all agents. γ ∈ [0, 1] is a discount factor. Each agent has individual, partial observation z ∈ Z according to observation function O(s, i) : S × N → Z. Each agent also has an action-observation history τ i ∈ T : (Z × U) * , on which it conditions a stochastic policy π i (u i |τ i ) : T × U → [0, 1]. The joint policy π has a joint action-value function Q π (s t , u t ) = E st+1:∞,ut+1:∞ [ ∞ k=0 γ k r t+k |s t , u t ]. Centralized Training with Decentralized Execution (CTDE) is a common paradigm of cooperative MARL tasks. Through centralized training,the action-observation histories of all agents and the full state can be made accessible to all agents. This allows agents to learn and construct individual action-value functions correctly while selecting actions based on its own local action-observation history at execution time .

2.2. VDN, QMIX AND QTRAN

An important concept for factorizable tasks is IGM which asserts that the joint action-value function Q tot : T N × U N → R and individual action-value functions [Q i : T × U → R] N i=1 satisfies arg max u Q tot (τ, u) = (arg max u1 Q 1 (τ 1 , u 1 ), ..., arg max u N Q N (τ N , u N )) To this end, VDN and QMIX give sufficient conditions for the IGM by additivity and monotonicity structures, respectively, as following: Q tot (τ, u) = N i=1 Q i (τ i , u i ) and ∂Q tot (τ, u) ∂Q i (τ i , u i ) > 0, ∀i ∈ N (2) However, there exist tasks whose joint action-value functions do not meet the said conditions, where VDN and QMIX fail to construct individual action-value function correctly. QTRAN uses a linear constraint between individual and joint action values to guarantee the optimal decentralisation. To avoid the intractability, QTRAN relax these constraints using two L2 penalties. However, this relaxation may violate the IGM and it has poor performance on multiple multi-agent cooperative benchmarks as reported recently.

2.3. THE RELATIVE OVERGENERALIZATION PROBLEM

Relative overgeneralization occurs when a sub-optimal Nash Equilibrium (e.g. N in Fig. 1 ) in joint action space is preferred over an optimal Nash Equilibrium (e.g. M in Fig. 1 ) because each agent's action in the suboptimal equilibrium is a better choice when matched with arbitrary actions from the collaborating agents. Specifically, as shown in Figure 1 , where two agents with one-dimensional bounded action (or three actions in discrete action space) try to cooperate and find the optimal joint action, the action B (or C) is often preferred by most algorithms as mentioned in (Son et al. (2019) and Wei et al. ( 2018)) due to their structure limitation and lack of exploration. 

3. METHOD

In this section, we will first introduce the IGO (Individual-Global-Optimal), a new definition of factorizable MARL tasks with stochastic policies. Theoretical analysis shows that IGO degenerates into IGM if the policy is greedy. With the energy-based policy, the structure between joint and individual action values of IGO can be explicitly constructed, which is a novel factorization stochastic-based policy solution we proposed, named FSV. Specifically, FSV realizes IGO using an efficient linear structure and learns stochastic policies through maximum entropy reinforcement learning.

3.1. INDIVIDUAL GLOBAL OPTIMAL

In the CTDE paradigm, each agent i ∈ N chooses an action based on a stochastic policy π i (u i |τ i ) at the same time step. The joint policy π tot (u|τ ) = N i=1 π i (u i |τ i ) describes the probability of taking joint actions u on joint observation history τ . If each agent adopts its optimal policy while the joint policy is exactly the optimum, the task itself can achieve global optimum through local optimum, which naturally motivates us to consider the factorizable tasks with stochastic policy as following: Definition 1 For a joint optimal policy π * tot (u|τ ) : T N × U N → [0, 1], if there exists individual optimal policies [π * i (u i |τ i ) : T × U → [0, 1]] N i=1 , such that the following holds π * tot (u|τ ) = N i=1 π * i (u i |τ i ) (3) then, we say that [π i ] satisfy IGO for π tot As specified above, IGO requires the consistency of joint optimal policy and individual optimal policies rather than the actions in IGM, but it degenerates into IGM if policies are greedy. That is to say, IGO is more generality than IGM.

3.2. FSV

In this work, we take the energy-based policies as joint and individual optimal policy respectively, π * tot (u|τ ) = exp( 1 α (Q tot (τ, u) -V tot (τ ))) (4) π * i (u i |τ i ) = exp( 1 α i (Q i (τ i , u i ) -V i (τ i ))) (5) where α, α i are temperature parameters, V tot (τ ) = α log U N exp( 1 α Q tot (τ, u))du and V i (τ i ) = α i log U exp( 1 αi Q i (τ i , u ))du are partition functions. The benefit of using energy-based policy is that it is a very general class of distributions that can represent complex, multi-modal behaviors Haarnoja et al. (2017) . Moreover, energy-based policies can easily degenerate into greedy policies as α, α i anneals. To learn this decentralized energy-based policy, we extend the maximum entropy reinforcement learning framework for the multi-agent setting, which we'll describe in the next. Another benefit of considering the stochastic policy with explicit function class for factorizable tasks through IGO is that the architecture between joint and individual action values can be easily constructed through its constrains on policies with specific meanings as follows. Theorem 1 If the task satisfies IGO, with energy-based optimal policy, the joint action value Q tot can be factorized by individual action values [Q i ] N i=1 as following: Q tot (τ, u) = N i=1 λ * i [Q i (τ i , u i ) -V i (τ i )] + V tot (τ ) where λ * i = α/α i . Theorem 1 gives the decomposition structure like VDN-the joint value is a linear combination of individual values weighted by λ * i > 0. However, the function class defined by Eq(6), which should only concern the task itself, is related to and limited by the distributions of policy. Although energybased distribution is very general which has the representation ability of most tasks, to establish the correct architecture between joint and individual Q-values and enable stable learning, we need to extend the function class into any distributions. The key idea is that we approximate the weight vector λ i directly as α, α i is zero instead of annealing α i during training process. This extends the function class and will at least guarantee IGM constraint when α, α i is zero . Theorem 2 When α, α i → 0, the function class defined by IGM is equivalent to the following Q tot (τ, u) = N i=1 λ i (τ, u)[Q i (τ i , u i ) -V i (τ i )] + V tot (τ ) where λ i (τ, u) = lim α,αi→0 λ * i . Note that λ i is now a function of observations and actions due to the relaxation. Eq(7) allows us to use a simple linear structure to train joint and individual action values efficiently and guarantee the correct estimation of optimal Q-values. We'll describe it in experiment. Then, we introduce the maximum entropy reinforcement learning in CTDE setting which is an directly extension of soft actor-critic (q-learning). The standard reinforcement learning tries to maximum the expected return t E π [r t ], while the maximum entropy objective generalizes the standard objective by augmenting it with an entropy term, such that the optimal policy additionally aims to maximize its entropy at each visited state π M axEnt = arg max π t E π [r t + αH(π(•|s t ))] ( ) where α is the temperature parameter that determines the relative importance of the entropy term versus the reward, and thus controls the stochasticity of the optimal policy (Haarnoja et al. (2017) ). We can extend it into cooperative multi-agent tasks by directly considering the joint policy π tot (u|τ ) and defining the soft joint action-value function as following: Q tot (τ t , u t ) = r(τ t , u t ) + E τt+1,... [ ∞ k=1 γ k (r t+k + αH(π * tot (•|τ t+k ))] then the joint optimal policy for Eq(8) is given by Eq(4) (Haarnoja et al. (2017) ). Note that we don't start considering decentralized policies, the joint Q-function should satisfy the soft Bellman equation: Q * tot (τ t , u t ) = r t + E τt+1 [V * tot (τ t+1 )] And we can update the joint Q functions in centralized training through soft Q-iteration: Q tot (τ t , u t ) ← r t + E τt+1 [V tot (τ t+1 )] It's natural to take the similar energy-based distribution as individual optimal policies π * i in Eq(5) which allows us to update the individual policies through soft policy-iteration: π new i = arg min π ∈ D KL (π (•|τ )||π * i (•|τ ))

3.3. ARCHITECTURE

In this section, we present a novel MARL framework named FSV, which incorporates the idea in a simple and efficient architecture through Eq(7) with multi-agent maximum entropy reinforcement learning. FSV can be applied both in continuous action space and also in discrete action space as a simplification. Figure 2 shows the overall learning framework, which consists of two parts:(i) individual parts for each agent i, which represents Q i , V i and π i (ii)incorporation part that composes Q i , V i to Q tot . Individual parts for each agent i has three networks: (i)individual Q network takes its own action and observation history τ i , u i as input and produces action-values Q i (τ i , u i ) as output.(ii)individual value network takes its own observation history τ i as input and produces V i (τ i ) as output.(iii)individual policy network takes its own observation history τ i as input and produces a distribution (e.g. mean and standard deviation of Gaussian distribution) for sample actions. Incorporation part composes Q i , V i to Q tot through linear combination. Specifically, it sums up [Q i -V i ] N i=1 with coefficients λ i and uses a one-layer hyper-network to efficiently approximate the high-dimensional partition function as following: V tot (τ ) = N i=1 w i (τ )V i (τ i ) + b(τ ) ( ) where w i , b is a positive weight and bias respectively. To enable efficient learning, we adopt a multi-head attention structure to estimate the weight vector: λ i (τ, u) = H h=1 λ i,h (τ, u) ( ) where H is the number of attention heads and λ i,h is defined by λ i,h ∝ exp(e T u W T k,h W q,h e s ) where e u and e s is obtained by two-layer embedding transformation for u and s. The joint action value function Q tot is updated through soft Q-iteration: 16) where J θ Qtot = E (τt,ut)∼D [Q tot (τ t , u t ) -Q(τ t , u t )] 2 Q(τ t , u t ) = r(τ t , u t ) + γE τt+1∼D,ut+1∼π [Q tot (τ t+1 , u t+1 ) -α log π tot (u t+1 |τ t+1 )]. The individual value network is trained by minimize J φi Vi = E τi∼D [V i (τ i ) -(E ui [Q i (τ i , u i ) -α log π i (u i |τ i )])] 2 (17) The policy network of each agent is trained by minimizing the expected KL-divergence J ψi πi = E τi∼D,ui∼πi [α log π i (u i |τ i ) -Q i (τ i , u i )] For discrete action space, it's convenient to simplify this framework to Q-learning. Specifically, we directly compute the individual value function V i = α i log exp( 1 αi Q i (τ i , • )) instead of updating the value network, and action distributions are directly produced by Eq(5) instead of the policy network.

4. RELATED WORK

There are many early works with maximum entropy principle such as Todorov (2010) and Levine & Koltun (2013) use it in policy search in linear dynamics and Kappen (2005) and A. Theodorou et al. (2010) use it in path integral control in general dynamics. Recent off policy methods (Haarnoja et al. (2017) ; Haarnoja et al. (2018b) ; Haarnoja et al. (2018a) ) have been proposed to learn an energy-based policy efficiently through the maximum entropy objective which is adopted in our framework. Value function factorization methods start from VDN (Sunehag et al. (2017) ), and is extended by QMIX (Rashid et al. (2018) ) and QTRAN (Son et al. (2019) ). Other methods such as QATTEN (Yang et al. (2020) ) and MAVEN (Mahajan et al. (2019) ) go a step further on architecture and exploration. Our method are a member of them but out of the deterministic policy Current methods adopt different ideas to solve the relative overgeneralization problem. Wei et al. (2018) conduct multi-agent soft Q learning for better exploration. Wen et al. (2019) uses probabilistic recursive reasoning to model the opponents, Yu et al. (2019) adopts inverse reinforcement learning to avoid this problem through right demonstrations, Tian et al. (2019) derives a variational lower bound of the likelihood of achieving the optimality for modeling the opponents. However, none of them adopt value function factorization like FSV which means they suffer the scalability problem.

5. EXPERIMENTS

In this section, we first consider two simple examples proposed by prior work (Son et al. (2019) , Wei et al. (2018) ) to demonstrate the optimality and convergence of FSV in discrete and continuous action space respectively. And we evaluate the performance in a challenging set of cooperative StarCraft II maps from the SMAC benchmark (Samvelyan et al. ( 2019)).

5.1. MATRIX GAME

The matrix game is proposed by QTRAN Son et al. (2019) , where two agents with three actions and shared reward as illustrated in Table1, should learn to cooperate to find the optimal joint action (A, A). This is a simple example of the relative overgeneralization problem, where the sub-optimal action B, C has higher expected return in exploration process. We train all algorithms through a full exploration (i.e., = 1 in -greedy) conducted over 20,000 steps while FSV is trained by annealing α from 1 to α 0 . To demonstrate the expressive ability related to temperature parameter α, we set α 0 = 1, 0.1, 0.01 respectively. As shown in Table3, QMIX fails to represent the optimal joint action value and the optimal action due to the limitation of additivity and monotonicity structures while FSV and QTRAN successfully represent all the joint action values. In addition, even if α is not annealed to very small, FSV correctly approximated the optimal joint action values because we directly estimate λ when α and α i tend to 0, which relaxes the constraints of the function class to guarantee the correct structure during the training process.

5.2. MAX OF TWO QUADRATICS GAME

We use The Max of Two Quadractics game (Wei et al. (2018) ), which is a simple single state continuous game for two agents, to demonstrate the performance of current algorithms in the relative overgeneralization problem. Each agent has one dimensional bounded action with shared reward as following    f 1 = h 1 × [-( u1-x1 s1 ) 2 -( u2-y1 s1 ) 2 ] f 2 = h 2 × [-( u1-x2 s2 ) 2 -( u2-y2 s2 ) 2 ] + c r(u 1 , u 2 ) = max(f 1 , f 2 ) (19) where u 1 ,u 2 are the actions from agent 1 and agent 2 respectively, h 1 = 0.8, h 2 = 1, s 1 = 3, s 2 -1, x 1 = -5, x 2 = 5, y 1 = -5, y 2 = 5, c = 10. The reward function is shown as Fig 3(a) . Although this game is very simple, the gradient points to the sub-optimal solution at (x 1 , y 1 ) over almost all the action space which will fox the policy-based method. And for value function factorization methods, this task requires non-monotonic structures to correctly represent the optimal joint Q-values through individual Q values. We extend QMIX and VDN to actor-critic framework (like DDPG) while QTRAN is not applicable in continuous action space due to its requirement of max operations on Q-values. 7 gives a more detailed result, where MADDPG and QMIX happened to find the optimal actions due to random initialization twice. VDN never find the optimal actions and even fails to find the sub-optimal 4 times. These Then we introduced FSV, a novel MARL algorithm under IGO, which learns to factorize soft value function into individual ones for decentralized execution enabling efficient learning and exploration through maximum entropy reinforcement learning. As immediate future work, we aim to develop a theoretical analysis for FSV as a policy-based method. We would also like to explore the committed exploration like Mahajan et al. (2019) in continuous space due to the miscoordination caused by energy-based policy (Wei & Luke (2016) ).



Figure 1: The relative overgeneralization in discrete (a) and continuous (b) action space

Figure 2: FSV network architecture

Figure 3: Max of Two Quadratics game:(a)reward function, (b)average reward for FSV,VDN,QMIX and MADDPG

Figure 4: test win rate of FSV, VDN, QMIX and QTRAN

Payoff of matrix game

training result for Max of Two Quadratics game , a more explorative policy and correct estimation of Q-values are both needed to overcome the relative overgeneralization problem. Using a centralized critic like MADDPG to guide the decentralized actors will mislead the policy gradients because it averages the Q-values based on others' policies (?). Using individual Q-values to guide actors requires the full expressive ability of factorizable tasks where QMIX and VDN fail to estimate individual Q-values correctly due to the structural limitation as shown in Sec5.1 and QTRAN losts its tractability for continuous tasks. To enable better exploration in joint action space,Wei et al. (2018) adopt multi-agent soft q-learning to avoid the relative overgeneralization problem, but it still uses a centralized critic which suffers scalability and it's very sensitive to how the temperature parameter anneals. It's clear that, FSV utilizes value function factorization method to get correct estimation of individual Q-values and carries exploration with a more explorative energy-based policy can achieve 100% success rate.

7. APPENDIX

7.1 PROOF 7.1.1 RELATIONSHIP BETWEEN IGO AND IGM If the joint and individual optimal policies are greedy:Then IGO gives that u = arg max u Q(τ, u) if and only if u i = arg max ui Q i (τ i , u i ) for any i which is equivalent to IGM.

7.1.2. PROOF OF THEOREM1

Theorem 1 If the task satisfies IGO, with energy-based optimal policy, the joint action value Q tot can be factorized by individual action values [Q i ] N i=1 as following:whereProof. Considering Eq(4) and Eq(5), IGO can be reformulated as:This gives:which is Theorem 1.

7.1.3. PROOF OF THEOREM2

Theorem 2 When α, α i → 0, the function class defined by IGM is equivalent to the followingwhereProof. IGM⇒Eq( 7):It's clear that Eq(7) can always hold if λ i is well constructed. Here we give one way to construct λ i which meanwhile explains how we extend function class limited by energy-based policy.Denote π i , π tot to be the current policy and π * i , π * tot to be the optimal policy. We can always take individual optimal policies during the process of approaching greedy, thuswhere is a small parameter. Then, the joint policy is given by:Considering IGM, u = arg max)) and α, α i is a function of observations and actions:Thus, λ i is given by:This makes the Eq(7) permanent. In particular, if sampled action u is exactly current arg max u Q tot , then Q tot = V tot and Q i = V i when α, α i → 0 and λ i can be set to 1. Thus we extend the function class and Eq(7) always holds for any action.IGM⇐Eq(7):Therefore, Q tot = V tot and Q i = V i if and only if u = u * and u i = u * i , respectively. Considering Eq(7) and λ i > 0, u = u * if and only if u i = u * i , that complete the proof. 8 and others are the default settings in Py-MARL. In discrete tasks, we extend VDN and QMIX to the actor-critic framework. Specifically, we add an actor for individual agents to maximise the Q-values from their own critic like DDPG. We illustrate the hyper-parameters of these algorithm as well as FSV in Table 8 . For stability, we reformulate the Eq(7) as following:Then we stop all the gradients except for λ i and the last term Q i . Considering incorrect weight λ i will cause incorrect α i at the beginning of training thus obstruct the exploration, we use α i = α , which means we ignore the KL-divergence between current policy and optimal policy. The only temperature parameter α is updated through annealing or Automating Entropy Adjustment in Haarnoja et al. (2018b) as following:where the H is the target entropy.

