REPRESENTATION INTERFERENCE SUPPRESSION VIA NON-LINEAR VALUE FACTORIZATION FOR INDECOM-POSABLE MARKOV GAMES

Abstract

Value factorization is an efficient approach for centralized training with decentralized execution in cooperative multi-agent reinforcement learning tasks. As the simplest implementation of value factorization, Linear Value Factorization (LVF) attracts wide attention. In this paper, firstly, we investigate the applicable conditions of LVF, which is important but usually neglected by previous works. We prove that due to the representation limitation, LVF is only perfectly applicable to an extremely narrow class of tasks, which we define as the decomposable Markov game. Secondly, to handle the indecomposable Markov game where the LVF is inapplicable, we turn to value factorization with complete representation capability (CRC) and explore the general form of the value factorization function that satisfies both Independent Global Max (IGM) and CRC conditions. A common problem of these value factorization functions is the representation interference among true Q values with shared local Q value functions. As a result, the policy could be trapped in local optimums due to the representation interference on the optimal true Q values. Thirdly, to address the problem, we propose a novel value factorization method, namely Q Factorization with Representation Interference Suppression (QFRIS). QFRIS adaptively reduces the gradients of the local Q value functions contributed by the non-optimal true Q values. Our method is evaluated on various benchmarks. Experimental results demonstrate the good convergence of QFIRS.

1. INTRODUCTION

Centralized training with decentralized execution (CTDE) (Lowe et al., 2017; Oliehoek et al., 2008; Foerster et al., 2016) shows surprising performance and great scalability in challenging fully cooperative multi-agent reinforcement learning (MARL) tasks (Tan, 1993b) . Such tasks only provide rewards shared by all agents. Each agent is expected to deduce its own contribution to the team, which introduces the problem of credit assignment (Foerster et al., 2018) . As a simple and efficient approach for credit assignment in the CTDE paradigm, value factorization, especially Linear Value Factorization (LVF) recently gains growing attention, e.g., VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2018) . An important property of LVF is that it concisely meets the Independent Global Max (IGM) principle (Son et al., 2019) . The IGM principle is defined as the identity between the joint Q value function and the set of factorized local Q value functions, which is wildly acknowledged as a critical rule for value factorization. However, the linearly factorizable joint Q value function in LVF is incapable to represent non-linear true Q value functions, known as the representation limitation of LVF. Recent works focus on the solutions to the representation limitation but usually neglect under what conditions the true Q value function is not linearly factorizable. In this paper, we prove that in the context of Markov games, the linear factorizability relies on two conditions: (1) the reward function is linearly factorizable on a set of subspaces of the joint state-action space; (2) the state transitions in each subspace is irrelevant to the state and action out of the subspace. Based on the two conditions above, we define the decomposability of the Markov game. In words, the true Q value function is linearly factorizable if and only if the Markov game is decomposable. Most of the tasks are indecomposable Markov games, so we go deeper into the property of LVF in this case. We prove that the target of the joint Q value function in Bellman equation (Sutton & Barto, 2018) is always unbiased for LVF under value iteration in sarsa manner. To deal with the indecomposable Markov game where the true Q value function is not linearly factorizable, we consider improving the representation capability of the value factorization function by introducing extra approximators. According to the partial derivative on local Q value functions, value factorization functions can be classified into two categories, i.e., linear and non-linear. For both categories, we investigate the conditions of value factorization functions that satisfy the complete representation capability (i.e., the capability to approximate any true Q value function) and the IGM principle. Then we propose a rule to generate qualified functions and list some example functions for both linear and non-linear cases. A common problem of these value factorization functions is the representation interference among true Q values. Specifically, a local Q value corresponds to multiple true Q values in value factorization. As a result, the representation of these true Q values is interfered by each other through the training of the shared local Q value function. The representation interference on the optimal true Q value function could leave the policy trapped in the local optimum. To address the problem, we design a novel value factorization function. Our method, namely Q Factorization with Representation Interference Suppression (QFRIS), alleviates the representation interference on the optimal true Q value by reducing the weight contributed by the non-optimal ones. QFRIS is evaluated on matrix game, predator-prey and starcraft multi-agent challenge. The experimental results demonstrate the good convergence of our method. We have three main contributions in this work: (1) We prove a sufficient and necessary condition for the linear factorizability of the true Q value function, which can be used to distinguish whether the joint Q value function of LVF is faced with representation limitation; (2) To deal with indecomposable Markov games, we propose the rules to generate value factorization functions that satisfy both IGM and CRC conditions; (3) We point out a common problem of value decomposition, namely representation interference, and design a novel value factorization function to address the problem. Our method shows good convergence in experiments on various benchmarks.

2.1. DEC-POMDP

A fully cooperative multi-agent reinforcement learning problem can be modelled by the Decentralized Partially Observable Markov Decision Process (Dec-POMDP), which is usually described by a tuple G =< S, U , P, r, Z, O, n, γ > (Guestrin et al., 2001; Oliehoek & Amato, 2016; Seuken & Zilberstein, 2008) . s ∈ S denotes the global state of the environment, by which a local observation z i ∈ Z i is assigned to agent i ∈ I ≡ {1, 2, • • • , n} according to the observation function O : S × I → Z i . After receiving z i , each agent chooses an individual action u i ∈ U i based on its local policy π i (u i |τ i ) : T i × U i → [0, 1], where τ i ∈ T i ≡ (Z i × U i ) is the local observation-action history, i.e., the local trajectory. After the execution of the joint action u = {u 1 , • • • , u n }, a reward r shared by all agents and the next state s are generated by the reward function r(s, u) : S × U → R and transition function P(s |s, u) : S × U → S, respectively. γ ∈ [0, 1) is a discount factor. Note that we use bold symbols to denote the global and joint variables, e.g., S and u. The true Q value function is defined as the expectation of accumulative rewards, i.e., Q(s t , u t ) := E st+1:∞,ut+1:∞ [R t |s t , u t ], where R t = ∞ i=0 γ i r t+1 . Q(s t , u t ) is approximated by the joint Q value function Q(s, u). We denote the optimal action and greedy action by u * := argmax u Q(s, u) and u gre := argmax u Q(s, u), respectively.

2.2. VALUE FACTORIZATION

In value factorization, the joint Q value function is factorized through a value factorization operator F(•) as Q(s, u) = F(Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )) (1) Q i (u i , τ i ) : U i → R (i ∈ [1, n] ) is defined as the local Q value function of agent i. A critical rule of value factorization is the Independent Global Max principle. The IGM principle is defined as the identity of the joint greedy action and the set of local greedy actions. Formally, given the joint Q value function Q(s, u) and the factorized local Q functions {Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )} by F(•), if the following equality holds argmax u Q(s, u) = {argmax u 1 Q 1 (τ 1 , u 1 ), • • • , argmax un Q n (τ n , u n )} (2) we say the factorization operator satisfies the IGM principle. The IGM principle enables the coordination of local policies under the centralized trained joint Q value function. Linear Value Factorization (LVF) naturally meets the IGM principle and becomes the most popular value factorization method in recent years. In LVF, the joint Q value function is linearly factorized as Q(s, u) = F(Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )) = n i=1 w i Q i (τ i , u i ) + V (s) The joint Q value function of LVF can only represent linearly factorizable true Q value functions, known as the problem of representation limitation. As a result, the optimal Bellman operator could be not a γ-constraint (Wang et al., 2020a) when faced with non-linear true Q value functions. In words, there could be multiple convergences for the joint Q value function (Wan et al., 2021) and the policy would get trapped in sub-optimums.

3.1. DECOMPOSABILITY OF MARKOV GAMES

The linearly factorizable joint Q value function in LVF is incapable to represent non-linear true Q value functions. In this section, we investigate the conditions of the linearity of the true Q value function in the context of the Markov game. A Markov game (Littman, 1994) is equivalent to a decentralized fully observable Markov decision process, which can described by a tuple MG =< S, U , P, r, n, γ >. The explanation of the symbols can be found in the preliminary. Firstly, we introduce the concept of decomposability of Markov games. Definition 1 (Decomposable Markov Game). Given an Markov game (Dou et al., 2022) MG =< S, U , P, r, n, γ >, if there exists a collection of subspaces of the joint state-action space MG i , MG j (∀i, j ∈ [1, k] and i = j) should not be considered as elements of the decomposition in the following situations. (1) Void decomposition: Ûi = ∅ and r i (s i , ûi ) = 0; {S 1 × Û1 , S 2 × Û2 , • • • , S k × Ûk } (k ≥ 2), i. (2) Selfdecomposition: Ûi = U and r i (s i , ûi ) = C • r(s, u), where C is a constant; (3) Overlapping decomposition: Ûi = Ûj and r i (s i , ûi ) = C • r j (s j , ûj ). Therefore, we also require ∀i, j ∈ [1, k] (i = j), Ûi = ∅, Ûi = U and Ûi = Ûj for a decomposable Markov game. Examples of both decomposable and indecomposable Markov games are provided in Fig. 3 .1, where 4 agents (denoted by dots) need to cover 2 landmarks (denoted by squares) in pairs. Agents are assigned target landmarks in colors. The team receives an instant reward when any agent covers the target landmark. In the indecomposable case, the team only receives a reward when a landmark is covered by the first 2 agents, where the reward function is not linearly factorizable since it is determined by the policy of all agents. Fig. 3 .1 (c) and (e) present two decompositions of the decomposable Markov game. Especially, the decomposition in Fig. 3 .1(e) is the MGD since none of the decomposed Markov games are further decomposable. Proposition 1 (Linear factorizability of the true Q value function in decomposable Markov games). The true Q value function can be linearly factorized as Q(s, u) = k i=1 Q i (s i , ûi ) for ∀(s, u) ∈ S × U if and only if MG is decomposable by {MG 1 , MG 2 , • • • , MG k }. The proof of Proposition 1 can be found in Appendix A. Fig. 3 .1(d) presents the factorization of the true Q value function under the decomposition in Fig. 3 .1(c). The joint Q value function of LVF is capable to represent the true Q value function only if each decomposed Markov game of the MGD involves only a single agent. Note that the decomposition of a Markov game is non-unique. We can obtain new decompositions from an existing decomposition, for which we introduce the following lemma: Lemma 1. Suppose {MG 1 , MG 2 , • • • , MG k } (k ≥ 2) is a decomposition of Markov game MG. {MG 1 , MG 2 , • • • , MG ks } is also a decomposition of MG if the following conditions holds: (1) MG j is decomposable by a non-empty subset of {MG 1 , MG 2 , • • • , MG k } for ∀j ∈ [1, k s ]; (2) ∪ ks j=1 {S j × Û j } = S × U , where MG j =< S j , U j , P, r j , n j , γ >. The proof of Lemma 1 can be found in Appendix B. Obviously, if {MG 1 , MG 2 , • • • , MG k } (k ≥ 2) is a decomposition of Markov game MG, we can also obtain new decompositions by further decomposing the elements of {MG 1 , MG 2 , • • • , MG k }.

3.2. LVF IN INDECOMPOSABLE MARKOV GAMES

Decomposability is unusual for Markov games. Multi-agent tasks involving cooperative rewards or interactive transitions of all agents are usually indecomposable Markov games, where the true Q value functions are not linearly factorizable. In this subsection, we investigate the performance of LVF in the most frequent cases, i.e., indecomposable Markov games. Our investigation is carried out from the perspective of indecomposable Markov games with discrete action spaces, where the representation of the true Q value function is equivalent to solving the linear equation system Q(s, u) = n i=1 w i Q i (s, u i ) + V (s) ∀u∈U (4) The maximum number of independent equations is m n , where m is the size of the discrete action space. It can be proved that the rank of the coefficient matrix of the equation system equals n(m -1) + 1 (the proof is available in Appendix C). The equation system is overdetermined since m n > n(m -1) + 1 for ∀m, n ∈ [2, ∞). Despite the representation error of the joint Q value function, the target of the joint Q value function is always unbiased for LVF under value iteration in sarsa manner. To explain this, we introduce the following proposition: Proposition 2. In indecomposable Markov game, the estimate of state value function is unbiased for LVF under the value iteration in sarsa manner, i.e., m n u π(u|s)Q(s, u) = m n u π(u|s)Q(s, u). The proof of Proposition 2 can be found in Appendix D. Furthermore, we have Q target,t = r(s t , u t ) + γ 1 m n m n ut+1 st+1 P(s t+1 |s t , u t )π(u t+1 |s t+1 )Q(s t+1 , u t+1 )ds t+1 = r(s t , u t ) + γE ut+1∼π(ut+1|st+1),st+1∼P(st+1|st,ut) [Q(s t+1 , u t+1 )] = Q(s t , u t ) Eq.5 indicates the target of the joint Q value function still equals the true Q value function in indecomposable Markov games for LVF under the value iteration in sarsa manner. In this case, LVF is capable to find the optimal policy if for ∀t ∈ [0, ∞), the following holds u * t = argmax u Q LV F (s * t , u t ) where τ * := (s * 0 , u * 0 , s * 1 , u * 1 , • • • ) is the optimal trajectory, i.e., ∀t ∈ [0, ∞) u * t = argmax u Q(s * t , u t ). Eq.6 is equivalent to solving a single-step matrix game. But note that the joint Q value function is a biased estimate of the true Q value function. Therefore, we have max ut Q(s t , u t ) = max ut Q(s t , u t ), which suggests there are errors in the Q-learning target. Such errors could accumulate alone the trajectories by the bootstrap of the joint Q value function.

4. VALUE FACTORIZATION FUNCTIONS FOR INDECOMPOSABLE MARKOV

Although the target of the joint Q value function exactly equals the true Q value function in indecomposable Markov games for LVF under the value iteration in sarsa manner, it is still impractical for LVF to solve every single step matrix game in the optimal trajectory. To deal with the indecomposable Markov game, in this section, we turn to value factorization functions that satisfy both IGM and CRC conditions. According to the partial derivative on local Q value functions, we divide the value factorization functions into linear and non-linear.

4.1. EXTEND LINEAR VALUE FACTORIZATION FUNCTION

Firstly, consider a linear value factorization function F(Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )). Let Q set (τ , u) := {Q 1 (τ 1 , u 1 ), • • • , Q n (τ n , u n )} denote the collection of local Q value functions. We have ∂F(Q set (τ , u))/∂Q i (τ i , u i ) = w i for ∀i ∈ [1, n]. To improve the representation capability of F(•), we introduce a set of parameterized modules denoted by M set (s, u) := {M 1 (s 1 , û1 ), • • • , M k (s k , ûk )}. (s i , ûi ) ∈ S i × Ûi (i ∈ [1, k]), where S i × Ûi ⊂ S × U . The joint Q value function equals Q(τ , u) = F(Q set (τ , u), M set (s, u)) = n i=1 w i Q i (τ i , u i ) + k j=1 M j (s j , ûj ) + V (s) (7) To distinguish from LVF in Eq.1, we refer to the function in Eq.7 as extended LVF. Note that indecomposable Markov games are not decomposable on any collection of subspaces of the joint stateaction space. According to Proposition 1, the true Q value function is also not linearly factorizable by any functions based on the proper subspaces of the joint state-action space. In words, a necessary condition for Eq.7 to represent any true Q value functions is ∃M(s j , ûj ), (s j , ûj ) = (s, u) (j ∈ [1, k]). Notice F : S × U → (-∞, Q * ]. Therefore, we also require M j : S × U → (-∞, C], where C is an arbitrary constant. Now we consider the IGM principle. The IGM principle requires ∂F(Q set (τ , u), M set (s, u))/∂Q i (τ i , u i ) = w i > 0 and M j (s j , ûj ) ≤ M j (s j , ûj,gre ). Based on the constraints above, we list some examples of extended LVF functions, which is available in Appendix.E.

4.2. NON-LINEAR VALUE FACTORIZATION FUNCTION

Linear value factorization functions constitute a small part of the whole value factorization function family. In this subsection, we discuss the functions of Non-linear Value Factorization (NVF). We have ∂F(Q set (τ , u))/∂Q i (τ i , u i ) = f i (Q set (τ , u), s i , ûi ) for NVF functions, where S i × Ûi ⊂ S × U . There are two different approaches to improve the representation capability of the function: (1) Introducing parameterized functions directly; (2) introducing parameterized modules. Let F θ (Q set (τ , u)) denote the parameterized functions, where θ is the collection of introduced parameters. A defect of F θ (Q set (τ , u)) is the uncontrollable sign of the derivative of local Q value functions. As a result, the function suffers from poor convergence. More details are provided in Appendix E. Consider the second approach, i.e., introducing parameterized modules in a predefined NVF. Let M set (s, u) := {M 1 (s 1 , û1 ), • • • , M k (s k , ûk )} denote the introduced modules, we have Q(τ , u) = F(Q set (τ , u), M set (s, u)). We denote the partial derivatives of Q i (τ i , u i ) by F i := ∂F(Q set (τ , u), M set (s, u))/∂Q i (τ i , u i ) (i ∈ [1, n]). For good convergence, we expect that F i > 0 for Q i (τ i , u i ) ∈ (-∞, Q i (τ i , u i,gre )], which is a more strict constraint than the IGM principle. Since the function F(•) is predefined with a fixed form, a necessary condition of CRC is ∃M(s j , ûj ), (s j , ûj ) = (s, u) (j ∈ [1, k]). Based on the constraints above, we list some examples of NVF function, which is available in Appendix.E. 

5. METHODOLOGY

Q i (τ i , u i ). Methods that ignore the correlation of representation would suffer from poor convergence, where the gradient on Q i (τ i , u * i ) contributed by the optimal true Q value is interfered or even submerged in the gradients contributed by the correlated non-optimal true Q values. In words, the representation of Q(s, u * ) is interfered by its representation correlation with non-optimal true Q values. Let w * i denote the relative weight of gradient contributed by Q(τ , u * ) in all true Q values involving u i , i.e., w * i = π(u * |s) • ∂F ∂Qi | u=u * U n-1 u \i π(u * i , u \i |s) • ∂F ∂Qi | u={u * i ,u \i } (8) where u \i denote the joint action of all agents except i. The representation interference on Q(τ , u * ) is negatively correlated to w * i . For linear value factorization function, according to Eq.7, we have ∀i ∈ [1, n], ∂F ∂Qi | ∀u∈U n = w i , where w * i = π(u * |s) U n-1 u \i π(u * i ,u \i |s) is mainly determined by the sample distribution. By contrast, for non-linear value factorization function, ∂F ∂Qi is a function of u. An example is QPLEX, where the representation interference is serious due to the sharply decreasing w * i during training. More details and analysis of QPLEX can be found in Appendix F. Since the representation interference on Q(τ , u * ) is related to the form of the value factorization function, we consider to design a non-linear value factorization function to address the problem.

5.2. REPRESENTATION INTERFERENCE SUPPRESSION VIA NON-LINEAR VALUE FACTORIZATION

To alleviate the representation interference on Q(u * ), we consider raising the relative weight of gradient contributed by Q(u * ), i.e., w * i . Referring to Eq.8, w * i is determined by the sample distribution and the partial derivatives of F(Q i (u * i ), Q \i (u \i )) with respect to Q i (u * i ) , where Q \i (u \i ) denotes the set of all agents' local Q value functions except u i . Note that w * i continuously reduces during the training in QPLEX, where the value factorization function is F(Q set (τ , u), M set (s, u)) = - n i=1 |M i (s, u)| • [Q i (τ i , u i,gre ) -Q i (τ i , u i )] + n i=1 Q i (τ i , u i,gre ) (9) We make a slight change on the function to reverse the trend. F(Q set (τ , u), M set (s, u)) = - n i=1 Q i (τ i , u i,gre ) -e -I(u=ugre)•|Mi(s,u)| • Q i (τ i , u i ) + n i=1 Q i (τ i , u i,gre ) = n i=1 e -I(u=ugre)•|Mi(s,u)| • Q i (τ i , u i ) (10) where I(u = u gre ) equals 1 if u = u gre otherwise 0. When Q i (τ i , u i ) > 0, we have ∂F/∂|M i (s, u)| = -e -I(u=ugre)•|Mi(s,u)| Q i (τ i , u i ) < 0, i.e., |M i (s, u)| decreases as Q(τ , u) = F(Q set (τ , u), M set (s, u)) grows. Let F i := ∂F(Q set (u), M set (u))/∂Q i (τ i , u i ) denote the partial derivative of Q i (τ i , u i ). We have F i = e -I(u=ugre)•|Mi(s,u)| , which is nega- tively related to |M i (s, u)|. Therefore, F i is positively related to Q(τ , u) when Q i (τ i , u i ) > 0. To ensure Q i (τ i , u i ) > 0, we replace Q i (τ i , u i ) with |Q i (τ i , u i )|. F(Q set (τ , u), M set (s, u)) = n i=1 e -I(u=ugre)•|Mi(s,u)| • |Q i (τ i , u i )| + V (s) where V (s) enables F(Q set (τ , u), M set (s, u)) to represent negative true Q values. Based on the value factorization function above, we introduce our method, namely, Q Factorization with Representation Interference Suppression (QFRIS). The value factorization function of QFRIS equals F QF RIS (Q set (τ , u), M set (s, u)) = e -|M(s,u)| 2 •I(u=ugre) n i=1 Q i (τ i , u i ) -|M(s, u)| • I(u = u gre ) + V (s) Obviously, our QFRIS satisfies both IGM and CRC conditions. The network structure of QFRIS is provided in Appendix G.

6. EXPERIMENTS

Our experiments consist of 4 parts. Firstly, we verify our propositions on a finite Markov Game; Secondly, we compare the performance of QFRIS value factorization with the value factorization functions in other methods. Finally, we evaluate the performance of QFRIS on predator-pery and StarCraft Multi-Agent Challenge (SMAC). The latter three parts are available in Appendix H. We design toy games for both decomposable and indecomposable cases of Fig. 3 .1 and carry out experiments to verify our propositions about the decomposability of Markov games. The tasks are shown in Fig. 6 (a), where 4 agents (denoted by dots) need to cover 2 landmarks (denoted by squares) in pairs. The map is gridded in a 4 × 4 checkerboard. All agents are initialized with the position (3, 0) and required to select actions from {up, right} at each time step. Each agent is assigned with target landmark in color. The team receives an instant reward of 1.0 when any agent covers the target landmark. For the indecomposable case, the team only receives reward when a landmark is covered by the first 2 agents. The invalid actions, e.g., up at position (0, 0) are masked.  Q(s, u) = Q 331 (s 1 , u 1 , u 2 , u 3 ) + Q 332 (s 2 , u 1 , u 2 , u 4 ) or Q(s, u) = Q 1 (s 1 , u 1 ) + Q 2 (s 2 , u 2 ) + Q 2 (s 3 , u 3 ) + Q 2 (s 4 , u 4 ). We apply two neural networks denoted by Q 33 (s, u) and Q M GD (s, u) to model Q(s, u), respectively. Each network sums the approximated true Q value function of decomposed Markov games, e.g., 3.1.(d) for Q 33 (s, u). To verify the factorizability of Q(s, u), we evaluate the estimated error of Q 33 (s, u) and Q M GD (s, u). To be specific, we approximate Q(s, u) by a non-factorized neural network denoted by Q(s, u). As shown in Fig. 6 (b), based on the positions of all agents, there are totally 98 states in the first 3 time steps. At each state, we calculate the Root Mean Square Error (RMSE) of Q 33 (s, u) and Q M GD (s, u), respectively, e.g., RM SE 33 (s) = 1 m n U n u [Q(s, u) -Q 33 (s, u)] 2 1 2 All agents follow random policies. The experimental results after 6k steps of training are shown in Fig. 6(c) , where each bar denotes the result of a single state. The estimation errors of Q 33 (s, u) and Q M GD (s, u) are negligible for the decomposable case but sizable for the indecomposable case, which suggests Q(s, u) is linearly factorizable only if the Markov game is decomposable. We also test the return under Q 33 (s, u) and Q M GD (s, u) in both decomposable and indecomposable cases. The results are shown in Fig. 6(d ). The task is solved when all agents cover the target landmarks, i.e., the return equals 4.0. Both Q 33 (s, u) and Q M GD (s, u) are able to handle the decomposable case. But for the indecomposable case, both joint Q value functions fail to solve the task since the true Q value function is not linearly factorizable. Verification of Proposition 2. According to proposition 2, the estimate of state value function is unbiased for LVF under the value iteration in sarsa manner. We model the linearly factorized Q(s, u) under MGD and the non-factorized Q(s, u) by Q M GD (s, u) and Q(s, u), respectively. The state value function can be approximated by V (s) = U n u π(u|s)Q(s, u). The estimated state value function of LVF equals V M GD (s) = U n u π(u|s)Q M GD (s, u). Note that the target of Q(s t , u t ) equals r(s t , u t ) + γP (s t+1 |s t , u t )V (s t+1 ) for sarsa. To evaluate the error of the representation target for LVF in indecomposable Markov games, we calculate the difference between V (s) and V M GD (s). Besides, the target of Q(s t , u t ) equals r(s t , u t ) + γP (s t+1 |s t , u t )max ut+1 Q(s t+1 , u t+1 ) for Q-learning. We also calculate the difference between max ut Q(s t , u t ) and max ut Q M GD (s t , u t ). The experimental results are shown in Fig. 6 , where each bar denotes the result of a single state. From Fig. 6 we can see that for Q M GD (s, u) trained by sarsa value iteration, the difference between V (s) and V M GD (s) is negligible in both decomposable and indecomposable Markov games. By contrast, for Q M GD (s, u) trained by Q-learning value iteration, the difference between max u Q(s t , u t ) and max u Q M GD (s t , u t ) is sizable in the indecomposable case. The experimental results indicate that although the true Q value function is not linearly factorizable in indecomposable Markov games, the representation target of a linearly factorized joint Q value function is still unbiased under the value iteration of sarsa manner. However, for a linearly factorized joint Q value function trained by Q-learning, the representation target is biased in indecomposable Markov games.

7. CONCLUSION

In this paper, we define the decomposability of Markov games and prove that the true Q value function is linearly factorizable if and only if the Markov game is decomposable. LVF is perfectly applicable in decomposable Markov games where each element of the MGD involves only a single agent. We also prove that in indecomposable Markov game, the estimate of state value function is still unbiased for LVF under the value iteration in sarsa manner. In addition to theoretical proofs, our conclusions are also verified in experiments on a toy game. To deal with the indecomposable Markov games, we explore the general form of value factorization functions that satisfy both IGM and CRC conditions. A common problem of these functions is the representation interference on the optimal true Q value function. To address this problem, we design a non-linear value factorization function that adaptively reweights the gradient contributed by different true Q values. Our method, namely QFRIS, is proved effective to address the representation interference in the experiments on matrix games. Besides, comparison with baselines in predator-prey and SMAC demonstrates the good convergence of our methods.  (s t+1 , u t+1 ) is linearly factor- izable as Q(s t+1 , u t+1 ) = k i=1 Q i (s i,t+1 , ûi,t+1 ). The state value function equals V (s t+1 ) = k i=1 ut+1 π(u t+1 |s t+1 )Q i (s i,t+1 , ûi,t+1 )du t+1 (14) Let û\i,t denote the collection of the joint action except ûi,t , i.e., ûi,t ∪û \i,t = u t and ûi,t ∩û \i,t = ∅. Since the local policies are decentralized, the local actions are independent of each other. We have V (s t+1 ) = k i=1 ut+1 k i=1 π i (û i,t+1 |s i,t+1 )Q i (s i,t+1 , ûi,t+1 )du t+1 = k i=1 ûi,t+1 û\i,t+1 ûj,t+1 ûj,t+1 π j (û j,t+1 |s j,t+1 )dû j,t+1 • π i (û i,t+1 |s i,t+1 )Q i (s i,t+1 , ûi,t+1 )dû i,t+1 = k i=1 ûi,t+1 π i (û i,t+1 |s i,t+1 )Q i (s i,t+1 , ûi,t+1 )dû i,t+1 (15) Let V i (s i,t+1 ) := ûi,t+1 π i (û i,t+1 |s i,t+1 )Q i (s i,t+1 , ûi,t+1 )dû i,t+1 denote the state value function of MG i . We have V (s t+1 ) = k i=1 V i (s i,t+1 ). According to Definition 1, the reward function is linearly factorizable in decomposable Markov game. The true Q value function equals Q(s t , u t ) = r(s t , u t ) + γ st+1 P(s t+1 |s t , u t )V (s t+1 )ds t+1 = k i=1 r i (s i,t , ûi,t ) + γ st+1 P(s t+1 |s t , u t )V i (s i,t+1 )ds t+1 (16) Note that S i is a subspace of S. We have st+1 P(s t+1 |s t , u t )V i (s i,t+1 )ds t+1 = si,t+1 P(s i,t+1 |s t , u t )V i (s i,t+1 )ds i,t+1 According to the second condition in Definition 1, P (s i,t+1 |s i,t , ûi,t ) = P (s i,t+1 |s t , u t ). We have Q(s t , u t ) = k i=1 r i (s i,t , ûi,t ) + γ si,t+1 P(s i,t+1 |s t , u t )V i (s i,t+1 )ds i,t+1 = k i=1 r i (s i,t , ûi,t ) + γ si,t+1 P(s i,t+1 |s i,t+1 , ûi,t+1 )V i (s i,t+1 )ds i,t+1 = k i=1 Q i (s i,t , ûi,t ) For the joint Q value function under the value iteration in Q-learning manner, suppose Q(s t+1 , u t+1 ) is linearly factorizable, we have max ut+1 Q(s t+1 , u t+1 ) = k i=1 max ûi,t+1 Q i (s i,t+1 , ûi,t+1 ) According to the properties of decomposable Markov game, we have Q(s t , u t ) = r(s t , u t ) + γ st+1 P(s t+1 |s t , u t ) • max ut+1 Q(s t+1 , u t+1 )ds t+1 = k i=1 r i (s i,t , ûi,t ) + γ st+1 P(s t+1 |s t , u t ) • max ûi,t+1 Q i (s i,t+1 , ûi,t+1 )ds t+1 = k i=1 r i (s i,t , ûi,t ) + γ si,t+1 P i (s i,t+1 |s i,t , ûi,t ) max ûi,t+1 Q i (s i,t+1 , ûi,t+1 )ds i,t+1 = k i=1 Q i (s i,t , ûi,t ) (20) We have proved that if Q(s t+1 , u t+1 ) is linearly factorizable, Q(s t , u t ) is also linearly factorizable. For a finite Markov game, let ∀i ∈ [1, k], Q(s T , u T ) = Q i (s i,T , ûi,T ) = 0, where T is the terminal time step. Since Q(s T , u T ) = k i=1 Q i (s i,T , ûi,T ) = 0 is linearly factorizable, ∀t ∈ [0, T ], Q(s t , u t ) is linearly factorizable. The factorizability of Q(s t , u t ) in decomposable Markov games is proved. A.2 PROOF OF NECESSITY Given a Markov game MG =< S, U , P, r, n, γ >, suppose the true Q value function Q(s t , u t ) is linearly factorizable as Q(s t , u t ) = k i=1 Q i (s i,t , ûi,t ), where (s t , u t ) ∈ S × U and (s i,t , ûi,t ) ∈ S i × U i (i ∈ [1, k]). S i × U i is a subspace of the joint state-action space (i.e., S i × U i ⊂ S × U ). We have Q(s t , u t ) = r(s t , u t ) + γ st+1 P(s t+1 |s t , u t )V (s t+1 )ds t+1 = k i=1 Q i (s i,t , ûi,t ) Note that r(s t , u t ) is irrelevant to V (s t+1 ) because r(s t , u t ) is the reward of current time step but Q(s t+1 , u t+1 ) is determined by the policies, transitions and rewards of future time steps. Therefore, Eq.21 holds if and only if both r(s t , u t ) and st+1 P(s t+1 |s t , u t )V (s t+1 )ds t+1 are linearly factorizable as r(s t , u t ) = k i=1 r i (s i,t , ûi,t ) st+1 P(s t+1 |s t , u t )V (s t+1 )ds t+1 = k i=1 f i (s i,t , ûi,t ) Let s \i,t denote the other dimensions of s t except s i,t , i.e., s i,t ∪ s \i,t = s t and s i,t ∩ s \i,t = ∅. Note that Q i (s i,t , ûi,t ) = r i (s i,t , ûi,t ) + γ si,t+1 P i (s i,t+1 |s i,t , ûi,t )V i (s i,t+1 )ds i,t+1 and Q i (s i,t , ûi,t ) is irrelevant to (s \i,t , u \i,t ). We have P i (s i,t+1 |s i,t , ûi,t ) = P i (s i,t+1 |s i,t , s \i,t , ûi,t , u \i,t ) = P i (s i,t+1 |, s t , u t ) (23) According to Definition 1, MG is decomposable by {MG 1 , MG 2 , • • • , MG k }. B PROOF OF LEMMA 1 Given a Markov game MG =< S, U , P, r, n, γ >, {MG 1 , MG 2 , • • • , MG k } (k ≥ 2) is a decom- position of MG. Suppose (1) MG j =< S j , U j , P, r j , n j , γ > is decomposable by a non-empty subset of {MG 1 , MG 2 , • • • , MG k } for ∀j ∈ [1, k s ]; (2) ∪ ks j=1 {S j × Û j } = S × U . Let A j = [A 1,j , A 2,j , • • • , A k,j ] (j ∈ [1, k s ]) denote an indicator vector, where A i,j = 1 (j ∈ [1, k]) if MG i is an element of the decomposition of MG j , otherwise, A i,j = 0. According to the definition of decomposable Markov game, S i × Ûi ⊂ S j × Û j if A i,j = 1. We have ŝ j = ∪ k i=1 A i,j •s i and û j = ∪ k i=1 A i,j • ûi . Let r j (s j , û j ) denote the reward function of MG j , which is defined as r j (s j , û j ) = k i=1 A i,j • r i (s i , ûi ) ks j=1 A i,j According to the second condition of Lemma 2, we have ∪ ks j=1 û j = u. Note that ∪ k i=1 ûi = u since {MG 1 , MG 2 , • • • , MG k } (k ≥ 2) is a decomposition of MG. Therefore, ∪ ks j=1 û j = ∪ ks j=1 ∪ k i=1 A i,j • ûi = ∪ ks j=1 A i,j • ∪ k i=1 ûi = ∪ k i=1 ûi which indicates ∀j ∈ [1, k], ks j=1 A i,j ≥ 1. In words, the denominator in Eq.24 are non-zero. The sum of the reward functions of {MG 1 , MG 2 , • • • , MG ks } equals ks j=1 r j (s j , û j ) = ks j=1 k i=1 A i,j • r i (s i , ûi ) ks j=1 A i,j = k i=1   ks j=1 A i,j • r i (s i , ûi ) ks j=1 A i,j   = k i=1 ks j=1 A i,j • r i (s i , ûi ) ks j=1 A i,j = k i=1 r(s i , ûi ) = r(s, u) We have proved that the reward function is linearly factorizable on the collection of state-action spaces of {MG 1 , MG 2 , • • • , MG ks }. Besides, since {MG 1 , MG 2 , • • • , MG k } is a decomposition of MG, we have P(s i,t+1 |s i,t , ûi,t ) = P(s i,t+1 |s t , u t ) for ∀i ∈ [1, k]. Note that ŝ j = ∪ k i=1 A i,j • s i and û j = ∪ k i=1 A i,j • ûi . We have P(s j,t+1 |s j,t , û j,t ) = P(∪ k i=1 A i,j • s i,t+1 | ∪ k i=1 A i,j • s i,t , ∪ k i=1 A i,j • ûi,t ) = P(∪ k i=1 A i,j • s i,t+1 |s t , u t ) = P(s j,t+1 |s t , u t ) According to the definition of decomposable Markov game, {MG 1 , MG 2 , • • • , MG ks } is a decomposition of MG.

C RANK OF THE COEFFICIENT MATRIX

In indecomposable Markov games with discrete action spaces, the representation of true Q value function is equivalent to solving the linear equation system Q(s, u) = n i=1 w i Q i (s, u i ) + V (s) ∀u∈U For simplicity, we omit the state value function V (s). Let {1, 2, • • • , m} denote the discrete local action space. The joint Q value function can be written as Q(s, u) = I(u 1 = 1)Q 1 (1) + I(u 1 = 2)Q 1 (2) + • • • + I(u 1 = m)Q 1 (m) + I(u 2 = 1)Q 2 (1) + I(u 2 = 2)Q 2 (2) + • • • + I(u 2 = m)Q 2 (m) + • • • + I(u n = 1)Q n (1) + I(u n = 2)Q n (2) + • • • + I(u n = m)Q n (m) = agent 1 I(u 1 = 1) • • • I(u 1 = m) agent 2 I(u 2 = 1) • • • I(u 2 = m) • • • agent n I(u n = 1) • • • I(u n = m) • agent 1 Q 1 (1) • • • Q 1 (m) agent 2 Q 2 (1) • • • Q 2 (m) • • • agent n Q n (1) • • • Q n (m) ) Here we omit the states in all inputs. In Markov game with discrete action space, the true Q values of the joint actions' all permutations constitute the complete set of the representation targets. For example, the all permutations of 2-agent joint actions are    (•,1) (1, 1) (2, 1) • • • (m, 1) (•,2) (1, 2) (2, 2) • • • (m, 2) • • • • • • (•,m) (1, m) (2, m) • • • (m, m)    (30) Eq.28 is equivalent to the following matrix equation in 2-agent cases: A 2 × Q 2 loc = Q 2 where A 2 =                   agent 1 0 0 • • • 1 0 0 • • • 1                   , Q 2 loc =              Q 1 (1) Q 1 (2) . . . Q 1 (m) Q 2 (1) Q 2 (2) . . . Q 2 (m)              , Q 2 =                    Q(1, 1) Q(2, 1) . . . Q(m, 1) . . . . . . Q(1, m) Q(2, m) . . . Q(m, m)                    The coefficient matrix Afoot_1 can be represented by A 2 =     E m A 2 1 E m A 2 2 . . . . . . E m A 2 m     , A 2 i = O 2 i - I 2 O 2 i + (33) where E m is an m-dimensional unit matrix. O 2 i -and O 2 i + are zero matrices of size m × i and m × (m -i -1) (i ∈ [0, m -1]), respectively. I 2 is an m-dimensional column vector with all 1 elements. Note that rk E m A 2 1 E m A 2 2 = m + 1. We have rk(A 2 ) = m + (m -1) = 2m -1. Now we extend the 2-agent case to the 3-agent, where A 3 =     A 2 A 3 1 A 2 A 3 2 . . . . . . A 2 A 3 m     , A 3 i = O 3 i - I 3 O 3 i + O 3 i -and O 3 i + are zero matrices of size m 2 × i and m 2 × (m -i -1) (i ∈ [0, m -1]), respectively. I 3 is an m 2 -dimensional column vector with all 1 elements. We have rk(A 3 ) = rk(A 2 ) + m -1 = 3m -2. For the n-agent case, we can infer that rk( A n ) = rk(A n-1 ) + m -1 = rk(A 2 ) + (n -2) • (m -1) = n(m -1) + 1 (35) D PROOF OF PROPOSITION 2 Eq.28 is equivalent to the following matrix equation A n × Q n loc = Q n (36) The expression of A n , Q n loc and Q n can be inferred from Eq.32. We consider the worst case where the augment matrix is full rank, i.e., rk( A n Q n ) = m n . Note that m n > n(m -1) + 1 for ∀m, n ∈ [2, ∞). The matrix equation is overdetermined, which can be solved by least square method. Let π n denote the vector of all permutations of the joint action's probabilities. Notice √ π n • (A n × Q n loc ) = ( √ π n • A n ) × Q n loc . The aim of least square method is: min π n • ||A n × Q n loc -Q n || = min ||( √ π n • A n ) × Q n loc - √ π n • Q n || (37) Q n * loc is the least square solution if and only if the following holds: ( √ π n • A n ) × ( √ π n • A n ) × Q n * loc = ( √ π n • A n ) × ( √ π n • Q n ) (38) Let Q n * jt denote the vector of all permutations of the joint Q values under the least square solution. Notice that Q n * jt = A n × Q n * loc . We have ( √ π n • A n ) × ( √ π n • A n ) × Q n * loc = ( √ π n • A n ) × ( √ π n • Q n jt ) ) Combining Eq.38 with Eq.39, we have ( √ π n • A n ) × ( √ π n • Q n ) = A n × ( π n • Q n ) = ( √ π n • A n ) × ( √ π n • Q n jt ) = A n × ( π n • Q n jt ) According to Eq.34, we have A n = A n-1 A n-1 • • • A n-1 A n 1 A n 2 • • • A n m , A n i =   O n i - I n O n i +   where O n i -and O n i + are zero matrices of size i × m n-1 and (m -i -1) × m n-1 (i ∈ [0, m -1]), respectively. I n is an m n-1 -dimensional row vector with all 1 elements. Referring to Eq.40 and Eq.41, we have A n 1 A n 2 • • • A n m × ( π n • Q n jt ) = A n 1 A n 2 • • • A n m × ( π n • Q n ) In words, ∀i ∈ [1, m], the following holds m n-1 u \1 π(u 1 = i, u \1 |s)Q(s, u 1 = i, u \1 ) = m n-1 u \1 π(u 1 = i, u \1 |s)Q(s, u 1 = i, u \1 ) where u \1 denotes the group of all actions except u 1 . Summing up the equations from u 1 = 1 to u 1 = m, we have m i=1 m n-1 u \1 π(u 1 = i, u \1 |s)Q(s, u 1 = i, u \1 ) = m n u π(u|s)Q(s, u) = m i=1 m n-1 u \1 π(u 1 = i, u \1 |s)Q(s, u 1 = i, u \1 ) = m n u π(u|s)Q(s, u) m n u π(u|s)Q(s, u) is the state value estimated by the joint Q value function of LVF and m n u π(u|s)Q(s, u) is the actual state value. In words, the estimate of the state value function is unbiased for LVF under the value iteration in sarsa manner. A variant of F d 1 (Q set (u), M set (u)) is n i=1 |M 1 (u)| • [Q i (u i,gre ) -Q i (u i )]. Let F(Q set (u gre ), M set (u gre )) = n i=1 Q i (u i,gre ). The joint Q value function equals Q(u) = F(Q set (u), M set (u)) = - n i=1 |M 1 (u)| • [Q i (u i,gre ) -Q i (u i )] + n i=1 Q i (u i,gre ) (50) which is exact the joint Q value function of QPLEX (Wang et al., 2020b) .

F RELATED WORKS

Independent learning has been introduced in fully cooperative multi-agent tasks for a long time (Tan, 1993a) . In tasks with small number of agents, independent proximal policy optimization (PPO) with agent-specific reward functions is able to acquire strategies on the level of human experts. For better scalability, recent works turn to automatic credit assignment under reward functions shared by the team. Meanwhile, by introducing global information in the training of local policies, centralized training with decentralized execution (CTDE) achieve great success in complex cooperative MARL tasks. As a simple and effective approach to achieve credit assignment in the paradigm of CTDE, value decomposition recently gains wide attention.

F.1 LINEAR VALUE FACTORIZATION

There is a series of implementations of linear value factorization. VDN (Sunehag et al., 2017) obtains the joint Q value function by simply adding all local Q value functions together and update the joint Q value function by Q-learning value iteration. Based on VDN, QMIX (Rashid et al., 2018) extracts a set of weights form the global state and applies them to the local Q value functions. SMIX (Wen et al., 2020) and Qatten (Yang et al., 2020) share the same value factorization function with QMIX. SMIX replaces the TD(0) Q-learning target with a TD(λ) sarsa target. Qatten introduces an attention network before the mixing network. All methods above suffer form relative overgeneralization due to the representation limitation of the joint Q value function, i.e., there would be multiple possible convergence.

F.2 VALUE FACTORIZATION FOR INDECOMPOSABLE MARKOV GAMES

There are various works to address relative overgeneralization, which can be summarized into two categories. The first is biased representation. The basic idea is reducing the representation errors of the Q-learning targets at the expense of increased representation errors of non-maximal joint Q values. WQMIX (Rashid et al., 2020) alleviates the estimate error of the Q-learning targets by attaching more weights on the representation of the joint Q values for potential optimal actions. In practice, a weight α ∈ (0, 1) is applied to the samples with lower targets than expected. The Q-learning targets are unbiased when α = 0. GVR (Wan et al., 2021) achieves approximatively unbiased estimate of the Q-learning targets by target shaping. The former reshapes the targets of joint Q values lower than expected, while the latter reshapes the targets matrices into a monotonic from. Biased representation alleviates but does not eliminate the representation errors. Besides, these methods rely on the identification of potential optimal actions, which would introduce extra errors in the training. Another route to address the indecomposable Markov game is completing the representation capability of the joint Q value function under the IGM principle, i.e., introducing value factorization functions with both IGM and CRC properties, e.g., Qtran and QPLEX. Qtran (Son et al., 2019) adopts a linear value factorization function as Q(s, u) = n i=1 Q i (τ i , u i )+ M 1 (s, u) + V (s) (Eq.45). The IGM principle requires M 1 (s, u) ≤ M 1 (s, u gre ) = 0. However, instead of modelling M 1 (s, u) explicitly, Qtran models Q(s, u) and represent M 1 (s, u) by M 1 (s, u) = Q(s, u) -n i=1 Q i (τ i , u i ) -V (s). As a result, (1) Qtran does not satisfy the IGM principle strictly. To approximate the IGM principle, Qtran applies a multi-stage training to regulate that M 1 (s, u) ≤ M 1 (s, u gre ) = 0. In the first stage, Q(s, u) is trained to approximate the true Q value function; In the second stage, V (s) is trained to meet M 1 (u gre ) = 0, i.e., V (s) = Q jt (s, u gre ) -n i=1 Q i (τ i , u i,gre ); In the third stage, the local value functions are trained to meet M 1 (s, u) ≤ 0, where the local policies are updated only if M 1 (s, u) > 0, i.e., Q(s, u)-n i=1 Q i (τ i , u i ) > V (s). As a result, (2) the estimate errors of Q(s, u) and V (s) magnify of all agents and the latter involves the transition on the health of enemy unites, which is determined by the policies of all agents.

H.2 MATRIX GAME

In this subsection, we compare the performance of different value factorization functions in one-step matrix games. The pay-off matrix is shown in Fig. H.2(a) . Two agents select actions from {0, 1, 2} and receive a reward according to the pay-off matrix. To evaluate whether the function is capable to drive the joint Q value function out from the local optimum. We only consider the cases where the greedy action is trapped at (2, 2) after few rounds of training. The experimental results are shown in Fig. H .2. The red curves denote the mean test return. The orange and blue curves denote the difference between the local Q values of the optimal action and current greedy action. The green and brown curves denote the non-linear coefficients on the optimal local Q values contributed by the optimal and non-optimal true Q values, respectively. We do not measure the coefficients on the local Q values for (extended) linear value factorization functions (QMIX and Qtran), where ∂F ∂Qi is a constant. We compare the value factorization functions of QMIX (Rashid et al., 2018) , Qtran (Son et al., 2019) (Eq.45) , QPLEX (Wang et al., 2020b) (Eq.50), an variant of QPLEX and our method (Eq.12). The variant of QPLEX is  F(Q set (u), M set (u)) = - n i=1 (|M 1 (u)| + 1) • [Q i (u i,gre ) -Q i (u i )] + n i=1 Q i (u i,gre ) (51)



0 • • • 0 agent 1 0 • • • 0 0 1 • • • 0 1 0 • • • 0 . . . . . . . . . . . . . . . . . . . . . . . . 0 0 • • • 1 1 0 • • • 0 . . . . . . 1 0 • • • 0 0 0 • • • 1 0 1 • • • 0 0 0 • • • 1 . . . .. . . . . . . . . . . . . . . . . . . .



Figure 1: Examples of decomposable and indecomposable Markov games.

Figure 2: In value factorization, the representation of all true Q values involving u 1 is interfered by each other through the training of the shared local Q value function Q 1 (u 1 ).

Figure 3: Verification of the factorizability of the true Q value functions in decomposable & indecomposable Markov games. (a) Tasks for decomposable & indecomposable cases; (b) state number in first 3 time steps; (c) RMSE of linearly factorized joint Q value functions; (d) test mean return of linearly factorized joint Q value functions.

Figure 4: The estimation error of state value (sarsa) and max Q value (Q learning) in both decomposable and indecomposable Markov games.

Figure 6: Evaluation of various value factorization operators.

MG is decomposable by {MG 1 , MG 2 , • • • , MG k }, where MG i :=< S i , Ûi , P, r i , n i , γ > (i ∈ [1, k]). Otherwise we say MG is indecomposable. n i is the number of agents involved in MG i . Specially, if MG i is no longer decomposable for ∀i ∈ [1, k], we say that {MG 1 , MG 2 , • • • , MG k } is the Minimum Granularity Decomposition (MGD).

Lipeng Wan, Zeyang Liu, Xingyu Chen, Han Wang, and Xuguang Lan. Greedy-based value representation for optimal coordination in multi-agent reinforcement learning. arXiv preprint arXiv:2112.04454, 2021.

E EXAMPLES OF EXTENDED LVF AND NVF E.1 EXAMPLES OF EXTENDED LVF

An example of linear value factorization function is QTRAN (Son et al., 2019) . Let M set (u) = M 1 (u) and C Q,i = 1 (i ∈ [1, n]). We haveIn QTRAN, Q(u) and C F refer to the joint Q value function (i.e., Q jt ) and the state value function (i.e., V jt ), respectively. M 1 (u) is the error between Q(u) and n i=1 Q i (u i ) + C F . To ensure the IGM principle, QTRAN applies two regularizations to regulate that M 1 (u) ≤ M 1 (u gre ) = 0.To be more compact, let M set (u) = M 1 (u), C Q,i = 1 (i ∈ [1, n]) and C F = -M 1 (u gre ). We havewhere where

E.2 EXAMPLES OF NVF FUNCTIONS

Consider parameterizing the operator directly. The IGM principle requires) ≥ 0, which can be modelled by Eq.47.For linear value factorization function, the problem can be avoided by settingConsider introducing parameterized modules. For brevity, we only consider the model M j (û j ) conditioned on the joint action spaces, i.e., U j = U . Refer to Eq.48 and let) are listed as follows:the estimate errors of local Q value functions. A simple method to overcome the above two defects of Qtran is modelling M 1 (s, u) explicitly with functions that satisfies M 1 (s, u) ≤ M 1 (s, u gre ) = 0, e.g., distance functions. Then represent Q(s, u) by M 1 (s, u) and train Q(s, u) end-to-end.

QPLEX (Wang et al., 2020b) adopts a non-linear value factorization function as

However, QPLEX is easily trapped in local optimums for the following two reasons. (1) Suppose current greedy action is not the optimal action, i.e., u gre = u * . The convergence to the global optimum requires), which can be viewed as a weight of the sample. As Q(s, u * ) increases, |M i (s, u * )|, i.e., the weight of the optimal sample decreases to 0. ( 2, the weight of the sample Q(s, u ) increases sharply. Q 1 (τ 1 , u * 1 ) is updated by all samples involving u * 1 , e.g., Q(s, u ) and Q(s, u * ). As a result, the update of Q 1 (τ 1 , u * 1 ) is dominated by non-optimal samples. 

H EXPERIMENTS H.1 EXPERIMENTAL SETTINGS

In the experiments on one-step matrix games and the verifications of the propositions, all modules are implemented by multilayer perceptrons. A replay buffer of length 1000 is applied for all algorithms. In experiments on predator-prey and SMAC, we adopt the default settings for VDN, QMIX, QPLEX and WQMIX. The length of replay buffer is 5000 and the batch size is 32. For WQMIX, we adopt a weight of 0.5 for predator-prey and 0.1 for SMAC to the samples of poor performance, respectively. The game version of StarCraft II is 69232. Each algorithm is trained for 2e6 steps in MMM2, 2c vs 64 zg, 3s vs 5z and 5m vs 6m, with damping from 1 to 0.05 in the first 5e4 steps. Besides, in 6h vs 8z and 3s5z vs 3s6z, each algorithm is trained for 5e6 steps, with damping from 1 to 0.05 in the first 1e6 steps. All experiments are repeated over 5 seeds.According to Definition 1, the tasks involving cooperative rewards or interactive transitions of all agents are indecomposable, e.g., predator-prey (Böhmer et al., 2020) and Starcraft multi-agent challenges (SMAC) (Samvelyan et al., 2019) . The former involves punitive rewards for miscoordination training and quickly approximates 0. As a result, Q 1 (0) and Q 2 (0) grows slowly and the policy is trapped in the local optimum. We add a constant 1 to the non-linear coefficients and obtain a variant of QPLEX (Eq.51). As shown in Fig. H .2(e), the minimum coefficient contributed by Q(0, 0) becomes 1. But the value factorization function still can not drive the joint Q value function out from the local optimum since the coefficients on Q 1 (0) and Q 2 (0) contributed by non-optimal true Q values (the brown curve) are much larger. QFRIS adopts the representation interference suppression technique, whose coefficients on Q 1 (0) and Q 2 (0) contributed by non-optimal true Q values (the brown curve) are greatly suppressed, thus is capable to jump out from the local optimum quickly and stably.

H.3 PREDATOR-PREY

Predator-prey is a cooperative multi-agent task which requires highly coordinated policies. The agents, i.e., the predators are trained to capture the preys moving in random polices. The team is assigned with an instant reward at each time step. The basic reward is 0. There will be a bonus when any prey is captured by more than one agents as well as a punishment if any prey is captured by a single agent. As the punishment increases, the agents are more likely to take a sub-optimal but safe policy, i.e., staying away from the preys. We carry out experiments on 3 different levels of punishments. The experimental results are shown in Fig. 3 .1. From Fig. 3 .1 we can see that our method can handle the task under different levels of punishments. Two implementations of LVF, i.e., VDN and QMIX are incapable to solve the tasks. cw-QMIX and ow-QMIX reduce the weight of samples with poor performance, which is able to deal with small punishments. Although Qtran and QPLEX adopts value factorization functions that satisfy both IGM and CRC conditions, the problem can not be well solved due to the representation interference.

H.4 STARCRAFT MULTI-AGENT CHALLENGE

We compare our method with baselines on challenging tasks of StarCraft Multi-Agent Challenge. The experimental results are shown in Fig.H.4. From Fig.H.4 we can see that our method outperforms the baselines in most of the tasks.

H.5 ABLATION STUDIES

To evaluate the effect of interference suppression introduced by the non-linear value factorization function. We compare QFIRS with a linear variant of it, whose joint Q value function is 

