CAUSAL MEAN FIELD MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

Scalability remains a challenge in multi-agent reinforcement learning and is currently under active research. A framework named mean-field reinforcement learning (MFRL) could alleviate the scalability problem by employing Mean Field Theory to turn a many-agent problem into a two-agent problem. However, this framework lacks the ability to identify essential interactions under non-stationary environments. Causality contains relatively invariant mechanisms behind interactions, though environments are non-stationary. Therefore, we propose an algorithm called causal mean-field Q-learning (CMFQ) to address the scalability problem. CMFQ is ever more robust toward the change of the number of agents though inheriting the compressed representation of MFRL's action-state space. Firstly, we model the causality behind the decision-making process of MFRL into a structural causal model (SCM). Then the essential degree of each interaction is quantified via intervening on the SCM. Furthermore, we design the causality-aware compact representation for behavioral information of agents as the weighted sum of all behavioral information according to their causal effects. We test CMFQ in a mixed cooperative-competitive game and a cooperative game. The result shows that our method has excellent scalability performance in both training in environments containing a large number of agents and testing in environments containing much more agents.

1. INTRODUCTION

multi-agent reinforcement learning (MARL) has achieved remarkable success in some challenging tasks. e.g., video games (Vinyals et al., 2019; Wu, 2019) . However, training a large number of agents remains a challenge in MARL. The main reasons are 1) the dimensionality of joint state-action space increases exponentially as agent number increases, and 2) during the training for a single agent, the policies of other agents keep changing, causing the non-stationarity problem, whose severity increases as agent number increases. (Sycara, 1998; Zhang et al., 2019; Gronauer & Diepold, 2021) . Existing works generally use the centralized training and decentralized execution paradigm to mitigate the scalability problem via mitigating the non-stationarity problem (Rashid et al., 2018; Foerster et al., 2018; Lowe et al., 2017; Sunehag et al., 2017) . Curriculum learning and attention techniques are also used to improve the scalability performance (Long et al., 2020; Iqbal & Sha, 2019) . However, above methods focus mostly on tens of agents. For large-scale multi-agent system (MAS) contains hundreds of agents, studies in game theory (Blume, 1993) and mean-field theory (Stanley, 1971; Yang et al., 2018) offers a feasible framework to mitigate the scalability problem. Under this framework, Yang et al. (2018) propose a algorithm called mean-field Q-learning (MFQ), which replaces joint action in joint Q-function with average action, assuming that the entire agent-wise interactions could be simplified into the mean of local pairwise interactions. That is, MFQ reduces the dimensionality of joint state-action space with a merged agent. However, this approach ignores the importance differences of the pairwise interactions, resulting in the poor robustness. Nevertheless, one of the drawbacks to mean field theory is that it does not properly account for fluctuations when few interactions exist (Uzunov, 1993 ) (e.g., the average action may change drastically if there are only two adjacent agents). Wang et al. (2022) attempt to improve the representational ability of the merged agent by assign weight to each pairwise interaction by its attention score. However, the observations of other agents are needed as input, making this method not practical enough in the real world. In addition, the attention score is essentially a correlation in feature space, which seems unconvincing. On the one hand, an agent pays more attention to another agent not simply because of the higher correlation. On the other hand, it may be inevitable that the proximal agents will be assigned high weight just because of the high similarity of their observation. In this paper, we want to discuss a better way to represent the merged agent. We propose a algorithm named causal mean-field Q-learning (CMFQ) to address the shortcoming of MFQ in robustness via causal inference. Research in psychology reveals that humans have a sense of the logic of intervention and will employ it in a decision-making context (Sloman & Lagnado, 2015) . This suggests that by allowing agents to intervene in the framework of mean-field reinforcement learning (MFRL), they could have the capacity to identify more essential interactions as humans do. Inspired by this insight, we assume that different pairwise interactions should be assigned different weights, and the weights could be obtained via intervening. We introduce a structural causal model (SCM) that represents the invariant causal structure of decision-making in MFRL. We intervene on the SCM such that the corresponding effect of specific pairwise interaction can be presented by comparing the difference before and after the intervention. Intuitively, the intervening enable agents to ask "what if the merged agent was replaced with an adjacent agent" as illustrated in Fig. 1 . In practice, the pairwise interactions could be embodied as actions taken between two agents, therefore the intervention also performs on the action in this case. CMFQ is based on the assumption that the joint Q-function could be factorized into local pairwise Q-functions, which mitigates the dimension curse in the scalability problem. Moreover, CMFQ alleviates another challenge in the scalability problem, namely non-stationarity, by focusing on crucial pairwise interactions. Identifying crucial interactions is based on causal inference instead of attention mechanism. Surprisingly, the scalability performance of CMFQ is much better than the attention-based method (Wang et al., 2022) . The reasons will be discussed in experiments section. As causal inference only needs local pairwise Q-functions, CMFQ is practical in real-world applications, which are usually partially observable. We evaluate CMFQ in the cooperative predator-prey game and mixed cooperative-competitive battle game. The results illustrate that the scalability of CMFQ significantly outperforms all the baselines. Furthermore, results show that agents controlled by CMFQ emerge with more advanced collective intelligence. 

2. RELATED WORK

The scalability problem has been widely investigated in current literatures. Yang et al. (2018) propose the framework of MFRL that increases scalability by reducing the action-state space. Several works in a related area named mean-field game also proves that using a compact representation to characterize population information helps solve scalability problem (Guo et al., 2019; Perrin et al., 2021) . Several works were proposed to improve MFQ. Wu et al. (2022) proposed a weighted mean-field assigning different weights to neighbor actions according to the correlations of the hand-craft agent attribute set, which is difficult to generalize to different environments. Wang et al. (2022) calculate the weights with attention score. The observations of other agents are needed to calculate the attention scores, making its practicality not satisfactory. Our work is also closely related to recent development in causal inference. Researches indicate that once the SCM, which implicitly contains the causal relationships between variables, is constructed, we can obtain the causal effect by intervening. The causal inference has already been exploited for communication pruning (Ding et al., 2020) , solving credit assignment problem (Foerster et al., 2018; Omidshafiei et al., 2019) , demonstrating the potential of causal inference in reinforcement learning. (Pearl, 2019; 2001; Peters et al., 2017) . Xia et al. (2021) and Zečević et al. (2021) further proved that SCM could be equally replaced with NCM under certain constraints, enabling us to ask "what if" by directly intervening on neural network. 𝑎 𝑡-1 𝑖,1 𝑎 𝑡-1 𝑖,2 𝑎 𝑡-1 𝑖,𝑁 𝑖 Causal Module Causal Module Causal Module Normalization Weighted Average Q Network Softmax 𝜋 𝑡 𝑖 𝑠𝑡 𝑇𝐸𝑡 𝑖,1 𝑇𝐸𝑡 𝑖,2 𝑇𝐸𝑡 𝑖,𝑁𝑖 𝜔𝑡 𝑖,1 𝜔𝑡 𝑖,2 𝜔𝑡 𝑖,𝑁𝑖 ු 𝑎𝑡-1 𝑖 𝑎𝑡-1 𝑖,𝑘 ത 𝑎𝑡-1 𝑖 𝑠𝑡 Q Network Q Network Softmax Softmax 𝑇𝐸𝑡-1 𝑖,𝑘 𝑰𝒏 𝒂𝒈𝒆𝒏𝒕𝒔 𝒊 ′ 𝒔 𝒏𝒆𝒊𝒃𝒐𝒓𝒉𝒐𝒐𝒅 (a) Causal Module Causal Module Causal Module Normalization Weighted Average Q Network Softmax 𝜋 𝑡 𝑖 𝑠 𝑡 𝑇𝐸 𝑡 𝑖,1 𝑇𝐸 𝑡 𝑖,2 𝑇𝐸 𝑡 𝑖,𝑁 𝑖 𝜔 𝑡 𝑖,1 𝜔 𝑡 𝑖,2 𝜔 𝑡 𝑖,𝑁 𝑖 ത 𝑎′ 𝑡-1 𝑖 𝑎 𝑡-1 𝑖,𝑘 ത 𝑎 𝑡-1 𝑖 𝑠 𝑡 Q Network Q Network Softmax Softmax 𝑇𝐸 𝑡-1 𝑖,𝑘 Figure 2 : (a) is CMFQ's architecture. Each neighborhood agent is assigned a weight according to its causal effect to the policy of the central agent. (b) is the causal module. It calculate the KL divergence between the two policies that the merged agent is represented by the average action and the k th neighborhood agent action respectively. A large KL divergence means the k th neighborhood agent might be ignored in the merged agent represented by the average action, hence it should be assigned a higher weight to form a better merged agent.

3. PRELIMINARY

This section discusses the concepts of the stochastic game, mean-field reinforcement learning, and causal inference.

3.1. STOCHASTIC GAME

A N -player stochastic game could be formalized as G =< S, A, P, r, N, γ >, in which N agents in the environment take action a ∈ A = × N i=1 A i to interact with other agents and the environment. Environment will transfer according to the transition probability P (s ′ | s, a) : S × A × S → [0, 1], then every agent obtains its reward r i (s, a i ) : S × A i → R and γ ∈ [0, 1] is the discount factor. Agent makes decision according to its policy π i (s) : S → Ω(A i ), where Ω(A i ) is a probability distribution over agent i's action space A i . The joint Q-function of agent i is parameterized by θ i and takes s and a. It is updated as L i (θ i ) = E s,a,r,s ′ Q i (s, a; θ i ) -y 2 , y = r + γ max a ′ i Q i s ′ , a; θ - i (1) where θ - i is updated by with θ i every C steps and set fixed until the next C steps finish. where predict network parameters θ update at every epoch. Target network parameters θ -denotes parameters to be updated with

3.2. MEAN FIELD REINFORCEMENT LEARNING

Mean field approximation turns a many-agent problem into a two-agent problem by mapping the joint action space to a single action space. The joint action Q function is firstly factorized considering only local pairwise interactions, then pairwise interactions are approximated using the mean-field theory Q i s, a 1 , a 2 , . . . , a N = 1 N i k∈N (i) Q i s, a i , a k ≈ Q i s, a i , āi where N i = |N (i)|. N (i) is the set of agent i's neighboring agents. Interactions between central agent i and its neighbors are reduced to the interaction between the central agent and an abstract agent, which is presented by average behavior information of agents in the neighborhood of agent i. Finally, the policy of the central agent i is determined by pairwise Q-function π i t a i t | s, āi t = exp βQ i t s t , a i t , āi t-1 a ′ i ∈A i exp βQ i t s t , a ′ i , āi t-1 It is proven that π i t will converge eventually (Yang et al., 2018) .

3.3. CAUSAL INFERENCE

The data-driven statistical learning method lacks the identification of causality which is quite a vital part of composing human acknowledge. The SCM established with human knowledge is needed to represent the causality among all the variables we consider. An SCM is a 4-tuple M =< U, V, F, P (U) >. U = {U 1 , U 2 , • • • , U m } is the set of exogenous variables which are determined by factors outside the model. V = {V 1 , V 2 , • • • , V n } is the set of endogenous variables that are determined by other variables. F is a set of functions {f V1 , f V2 , • • • , f Vn } such that f Vj maps Pa Vj ∪ U Vj to V j . where U Vj ⊆ U is all the exogenous variables directly point to V j and Pa Vj ⊆ V\V j is all the endogenous variables directly point to V j . That is, V j = f Vj (Pa Vj , U Vj ) for j = 0, 1, • • • , n. P (U) is the probability distribution function over the domain of U. The causal mechanism in SCM M induced an acyclic graph G, which uses a direct arrow to present a direct effect between variables as shown in Fig. 3 . Intervention is performed through an operator called do(x), which directly deletes f X and replaces it with a constant X = x, while the rest of the model keeps unchanged. The equation defines the post-intervention distribution P M (y|do(x)) ≜ P Mx (y) (4) where M x is the SCM after performing do(x). Once we obtain the post-intervention distribution, one may measure the causal effect by comparing it with the pre-intervention distribution. A common measure is the average causal effect. E[Y |do(x ′ 0 )] -E[Y |do(x 0 )] where x ′ 0 and x 0 are two different interventions. The causal effect may also be measured by the experimental Risk Ratio (Pearl, 2010 ) E[Y |do(x ′ 0 )] E[Y |do(x 0 )] (6) 4 METHOD 4.1 CONSTRUCTION OF SCM The first step to obtain the causal effects behind the interactions among agents is constructing an SCM, which reveals the causal relations among all the variables. In the setting of MFRL, mean action āi t-1 and state s t determine the policy π i t (• | s t , āi t-1 ) of agent i. As the key causal effect we concern is how important an interaction is for decision making, that is, how the interaction contains in āi t-1 affects π i t , we construct the SCM center on π j t as illustrated in Fig. 3(b ). The importance of the interaction with adjacent agent k on the central agent i could be estimated by replacing āi t-1 with a i,k t-1 and quantified by the change of π i t . Formally, the causal effect of acting a k on π i t is T E i,k t = DIST (π i t (• | s t , a i t , āi t-1 ), π i t (• | s t , a i t , do(ā i t-1 = a i,k t-1 ))) where a i,k t-1 is the action of the k th agent in the neighborhood of agent i. The causal effects in Eq.( 5) and Eq.( 6) are quantified using the difference in statistics before and after the intervention. As the policy are known, we can utilize the difference in policy to quantify causal effects. DIST measures the difference between pre-intervention policy and post-intervention policy. We use KL divergence as the DIST function in practice. As π i t is parameterized using neural network, the do(•) calculation is performed by directly changing the input of π i t . It is worth noting that not all neural networks are capable of causal inference (Xia et al., 2021) . As a neural network learned by interacting with the environment, π i t lies on the second layer of Pearl Causal Hierarchy (Bareinboim et al., 2022) , and naturally contains both the causality between agent-wise interaction and the causality between agent-environment interaction. It is sufficient for estimating the causal effect of certain interaction.  𝑎 𝑡 𝑖 ~𝜋𝑡 𝑖 • |𝑠 𝑡 , ത 𝑎 𝑡-1 𝑖 ത 𝑎 𝑡-1 𝑖 ≔ 1 𝑁 𝑖 𝑘∈𝑁(𝑖) 𝑎 𝑡-1 𝑘 𝑎 𝑡-1 𝑘 ~𝜋𝑡-1 𝑘 • |𝑠 𝑡-1 , ത 𝑎 𝑡-2 𝑘 (b) Figure 3 : (a) is a canonical SCM, when do(x 0 ) operation is performed on X, all causes of X will be broken and keep all variable constant but only change X to x 0 . (b) is the SCM of MFRL, the do(•) operation on āi t-1 follows the same procedure.

4.2. IMPROVING MFQ WITH CAUSAL EFFECT

In MFRL, we assume that different pairwise Q-functions should be assigned different weights depending on their potential influences on the policy of central agent. Hence, the factorization of Eq.( 2) should be revised to Q i s, a 1 t , a 2 t , . . . , a N t = k∈N (i) w i,k Q i s, a i , a i,k where N (i) is the set of agent i's adjacent agents. Then Q j (s, a 1 , a 2 , • • • , a N ) is approximated using mean-field theory and considering the causality-aware weights Q i (s, a 1 , a 2 , • • • , a N ) = k∈N (i) w i,k Q i s, a i , a i,k = k∈N (i) w i,k Q i s, a i , ǎi + ∇ ǎi Q i s, a i , ǎi • δa i,k + 1 2 δa i,k • ∇ 2 ãi,k Q i s, a i , ãi,k • δa i,k = Q i s,a i ,ǎ i +∇ ǎi Q i s,a i ,ǎ i •   k∈N (i) w i,k δa i,k   + k∈N (i) w i,k 1 2 δa i,k •∇ 2 ãi,k Q i s,a i ,ã i,k •δa i,k = Q i s, a i , ǎi + k∈N (i) w i,k R i s,a i a i,k ≈ Q i s, a i , ǎi where δa i,k = a i,k -ǎi and ǎi = k∈N (i) w i,k a i,k , hence k w i,k δa i,k = 0. In the second-order term, ãi,k = ǎi + ϵ i,k δa i,k , ϵ i,k ∈ (0, 1). R i s,a i a i,k denotes the first-order Taylor expansion's Lagrange remainder which is bounded by [-L, L] in the condition that the Q i s, a i , a i,k function is L-smoothed. To be self-contained, we put the derivation in Appendix B. The remainder is a value fluctuating around zero. As Yang et al. (2018) discussed in their work, under the assumption that fluctuations caused by adjacent agents tend to cancel each other, the remainder could be neglected. Once causal effects of pairwise interactions are known, the next question is how to to improve the representational capacity of the merged agent. Both linear methods, e.g., weighted sum, or nonlinear methods, e.g., encoding with a neural network, might be useful. However, to ensure the merged agent's reasonability, we prefer a representation in the linear space formed by adjacent agents' action vectors. An intuitive method that can induce reasonable output is a weighted sum. In practice, we find that weighted sum using respective causal effects as weight is enough to effectively improve the representational capacity of average action π i t a i t | s t , ǎi t-1 = exp βQ i t s t , a i t , ǎi t-1 a ′ i ∈A i exp βQ i t s t , a ′ i , ǎi t-1 , ǎi t-1 = k∈N (i) w i,k t a i,k t-1 (10) w i,k t = T E i,k t + ϵ k∈N (i) T E i,k t + ϵ where subscripts are used to denote time steps. T E i,k t is calculated according to Eq.( 7). Each a i,k t-1 is encoded in one hot vector. Hence the weighted sum returns a reasonable representation in the linear space formed by the actions of neighborhoods. Moreover, the representation is close to essential actions, emphasizing high-potential impact interactions. A term ϵ was introduced to smooth the weight distribution across all adjacent agents, avoiding additional non-stationarity during training. Besides, the naive mean-field approximation could be achieved when ϵ → ∞. The Q-function Q i update using the following loss function similar with Eq.(1) L i (θ i ) = E s,a,r,s ′ Q i s, a i , ǎi ; θ i -r + γ max a ′ i Q i s ′ , a ′ i , ǎi ; θ - i 2 (12)

5. EXPERIMENTS

We evaluate CMFQ in two tasks: a mixed cooperative-competitive battle game and a cooperative predator-prey game. In the battle task, we compare CMFQ with IQL (Tampuu et al., 2017) , MFQ (Yang et al., 2018) , and Attention-MFQ (Wang et al., 2022) to investigate the effectiveness and scaling capacity of CMFQ. We further verify the effectiveness of CMFQ in another task. In the predator-prey task, we compare CMFQ with MFQ and Attention-MFQ. Our experiment environment is MAgent (Zheng et al., 2018) . Task Setting. In this task, agents are separated into two groups, each containing N agents. Every agent tries to survive and annihilate the other group. Ultimately the team with more agents surviving wins. Each agent obtains partial observation of the environment and knows the last actions other agents took. Agents will be punished when moving and attacking to lead agents to act efficiently. Agents are punished when dead and only rewarded when killing the enemy. The reward setting requires the agent to cooperate efficiently with teammates to annihilate enemies. In the experiments, we train CMFQ, IQL, MFQ, and Attention-MFQ in the setting of N = 64, then we change N from 64 to 400 to investigate the scalability of CMFQ. The concrete reward values are set as follow: r attack = -0.1, r move = -0.005, r dead = -0.1, r kill = 5. We train every algorithm in self-play paradigm.

5.1. MIXED COOPERATIVE-COMPETITIVE GAME

(SLVRGH 7RWDO5HZDUG $WWHQWLRQ0)4 0)4 &0)4 ,4/ 5DQGRP Quantitative Results and Analysis. As illustrated in Fig. 5 (a), we compare CMFQ with Attention-MFQ, MFQ, and IQL. We do not choose Wu et al. (2022) as a baseline because it is a correlationbased algorithm identical to Attention-MFQ. We assume that the attention-based method is a more challenging baseline. Moreover, in addition to these algorithms, we also set ablation algorithms named Random to verify that the performance improvement of CMFQ is not caused by randomization. Random follows the same pipeline as CMFQ but returns a random causal effect for each interaction. Fig. 4 shows the learning curve of all algorithms. We can see that the total rewards of all algorithms converge to a stable value, empirically demonstrating the training scalability of our algorithm. To compare the performance of each algorithm, we put trained algorithms in the test environment that N = 64, and let them battle against each other. Fig. 5 (a) shows that MFQ performs better than IQL but worse than Attention-MFQ, indicating that the mean-field approximation mitigates the scalability problem in this task. However, the simply averaging as MFQ is not a good representation of the population behavioral information. In order to improve its representational ability for large-scale scenarios, it is necessary to assign different weights to different agents. Moreover, CMFQ outperforms Attention-MFQ during the test, verifying the correctness of our hypothesis that correlation-based weighting is insufficient to catch the essential interactions properly, while the intervention fills this gap by giving agents the ability to ask "what if". Ablations. We set two ablation experiments. The first one to ablate the effectiveness of causal effects in CMFQ. As illustrated in Fig. 5 (a), the performance of Random is inferior to MFQ, verifying the validity of causal effect in CMFQ. The other one is ablation for ϵ. As we analyze in 4.2, ϵ is an adjustable parameter in the interval [0, +∞]. As ϵ increases, the effect of each interaction becomes smoother and eventually CMFQ equal to MFQ when ϵ → +∞. From the Fig. 5 (c), we can see that as we adjust ϵ from 0.001 to 1, the learning curve of CMFQ always converges, and in the test environment, win rates of CMFQ always outperform other baselines. When ϵ is relatively large, the win rate is close to that of MFQ. ,4/ 0)4 $WWHQWLRQ0)4 &0)4 5DQGRP :LQ5DWH YV,4/ YV0)4 YV$WWHQWLRQ0)4 YV&0)4 YV5DQGRP Visualization Analysis. As illustrated in Fig. 6 (a), CMFQ learns the tactic of besieging, while MFQ tends to confront frontally. The results in Fig. 6 (b) indicate the tricky issue in mixed cooperative-competitive game: agents need to cooperate with their teammates to kill enemies, whereas only the agent who hits a fatal attack gets the biggest reward r kill , driving agents hesitating to attack first. When there are few agents, the policies of MFQ and CMFQ tend to be conservative. However, CMFQ presents more advanced tactics: agents learn the trick of teaming up in the mixed cooperative-competitive game. When an agent chooses to attack, the adjacent teammates will arrive to help, achieving the maximum reward with the smallest cost of health. Moreover, Fig. 6(b ) also shows that attacks of CMFQ are more focused than baselines. CMFQ can discriminate key interactions and have a more accurate timing of attacks, while MFQ lacks this discriminatory ability and thus keeps attacking. Task Setting. In this task, agents are divided into predator and prey. Prey move 1.5 times faster than predators, and their task is to avoid predators as much as possible. Predators are four times larger than prey and can attack but not yield any damage. Predators only get rewarded when they are close to prey. Therefore, to gain the reward, they must cooperate with other predators and try to surround prey with their size advantage. In our experiments, to test the scalability of the CMFQ, we first train MFQ, CMFQ, and Attention-MFQ employing the self-play paradigm in a scenario involving 20 predators and 40 prey, and then test them in environments involving (20 predators, 40 prey), (80 predators, 160 prey), (180 predators, 360 prey) respectively. The reward are set as follow: r attack = -0.2, r surround = 1, r be surrounded = -1. We can see that the total reward of Attention-MFQ is higher than that of MFQ, and the trend is similar to that of MFQ. In comparison, the total reward of CMFQ is higher than that of both MFQ and Attention-MFQ, and the trend is ever more flat, indicating that CMFQ has better scalability.

5.2. COOPERATIVE GAME

Figure 9 : Visualization of cooperative predator prey game. The first row is results of CMFQ, the second row is results of Attention-MFQ. N predator =20,N prey =40 for the left column, N predator =40,N prey =80 for the middle column, N predator =180,N prey =360 for the last column. Red squares are predators while blue squares are prey, the grey squares are obstacles. All images are obtained 400 steps after the game begin. Visualization Analysis. The results that the trained CMFQ and Attention-MFQ controls predators are shown in Fig. 9 . In the the environment that N predator =20, N prey =40, both CMFQ and Attention-MFQ perform similarly. Predators learn two strategies: four predators cooperating to surround the prey in an open area; two or three predators surrounding the prey with the help of obstacles. In the environment that N predator =40, N prey =80, when the number of agents increases, predators controlled by Attention-MFQ are more dispersed than predators controlled by CMFQ. Besides, MFQ has more predators idle than CMFQ. Predators controlled by CMFQ gather on map edges, because it is more efficient to surround prey with the help of map edges. In addition, predators controlled by CMFQ learn an advanced strategy to drive prey to map edges then take advantage of the terrain to surround them. In the environment that N predator =180, N prey =360, the advanced strategy is also presented. Moreover, predators controlled by CMFQ master the skill to utilize the bodies of still teammates who have captured prey as obstacles. Thus, predators controlled by CMFQ present a high degree of aggregation and environmental adaptability. Obviously, we can obtain the bound of λ, λ ∈ [-L, L]. ∇ 2 Q (x) is a real symmetric matrix, so there exist an orthogonal matrix U to diagonalize ∇ 2 Q (x) such that U T ∇ 2 Q (x) U = Λ ≜ diag [λ 1 , λ 2 , . . . , λ N ]. Then the bound of R i s,a i (a i,k ) can be derived as follow R i s,a i (a i,k ) = 1 2 δa i,k •∇ 2 Q a k • δa i,k = 1 2 U • δa i,k T Λ U • δa i,k = 1 2 N n=1 λ n U • δa i,k 2 n (17) -L∥U • δa i,k ∥ 2 ≤ N n=1 λ n U • δa i,k 2 n ≤ L∥U • δa i,k ∥ 2 (18) where U • δa i,k n refers to the n th element of vector U • δa i,k . ∥U • δa i,k ∥ 2 = ∥δa i,k ∥ 2 = (a i,k -ǎi ) T (a i,k -ǎi ) = a i,k T a i,k + ǎiT ǎi -ǎiT a i,k -ǎi a i,k T = 2(1 -ǎi n ) ≤ 2 (19) where a i,k is a one-hot encoding action, ǎi n denotes the n th element in ǎi . Finally, according to Eq.( 17) Eq.( 18) Eq.( 19), the bound of R i s,a i (a i,k ) is [-L, L].

C VISUALIZATION FOR THE WEIGHTS OF CMFQ AND ATTENTION-MFQ

To further analyze the reasons why CMFQ is more effective than Attention-MFQ empirically, we randomly select an agent in the mixed cooperative-competitive game task and visualize its weight. Some interesting observations can be made from Fig. 10 (a). First of all, it makes sense that the agents on the front line will be given high weights because they are battling. Secondly, the weights of agents at the edge of the front line are relatively small, possibly because these agents can cooperate with nearby teammates to attack an enemy due to their position advantages, so they are in a relatively dominant state. In addition, agents at the very edge of the front line are given higher weights, even if they are out of combat. This is because they are in a position to flank their opponents and work with their teammates to surround the opponents. In Fig. 10 (b), we observe a result consistent with the analysis in our paper. That is, the attention-based method uses the attributes of other agents to calculate the attention scores, and observation is an important part of the attributes, so it tends to give high weight to the agents nearby because their observations are similar. Each agent in the red team is controlled by MFQ. We label the agents in the blue team whose weights are visualized in green. The number above the blue agent represents the normalized weight given by the green agent to the pairwise interaction between them. Due to space constraints, the integer bits of all weights are omitted



CONCLUSIONSThis paper aims at scalability problem in large-scale MAS. Firstly, We inherit the framework of MFRL which significantly reduce the dimensionality of joint state-action space. To further handle the intractable non-stationarity when the number of agent is large, we propose a SCM to model the decision-making process, and enable agents to identify the more crucial interactions via intervening on the SCM. Finally a causality-aware representation of population behavioral information could be obtained by the weighted sum of the action of each agent according to its causal effect. Experiments in two tasks reveal the excellent scalability of CMFQ.



𝑍 𝑈 𝑍 𝑋 ≔ 𝑓 𝑋 𝑍, 𝑈 𝑋 𝑌 ≔ 𝑓 𝑌 𝑋, 𝑈 𝑌 𝑠 𝑡 ~𝑃(• |𝑠 𝑡-1 , 𝑎 𝑡-1 )

Figure 4: Total reward during training.

a) Performance comparisons. Ablation experiments of ϵ.

Figure 5: Win rate during execution. (a) shows the results that algorithms battle against each other. the horizontal axis is divided into five groups by algorithms, and within each group there are five bars representing the win rate of the algorithm on the horizontal axis. (b) shows win rates of algorithms in the label against MFQ algorithms which are on the horizontal axis. (c) shows the win rate of CMFQ with different ϵ against other algorithms.

Figure 7: Total reward during training.

Figure 8: Total reward of predators during execution changes when the number of agents increases. 1× denotes N predator =20, N prey =40, 4× demotes N predator =80, N prey =160 and so on. All algorithms are trained in the 1× environment.

The weights obtained by CMFQ. (b) The weights obtained by Attention-MFQ.

Figure10: The two figures visualize the mixed cooperative-competitive task, where each agent in the blue team in (a) is controlled by CMFQ and each agent in the blue team in (b) is controlled by Attention-MFQ. Each agent in the red team is controlled by MFQ. We label the agents in the blue team whose weights are visualized in green. The number above the blue agent represents the normalized weight given by the green agent to the pairwise interaction between them. Due to space constraints, the integer bits of all weights are omitted

A IMPLEMENTATION DETAILS

The pseudocode of CMFQ is listed below.Algorithm 1 Causal Mean Field Q-learning Input: Initialize state s 0 ; Q θi , Q θ - i , ǎi 0 for all agent i ∈ {1, 2, Sample a minibatch transition from replay buffer; Calculate L i and update θ i by Eq.( 12); Updata target network by θ - i = θ i after every C updates of θ i ; end for end while

B DERIVATION FOR THE BOUND OF THE LAGRANGE REMAINDER

As s, a i in Q i (s, a i , a i,k ) are fixed parameter in the derivation of Eq.( 9), for simplicity, the pairwise Q-function Q i (s, a i , a i,k ) can be rewrite as Q(a k ) in the following. We assume that a k is a one-hot encoding for n actions, to make Q(a k ) more general, we replace the discrete a k (a k ∈ R N ) by a continuous x (x ∈ R N ) which don't violate the domain of the parameterized Q-function. Given the Q (x) is L-smooth, then for any two points x, y ∈ dom (Q) ⊆ R N , there exists a Lipschitz constantBy the first order Taylor expansion with Lagrange remainder, we havewhere u = y -x, lim u→0 R(u) ∥u∥2 = 0. Assume x ̸ = y, then we can reform the first order Taylor expansionu could be the eigenvalue of ∇ 2 Q (x), then Eq.( 15) can be convert to

D SUPPLEMENTAL EXPERIMENT ON MPE

To further investigate the applicability of CMFQ, we perform an experiment on another environment named multi-agent particle environment (MPE) (Mordatch & Abbeel, 2017) . As the dimentionality of action-state space will change as the initial number of agent changes, making it difficult to verify scalability, but we believe that CMFQ's scalability performance has been adequately validated in previous experiments. For MPE, we tested the predator prey task in MPE when the number of agents was the same as that in the training environment, and compared it with 5.2 to see whether the same conclusions could be drawn in the two environments.(a) Average reward of Predator.(b) Average reward of Prey.Figure 11 : Average reward during training.Task Setting. There are 20 predators, 40 preys, and 20 obstacles. Predator gets r collide = 10 if it collide with prey. Prey gets r be collided = -10 if it collided with predator. The speed of prey is 1.3 times of that of predator. In order to make preys learn to leverage obstacles instead of running to infinity, we manually draws an area. If preys go beyond this area, they will get penalty r bound which will be aggravate as the distance preys go beyond this area increase, until r bound = -10. We trained MFQ, CMFQ and Attention-MFQ in the self-play paradigm.The training curve is shown in Fig. 11 .Quantitative Results and AnalysisIn the test phase, we controlled 20 Predators and 40 prey with different algorithms respectively, test 10 times and calculated the average reward of each algorithm, as shown in Table .1. First, the average reward of MFQ is lower than CMFQ and Attention-MFQ, regardless of whether it controls predators or preys. This indicates that the representational ability of average merged agent is insufficient. Secondly, when MFQ controls prey, the average predator reward of CMFQ is higher than Attention-MFQ, indicating that the weight obtained by CMFQ was more representational. Finally, in the comparison between CMFQ and Attention-MFQ, CMFQ outperforms Attention-MFQ in both predator reward and prey reward, further confirms the superiority of CMFQ. In the task that the number of agents in testing was the same as that in the training, We compare the performance of MFQ, CMFQ, and Attention-MFQ and come to the same conclusion consistent with 5. 

