CAUSAL MEAN FIELD MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

Scalability remains a challenge in multi-agent reinforcement learning and is currently under active research. A framework named mean-field reinforcement learning (MFRL) could alleviate the scalability problem by employing Mean Field Theory to turn a many-agent problem into a two-agent problem. However, this framework lacks the ability to identify essential interactions under non-stationary environments. Causality contains relatively invariant mechanisms behind interactions, though environments are non-stationary. Therefore, we propose an algorithm called causal mean-field Q-learning (CMFQ) to address the scalability problem. CMFQ is ever more robust toward the change of the number of agents though inheriting the compressed representation of MFRL's action-state space. Firstly, we model the causality behind the decision-making process of MFRL into a structural causal model (SCM). Then the essential degree of each interaction is quantified via intervening on the SCM. Furthermore, we design the causality-aware compact representation for behavioral information of agents as the weighted sum of all behavioral information according to their causal effects. We test CMFQ in a mixed cooperative-competitive game and a cooperative game. The result shows that our method has excellent scalability performance in both training in environments containing a large number of agents and testing in environments containing much more agents.

1. INTRODUCTION

multi-agent reinforcement learning (MARL) has achieved remarkable success in some challenging tasks. e.g., video games (Vinyals et al., 2019; Wu, 2019) . However, training a large number of agents remains a challenge in MARL. The main reasons are 1) the dimensionality of joint state-action space increases exponentially as agent number increases, and 2) during the training for a single agent, the policies of other agents keep changing, causing the non-stationarity problem, whose severity increases as agent number increases. (Sycara, 1998; Zhang et al., 2019; Gronauer & Diepold, 2021) . Existing works generally use the centralized training and decentralized execution paradigm to mitigate the scalability problem via mitigating the non-stationarity problem (Rashid et al., 2018; Foerster et al., 2018; Lowe et al., 2017; Sunehag et al., 2017) . Curriculum learning and attention techniques are also used to improve the scalability performance (Long et al., 2020; Iqbal & Sha, 2019) . However, above methods focus mostly on tens of agents. For large-scale multi-agent system (MAS) contains hundreds of agents, studies in game theory (Blume, 1993) and mean-field theory (Stanley, 1971; Yang et al., 2018) offers a feasible framework to mitigate the scalability problem. Under this framework, Yang et al. (2018) propose a algorithm called mean-field Q-learning (MFQ), which replaces joint action in joint Q-function with average action, assuming that the entire agent-wise interactions could be simplified into the mean of local pairwise interactions. That is, MFQ reduces the dimensionality of joint state-action space with a merged agent. However, this approach ignores the importance differences of the pairwise interactions, resulting in the poor robustness. Nevertheless, one of the drawbacks to mean field theory is that it does not properly account for fluctuations when few interactions exist(Uzunov, 1993) (e.g., the average action may change drastically if there are only two adjacent agents). Wang et al. (2022) attempt to improve the representational ability of the merged agent by assign weight to each pairwise interaction by its attention score. However, the observations of other agents are needed as input, making this method not practical enough in the real world. In addition, the attention score is essentially a correlation in feature space, which seems unconvincing. On the one hand, an agent pays more attention to another agent not simply because of the higher correlation. On the other hand, it may be inevitable that the proximal agents will be assigned high weight just because of the high similarity of their observation. In this paper, we want to discuss a better way to represent the merged agent. We propose a algorithm named causal mean-field Q-learning (CMFQ) to address the shortcoming of MFQ in robustness via causal inference. Research in psychology reveals that humans have a sense of the logic of intervention and will employ it in a decision-making context (Sloman & Lagnado, 2015) . This suggests that by allowing agents to intervene in the framework of mean-field reinforcement learning (MFRL), they could have the capacity to identify more essential interactions as humans do. Inspired by this insight, we assume that different pairwise interactions should be assigned different weights, and the weights could be obtained via intervening. We introduce a structural causal model (SCM) that represents the invariant causal structure of decision-making in MFRL. We intervene on the SCM such that the corresponding effect of specific pairwise interaction can be presented by comparing the difference before and after the intervention. Intuitively, the intervening enable agents to ask "what if the merged agent was replaced with an adjacent agent" as illustrated in Fig. 1 . In practice, the pairwise interactions could be embodied as actions taken between two agents, therefore the intervention also performs on the action in this case. CMFQ is based on the assumption that the joint Q-function could be factorized into local pairwise Q-functions, which mitigates the dimension curse in the scalability problem. Moreover, CMFQ alleviates another challenge in the scalability problem, namely non-stationarity, by focusing on crucial pairwise interactions. Identifying crucial interactions is based on causal inference instead of attention mechanism. Surprisingly, the scalability performance of CMFQ is much better than the attention-based method (Wang et al., 2022) . The reasons will be discussed in experiments section. As causal inference only needs local pairwise Q-functions, CMFQ is practical in real-world applications, which are usually partially observable. We evaluate CMFQ in the cooperative predator-prey game and mixed cooperative-competitive battle game. The results illustrate that the scalability of CMFQ significantly outperforms all the baselines. Furthermore, results show that agents controlled by CMFQ emerge with more advanced collective intelligence. The imaginary of agent 𝑖 Figure 1 : Blue agents and orange agents belong to different teams. The purple agent denote a merged agent that simply average all agents in agent i's neighborhood. The diagram on the left shows a scenario in which the central agent i interacts with many agents, i k denotes the k th agent in the observation of agent i. In the framework of MFRL, the scenario is transferred to the diagram in the middle, in which an merged agent is used to characterize all the agents in the central agent's observation. Our method further enables the central agent to learn to ask "what if". When it asks this question, it can imagine the scenario illustrated in the right diagram. The central agent can hypothetically replace the action of the merged agent in MFRL with the action of a neighborhood agent, and if this replacement will cause dramatic changes in policy, it means this neighborhood agent is potentially important. Thus central agent should pay more attention to the interaction with this neighborhood agent.

2. RELATED WORK

The scalability problem has been widely investigated in current literatures. Yang et al. (2018) propose the framework of MFRL that increases scalability by reducing the action-state space. Several works in a related area named mean-field game also proves that using a compact representation to



change dramatically if the merged agent were replaced with 𝑖2. I should pay more attention to 𝑖2! 𝑟𝑒𝑝𝑙𝑎𝑐𝑒 𝑚𝑒𝑟𝑔𝑒𝑑 𝑎𝑔𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑖1 𝑟𝑒𝑝𝑙𝑎𝑐𝑒 𝑚𝑒𝑟𝑔𝑒𝑑 𝑎𝑔𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑖2 𝑟𝑒𝑝𝑙𝑎𝑐𝑒 𝑚𝑒𝑟𝑔𝑒𝑑 𝑎𝑔𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑖3 𝑟𝑒𝑝𝑙𝑎𝑐𝑒 𝑚𝑒𝑟𝑔𝑒𝑑 𝑎𝑔𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑖4

