INTENTION PROPAGATION FOR MULTI-AGENT REIN-FORCEMENT LEARNING

Abstract

A hallmark of an AI agent is to mimic human beings to understand and interact with others. In this paper, we propose a collaborative multi-agent reinforcement learning algorithm to learn a joint policy through the interactions over agents. To make a joint decision over the group, each agent makes an initial decision and tells its policy to its neighbors. Then each agent modifies its own policy properly based on received messages and spreads out its plan. As this intention propagation procedure goes on, we prove that it converges to a mean-field approximation of the joint policy with the framework of neural embedded probabilistic inference. We evaluate our algorithm on several large scale challenging tasks and demonstrate that it outperforms previous state-of-the-arts.

1. INTRODUCTION

Collaborative multi-agent reinforcement learning is an important sub-field of the multi-agent reinforcement learning (MARL) , where the agents learn to coordinate to achieve joint success. It has wide applications in traffic control (Kuyer et al., 2008) , autonomous driving (Shalev-Shwartz et al., 2016) and smart grid (Yang et al., 2018) . To learn a coordination, the interactions between agents are indispensable. For instance, humans can reason about other's behaviors or know other peoples' intentions through communication and then determine an effective coordination plan. However, how to design a mechanism of such interaction in a principled way and at the same time solve the large scale real-world applications is still a challenging problem. Recently, there is a surge of interest in solving the collaborative MARL problem (Foerster et al., 2018; Qu et al., 2019; Lowe et al., 2017) . Among them, joint policy approaches have demonstrated their superiority (Rashid et al., 2018; Sunehag et al., 2018; Oliehoek et al., 2016) . A straightforward approach is to replace the action in the single-agent reinforcement learning by the joint action a = (a 1 , a 2 , ..., a N ), while it obviously suffers from the issue of the exponentially large action space. Thus several approaches have been proposed to factorize the joint action space to mitigate such issue, which can be roughly grouped into two categories: • Factorization on policy. This approach explicitly assumes that π(a|s) := N i=1 π i (a i |s), i.e., policies are independent (Foerster et al., 2018; Zhang et al., 2018) . To mitigate for the instability issue caused by the independent learner, it generally needs a centralized critic. • Factorization on value function. This approach has a similar spirit but factorizes the joint value function into several utility functions, each just involving the actions of one agent (Rashid et al., 2018; Sunehag et al., 2018) . However, these two approaches lack of the interactions between agents, since in their algorithms agent i does not care about the plan of agent j. Indeed, they may suffer from a phenomenon called relative over-generalization in game theory observed by Wei & Luke (2016) ; Castellini et al. (2019) ; Palmer et al. (2018) . Approaches based on the coordinate graph would effectively prevent such cases, where the value function is factorized as a summation of utility function on pairwise or local joint action (Guestrin et al., 2002; Böhmer et al., 2020) . However, they only can be applied in discrete action, small scale game. Furthermore, despite the empirical success of the aforementioned work in certain scenarios, it still lacks theoretical insight. In this work, we only make a simple yet realistic assumption: the reward function r i of each agent i just depends on its individual action and the actions of its neighbors (and state), i.e., r i (s, a) = r i (s, a i , a Ni ), where we use N i to denote the neighbors of agent i, s to denote the global state. It says the goal or decision of agent is explicitly influenced by a small subset N i of other agents. Note that such an assumption is reasonable in lots of real scenarios. For instance, • The traffic light at an intersection makes the decision on the phase changing mainly relying on the traffic flow around it and the policies of its neighboring traffic light. • The main goal of a defender in a soccer game is to tackle the opponent's attacker, while he rarely needs to pay attention to opponent goalkeeper's strategy. Based on the assumption in equation 1, we propose a principled multi-agent reinforcement learning algorithm in the framework of probabilistic inference, where the objective is to maximize the long term reward of the group, i.e., ∞ t=0 N i=1 γ t r t i ( see details in section 4). Note since each agent's reward depends on its neighbor, we still need a joint policy to maximize the global reward through interactions. In this paper, we derive an iterative procedure for such interaction to learn the joint policy in collaborative MARL and name it intention propagation. Particularly, • In the first round, each agent i makes an independent decision and spreads out his plan μi (we name it intention) to neighbors. • In the second round, agents i changes its initial intention properly based on its neighbors' intention μj , j ∈ N i and propagates its intention μi again. • In the third round, it changes the decision in the second round with a similar argument. • As this procedure goes on, we show the final output of agents' policy converges to the mean field approximation (the variational inference method from the probabilistic graphical model (Bishop, 2006) ) of the joint policy. In addition, this joint policy has the form of Markov Random Field induced by the locality of the reward function (proposition 1). Therefore, such a procedure is computationally efficient when the underlying graph is sparse, since in each round, each agent just needs to care about what its neighbors intend to do. Remark: (1) Our work is not related to the mean-field game (MFG) (Yang et al., 2018) . The goal of the MFG is to find the Nash equilibrium, while our work aims to the optimal joint policy in the collaborative game. Furthermore, MFG generally assumes agents are identical and interchangeable. When the number of agents goes to infinity, MFG can view the state of other agents as a population state distribution. In our problem, we do not have such assumptions. (2) our analysis is not limited to the mean-field approximation. When we change the message passing structure of intention propagation, we can show that it converges to other approximation of the joint policy, e.g., loopy belief propagation in variational inference (Yedidia et al., 2001) (see Appendix B.2 ).

Contributions:

(1) We propose a principled method named intention propagation to solve the joint policy collaborative MARL problem; (2) Our method is computationally efficient, which can scale up to one thousand agents and thus meets the requirement of real applications; (3) Empirically, it outperforms state-of-the-art baselines with a wide margin when the number of agents is large; (4) Our work builds a bridge between MARL and neural embedded probabilistic inference, which would lead to new algorithms beyond intention propagation. Notation: s t i and a t i represent the state and action of agent i at time step t. The neighbors of agent i are represented as N i . We denote X as a random variable with domain X and refer to instantiations of X by the lower case character x. We denote a density on X by p(x) and denote the space of all such densities as by P.

2. RELATED WORK

We first discuss the work of the factorized approaches on the joint policy. COMA designs a MARL algorithm based on the actor-critic framework with independent actors π i (a i |s), where the joint policy is factorized as π(a|s) = N i=1 π i (a i |s) (Foerster et al., 2018) . MADDPG considers a MARL with the cooperative or competitive setting, where it creates a critic for each agent (Lowe et al., 2017) . Other similar works may include (de Witt et al., 2019; Wei et al., 2018) . Another way is to factorize the value functions into several utility functions. Sunehag et al. (2018) assumes that the overall Q function can be factorized as Q(s, a 1 , a 2 , .., a N ) = N i=1 Q i (s i , a i ) . QMIX extends this work to include a richer class of function, where it assumes the overall Q function is a monotonic function w.r.t. each Q i (s i , a i ) (Rashid et al., 2018) . Similarly, Son et al. (2019) further relax the structure constraint on the joint value function. However these factorized methods suffer from the relative overgeneralization issue (Castellini et al., 2019; Palmer et al., 2018) . Generally speaking, it pushes the agents to underestimate a certain action because of the low rewards they receive, while they could get a higher one by perfectly coordinating. A middle ground between the (fully) joint policy and the factorized policy is the coordination graph (Guestrin et al., 2002) , where the value function is factorized as a summation of the utility function on the pairwise action. Böhmer et al. (2020) ; Castellini et al. (2019) combine deep learning techniques with the coordination graph. It addresses the issue of relative overgeneralization, but still has two limitations especially in the large scale MARL problem. (1) The max-sum algorithm can just be implemented in the discrete action space since it needs a max-sum operation on the action of Q function. (2) Even in the discrete action case, each step of the Q learning has to do several loops of max-sum operation over the whole graph if there is a cycle in graph. Our algorithm can handle both discrete and continuous action space cases and alleviate the scalability issue by designing an intention propagation network. Another category of MARL is to consider the communication among agents. The attention mechanism is used to decide when and who to communicate with (Das et al., 2018) . Foerster et al. (2016) propose an end-to-end method to learn communication protocol. In (Liu et al., 2019; Chu et al., 2020) , each agent sends the action information to it neighbors. In addition, Chu et al. (2020) require a strong assumption that the MDP has the spatial-temporal Markov property. However, they utilizes neighbor's action information in a heuristic way and thus it is unclear what the agents are learning (e.g., do they learn the optimal joint policy to maximize the group reward? ). Jiang et al. (2020) propose DGN which uses GNN to spread the state embedding information to neighbors. However each agent still uses an independent Q learning to learn the policy and neglects other agents' plans. In contrast, we propose a principled algorithm, where each agent makes decision considering other agents' plan. Such procedure can be parameterized by GNN and other neural networks (see section 4.1 and appendix B.2). We prove its convergence to the solution of variational inference methods.

3. BACKGROUNDS

Probabilistic Reinforcement Learning: Probabilistic reinforcement learning (PRL) (Levine, 2018) is our building block. PRL defines the trajectory τ up to time step T as τ = [s 0 , a 0 , s 1 , a 1 , ..., s T , a T , s T +1 ]. The probability distribution of the trajectory τ induced by the optimal policy is defined as p(τ ) = [p(s 0 ) T t=0 p(s t+1 |s t , a t )] exp T t=0 r(s t , a t ) . While the probability of the trajectory τ under the policy π(a|s) is defined as p(τ ) = p(s 0 ) T t=0 p(s t+1 |s t , a t )π(a t |s t ). The objective is to minimize the KL divergence between p(τ ) and p(τ ). It is equivalent to the maximum entropy reinforcement learning max π J(π) = T t=0 E[r(s t , a t ) + H(π(a t |s t ))], where it omits the discount factor γ and regularizer factor α of the entropy term, since it is easy to incorporate them into the transition and reward respectively. Notice in this framework the max operator in the Bellman optimality equation would be replaced by the softmax operator and thus its optimal policy is a softmax function related to the Q function (Haarnoja et al., 2017) . Such framework subsumes state-of-the-art algorithms such as soft-actor-critic (SAC) (Haarnoja et al., 2018) . In each iteration, SAC optimizes the following loss function of Q,π, V , and respectively. E (s t ,a t )∼D Q(s t , a t ) -r(s t , a t ) -γE s t+1 ∼p [V (s t+1 )] 2 , E s t ∼D E a t ∼π [log π(a t |s t ) -Q(s t , a t )] E s t ∼D V (s t ) -E a t ∼π θ [Q(s t , a t ) -log π(a t |s t )] 2 , where D is the replay buffer. Function Space Embedding of Distribution: In our work, we use the tool of embedding in Reproducing Kernel Hilbert Space (RKHS) to design an intention propagation procedure (Smola et al., 2007) . We let φ(X) be an implicit feature mapping and X be a random variable with distribution p(x). Embeddings of p(x) is given by µ X := E X [φ(X)] = φ(x)p(x)dx where the distribution is mapped to its expected feature map. By assuming that there exists a feature space such that the embeddings are injective, we can treat the embedding µ X of the density p(x) as a sufficient statistic of the density, i.e., any information we need from the density is preserved in µ X (Smola et al., 2007) . Such injective assumption generally holds under mild condition (Sriperumbudur et al., 2008) . This property is important since we can reformulate a functional f : P → R of p(•) using the embedding only, i.e., f (p(x)) = f (µ X ). It also can be generalized to the operator case. In particular, applying an operator T : P → R d to a density can be equivalently carried out using its embedding T • p(x) = T • µ X , where T : F → R d is the alternative operator working on the embedding. In practice, µ X , f and T have complicated dependence on φ. As such, we approximate them by neural networks, which is known as the neural embedding approach of distribution (Dai et al., 2016) .

4. OUR METHOD

In this section, we present our method intention propagation for the collaborative multi-agent reinforcement learning. To begin with, we formally define the problem as a networked MDP. The network is characterized by a graph G = (V, E), where each vertex i ∈ V represents an agent and the edge ij ∈ E means the communication link between agent i and j. We say i,j are neighbors if they are connected by this edge. The corresponding networked MDP is characterized by a tuple ({S i } N i=1 , {A i } N i=1 , p, {r i } N i=1 , γ, G), where N is the number of agents, S i is the local state of the agent i and A i denotes the set of action available to agent i. We let S := N i=1 S i and A := N i=1 A i be the global state and joint action space respectively. At time step t + 1, the global state s t+1 ∈ S is drawn from the transition s t+1 ∼ p(•|s t , a t ), conditioned on the current state s t and the joint action a t = (a t 1 , a t 2 , ..., a t N ) ∈ A. Each transition yields a reward r t i = r i (s t , a t ) for agent i and γ is the discount factor. The aim of our algorithm is to learn a joint policy π(a t |s t ) to maximize the overall long term reward (with an entropy term H(•|s) on the joint action a) η(π) = E[ ∞ t=0 γ t ( N i=1 r t i + H(•|s t ))], where each agent i can just observe its own state s i and the message from the neighborhood communication. We denote the neighbors of agent i as N i and further assume that the reward r i depends on the state and the actions of itself and its neighbors, i.e., r i (s, a) := r i (s, a i , a Ni ). Such assumption is reasonable in many real scenarios as we discussed in the introduction. In the following, we start the derivation with the fully observation case, and discuss how to handle the partial observation later. The roadmap of the following derivation : At the beginning, we prove that the optimal policy has a Markov Random Field (MRF) form, which reduces the exponential large searching space to a polynomial one. However implement a MRF policy is not trivial in the RL setting (e.g., sample an action from the policy). Thus we sort to the varational inference method (focus on mean field approximation in the main paper and leave other methods in the appendix). But it would introduce complicated computations. At last we apply the kernel embedding method introduced in section 3 to solve this problem and learn the kernel embedding by neural networks. We also discuss how to handle the partially observable setting.

4.1. REDUCE POLICY SEARCHING SPACE

Recall that our aim is to maximize the long term reward with the entropy term. Therefore, we follow the definition of the optimal policy in the probabilistic reinforcement learning in (Levine, 2018 ) and obtain the proposition 1. It says under the assumption r i (s, a) = r i (s, a i , a Ni ), the optimal policy is in the form of Markov Random Field (MRF). We prove the following proposition in I.1. Proposition 1 The optimal policy has the form π * (a t |s t ) = 1 Z exp( N i=1 ψ i (s t , a t i , a t Ni )) , where Z is the normalization term. This proposition is important since it suggests that we should construct the policy π(a t |s t ) with this form, e.g., a parametric family, to contain the optimal policy. If agent i and its neighbors compose a clique, the policy reduces to a MRF and ψ is the potential function. One common example is that the reward is a function on pairwise actions, i.e., r(s, a) = i∈V r(s, a i ) + (i,j)∈E r(s, a i , a j ). Then the policy has the form π(a|s) = 1 Z exp( i∈V ψi (s, a i ) + (i,j)∈E ψi,j (s, a i , a j )), which is the pairwise MRF. For instance, in traffic lights control, we can define a 2-D grid network and the pairwise reward function. The MRF formulation on the policy effectively reduces the policy space comparing with the exponentially large one in the fully connected graph. A straightforward way to leverage such observation is to define a π θ (a t |s t ) as a MRF, and then apply the policy gradient algorithm, e.g., the following way in SAC. ∇ θ E s t ∼D E a t ∼π θ [log π θ (a t |s t ) - Q κ (s t , a t )]. However it is still very hard to sample joint action a t from π θ (a t |s t ). In the next section, we resort to embedding to alleviate such problem. Recall the remaining problem is how to sample the joint action from a MRF policy. Classical ways include the Markov Chain Monte Carlo method and variational inference. The former provides the guarantee of producing exact samples from the target density but computationally intensive. Therefore it is not applicable in the multi-agent RL setting, since we need to sample action once in each interaction with the environment. As such, we advocate the second approach. Here we use the mean-field approximation for the simplicity of presentation and defer more variational inference methods, e.g., loopy belief propagation, in Appendix B.2. We use an intention propagation network with the embedding of the distribution to represent the update rule of the mean field approximation. Mean field approximation. We hope to approximate the π * (a|s) by the mean-field variational family p i min (p1,p2,...,p N ) KL( N i=1 p i (a i |s)||π * (a|s)), where we omit the superscript t to simplify the notation. We denote the optimal solution of above problem as q i . Using the coordinate ascent variational inference,the optimal solution q i should satisfy the following fixed point equation (Bishop, 2006) . Since the objective function is (generally) non-convex, such update converges to a local optimum (Blei et al., 2017) . q i (a i |s) ∝ exp j =i q j (a j |s) log π * (a|s)da. For simplicity of the representation, in the following discussion, we assume that the policy is a pairwise MRF but the methodology applies to more general case with more involved expression. Particularly, we assume π * (a|s) = 1 Z exp( i∈V ψ i (s, a i ) + (i,j)∈E ψ ij (s, a i , a j )). We plug this into equation 2 and obtain following fixed point equation. log q i (a i |s) = c i + ψ i (s, a i ) + j∈Ni q j (a j |s)ψ ij (s, a i , a j )da j , where c i is some constant that does not depend on a i . We can understand this mean-field update rule from the perspective of intention propagation. Equation 3 basically says each agent can not make the decision independently. Instead its policy q i should depend on the policies of others, particularly the neighbors in the equation. Clearly, if we can construct the intention propagation corresponding to equation 3, the final policy obtained from intention propagation will converge to the mean-field approximation of the joint policy. However we can not directly apply this update in our algorithm, since it includes a complicated integral. To this end , in the next section we resort to the embedding of the distribution q i (Smola et al., 2007) , which maps the distributions into a reproducing kernel Hilbert space. Embed the update rule. Observe that the fixed point formulation equation 3 says that q i (a i |s) is a functional of neighborhood marginal distribution {q j (a j |s)} j∈Ni , i.e., q i (a i |s) = f (a i , s, {q j } j∈Ni ). Denote the d-dimensinoal embedding of q j (a j |s) by μj = q j (a j |s)φ(a j |s)da j . Notice the form of feature φ is not fixed at the moment and will be learned implicitly by the neural network. Following the assumption that there exists a feature space such that the embeddings are injective in Section 3, we can replace the distribution by its embedding and have the fixed point formulation as q i (a i |s) = f (a i , s, {μ j } j∈Ni ). For more theoretical guarantee on the kernel embedding, e.g., convergence rate on the empirical mean of the kernel embedding, please refer to (Smola et al., 2007) . Roughly speaking, once there are enough data, we can believe the learned kernel embedding is close enough to the true kernel embedding. Therefore the update of equation 4 and equation 5 in the following would converge to the fixed point of equation 2. Remind that in section 3 at both sides we can do integration w.r.t. the feature map φ, which yields, μi = q i (a i |s)φ(a i |s) 6 1 5 2 3 4 D 𝜇 # (E) D 𝜇 " (E,#) D 𝜇 $ (E,#) D 𝜇 ' (E,#) D 𝜇 & (E,#) D 𝜇 % (E,#) 𝑞 # (𝑎 # |𝑠) After M iterations of propagation, it outputs 𝑞# (a) Message Passing 𝑠 ! 𝑠 " 𝑠 # 𝑠 $ 𝜇 # " % 𝜇 # # % 𝜇 # $ % 𝜇 # ! % … … 𝜇 # ! ! 𝜇 # " ! 𝜇 # # ! 𝜇 # $ ! 𝑞!(𝑎!|𝑠) da i = f (a i , s, {μ j } j∈Ni )φ(a i |s)da i . Thus we can rewrite it as a new operator on the embedding, which induces a fixed point equation again μi = T • (s, {μ j } j∈Ni ). In practice, we do this fix-point update with M iterations. μm i ← T • (s, {μ m-1 j } j∈Ni ) m = 1, ..., M. Finally, we output the distribution q i with q i (a i |s) = f (a i , s, {μ M j } j∈Ni ). In next section, we show how to represent these variables by neural network. Parameterization by Neural Networks. In general f and T have complicated dependency on ψ and φ. Instead of learning such dependency, we directly approximate f and T by neural networks. For instance, we can represent the operator T in equation 5 by μi = σ(W 1 s+W 2 j∈Ni μj ), where σ is a nonlinear activation function, W 1 and W 2 are some matrixes with row number equals to d. Interestingly, this is indeed a message passing form of Graph Neural Network (GNN) (Hamilton et al., 2017) . Thus we can use M -hop (layer) GNN to represent the fixed-point update in equation 5. If the action space is discrete, the output q i (a i |s) is a softmax function. In this case f is a fully connected layer with softmax output. When it is continuous, we can output a Gaussian distribution with the reparametrization trick (Kingma & Welling, 2019) . We denote this intention propagation procedure as intention propagation network Λ θ (a|s) with parameter θ in Figure 1(b) . Figure 1 (a) illustrates the graph and the message passing procedure. Agent 1 receives the embedding (intention) μm-1 2 , μm-1 5 , μm-1 6 from its neighbors, and then updates the its own embedding with operator T and spreads its new embedding μm 1 at the next iteration. Figure 1 (b) gives the details on the parameterization of GNN. Here we use agent 1 as an example. To ease the exposition, we assume agent 1 just has one neighbor, agent 2. Each agent observes its own state s i . After a MLP and softmax layer (we do not sample actions here, but just use the probabilities of the actions), we get a embedding μ0 i , which is the initial distribution of the policy. Then agent 1 receives the embedding μ0 2 of its neighbor (agent 2). After a GNN layer to combine the information, e.g, μ1 1 = Relu[W 1 (s 1 + s 2 ) + W 2 (μ 0 1 + μ0 2 )](W 1 , W 2 are shared across all agents as that in GNN), we obtain new embedding μ1 1 of agent 1. Notice we also do message passing on state, since in practice the global state is not available. In the second layer, we do similar things. We defer detailed discussion and extension to other neural networks to Appendix B due to space constraint.

4.2. ALGORITHM

We are ready to give the overall algorithm by combining all pieces together. All detailed derivation on V i , Q i for agent i and the corresponding loss function will be given in the appendix I, due to the space constraint. Recall we have a mean-field approximation q i of the joint-policy, which is obtained by M iterations of intention propagation. We represent this procedure by a M-hop graph neural network with parameter θ discussed above. Notice that this factorization is different from the case π(a|s) = N i=1 π(a i |s) in (Zhang et al., 2018; Foerster et al., 2018) , since q i (a i |s) depends on the information of other agents' plan. Using the mean field approximation q i , we can further decompose Q = N i=1 Q i and V = N i=1 V i , see appendix I. We use neural networks to approximate V i and Q i function with parameter η i and κ i respectively. As that in TD3 (Fujimoto et al., 2018) , for each agent i we have a target value network V ηi and two Q κi functions to mitigate the overestimation by training them simultaneously with the same data but only selecting minimum of them as the target in the value update. In the following we denote q i (a i |s) as q i,θ (a i |s) to explicitly indicate its dependence on the intention propagation network Λ θ . We use D to denote the replay buffer. The whole algorithm is presented in Algorithm 1. Loss Functions. The loss of value function V i : J(η i ) = E s t ∼D [ 1 2 V ηi (s t ) -E (a t i ,a t N i )∼(qi,q N i ) [Q κi (s t , a t i , a t Ni ) -log q i,θ (a t i |s t )] 2 ]. The loss of Q i : J(κ i ) = E (s t ,a t i ,a t N i )∼D [ 1 2 Q κi (s t , a t i , a t Ni ) -Qi (s t , a t i , a t Ni ) 2 ], where Qi (s t , a t i , a t Ni ) = r i + γE s t+1 ∼p(•|st,a t ) [V ηi (s t+1 )]. The loss of policy: J(θ) = E s t ∼D,a t ∼ N i=1 qi [ N i=1 log q i,θ (a t i |s t ) - N i=1 Q κi (s t , a t i , a t Ni )]. It is interesting to compare the loss with the counterpart in the single agent SAC in section 3. • q i,θ (a i |s) is the output of intention propagation network Λ θ (a|s) parameterized by a graph neural network. Thus it depends on the policy of other agents. • Q κi depends on the action of itself and its neighbors, which can also be accomplished by the graph neural network in practice.

Algorithm 1 Intention Propagation

Inputs: Replay buffer D. V i , Q i for each agent i. Intention propagation network Λ θ (a t |s) with outputs {q i,θ } N i=1 . Learning rate l η , l κ, l θ . Moving average parameter τ for the target network for each iteration do for each environment step do sample a t ∼ q i,θ (a t i |s t ) from the intention propagation network. s t+1 ∼ p(s t+1 |s t , a t ), D ← D s t i , a t i , r t i , s t+1 i N i=1 end for for each gradient step do update η i , κ i , θ, ηi . η i ← η i -l η ∇J(η i ), κ i ← κ i -l κ ∇J(κ i ) θ ← θ -l θ ∇J(θ), ηi ← τ η i + (1 -τ )η i end for end for Handle the Partial Observation: So far, we assume that agents can observe global state while in practice, each agent just observes its own state s i . Thus besides the communication with the intention propagation, we also do the message passing on the state embedding with the graph neural network. The idea of this local state sharing is similar to (Jiang et al., 2020) , while the whole structure of our work is quite different from (Jiang et al., 2020) . See the discussion in the related work.

5. EXPERIMENT

In this section, we evaluate our method and eight state-of-the-art baselines on more than ten different scenarios from three popular MARL platforms: (1) CityFlow, a traffic signal control environment Figure 3 : Performance on large-scale traffic lights control scenarios in CityFlow. Horizontal axis: environmental steps. Vertical axis: average episode reward (negative average travel time). Higher rewards are better. Our intention propagation (IP) performs best especially on large-scale tasks. (Tang et al., 2019) . It is an advanced version of SUMO (Lopez et al., 2018) widely used in MARL community. (2) multiple particle environment (MPE) (Mordatch & Abbeel, 2017) and (3) grid-world platform MAgent (Zheng et al., 2018) . Our intention propagation (IP) empirically outperforms all baselines on all scenarios especially on the large scale problem.

5.1. SETTINGS

We give a brief introduction to the settings of the experiment and defer the details such as hyperparameter tuning of intention propagation and baselines to appendix D. Notice all algorithms are tested in the partially observable setting, i.e., each agent just can observe its own state s i . In traffic signal control problem (Left panel in Figure 2 ), each traffic light at the intersection is an agent. The goal is to learn policies of traffic lights to reduce the average waiting time to alleviate the traffic jam. Graph for cityflow: graph is a 2-D grid induced by the map (e.g. Figure 2 ). The roads are the edges which connects the agents. We can define the cost -r i as the traveling time of vehicle around the intersection i, thus the total cost indicates the average traveling time. Obviously, r i has a close relationship with the action of neighbors of agent i but has little dependence on the traffic lights far away. Therefore our assumption on reward function holds. We evaluate different methods on both real-world and synthetic traffic data under the different numbers of intersections. MPE (Mordatch & Abbeel, 2017) and MAgent (Zheng et al., 2018) (Figure 2 ) are popular particle environments on MARL (Lowe et al., 2017; Jiang et al., 2020) . Graph for particle environments : for each agent, it has connections (i.e., the edge of the graph) with k nearest neighbors. Since the graph is dynamic, we update the adjacency matrix of the graph every n step, e.g., n = 5 steps. It is just a small overhead comparing with the training of the neural networks. The reward functions also have local property, since they are explicitly or implicitly affected by the distance between agents. For instance, in heterogeneous navigation, if small agents collide with big agents, they will obtain a large negative reward. Thus their reward depends on the action of the nearby agents. Similarly, in the jungle environment, agent can attack the agents nearby to obtain a high reward. Baselines. We compare our method against eight different baselines mentioned in introduction and related work section: QMIX (Rashid et al., 2018) ; MADDPG (Lowe et al., 2017) ; permutation invariant critic (PIC) (Liu et al., 2019) ; graph convolutional reinforcement learning (DGN) (Jiang et al., 2020) ; independent Q-learning (IQL) (Tan, 1993) ; permutation invariant MADDPG with data shuffling mechanism (MADDPGS); COMA (Foerster et al., 2018) ; MFQ (Yang et al., 2018) . These baselines are reported as the leading algorithm of solving tasks in CityFlow, MPE and MAgent. Among them, DGN and MFQ need the communication with neighbors in the training and execution. Also notice that PIC assumes the actor can observe the global state. Thus in the partially observable setting, each agent in PIC also needs to communicate to get the global state information in the training and the execution. Further details on baselines are given in appendix E.1. Neural Network and Parameters. Recall the intention propagation network is represented by GNN. In our experiment, our graph neural network has hop = 2 (2 GNN layers, i.e., M = 2) and 1 fully-connected layer at the top. Each layer contains 128 hidden units. Other hyperparameters are listed in appendix H.

5.2. COMPARISON TO STATE-OF-THE-ART

In this section, we compare intention propagation (IP) with other baselines. The experiments are evaluated by average episode reward (Lowe et al., 2017) . For CityFlow tasks, average reward refers We report the mean and standard deviation in the curves. We report the results on six experiments and defer all the others to appendix G due to the limit of space. CityFlow. We first evaluate our algorithm on traffic control problem. Particularly, we increase the number of intersections (agents) gradually to increase the difficulties of the tasks. Figure 3 presents the performance of different methods on both real-world and synthetic CityFlow data with different number of intersections. On the task of Manhattan City, intention propagation (IP) method, the baseline methods PIC and DGN achieve better reward than the other methods while our method approaches higher reward within fewer steps. On the larger task (N=100), both PIC and DGN have large variance and obtain poor performance. The experiment with N=1225 agents is an extremely challenging task. Our algorithm outperforms all baselines with a wide margin. The runner-up is MADDPG with data shuffling mechanism. Its final performance is around -4646 and suffers from large variance. In contrast, the performance of our method is around -569 (much higher than the baselines). It's clear that, in both real-world and synthetic cityflow scenarios, the proposed IP method obtains the best performance. We defer further experimental results to appendix G.

MPE and MAgent.

Figure 4 demonstrates the performance of different methods on other three representative scenario instances: a small task cooperative navigation (N=30) and two large-scale tasks heterogeneous navigation (N=100) and prey and predator (N=100). We run all algorithms long enough (more than 1e6 steps). In all experiments, our algorithm performs best. For cooperative navigation, MADDPGS performs better than MADDPG. The potential improvement comes from data-shuffling mechanism, which makes MADDPGS more robust to handle the manually specified order of agents. QMIX performs much better than MADDPG, MADDPGS and IQL. However, its performance is not stable even on small setting (N=30). DGN is better and more stable than QMIX. However, when solving large-scale settings, its performance is much worse than PIC and our intention propagation (IP). Although PIC can solve large-scale tasks, our IP method is still much better. In prey and predator, there are two groups of agents: good agents and adversaries. To make a fair comparison of rewards of different methods, we fix good agents' policies and use all the methods to learn the adversaries' policies. Such setting is commonly used in many articles (Lowe et al., 2017; Liu et al., 2019) . Stability. Stability is a key criterion to evaluate MARL. In all experiments, our method is quite stable with small variance. For instance, as shown in Figure 3 (b), DGN approaches -1210 ± 419 on the CityFlow scenario with N=100 intersections while our method approaches -465 ± 20 after 1.6 × 10 6 steps (much better and stable). The reason is that to make the joint decision, the agent in our algorithm can adjust its own policy properly by considering other agents' plans. Ablation Study: We conduct a set of ablation studies related to the effect of joint policy, graph, hop size, number of neighbors and the assumption of the reward function. Particularly, we find the joint policy is essential for the good performance. In Cityflow, the performance of traffic graph (2-d grid induced by the roadmap) is better than the fully connected graph. In MPE and MAgent, We define the adjacent matrix based on the k nearest neighbors and pick k = 8 in large scale problem and k = 4 in small scale problem. In all of our experiment, we choose the 2-hop GNN. Because of the limitation of space, we just summarize our conclusion here and place the details in appendix F.

A ORGANIZATION OF THE APPENDIX

In appendix B, we give the details on the intention propagation network and parameterization of the GNN. We explain intention propgation from the view of the MARL. At last, we extend the intention propagation to other approximations which converges to other solutions of the variational inference. Notice such extension on the algorithm can also be easily parameterized by neural networks. In Appendix C, we give the details of the algorithm deferred from the main paper. Appendix D summarizes the configuration of the experiment and MARL environment. Appendix E gives more details on baselines and the hyperparameters of GNN used in our model. Appendix F conducts the ablation study deferred from the main paper. Appendix G and H give more experimental results and hyperparameters used in the algorithms. At appendix I, we derive the algorithm and prove the proposition 1.

B INTENTION PROPAGATION NETWORK B.1 DETAILS ON THE INTENTION PROPAGATION NETWORK

In this section, we give the details on the intention propagation network deferred from the main paper. We first illustrate the message passing of the intention propagation derived in section 4.1. Then we give a details on how to construct graph neural network. Message passing and explanation from the view of MARL: μi is the embedding of policy of agent i, which represents the intention of the agent i. At 0 iteration, every agent makes independent decision. The policy of agent i is mapped into its embedding μ0 i . We call it the intention of agent i at iteration 0. Then agent i sends its plan to its neighbors . In Figure 5 , μm i is the d dimensional (d = 3 in this figure) embedding of q i at m-th iteration of intention propagation. We draw the update of μ(m) , μm-1 6 from its neighbors, and then updates the its own embedding with operator T . After M iterations, we obtain μM 1 and output the policy distribution q 1 using equation 4. Similar procedure holds for other agents. At each RL step t, we do this procedure (with M iterations) once to generate joint policy. M in general is small, e.g., M = 2 or 3. Thus it is efficient.

Parameterization on GNN:

We then illustrate the parameterization of graph neural network in Figure 6 . If the action space is discrete, the output q i (a i |s) is a softmax function. When it is continuous, we can output a Gaussian distribution (mean and variance) with the reparametrization trick (Kingma & Welling, 2019). Here, we draw 2-hop (layer) GNN to parameterize it in discrete action intention propagation. In Figure 6 (b), each agent observe its own state s i . After a MLP and softmax layer (we do not sample here, and just use the output probabilities of the actions), we get a embedding μ0 i , which is the initial distribution of the policy. In the following, we use agent 1 as an example. To ease the exposition, we assume Agent 1 just has one neighbor, agent 2. Agent 1 receives the embedding μ0 2 of its neighbor. After a GNN layer to combine the information, e.g, Relu[W 1 (s 1 + s 2 ) + W 2 (μ 0 1 + μ0 2 )], we obtain new embedding μ1 1 of agent 1. Notice we also do  𝑠 ! 𝑠 " 𝑠 # 𝑠 $ 𝜇 # " % 𝜇 # # % 𝜇 # $ % 𝜇 # ! % … … 𝜇 # ! ! 𝜇 # " ! 𝜇 # # ! 𝜇 # $ ! 𝑞 ! (𝑎 ! |𝑠) MLP+ softmax 𝑞 " (𝑎 " |𝑠) … 𝝁 " 𝟏 𝟏 = 𝜇 # ! " 𝜇 # " " 𝜇 # # " 𝜇 # & " s ! ! = Relu[W # s ! + W % s " ] Figure 6 : Details of the graph neural network message passing on state, since in practice the global state is not available. In the second layer, we do similar things. Agent 1 receives the embedding information of μ1 2 from its neighbors and get a new embedding μ2 1 . Then this embedding passes a MLP+softmax layer and output probability of action, i.e. q 1 (a 1 |s).

B.2 EXTENSION TO OTHER VARIATIONAL INFERENCE METHODS AND NEURAL NETWORKS

In this section, we show how to approximate the joint policy with the Loopy Belief Propagation in the variational inference (Yedidia et al., 2001) . This will lead to a new form of neural networks beyond vanilla GNN that we illustrate above. The objective function in Loop Belief Propagation is the Beth Free energy (Yedidia et al., 2001) . Different from the mean-field approximation, it introduces another variational variable q ij , which brings more flexibility on the approximation. The following is objective function in our case. min qi,qij ∈E - i (|N i | -1) q i (a i |s) log q i (a i |s) ψ i (s, a i ) da i + ij q ij (a i , a j |s) log q ij (a i , a j |s) ψ ij (s, a i , a j )ψ i (s, a i )ψ j (s, a j ) da i da j . s.t. q ij (a i , a j |s)da j = q i (a j |s), q ij (a i , a j |s)da i = q j (a j |s) Solve above problem, we have the fixed point algorithm m ij (a j |s) ← k∈Ni\j m ki (a i |s)ψ i (s, a i )ψ ij (s, a i , a j )da i , q i (a i |s) ← ψ i (s, a i ) j∈Ni m ji (a i |s). Similar to the mean-field approximation case, we have m ij (a j |s) = f (a j , s, {m ki } k∈Ni\j ), q i (a i |s) = g(a i , s, {m ki } k∈Ni ), It says the message m ij and marginals q i are functionals of messages from neighbors. Denote the embedding νij = ψ j (s, a j )m ij (a j |s)da j and μi = ψ i (s, a i )q i (a i |s)da i , we have νij = T • s, {ν ki } k∈Ni\j , μi = T • s, {ν ki } k∈Ni . Again, we can parameterize above equation by (graph ) neural network νij = σ W 1 s + W 2 k∈Ni\j νki , μi = σ W 3 s + W 4 k∈Ni νki . Following similar way, we can derive different intention propagation algorithms by changing different objective function which corresponds to e.g., double-loop belief propagation (Yuille, 2002) , tree-reweighted belief propagation (Wainwright et al., 2003) and many others.

C ALGORITHM

We present some remarks of the algorithm Intention Propagation (algorithm 1) deferred from the main paper. Remark: To calculate the loss function J(η i ), each agent need to sample the global state and (a i , a Ni ). Thus we first sample a global state from the replay buffer and then sample all action a once using the intention propagation network.

D FURTHER DETAILS ABOUT ENVIRONMENTS AND EXPERIMETAL SETTING

Table 1 summarizes the setting of the tasks in our experiment. D.1 CITYFLOW CityFlow (Tang et al., 2019) is an open-source MARL environment for large-scale city traffic signal controlfoot_0 . After the traffic road map and flow data being fed into the simulators, each vehicle moves from its origin location to the destination. The traffic data contains bidirectional and dynamic flows with turning traffic. We evaluate different methods on both real-world and synthetic traffic data. For real-world data, we select traffic flow data from Gudang sub-district, Hangzhou, China and Manhattan, USAfoot_1 . For synthetic data, we simulate several different road networks: 7 × 7 grid network (N = 49) and large-scale grid networks with N = 10 × 10 = 100 , 15 × 15 = 225, 35 × 35 = 1225. Each traffic light at the intersection is the agent. In the real-world setting (Hang Zhou, Manhattan), the graph is a 2-d grid induced by the roadmap. Particularly, the roads are edges which connect the node (agent) of the graph. For the synthetic data, the map is a n * n 2-d grid (Something like Figure 7 ), where edges represents road, node is the traffic light. We present the experimental results deferred from the main paper in Figure 10 .

D.2 MPE

In MPE (Mordatch & Abbeel, 2017) foot_2 , the observation of each agent contains relative location and velocity of neighboring agents and landmarks. The number of visible neighbors in an agent's observation is equal to or less than 10. In some scenarios, the observation may contain relative location and velocity of neighboring agents and landmarks. We consider four scenarios in MPE. (1) cooperative navigation: N agents work together and move to cover L landmarks. If these agents get closer to landmarks, they will obtain a larger reward. In this scenario, the agent observes its location and velocity, and the relative location of the nearest 5 landmarks and N agents. The observation dimension is 26. (2) prey and predator: N slower cooperating agents must chase the faster adversaries around a randomly generated environment with L large landmarks. Note that, the landmarks impede the way of all agents and adversaries. This property makes the scenario much more challenging. In this scenario, the agent observes its location and velocity, and the relative location of the nearest 5 landmarks and 5 preys. The observation dimension is 34. (3) cooperative push: N cooperating agents are rewarded to push a large ball to a landmark. In this scenario, each agent can observe 10 nearest agents and 5 nearest landmarks. The observation dimension is 28. ( 4) heterogeneous navigation: this scenario is similar with cooperative navigation except dividing N agents into N 2 big and slow agents and N 2 small and fast agents. If small agents collide with big agents, they will obtain a large negative reward. In this scenario, each agent can observe 10 nearest agents and 5 nearest landmarks. The observation dimension is 26. Further details about this environment can be found at https://github.com/IouJenLiu/ PIC.

D.3 MAGENT

MAgent (Zheng et al., 2018) is a grid-world platform and serves another popular environment platform for evaluating MARL algorithms. Jiang et al. (2020) tested their method on two scenarios: jungle and battle. In jungle, there are N agents and F foods. The agents are rewarded by positive reward if they eat food, but gets higher reward if they attack other agents. This is an interesting scenario, which is called by moral dilemma. In battle, N agents learn to fight against several enemies, which is very similar with the prey and predator scenario in MPE. In our experiment, we evaluate our methods on jungle. In our experiment, the size for the grid-world environment is 30 × 30. Each agent refers to one grid and can observe 11 × 11 grids centered at the agent and its own coordinates. The actions includes moving and attacking along the coordinates. Further details about this environment can be found at https://github.com/geek-ai/MAgent and https://github.com/ PKU-AI-Edge/DGN.

E FURTHER DETAILS ON SETTINGS E.1 DESCRIPTION OF OUR BASELINES

We compare our method with multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al., 2017) , a strong actor-critic algorithm based on the framework of centralized training with decentralized execution; QMIX (Rashid et al., 2018) , a q-learning based monotonic value function factorisation algorithm; permutation invariant critic (PIC) (Liu et al., 2019) , a leading algorithm on MPE yielding identical output irrespective of the agent permutation; graph convolutional reinforcement learning (DGN) (Jiang et al., 2020) , a deep q-learning algorithm based on deep convolutional graph neural network with multi-head attention, which is a leading algorithm on MAgent; independent Q-learning (IQL) (Tan, 1993) , decomposing a multi-agent problem into a collection of simultaneous single-agent problems that share the same environment, which usually serves as a surprisingly strong benchmark in the mixed and competitive games (Tampuu et al., 2017) . In homogeneous settings, the input to the centralized critic in MADDPG is the concatenation of all agent's observations and actions along the specified agent order, which doesn't hold the property of permutation invariance. We follow the similar setting in (Liu et al., 2019) and shuffle the agents' observations and actions in training batchfoot_3 . In COMA (Foerster et al., 2018) , it directly assume the poilcy is factorized. It calculates the counterfactual baseline to address the credit assignment problem in MARL. In our experiment, since we can observe each reward function, each agent can directly approximate the Q function without counterfactual baseline. MFQ derives the algorithm from the view of mean-field game (Yang et al., 2018) . Notice the aim of mean-field game is to find the Nash equilibrium rather than maxmization of the total reward of the group. Further more, it needs the assumption that agents are identical.

E.2 NEURAL NETWORKS ARCHITECTURE

To learn feature from structural graph build by the space distance for different agents, we design our graph neural network based on the idea of a strong graph embedding tool structure2vec (Dai et al., 2016) , which is an effective and scalable approach for structured data representation through embedding latent variable models into feature spaces. Structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. After using M graph neural network layers, each node can receive the information from M -hops neighbors by message passing. Recently, attention mechanism empirically leads to more powerful representation on graph data (Veličković et al., 2017; Jiang et al., 2020) . We employ this idea into our graph neural network. In some settings, such as heterogeneous navigation scenario from MPE, the observations of different group of agents are heterogeneous. To handle this issue, we use different nonlinear functions to extract the features from heterogeneous observations and map the observations into a latent layer, then use the same graph neural networks to learn the policy for all types of agents. In our experiment, our graph neural network has M = 2 layers and 1 fully-connected layer at the top. Each layer contains 128 hidden units. F ABLATION STUDIES F.1 INDEPENDENT POLICY VS INTENTION PROPAGATION. We first give a toy example where the independent policy (without communication) fails. To implement such algorithm, we just replace the intention propagation network by a independent policy network and remain other parts the same. Think about a 3 × 3 2d-grid in Figure 7 where the global state (can be observed by all agents) is a constant scalar (thus no information). Each agent chooses an action a i = 0 or 1. The aim is to maximize a reward -(a 1 -a 2 ) 2 -(a 1 -a 4 ) 2 -(a 2 -a 3 ) 2 -... - (a 8 -a 9 ) 2 , (i.e., summation of the reward function on edges). Obviously the optimal value is 0. The optimal policy for agents is a 1 = a 2 =, ..., a 9 = 0 or a 1 = a 2 =, ..., a 9 = 1. However independent policy fails, since each agents does not know how its allies pick the action. Thus the learned policy is random. We show the result of this toy example in Figure 7 , where intention propagation learns optimal policy.

F.2 GRAPH TYPES, NUMBER OF NEIGHBORS, AND HOP SIZE

We conduct a set of ablation studies related to graph types, the number of neighbors, and hop size. Figure 8 (a) and Figure 8 (b) demonstrate the performance of our method on traffic graph and fullyconnected graph on the scenarios (N=49 and N=100) of CityFlow. In the experiment, each agent can only get the information from its neighbors through message passing (state embedding and the policy embedding). The result makes sense, since the traffic graph represents the structure of the map. Although the agent in the fully connected graph would obtain global information, it may introduce irrelevant information from agents far away. Figure 8 (c) and Figure 8 (d) demonstrate the performance under different number of neighbors and hop size on cooperative navigation (N=30) respectively. The algorithm with neighbors=8 has the best performance. Again the the fully connected graph (neighbors=30) may introduce the irrelevant information of the agents far away. Thus its performance is worse than the algorithm with graph constructed by the K-nearest neighbor. In addition the fully connected graph introduces more computations in the training. In Figure 8 (d), we increase the hop-size from 1 to 3. The performance of IP with hop=2 is much better than that with hop=1. While IP with hop=3 is just slightly better than that with hop=2. It means graph neural network with hop size =2 has aggregated enough information. In Figure 8 (e), we test the importance of the k-nearest neighbor structure. IP(neighbors=3)+random means that we pick 3 agents uniformly at random as the neighbors. Obviously, IP with K-nearest neighbors outperforms the IP with random graph a lot. In Figure 8 (f), we update adjacency matrix every 1, 5, 10 steps. IP(neighbors=8) denotes that we update the adjacency matrix every step, while IP(neighbors=8)+reset( 5) and IP(neighbors=8)+reset( 10) denote that we update adjacency matrix every 5 and 10 steps respectively. Obviously, IP(neighbors=8) has the best result. IP(neighbors=8)+reset( 5) is better than IP(neighbors=8)+reset( 10). The result makes sense, since the adjacency matrix is more accurate if the update interval is smaller.

F.3 ASSUMPTION VIOLATION

The aforementioned experimental evaluations are based on the mild assumption: the actions of agents that are far away would not affect the learner because of their physical distance. It would be interesting to see the performance where the assumption is violated. As such, we modify the reward in the experiment of cooperative navigation. In particular, the reward is defined by r = r1 + r2, where r1 encourages the agents to cover (get close to) landmarks and r2 is the log function of the distances between agents (farther agents have larger impact). To make a violation, we let r2 dominate the reward. We conduct the experiments with hop = 1, 2, 3. Figure 9 shows that the rewards obtained by our methods are 4115 ± 21, 4564 ± 22, and 4586 ± 25 respectively. It's expected in this scenario, since we should use large hop to collect information from the far-away agents.

G FURTHER EXPERIMENTAL RESULTS

For most of the experiments, we run them long enough with 1 million to 1.5 million steps and stop (even in some cases our algorithm does not converge to the asymptotic result), since every experment in MARL may cost several days. We present the results on Cityflow in Figure 10 . Figure 11 provides the experimental results on the cooperative navigation instances with N = 15, N = 30 and N = 200 agents. Note that, the instance with N = 200 is a large-scale and challenging multiagents reinforcement learning setting (Chen et al., 2018; Liu et al., 2019) , which typically needs several days to run millions of steps. It's clear that IQL, MADDPG, MADDPG perform well in the small setting (N=15), however, they failed in large-scale instances (N = 30 and N = 200). In the instance with N = 30, MADDPGS performs better than MADDPG. The potential reason is that with the help of shuffling, MADDPGS is more robust to handle the manually specified order of agents. Although QMIX performs well in the instance of N = 15 and N = 30, it has large variances in both settings. DGN using graph convolutional network can hold the property of permutation invariance, it obtains much better performance than QMIX on these two settings. However, it also fails to solve the large-scale settings with N = 200 agents. Empirically, after 1.5 × 10 6 steps, PIC obtains a large reward (-425085 ± 31259) on this large-scale setting. Despite all these, the proposed intention propagation (IP) approaches -329229 ± 14730 and is much better than PIC. Furthermore, Figure 11 shows the results of different methods on (d) jungle (N=20, F=12) and (e) prey and predator (N=100). The experimental results shows our method can beats all baselines on these two tasks. On the scenario of cooperative push (N=100) as shown in Figure 11 (f), it's clear that DGN, QMIX, IQL, MADDPG and MADDPGS all fail to converge to good rewards after 1.5 × 10 6 environmental steps. In contrast, PIC and the proposed IP method obtain much better rewards than these baselines. Limited by the computational resources, we only show the long-term performance of the best two methods. Figure 11 (f) shows that IP is slightly better than PIC in this setting. 

G.1 POLICY INTERPRETATION

Explicitly analyzing the policy learned by deep multi-agent reinforcement learning algorithm is a challenging task, especially for the large-scale problem. We follow the similar ideas from (Zheng et al., 2019) and analyze the learned policy on CityFlow in the following way: We select the same period of environmental steps within [210000, 1600000] and group these steps into 69 intervals (each interval contains about 20000 steps). We compute the ratio of vehicle volume on each movement and the sampled action volume from the learned policy (each movement can be assigned to one action according to the internal function in CityFlow). We define the ratio of vehicle volume over all movements as the vehicle volume distribution and define the ratio of the sampled action volume from the learned policy over all movements as the sampled action distribution. It's expected that a good MARL algorithm will hold the property: these two distributions will very similar over a period of time. Figure 12 reports their KL divergence by intervals. It's clear that the proposed intention propagation method (IP) obtains the lowest KL divergence (much better than the state-of-the-art baselines). Because KL divergence is not symmetrical metric, we also calculate their Euclidean distances. Specifically, the distance of our method is 0.0271 while DGN is 0.0938 and P IC is 0.0933.

H HYPERPARAMETERS

The parameter on the environment. For the max episode length, we follow the similar settings like that in the baselines (Lowe et al., 2017) . Particularly, we set 25 for MPE and set 100 for CityFlow. For MAgent, we find that setting the max episode length by 25 is better than 100. All the methods share the same setting. We list the range of hyperparameter that we tune in all baselines and intention propagation. γ : {0. We prove the result by induction using the backward view. To see that, plug r(s t , a t ) = N i=1 r i (s t , a t i , a t Ni ) into the distribution of the optimal policy defined in section 3. Recall the goal is to find the best approximation of π(a t |s t ) such that the trajectory distribution p(τ ) induced by this policy can match the optimal trajectory probability p(τ ). Thus we minimize the KL divergence between them min π D KL (p(τ )||p(τ )), where p(τ ) = p(s 0 ) T t=0 p(s t+1 |s t , a t )π(a t |s t ). We can do optimization w.r.t. π(a t |s t ) as that in (Levine, 2018 ) and obtain a backward algorithm on the policy π * (a (12) Now we optimize KL divergence w.r.t π(•|s t ). Considering the constraint j π(j|s t ) = 1, we introduce a Lagrangian multiplier λ( |A| j=1 π(j|s t ) -1) (Rigorously speaking, we need to consider another constraint that each element of π is larger than 0, but later we will see the optimal value satisfies this constraint automatically). Now we take gradient of KL(p(τ ) For convenience, we define the soft V function and Q function as that in (Levine, 2018) , and will show how to decompose them into V i and Q i later. In the following, we will demonstrate how to update the parameter θ of the propagation network Λ θ (a 



https://github.com/cityflow-project/CityFlow We download the maps from https://github.com/traffic-signal-control/ sample-code. To make the environment more computation-efficient,Liu et al. (2019) provided an improved version of MPE. The code are released in https://github.com/IouJenLiu/PIC. This operation doesn't change the state of the actions.



Figure 1: (a) Illustration of the message passing of intention propagation Λ θ (a|s) ( equation 5). (b) An instance of 2-layer GNN with the discrete action outputs (n agents).

Figure 2: Experimental scenarios. Cityflow: Manhattan, Predator-Prey and Cooperative-Navigation.

Figure 4: Experimental results on Cooperative Navigation, Heterogeneous Navigation, Prey and Predator. Our intention propagation (IP) beats all the baselines.to negative average travel time. All experiments are repeated for 5 runs with different random seeds. We report the mean and standard deviation in the curves. We report the results on six experiments and defer all the others to appendix G due to the limit of space.

Figure 5: illustrate the message passing in intention propagation network Λ θ (a|s).

Figure 7: (a) a toy task on 2d-grid. (b) The performance of independent policy and intention propagation.

Figure 8: Performance of the proposed method based on different ablation settings. (a) Traffic graph and fully connected (fc) graph on CityFlow (N=49). (b) Traffic graph and fully connected (fc) graph on CityFlow (N=100). (c) Cooperative Nav. (N=30): Different number of neighbors. (d) Cooperative Nav. (N=30): Different hop size graph neural networks. (e) Cooperative Nav. (N=30): Construct random graph vs k-nearest-neighbor graph (k = 3, 8, 10). (f) Cooperative Nav. (N=30): Update 8-nearest-neighbor graph every n environment steps (5 and 10 respectively.).

Figure 10: Performance of different methods on traffic lights control scenarios in CityFlow environment: (a) N=16 (4 × 4 grid), Gudang sub-district, Hangzhou, China. (b) N=49 (7 × 7 grid), (c) N=96 (irregular grid map), Manhattan, USA. (d) N=100 (10 × 10 grid), (e) N=225 (15 × 15 grid), (f) N=1225 (35 × 35 grid).The horizontal axis is time steps (interaction with the environment). The vertical axis is average episode reward, which refers to negative average travel time. Higher rewards are better. The proposed intention propagation (IP) obtains much better performance than all the baselines on large-scale tasks .

Relu[ 𝑾 𝟏 𝒔 𝟏 + 𝒔 𝟐

Tasks. We evaluate MARL algorithms on more than 10 different tasks from three different environ-



t |s t ) (See equation 13 in I.2.) π * (a t |s t ) = 1 Z exp E p(s t+1:T ,a t+1:T |s t ,a t ) [ Obviously, it satisfies the form π * (a T |s T ) = 1 Z exp( (a t |s t ) = 1 Z exp E p(s t+1:T ,a t+1:T |s t ,a t ) [ is some constant related to the normalization term. Thus, we redefine a new term ψi (s t , a t , a t Ni ) = E p(s t+1:T ,a t+1:T |s t ,a t ) (s t , a t , a t Ni ) . (11) Then obviously π * (a t |s t ) satisfies the form what we need by absorbing the constant C into the normalization term . Thus we have the result. I.2 DERIVATION OF THE ALGORITHM We start the derivation with minimization of the KL divergence KL(p(τ )||p(τ )), t+1 |s t , a t )π(a t |s t ). KL(p(τ )||p(τ )) =E τ ∼ p(τ )

||p(τ )) + λ( |A| j=1 π(j|s t ) -1) w.r.t π(•|s), set it to zero, and obtain log π * (a t |s t ) = E p(s t+1:T ,a t+1:T |s t ,a t ) [ (a t |s t ) ∝ exp E p(s t+1:T ,a t+1:T |s t ,a t ) [ Since we know j π(j|s t ) = 1, thus we have π

t |s t ), if we use neural network to approximate it. Again we minimize the KL divergence (a t i |s t )||π * (a t |s t ))Plug the π * (a t |s t ) = exp(Q(s t ,a t ))exp Q(s t ,a t )dat into the KL divergence. It is easy to see, it is equivalent to the following the optimization problem by the definition of the KL divergence.max θ E s t E a t ∼ q i,θ (a t i |s t ) [Thus we sample state from the replay buffer and have the loss of the policy asJ(θ) = E s t ∼D,a t ∼ N i=1 q i,θ (a t i |s t ) [

annex

The optimal policy π * (a t |s t ) = exp(Q(s t ,a t ) exp Q(s t ,a t )da t by plugging the definition of Q into equation 13.Remind in section 4.1, we have approximated the optimal joint policy by the mean field approximation N i=1 q i (a i |s). We now plug this into the definition of equation 14 and consider the discount factor. Notice it is easy to incorporate the discount factor by defining a absorbing state where each transition have (1 -γ) probability to go to that state. Thus we haveThus we can further decompose V and Q into V i and Q i . We define V i and Q i in the following way.Obviously we haveNow we relate it to Q i , and haveThus it suggests that we should construct the loss function on V i and Q i in the following way. In the following, we use parametric family (e.g. neural network) characterized by η i and κ i to approximate V i and Q i respectively.where Qi (s t , a t i , a t Ni ) = r i + γE s t+1 ∼p(s t+1 |s t ,a t ) [V ηi (s t+1 )]. Now we are ready to derive the update rule of the policy, i.e., the intention propagation network.Remind the intention propagation network actually is a mean-field approximation of the joint-policy. It is the optimization over the function p i rather than certain parameters. We have proved that after M iteration of intention propagation, we have output the nearly optimal solution q i .

