A MAXIMUM MUTUAL INFORMATION FRAMEWORK FOR MULTI-AGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

In this paper, we propose a maximum mutual information (MMI) framework for multi-agent reinforcement learning (MARL) to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the mutual information between actions. By introducing a latent variable to induce nonzero mutual information between actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic (VM3-AC), which follows centralized learning with decentralized execution (CTDE). We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring coordination.

1. INTRODUCTION

With the success of RL in the single-agent domain (Mnih et al. (2015) ; Lillicrap et al. (2015) ), MARL is being actively studied and applied to real-world problems such as traffic control systems and connected self-driving cars, which can be modeled as multi-agent systems requiring coordinated control (Li et al. (2019) ; Andriotis & Papakonstantinou (2019) ). The simplest approach to MARL is independent learning, which trains each agent independently while treating other agents as a part of the environment. One such example is independent Q-learning (IQL) (Tan (1993) ), which is an extension of Q-learning to multi-agent setting. However, this approach suffers from the problem of non-stationarity of the environment. A common solution to this problem is to use fully-centralized critic in the framework of centralized training with decentralized execution (CTDE) (OroojlooyJadid & Hajinezhad (2019) ; Rashid et al. (2018) ). For example, MADDPG (Lowe et al. (2017) ) uses a centralized critic to train a decentralized policy for each agent, and COMA (Foerster et al. (2018) ) uses a common centralized critic to train all decentralized policies. However, these approaches assume that decentralized policies are independent and hence the joint policy is the product of each agent's policy. Such non-correlated factorization of the joint policy limits the agents to learn coordinated behavior due to negligence of the influence of other agents (Wen et al. (2019) ; de Witt et al. (2019) ). Thus, learning coordinated behavior is one of the fundamental problems in MARL (Wen et al. (2019) ; Liu et al. (2020) ). In this paper, we introduce a new framework for MARL to learn coordinated behavior under CTDE. Our framework is based on regularizing the expected cumulative reward with mutual information among agents' actions induced by injecting a latent variable. The intuition behind the proposed framework is that agents can coordinate with other agents if they know what other agents will do with high probability, and the dependence between action policies can be captured by the mutual information. High mutual information among actions means low uncertainty of other agents' actions. Hence, by regularizing the objective of the expected cumulative reward with mutual information among agents' actions, we can coordinate the behaviors of agents implicitly without explicit dependence enforcement. However, the optimization problem with the proposed objective function has several difficulties since we consider decentralized policies without explicit dependence or communication in the execution phase. In addition, optimizing mutual information is difficult because of the intractable conditional distribution. We circumvent these difficulties by exploiting the property of the latent variable injected to induce mutual information, and applying variational lower bound on the mutual information. With the proposed framework, we apply policy iteration by redefining value functions to propose the VM3-AC algorithm for MARL with coordinated behavior under CTDE.

2. RELATED WORK

Learning coordinated behavior in multi-agent systems is studied extensively in the MARL community. To promote coordination, some previous works used communication among agents (Zhang & Lesser (2013) ; Foerster et al. (2016) ; Pesce & Montana (2019) ). For example, Foerster et al. (2016) proposed the DIAL algorithm to learn a communication protocol that enables the agents to coordinate their behaviors. Instead of relying on communication, Jaques et al. (2018) proposed social influence intrinsic reward which is related to the mutual information between actions to achieve coordination. The purpose of the social influence approach is similar to our approach and the social influence yields good performance in social dilemma environments. The difference between our algorithm and the social influence approach will be explained in detail and the effectiveness of our approach over the social influence approach will be shown in Section 6. Wang et al. (2019) proposed an intrinsic reward capturing the influence based on the mutual information between an agent's current actions/states and other agents' next states. In addition, they proposed an intrinsic reward based on a decision-theoretic measure. Although they used the mutual information to enhance exploration, our approach focuses on the mutual information between simultaneous actions capturing policy correlation not influence. Besides, they considered independent policies, whereas policies are correlated in our approach. Some previous works considered correlated policies instead of independent policies. For example, Liu et al. (2020) proposed explicit modeling of correlated policies for multi-agent imitation learning, and Wen et al. ( 2019) proposed a recursive reasoning framework for MARL to maximize the expected return by decomposing the joint policy into own policy and opponents' policies. Going beyond adopting correlated policies, our approach maximizes the mutual information between actions which is a measure of correlation. Our framework can be interpreted as enhancing correlated exploration by increasing the entropy of own policy while decreasing the uncertainty about other agents' actions. Some previous works proposed other techniques to enhance correlated exploration (Zheng & Yue (2018); Mahajan et al. (2019) ). For example, MAVEN addressed the poor exploration problem of QMIX by maximizing the mutual information between the latent variable and the observed trajectories (Mahajan et al. (2019) ). However, MAVEN does not consider the correlation among policies.

3. BACKGROUND

We consider a Markov Game (Littman (1994)), which is an extention of Markov Decision Process (MDP) to multi-agent setting. An N -agent Markov game is defined by an environment state space S, action spaces for N agents A 1 , • • • , A N , a state transition probability T : S × A × S → [0, 1], where A = N i=1 A i is the joint action space, and a reward function R : S × A → R. At each time step t, Agent i executes action a i t ∈ A i based on state s t ∈ S. The actions of all agents a t = (a 1 t , • • • , a N t ) yield next state s t+1 according to T and yield shared common reward r t according to R under the assumption of fully-cooperative MARL. The discounted return is defined as R t = ∞ τ =t γ τ r τ , where γ ∈ [0, 1) is the discounting factor. We assume CTDE incorporating resource asymmetry between training and execution phases, widely considered in MARL (Lowe et al. (2017) ; Iqbal & Sha (2018) ; Foerster et al. (2018) ). Under CTDE, each agent can access all information including the environment state, observations and actions of other agents in the training phase, whereas the policy of each agent can be conditioned only on its own action-observation history τ i t or observation o i t in the execution phase. For given joint policy π = (π 1 , • • • , π N ), the goal of fully cooperative MARL is to find the optimal joint policy π * that maximizes the objective J(π) = E π R 0 . Maximum Entropy RL The goal of maximum entropy RL is to find an optimal policy that maximizes the entropy-regularized objective function, given by J(π) = E π ∞ t=0 γ t r t (s t , a t ) + αH(π(•|s t )) It is known that this objective encourages the policy to enhance exploration in the state and action spaces and helps the policy avoid converging to a local minimum. Soft actor-critic (SAC), which is based on the maximum entropy RL principle, approximates soft policy iteration to the actor-critic method. SAC outperforms other deep RL algorithms in many continuous action tasks (Haarnoja et al. (2018) ). We can simply extend SAC to multi-agent setting in the manner of independent learning. Each agent trains its decentralized policy using decentralized critic to maximize the weighted sum of the cumulative return and the entropy of its policy. We refer to this method as Independent SAC (I-SAC). Adopting the framework of CTDE, we can replace decentralized critic with centralized critic which incorporates observations and actions of all agents. We refer to this method as multi-agent soft actor-critic (MA-SAC). Both I-SAC and MA-SAC are considered as baselines in the experiment section.

4. THE PROPOSED MAXIMUM MUTUAL INFORMATION FRAMEWORK

We assume that the environment is fully observable, i.e., each agent can observe the environment state s t for theoretical development in this section, and will consider partially observable environment for practical algorithm construction under CTDE in the next section. Under the proposed MMI framework, we aims to find the policy that maximizes the mutual information between actions in addition to cumulative return. Thus, the MMI-regularized objective function for joint policy π is given by J(π) = E π ∞ t=0 γ t r t (s t , a t ) + α (i,j) I(π i (•|s t ); π j (•|s t )) where a i t ∼ π i (•|s t ) and α is the temperature parameter that controls the relative importance of the mutual information against the reward. As aforementioned, we assume decentralized policies and want the decentralized policies to exhibit coordinated behavior. By regularization with mutual information in the proposed objective function (2), the policy of each agent is implicitly encouraged to coordinate with other agents' policies without explicit dependency by reducing the uncertainty about other agents' policies. This can be seen as follows: Mutual information is expressed in terms of entropy and conditional entropy as I(π i (•|s t ); π j (•|s t )) = H(π j (•|s t )) -H(π j (•|s t )|π i (•|s t )). (3) If the knowledge of π i (•|s t ) does not provide any information about π j (•|s t ), the conditional entropy reduces to the unconditional entropy, i.e., H(π j (•|s t )|π i (•|s t )) = H(π j (•|s t )) , and the mutual information becomes zero. Maximizing mutual information is equivalent to minimizing the uncertainty about other agents' policies conditioned on the agent's own policy, which can lead the agent to learn coordinated behavior based on the reduced uncertainty about other agents' policies. 2) is not easy. Fig. 1(a) shows the causal diagram of the considered system model described in Section 3 in the case of two agents with decentralized policies. Since we consider the case of no explicit dependency, the two policy distributions can be expressed as π 1 (a 1 t |s t ) and π 2 (a 2 t |s t ). Then, for given environment state s t observed by both agents, π 1 (a 1 t |s t ) and π 2 (a 2 t |s t ) are conditionally independent and the mutual information I(π 1 (•|s t ); π 2 (•|s t )) = 0. Thus, the MMI objective (2) reduces to the standard MARL objective of only the accumulated return. In the following subsections, we present our approach to circumvent this difficulty and implement the MMI framework and its operation under CTDE.

4.1. INDUCING MUTUAL INFORMATION USING LATENT VARIABLE

First, in order to induce mutual information among agents' policies under the considered system causal diagram shown in Fig. 1 (a), we introduce latent variable z t . For illustration, consider the new diagram with latent variable z t in Fig. 1(b) . Suppose that the latent variable z t has a prior distribution p(z t ), and assume that both actions a 1 t and a 2 t are generated from the observed random variable s t and the unobserved random variable z t . Then, the policy of Agent i is given by the marginal distribution π i (•|s t ) = z π i (•|s t , z)p(z)dz marginalized over z. With the unobserved latent random variable z, the conditional independence does not hold for a 1 t and a 2 t and the mutual information can be positive, i.e., I(π 1 (•|s t ); π 2 (•|s t )) > 0. Hence, we can induce the mutual information between actions without explicit dependence by introducing the latent variable. In the general case of N agents, we have π(a 1 , • • • , a N |s) = E z [π 1 (a 1 |s, z) • • • π N (a N |s, z)]. Note that in this case we inject a common latent variable z into all agents' policies.

4.2. VARIATIONAL BOUND OF MUTUAL INFORMATION

Even with non-trivial mutual information I(π i (•|s t ); π j (•|s t )), it is difficult to directly compute the mutual information. Note that we need the conditional distribution of a j t given (a i t , s t ) to compute the mutual information as seen in ( 4), but it is difficult to know the conditional distribution directly. To circumvent this difficulty, we use a variational distribution q(a j t |a i t , s t ) to approximate p(a j t |a i t , s t ) and derive a lower bound on the mutual information I(π i (•|s t ); π j (•|s t )) =: I ij (s t ) as I ij (s t ) = E p(a i t ,a j t |st) log q(a j t |a i t , s t ) p(a j t ) + E p(a i t |st) KL(p(a j t |a i t , s t ) q(a j t |a i , s t ) ≥ H(π j (•|s t )) + E p(a i t ,a j t |st) log q(a j t |a i t , s t ) , where the inequality holds because KL divergence is always non-negative. The lower bound becomes tight when q(a j t |a i t , s t ) approximates p(a j t |a i t , s t ) well. Using the symmetry of mutual information, we can rewrite the lower bound as I ij (s t ) ≥ 1 2 H(π i (•|s t )) + H(π j (•|s t )) + E p(a i t ,a j t |st) log q(a i t |a j t , s t ) + log q(a j t |a i t , s t ) . (5) Then, we can maximize this lower bound of mutual information by using the tractable approximation q(a i t |a j t , s t ).

4.3. MODIFIED POLICY ITERATION

In this subsection, we develop policy iteration for the MMI framework. First, we replace the original MMI objective function (2) with the following tractable objective function based on the variational lower bound (5): Ĵ(π, q) = E π ∞ t=0 γ t r t (s t , a t ) + αN N i=1 H(π i (•|s t )) + α N i=1 j =i log q(a j t |a i t , s t ) , where q(a j t |a i t , s t ) is the variational distribution to approximate the conditional distribution p(a j t |a i t , s t ). Then, we determine the individual objective function Ĵi (π i , q) for Agent i as the sum of the terms in (6) associated with Agent i's policy π i or action a i t , given by Ĵi (π i , q) = E π ∞ t=0 γ t r t (s t , a t ) + β • H(π i (•|s t )) (a) + β N j =i log q(a i t |a j t , s t ) + log q(a j t |a i t , s t ) (b) , where β = αN is the temperature parameter. Note that maximizing the term (a) in ( 7) implies that each agent maximizes the weighted sum of the policy entropy and the return, which can be interpreted as an extension of maximum entropy RL to multi-agent setting. On the other hand, maximizing the term (b) with respect to π i means that we update the policy π i so that Agent j well predicts Agent i's action by the first term in (b) and Agent i well predicts Agent j's action by the second term in (b). Thus, the objective function ( 7) can be interpreted as the maximum entropy MARL objective combined with predictability enhancement for other agents' actions. Note that predictability is reduced when actions are uncorrelated. Since the policy entropy term H(π i (•|s i )) enhances individual exploration due to maximum entropy principle (Haarnoja et al. (2018) ) and the term (b) in ( 7) enhances predictability or correlation among agents' actions, the proposed objective function ( 7) can be considered as one implementation of the concept of correlated exploration in MARL (Mahajan et al. (2019) ). Now, in order to learn policy π i to maximize the objective function ( 7), we modify the policy iteration in standard RL. For this, we redefine the state and state-action value functions for each agent as follows: V π i (s) E π ∞ t=0 γ t r t + βH(π i (•|s t )) + β N j =i log q (i,j) (a i t , a j t , s t ) s 0 = s (8) Q π i (s, a) E π r 0 + γV π i (s 1 ) s 0 = s, a 0 = a , where q (i,j) (a i t , a j t , s t ) q(a i t |a j t , s t )q(a j t |a i t , s t ). Then, the Bellman operator corresponding to V π i and Q π i is given by T π Q i (s, a) r(s, a) + γE s ∼p [V i (s )], where V i (s) = E a∼π Q i (s, a) -β log π i (a i |s) + β N j =i log q (i,j) (a i , a j , s) In the policy evaluation step, we compute the value functions defined in ( 14) and ( 15) by applying the modified Bellman operator T π repeatedly to any initial function Q 0 i . Lemma 1. (Variational Policy Evaluation). For fixed π and the variational distribution q, consider the modified Bellman operator T π in ( 16) and an arbitrary initial function Q 0 i : S × A → R, and define Q k+1 i = T π Q k i . Then, Q k i converges to Q π i defined in (15). Proof. See Appendix A. In the policy improvement step, we update the policy and the variational distribution by using the value function evaluated in the policy evaluation step. Here, each agent updates its policy and variational distribution while keeping other agents' policies fixed as follows: (π i k+1 , q k+1 ) = arg max π i ,q E (a i ,a -i )∼(π i ,π -i k ) Q π k i (s, a) -β log π i (a i |s) + β N j =i log q (i,j) (a i , a j , s)) , where a -i {a 1 , • • • , a N }\{a i }. Then, we have the following lemma regarding the improvement step. Lemma 2. (Variational Policy Improvement). Let π i new and q new be the updated policy and the variational distribution from (30) in Appendix A. Then, Q π i new ,π -i old i (s, a) ≥ Q π i old ,π -i old i (s, a) for all (s, a) ∈ (S × A). Proof. See Appendix A. The modified policy iteration is defined as applying the variational policy evaluation and variational improvement steps in an alternating manner. Each agent trains its policy, critic and the variational distribution to maximize its objective function (7).

5. ALGORITHM CONSTRUCTION

Summarizing the development above, we now propose the variational maximum mutual information multi-agent actor-critic (VM3-AC) algorithm, which can be applied to continuous and partially 2017)). Let x denote either the environment state s or the observations of all agents (o 1 , • • • , o N ), whichever is used. In order to deal with the large continuous state-action spaces, we adopt deep neural networks to approximate the required functions. For Agent i, we parameterize the variational distribution with ξ i as q ξ i (a j |a i , o i , o j ), the state-value function with ψ i as V i ψi (x), two action-value functions with θ i,1 and θ i,2 as Q i θ i,1 (x, a), Q i θ i,2 (x, a), and the policy with φ i as π i φ i (a|o i ) = E z [π i φ i (a|o i , z)]. We assume normal distribution for the latent variable which plays a key role in inducing coordination among agents' policies, i.e., z t ∼ N (0, I), and further assume that the variational distribution is Gaussian distribution with constant variance σ 2 , i.e., q ξ i (a j |a i , o i , o j ) = N (µ ξ i (a i , o i , o j ), σ 2 ), where µ ξ i (a i , o i , o j ) is the mean of the distribution. Centralized Training As aforementioned, the policy is the marginalized distribution over the latent variable z, where the policies of all agents take the same z t generated from N (0, I) as an input variable. We perform the required marginalization based on Monte Carlo numerical expectation as follows: π(a|s) = E z [π 1 φ 1 (a 1 |s, z) • • • π N φ N (a N |s, z)] 1 L L l=1 π 1 φ 1 (a 1 |s, z l ) • • • π N φ N (a N |s, z l ), and we use L = 1 for simplicity. The parameterized value functions, policy, and variational distributions are trained similarly to the training in SAC. Due to space limitation, training detail and pseudo code are provided in Appendices B and C.

Decentralized Execution

In the centralized training phase, we pick actions (a 1 , • • • , a N ) by using Monte Carlo expectation based on common latent variable z l generated from zero-mean Gaussian distribution, as seen in ( 13). We consider two methods to achieve the same operation in the decentralized execution phase. First, this can be done by making all agents have the same Gaussian random sequence generator and distributing the same seed to this random sequence generator only once in the beginning of the execution phase. This eliminates the necessity of communication for sharing the latent variable. In fact, this way of sharing z l can be applied to the centralized training phase too. Second, we exploit the property of zero-mean Gaussian latent variable z l . That is, we simply replace z l with zero vector with the matching dimensions in the decentralized execution phase. This substitution method does not deteriorate the performance much as seen in Section 6 since the latent variable distribution is zero-mean Gaussian and zero has the highest density. Thus, the proposed algorithm is fully operative under CTDE.

6. EXPERIMENT

In this section, we provide numerical results to evaluate VM3-AC. We considered four baselines: 1) MADDPG (Lowe et al. (2017) ) -an extension of DDPG with a centralized critic to train a 2019)). Similarly to VM3-AC, MAVEN introduced latent variable and variational approach for optimizing the mutual information. However, MAVEN does not consider the mutual information between actions but consider the mutual information between the latent variable and trajectories of the agents. 4) Influence MOA (IMOA) also known as social influence (SI) (Jaques et al. (2018) ). The IMOA method models p(a j t+1 |a i t , s i t ), where agents j and i are influencee and influencer, respectively, and adds intrinsic reward given by the mutual information I(a i t ; a j t+1 |s t ) between influencer's current action and influencee's next timestep action not the mutual information I(a i t ; a j t |s t ) between two agents' simultaneous actions, which is considered in our MMI framework. Both MAVEN and IMOA are implemented on the top of MA-AC since we consider continuous action-space environments. We evaluated the proposed algorithm and the baselines in three multi-agent environments with the varying number of agents: multi-walker (Gupta et al. ( 2017)), predator-prey (Lowe et al. (2017) ), and cooperative navigation (Lowe et al. (2017) ). We modified the original environments to require further coordination among agents. For example, we increased the size of the agent and collision reward in the cooperative navigation. Hence, the agent should consider other agents more while achieving its goal. The detailed setting of each environments is provided in Appendix D. 6.1 RESULT Fig. 3 shows the learning curves for the considered three environments with the different number of agents. The y-axis denotes the average of all agents' rewards averaged over 7 random seeds, and the x-axis denotes time step. The hyperparameters including the temperature parameter β and the dimension of the latent variable are provided in Appendix E. As shown in Fig. 3 , VM3-AC outperforms the baselines in the considered environments. Especially, in the case of the multi-walker environment, VM3-AC has large performance gain. This is because the agents in the multi-walker environment are required especially to learn coordinated behavior to obtain high rewards. In addition, the agents in the predator-prey environment, where the number of agents is four, should spread out in groups of two to get more reward. In this environment, VM3-AC also has large performance gain. Thus, it is seen that the proposed MMI framework improves performance in complex multi-agent tasks requiring high-quality coordination. It is observed that both MAVEN and IMOA outperform the basic algorithm MA-AC but not VM3-AC. Hence, the mutual information between the latent variable and trajectory (used in MAVEN) and the mutual information between the action of the agent and the next time step action of other agents (used in IMOA) are not as effective for coordinated behavior as the mutual information between agents' simultaneous actions used for VM3-AC.

6.2. ABLATION STUDY

In this section, we provide ablation study on the major techniques and hyperparameter of VM3-AC: 1) mutual information versus entropy 2) the latent variable, 3) the temperature parameter β, and 4) injecting zero vector instead of the latent variable z to policies in the execution phase.

Mutual information versus entropy:

The proposed MMI framework maximizes the sum of entropy and variational conditional probability, which provides a lower bound of mutual information between actions. As aforementioned, maximizing the entropy and variational conditional probability enhance exploration and predictability for other agents' actions, respectively. Hence, the proposed MMI framework enhances correlated exploration among agents. We compared VM3-AC with MA-SAC, which is an extension of maximum entropy RL to multi-agent setting. We performed an experiment in the predator-prey environment with four agents where the number of required agents to catch the prey is two. In this environment, the agents started at the center of the map. Hence, the agents should spread out in the group of two to catch preys efficiently. Fig. 4 shows the positions of the four agents at five time-steps after the episode starts. The first row and the second row in Fig. 4 show the results of VM3-AC and MA-SAC in the early stage of the training, respectively. It is seen that the agents of VM3-AC explore in the group of two while the agents of MA-SAC tend to explore independently. We provided the performance comparisons of VM3-AC with MA-SAC in Fig. 5 (a) and (b) . Latent variable: The role of the latent variable is to induce mutual information among actions and promote coordinated behavior. We compared VM3-AC and VM3-AC without the latent variable (implemented by setting dim(z) = 0) in the multi-walker environment. In both cases, VM3-AC yields better performance that VM3-AC without the latent variable as shown in Fig. 5 Injecting zero vector instead of the latent variable: As mentioned in Section 5, we replace the latent variable with zero vector to execute actions without communication in the execution phase. We compared the performance of decentralized policies which use zero vector and decentralized policies which use the latent variable assuming communication. We used deterministic evaluation based on 20 episodes generated by the corresponding deterministic policy, i.e., each agent selects action using the mean network of Gaussian policy π i φ i . We averaged the return over 7 seeds, and the result 

APPENDIX A: VARIATIONAL POLICY EVALUATION AND POLICY IMPROVEMENT

In the main paper, we defined the state and state-action value functions for each agent as follows: V π i (s) E π ∞ t=0 γ t r t + βH(π i (•|s t )) + β N j =i log q (i,j) (a i t , a j t , s t ) s 0 = s (14) Q π i (s, a) E π r 0 + γV π i (s 1 ) s 0 = s, a 0 = a , Lemma 3. (Variational Policy Evaluation). For fixed π and the variational distribution q, consider the modified Bellman operator T π in ( 16) and an arbitrary initial function Q 0 i : S × A → R, and define Q k+1 i = T π Q k i . Then, Q k i converges to Q π i defined in (15). T π Q i (s, a) r(s, a) + γE s ∼p [V i (s )], where V i (s) = E a∼π Q i (s, a) -β log π i (a i |s) + β N j =i log q (i,j) (a i , a j , s) Proof. Define the mutual information augmented reward as T π Q i (s t , a t ) = = r(s t , a t ) + γE st+1∼p,a t+1 ∼π Q i (s t+1 , a t+1 ) -β log π i (a i t |s t ) + β N j =i log q (i,j) (a i t , a j t , s t ) (18) = r(s t , a t ) + γE st+1∼p,a t+1 ∼π -β log π i (a i t |s t ) + β N j =i log q (i,j) (a i t , a j t , s t ) rπ(st,at) + γE st+1∼p,a t+1 ∼π Q i (s t+1 , a t+1 ) (20) = r π (s t , a t ) + γE st+1∼p,a t+1 ∼π Q i (s t+1 , a t+1 ) Then, we can apply the standard convergence results for policy evaluation. Define T π (v) = R π + γP π v (22) for v = [Q(s, a)] s∈S,a∈A . Then, the operator T π is a γ-contraction. T π (v) -T π (u) ∞ = (R π + γP π v) -(R π + γP π u) ∞ (23) = γP π (v -u) ∞ (24) ≤ γP π ∞ v -u ∞ (25) ≤ γ u -v ∞ Note that the operator T π has an unique fixed point by the contraction mapping theorem, and we define the fixed point as Q π i (s, a). Since Q k i (s, a) -Q π i (s, a) ∞ ≤ γ Q k-1 i (s, a) -Q π i (s, a) ∞ ≤ • • • ≤ γ k Q 0 i (s, a) -Q π i (s, a) ∞ , (27) we have lim k→∞ Q k i (s, a) -Q π i (s, a) ∞ = 0 and this implies lim k→∞ Q k i (s, a) = Q π i (s, a), ∀(s, a) ∈ (S × A). We proved the variational policy evaluation in the finite state-action sets. We can expand it to the infinite state-action sets by assuming follows: • Assume that Q functions for π are in L infinity • From Folland (1999), L infinity is Banach space • From Agarwal et al. (2018) , by Banach fixed point theorem, Q function should be converge to a unique point in L infinity space and that is the Q function of given π Lemma 4. (Variational Policy Improvement). Let π i new and q new be the updated policy and the variational distribution from (30). Then, Q π i new ,π -i old i (s, a) ≥ Q π i old ,π -i old i (s, a) for all (s, a) ∈ (S × A). (π i k+1 , q k+1 ) = arg max π i ,q E (a i ,a -i )∼(π i ,π -i k ) Q π k i (s, a) -β log π i (a i |s) (30) + β N j =i log q (i,j) (a i , a j , s)) , Proof. Let π new be determined as (π i new , q new ) = arg max π i ,q E (a i t ,a -i t )∼(π i ,π -i old ) Q π old i (s t , a t ) -β log π i (a i t |s t ) + β N j =i log q (i,j) (a i t , a j t , s t )) . (33) Then, the following inequality is hold  E (a i t ,



Figure 1: Causal diagram in 2-agent Markov Game: (a) Standard MARL, (b) Introducing the latent variable to the standard MARL However, direct optimization of the objective function (2) is not easy. Fig.1(a)shows the causal diagram of the considered system model described in Section 3 in the case of two agents with decentralized policies. Since we consider the case of no explicit dependency, the two policy distributions can be expressed as π 1 (a 1 t |s t ) and π 2 (a 2 t |s t ). Then, for given environment state s t observed by both agents, π 1 (a 1 t |s t ) and π 2 (a 2 t |s t ) are conditionally independent and the mutual information I(π 1 (•|s t ); π 2 (•|s t )) = 0. Thus, the MMI objective (2) reduces to the standard MARL objective of only the accumulated return. In the following subsections, we present our approach to circumvent this difficulty and implement the MMI framework and its operation under CTDE.

Figure 2: Overall operation of the proposed VM3-AC. We only need the operation in the red box after training.

Figure 3: Performance of MADDPG (blue), MA-AC (green), MAVEN (purple), IMOA (black), and VM3-AC (the proposed method, red) on multi-walker environments (a)-(b), predator-prey (c)-(e), and cooperative navigation (f). (MW, PP, and CN denote multi-walker, predator-prey, and cooperative navigation, respectively)

Figure 4: The positions of four agents after five time-steps after the episode begins in the early stage of the training: 1st row -VM3-AC and 2nd row -MA-SAC. The figures in column correspond to a different seed. The black squares are the preys and each color except black shows the position of each agent.

(a) and 5(b).

Impact of replacing the latent variable z ∼ N (0, I) with zero vector z =

Figure 5: (a) and (b): VM3-AC (red), VM3-AC without latent variable (orange), and MA-SAC (cyan) and (c) and (d): performance with respect to the temperature parameter

7 CONCLUSIONIn this paper, we have proposed the MMI framework for MARL to enhance multi-agent coordinated learning under CTDE by regularizing the cumulative return with mutual information among actions. The MMI framework is implemented practically by using a latent variable and variational technique and applying approximate policy iteration. Numerical results show that the derived algorithm named VM3-AC outperforms other baselines, especially in multi-agent tasks requiring high coordination among agents. Furthermore, the MMI framework can be combined with other techniques for cooperative MARL, such as value decomposition(Rashid et al. (2018)) to yield better performance.Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207, 2019.Chongjie Zhang and Victor Lesser. Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp.1101-1108, 2013.

, a t ) = r(s t , a t ) + γE st+1∼p [V π old ≤ r(s t , a t ) + γE st+1∼p E (a i ≤ r(s t , a t ) + γE st+1∼p E (a i t+1 ,a -i t+1 )∼(π i new ,π -i old ) r i (s t+1 , a t+1 )HYPERPARAMETER AND TRAINING DETAILThe hyperparameters for MA-AC, MA-SAC, MADDPG, and VM3-AC are summarized in Table2. Hyperparameters of all algorithms

The temperature parameter β and the dimension of the latent variable z for VM3-AC on the considered environments. Note that the temperature parameter β in I-SAC and MA-SAC controls the relative importance between the reward and the entropy, whereas the temperature parameter β in VM3-AC controls the relative importance between the reward and the mutual information.

APPENDIX B: ALGORITHM CONSTRUCTION 7.1 CENTRALIZED TRAINING

The value functions V i ψi (x), Q i θi (x, a) are updated based on the modified Bellman operator defined in ( 16) and ( 17). The state-value function V i ψi (x) is trained to minimize the following loss function:whereξ i (a i t , a j t , o i t , o j t ) , D is the replay buffer that stores the transitions (x t , a t , r t , x t+1 ),] is the minimum of the two action-value functions to prevent the overestimation problem Fujimoto et al. (2018) . The two action-value functions are updated by minimizing the losswhereand V ψ i is the target value network, which is updated by the exponential moving average method. We implement the reparameterization trick to estimate the stochastic gradient of policy loss. Then, the action of agent i is given by a i = f φ i (s; i , z), where i ∼ N (0, I) and z ∼ N (0, I). The policy for agent i and the variational distribution are trained to minimize the following policy improvement loss,where q (i,j).Since approximation of the variational distribution is not accurate in the early stage of training and the learning via the term (a) in equation 45 is more susceptible to approximation error, we propagate the gradient only through the term (b) in equation 45 to make learning stable. Note that minimizing -log q ξ i (a j |a i , s t ) is equivalent to minimizing the mean-squared error between a j and µ ξ i (a i , o i , o j ) due to our Gaussian assumption on the variational distribution.

APPENDIX C: PSEUDO CODE

Algorithm 1 VM3-AC (L=1) 

APPENDIX D: ENVIRONMENT DETAIL

Multi-walker The multi-walker environment, which was introduced in Gupta et al. ( 2017), is a modified version of the BipedalWalker environment in OpenAI gym to multi-agent setting. The environment consists of N bipedal walkers and a large package. The goal of the environment is to move forward together while holding the large package on top of the walkers. The observation of each agent consists of the joint angular speed, the position of joints and so on. Each agent has 4-dimensional continuous actions that control the torque of their legs. Each agent receives shared reward R 1 depending on the distance over which the package has moved and receives negative local compensation R 2 if the agent drops the package or falls to the ground. An episode ends when one of the agents falls, the package is dropped or T time steps elapse. To obtain higher rewards, the agents should learn coordinated behavior. For example, if one agent only tries to learn to move forward, ignoring other agents, then other agents may fall. In addition, the different coordinated behavior is required as the number of agents changes. We set T = 500, R 2 = -10 and R 1 = 10d, where d is the distance over which the package has moved. We simulated this environment in three cases by changing the number of agents (N = 2, N = 3, and N = 4).All algorithms used neural networks to approximate the required functions. In the algorithms except I-SAC, we used the neural network architecture proposed in Kim et al. (2019) to emphasize the agent's own observation and action for centralized critics. For agent i, we used the shared neural network for the variational distribution q ξ i (a j t |a i t , o i t , o j t ) for j ∈ {1, • • • , N }\{i}, and the network takes the one-hot vector which indicates j as input. Experimental details are given in Appendix E. Cooperative navigation Cooperative navigation, which was proposed in Lowe et al. (2017) , consists of N agents and L landmarks. The goal of this environment is to occupy all landmarks while avoiding collision with other agents. The agent receives shared reward R 1 which is the sum of the minimum distance of the landmarks from any agents, and the agents who collide each other receive negative reward -R 2 . In addition, all agents receive R 3 if all landmarks are occupied. The observation of each agent consists of the locations of all other agents and landmarks, and action is two-dimensional physical action. We set R 2 = 10, R 3 = 1, and T = 50. We simulated the environment in the cases of (N = 3, L = 3).

