A MAXIMUM MUTUAL INFORMATION FRAMEWORK FOR MULTI-AGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

In this paper, we propose a maximum mutual information (MMI) framework for multi-agent reinforcement learning (MARL) to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the mutual information between actions. By introducing a latent variable to induce nonzero mutual information between actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic (VM3-AC), which follows centralized learning with decentralized execution (CTDE). We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring coordination.

1. INTRODUCTION

With the success of RL in the single-agent domain (Mnih et al. (2015) ; Lillicrap et al. (2015) ), MARL is being actively studied and applied to real-world problems such as traffic control systems and connected self-driving cars, which can be modeled as multi-agent systems requiring coordinated control (Li et al. (2019) ; Andriotis & Papakonstantinou (2019) ). The simplest approach to MARL is independent learning, which trains each agent independently while treating other agents as a part of the environment. One such example is independent Q-learning (IQL) (Tan (1993) ), which is an extension of Q-learning to multi-agent setting. However, this approach suffers from the problem of non-stationarity of the environment. A common solution to this problem is to use fully-centralized critic in the framework of centralized training with decentralized execution (CTDE) (OroojlooyJadid & Hajinezhad (2019); Rashid et al. (2018) ). For example, MADDPG (Lowe et al. (2017) ) uses a centralized critic to train a decentralized policy for each agent, and COMA (Foerster et al. (2018) ) uses a common centralized critic to train all decentralized policies. However, these approaches assume that decentralized policies are independent and hence the joint policy is the product of each agent's policy. Such non-correlated factorization of the joint policy limits the agents to learn coordinated behavior due to negligence of the influence of other agents (Wen et al. ( 2019 In this paper, we introduce a new framework for MARL to learn coordinated behavior under CTDE. Our framework is based on regularizing the expected cumulative reward with mutual information among agents' actions induced by injecting a latent variable. The intuition behind the proposed framework is that agents can coordinate with other agents if they know what other agents will do with high probability, and the dependence between action policies can be captured by the mutual information. High mutual information among actions means low uncertainty of other agents' actions. Hence, by regularizing the objective of the expected cumulative reward with mutual information among agents' actions, we can coordinate the behaviors of agents implicitly without explicit dependence enforcement. However, the optimization problem with the proposed objective function has several difficulties since we consider decentralized policies without explicit dependence or communication in the execution phase. In addition, optimizing mutual information is difficult because of the intractable conditional distribution. We circumvent these difficulties by exploiting the property of the latent variable injected to induce mutual information, and applying variational lower bound on the mutual information. With the proposed framework, we apply policy iteration by redefining value functions to propose the VM3-AC algorithm for MARL with coordinated behavior under CTDE.

2. RELATED WORK

Learning coordinated behavior in multi-agent systems is studied extensively in the MARL community. 2018) proposed social influence intrinsic reward which is related to the mutual information between actions to achieve coordination. The purpose of the social influence approach is similar to our approach and the social influence yields good performance in social dilemma environments. The difference between our algorithm and the social influence approach will be explained in detail and the effectiveness of our approach over the social influence approach will be shown in Section 6. Wang et al. (2019) proposed an intrinsic reward capturing the influence based on the mutual information between an agent's current actions/states and other agents' next states. In addition, they proposed an intrinsic reward based on a decision-theoretic measure. Although they used the mutual information to enhance exploration, our approach focuses on the mutual information between simultaneous actions capturing policy correlation not influence. Besides, they considered independent policies, whereas policies are correlated in our approach. Some previous works considered correlated policies instead of independent policies. For example, Liu et al. ( 2020) proposed explicit modeling of correlated policies for multi-agent imitation learning, and Wen et al. ( 2019) proposed a recursive reasoning framework for MARL to maximize the expected return by decomposing the joint policy into own policy and opponents' policies. Going beyond adopting correlated policies, our approach maximizes the mutual information between actions which is a measure of correlation. Our framework can be interpreted as enhancing correlated exploration by increasing the entropy of own policy while decreasing the uncertainty about other agents' actions. Some previous works proposed other techniques to enhance correlated exploration (Zheng & Yue (2018); Mahajan et al. (2019)). For example, MAVEN addressed the poor exploration problem of QMIX by maximizing the mutual information between the latent variable and the observed trajectories (Mahajan et al. (2019) ). However, MAVEN does not consider the correlation among policies.

3. BACKGROUND

We consider a Markov Game (Littman (1994)), which is an extention of Markov Decision Process (MDP) to multi-agent setting. An N -agent Markov game is defined by an environment state space S, action spaces for N agents A 1 , • • • , A N , a state transition probability T : S × A × S → [0, 1], where A = N i=1 A i is the joint action space, and a reward function R : S × A → R. At each time step t, Agent i executes action a i t ∈ A i based on state s t ∈ S. The actions of all agents a t = (a 1 t , • • • , a N t ) yield next state s t+1 according to T and yield shared common reward r t according to R under the assumption of fully-cooperative MARL. The discounted return is defined as R t = ∞ τ =t γ τ r τ , where γ ∈ [0, 1) is the discounting factor. We assume CTDE incorporating resource asymmetry between training and execution phases, widely considered in MARL (Lowe et al. (2017); Iqbal & Sha (2018); Foerster et al. (2018) ). Under CTDE, each agent can access all information including the environment state, observations and actions of other agents in the training phase, whereas the policy of each agent can be conditioned only on its own action-observation history τ i t or observation o i t in the execution phase. For given joint policy π = (π 1 , • • • , π N ), the goal of fully cooperative MARL is to find the optimal joint policy π * that maximizes the objective J(π) = E π R 0 . Maximum Entropy RL The goal of maximum entropy RL is to find an optimal policy that maximizes the entropy-regularized objective function, given by J(π) = E π ∞ t=0 γ t r t (s t , a t ) + αH(π(•|s t )) (1)



); de Witt et al. (2019)). Thus, learning coordinated behavior is one of the fundamental problems in MARL (Wen et al. (2019); Liu et al. (2020)).

To promote coordination, some previous works used communication among agents (Zhang & Lesser (2013); Foerster et al. (2016); Pesce & Montana (2019)). For example, Foerster et al. (2016) proposed the DIAL algorithm to learn a communication protocol that enables the agents to coordinate their behaviors. Instead of relying on communication, Jaques et al. (

