INTENTION PROPAGATION FOR MULTI-AGENT REIN-FORCEMENT LEARNING

Abstract

A hallmark of an AI agent is to mimic human beings to understand and interact with others. In this paper, we propose a collaborative multi-agent reinforcement learning algorithm to learn a joint policy through the interactions over agents. To make a joint decision over the group, each agent makes an initial decision and tells its policy to its neighbors. Then each agent modifies its own policy properly based on received messages and spreads out its plan. As this intention propagation procedure goes on, we prove that it converges to a mean-field approximation of the joint policy with the framework of neural embedded probabilistic inference. We evaluate our algorithm on several large scale challenging tasks and demonstrate that it outperforms previous state-of-the-arts.

1. INTRODUCTION

Collaborative multi-agent reinforcement learning is an important sub-field of the multi-agent reinforcement learning (MARL) , where the agents learn to coordinate to achieve joint success. It has wide applications in traffic control (Kuyer et al., 2008) , autonomous driving (Shalev-Shwartz et al., 2016) and smart grid (Yang et al., 2018) . To learn a coordination, the interactions between agents are indispensable. For instance, humans can reason about other's behaviors or know other peoples' intentions through communication and then determine an effective coordination plan. However, how to design a mechanism of such interaction in a principled way and at the same time solve the large scale real-world applications is still a challenging problem. Recently, there is a surge of interest in solving the collaborative MARL problem (Foerster et al., 2018; Qu et al., 2019; Lowe et al., 2017) . Among them, joint policy approaches have demonstrated their superiority (Rashid et al., 2018; Sunehag et al., 2018; Oliehoek et al., 2016) . A straightforward approach is to replace the action in the single-agent reinforcement learning by the joint action a = (a 1 , a 2 , ..., a N ), while it obviously suffers from the issue of the exponentially large action space. Thus several approaches have been proposed to factorize the joint action space to mitigate such issue, which can be roughly grouped into two categories: • Factorization on policy. This approach explicitly assumes that π(a|s) := N i=1 π i (a i |s), i.e., policies are independent (Foerster et al., 2018; Zhang et al., 2018) . To mitigate for the instability issue caused by the independent learner, it generally needs a centralized critic. • Factorization on value function. This approach has a similar spirit but factorizes the joint value function into several utility functions, each just involving the actions of one agent (Rashid et al., 2018; Sunehag et al., 2018) . However, these two approaches lack of the interactions between agents, since in their algorithms agent i does not care about the plan of agent j. (Guestrin et al., 2002; Böhmer et al., 2020) . However, they only can be applied in discrete action, small scale game. Furthermore, despite the empirical success of the aforementioned work in certain scenarios, it still lacks theoretical insight. In this work, we only make a simple yet realistic assumption: the reward function r i of each agent i just depends on its individual action and the actions of its neighbors (and state), i.e., r i (s, a) = r i (s, a i , a Ni ), where we use N i to denote the neighbors of agent i, s to denote the global state. It says the goal or decision of agent is explicitly influenced by a small subset N i of other agents. Note that such an assumption is reasonable in lots of real scenarios. For instance, • The traffic light at an intersection makes the decision on the phase changing mainly relying on the traffic flow around it and the policies of its neighboring traffic light. • The main goal of a defender in a soccer game is to tackle the opponent's attacker, while he rarely needs to pay attention to opponent goalkeeper's strategy. Based on the assumption in equation 1, we propose a principled multi-agent reinforcement learning algorithm in the framework of probabilistic inference, where the objective is to maximize the long term reward of the group, i.e., ∞ t=0 N i=1 γ t r t i ( see details in section 4). Note since each agent's reward depends on its neighbor, we still need a joint policy to maximize the global reward through interactions. In this paper, we derive an iterative procedure for such interaction to learn the joint policy in collaborative MARL and name it intention propagation. Particularly, • In the first round, each agent i makes an independent decision and spreads out his plan μi (we name it intention) to neighbors. • In the second round, agents i changes its initial intention properly based on its neighbors' intention μj , j ∈ N i and propagates its intention μi again. • In the third round, it changes the decision in the second round with a similar argument. • As this procedure goes on, we show the final output of agents' policy converges to the mean field approximation (the variational inference method from the probabilistic graphical model (Bishop, 2006) ) of the joint policy. In addition, this joint policy has the form of Markov Random Field induced by the locality of the reward function (proposition 1). Therefore, such a procedure is computationally efficient when the underlying graph is sparse, since in each round, each agent just needs to care about what its neighbors intend to do. Remark: (1) Our work is not related to the mean-field game (MFG) (Yang et al., 2018) . The goal of the MFG is to find the Nash equilibrium, while our work aims to the optimal joint policy in the collaborative game. Furthermore, MFG generally assumes agents are identical and interchangeable. When the number of agents goes to infinity, MFG can view the state of other agents as a population state distribution. In our problem, we do not have such assumptions. (2) our analysis is not limited to the mean-field approximation. When we change the message passing structure of intention propagation, we can show that it converges to other approximation of the joint policy, e.g., loopy belief propagation in variational inference (Yedidia et al., 2001) 



(see Appendix B.2 ).

Contributions:

(1) We propose a principled method named intention propagation to solve the joint policy collaborative MARL problem; (2) Our method is computationally efficient, which can scale up to one thousand agents and thus meets the requirement of real applications; (3) Empirically, it outperforms state-of-the-art baselines with a wide margin when the number of agents is large; (4) Our work builds a bridge between MARL and neural embedded probabilistic inference, which would lead to new algorithms beyond intention propagation.Notation: s t i and a t i represent the state and action of agent i at time step t. The neighbors of agent i are represented as N i . We denote X as a random variable with domain X and refer to instantiations of X by the lower case character x. We denote a density on X by p(x) and denote the space of all such densities as by P.

2. RELATED WORK

We first discuss the work of the factorized approaches on the joint policy. COMA designs a MARL algorithm based on the actor-critic framework with independent actors π i (a i |s), where the joint policy is factorized as π(a|s) = N i=1 π i (a i |s) (Foerster et al., 2018) . MADDPG considers a MARL with the cooperative or competitive setting, where it creates a critic for each agent (Lowe et al., 2017) . Other similar works may include (de Witt et al., 2019; Wei et al., 2018) . Another way is to factorize the value functions into several utility functions. Sunehag et al. (2018) assumes that the

