INTENTION PROPAGATION FOR MULTI-AGENT REIN-FORCEMENT LEARNING

Abstract

A hallmark of an AI agent is to mimic human beings to understand and interact with others. In this paper, we propose a collaborative multi-agent reinforcement learning algorithm to learn a joint policy through the interactions over agents. To make a joint decision over the group, each agent makes an initial decision and tells its policy to its neighbors. Then each agent modifies its own policy properly based on received messages and spreads out its plan. As this intention propagation procedure goes on, we prove that it converges to a mean-field approximation of the joint policy with the framework of neural embedded probabilistic inference. We evaluate our algorithm on several large scale challenging tasks and demonstrate that it outperforms previous state-of-the-arts.

1. INTRODUCTION

Collaborative multi-agent reinforcement learning is an important sub-field of the multi-agent reinforcement learning (MARL) , where the agents learn to coordinate to achieve joint success. It has wide applications in traffic control (Kuyer et al., 2008) , autonomous driving (Shalev-Shwartz et al., 2016) and smart grid (Yang et al., 2018) . To learn a coordination, the interactions between agents are indispensable. For instance, humans can reason about other's behaviors or know other peoples' intentions through communication and then determine an effective coordination plan. However, how to design a mechanism of such interaction in a principled way and at the same time solve the large scale real-world applications is still a challenging problem. Recently, there is a surge of interest in solving the collaborative MARL problem (Foerster et al., 2018; Qu et al., 2019; Lowe et al., 2017) . Among them, joint policy approaches have demonstrated their superiority (Rashid et al., 2018; Sunehag et al., 2018; Oliehoek et al., 2016) . A straightforward approach is to replace the action in the single-agent reinforcement learning by the joint action a = (a 1 , a 2 , ..., a N ), while it obviously suffers from the issue of the exponentially large action space. Thus several approaches have been proposed to factorize the joint action space to mitigate such issue, which can be roughly grouped into two categories: • Factorization on policy. This approach explicitly assumes that π(a|s) := N i=1 π i (a i |s), i.e., policies are independent (Foerster et al., 2018; Zhang et al., 2018) . To mitigate for the instability issue caused by the independent learner, it generally needs a centralized critic. • Factorization on value function. This approach has a similar spirit but factorizes the joint value function into several utility functions, each just involving the actions of one agent (Rashid et al., 2018; Sunehag et al., 2018) . However, these two approaches lack of the interactions between agents, since in their algorithms agent i does not care about the plan of agent j. Indeed, they may suffer from a phenomenon called relative over-generalization in game theory observed by Wei & Luke (2016); Castellini et al. (2019); Palmer et al. (2018) . Approaches based on the coordinate graph would effectively prevent such cases, where the value function is factorized as a summation of utility function on pairwise or local joint action (Guestrin et al., 2002; Böhmer et al., 2020) . However, they only can be applied in discrete action, small scale game. Furthermore, despite the empirical success of the aforementioned work in certain scenarios, it still lacks theoretical insight. In this work, we only make a simple yet realistic assumption: the reward function r i of each agent i just depends on its individual action and the actions of its neighbors (and 1

