MULTI-AGENT POLICY OPTIMIZATION WITH APPROX-IMATIVELY SYNCHRONOUS ADVANTAGE ESTIMATION

Abstract

Cooperative multi-agent tasks require agents to deduce their own contributions with shared global rewards, known as the challenge of credit assignment. General methods for policy based multi-agent reinforcement learning to solve the challenge introduce differentiate value functions or advantage functions for individual agents. In multi-agent system, polices of different agents need to be evaluated jointly. In order to update polices synchronously, such value functions or advantage functions also need synchronous evaluation. However, in current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously, thus suffer from natural estimation bias. In this work, we propose the approximatively synchronous advantage estimation. We first derive the marginal advantage function, an expansion from single-agent advantage function to multi-agent system. Further more, we introduce a policy approximation for synchronous advantage estimation, and break down the multi-agent policy optimization problem into multiple sub-problems of single-agent policy optimization. Our method is compared with baseline algorithms on StarCraft multi-agent challenges, and shows the best performance on most of the tasks.

1. INTRODUCTION

Reinforcement learning(RL) algorithms have shown amazing performance on many singleagent(SA) environment tasks (Mnih et al., 2013) (Jaderberg et al., 2016) (Oh et al., 2018) . However, for many real-world problems, the environment is much more complex where RL agents often need to cooperate with other agents. For example, taxi scheduling (Nguyen et al., 2018) and network control (Chu et al., 2019) . In cooperative multi-agent tasks, each agent is treated as an independent decision-maker, but can be trained together to learn cooperation. The common goal is to maximize the global return in the perspective of a team of agents. To deal with such tasks, the architecture of centralized training and decentralized executions(CTDE) is proposed (Oliehoek & Vlassis, 2007 )(Jorge et al., 2016) . The basic idea of CTDE is to construct a centralized policy evaluator, which only works during training and is accessable to global information. At the same time, each agent is assigned with a local policy for decentralized execution. The role of the evaluator is to evaluate agents' local policies differentially from the global perspective. A challenge in construction of centralized evaluator is multi-agent credit assignment (Chang et al., 2004) : in cooperative settings, joint actions typically generate only global rewards, making it difficult for each agent to deduce its own contribution to the team's success. Credit assignment requires differentiate evaluation for agents' local policies, but designing individual reward function for each agent is often complicated and lacks of generalization(Grzes, 2017) (Mannion et al., 2018) . Current policy based MARL methods generally realize credit assignment by introducing differentiate value functions or advantage functions (Foerster et al., 2018 )(Lowe et al., 2017) . However, these value functions or advantage functions are estimated asynchronously but decentralized policies are updated synchronously, as shown in figure 1(b), which results in natural estimation bias. In this paper, we propose a novel policy based MARL method called multi-agent policy optimization with approximatively synchronous advantage estimation(ASAE). In our work, we first define the counter-factual scenes, in which MA advantage estimation can be converted to SA advantage estimation. For certain agent, each counter-factual scene is assigned with a SA advantage. Then 1

