MULTI-AGENT POLICY OPTIMIZATION WITH APPROX-IMATIVELY SYNCHRONOUS ADVANTAGE ESTIMATION

Abstract

Cooperative multi-agent tasks require agents to deduce their own contributions with shared global rewards, known as the challenge of credit assignment. General methods for policy based multi-agent reinforcement learning to solve the challenge introduce differentiate value functions or advantage functions for individual agents. In multi-agent system, polices of different agents need to be evaluated jointly. In order to update polices synchronously, such value functions or advantage functions also need synchronous evaluation. However, in current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously, thus suffer from natural estimation bias. In this work, we propose the approximatively synchronous advantage estimation. We first derive the marginal advantage function, an expansion from single-agent advantage function to multi-agent system. Further more, we introduce a policy approximation for synchronous advantage estimation, and break down the multi-agent policy optimization problem into multiple sub-problems of single-agent policy optimization. Our method is compared with baseline algorithms on StarCraft multi-agent challenges, and shows the best performance on most of the tasks.

1. INTRODUCTION

Reinforcement learning(RL) algorithms have shown amazing performance on many singleagent(SA) environment tasks (Mnih et al., 2013) (Jaderberg et al., 2016) (Oh et al., 2018) . However, for many real-world problems, the environment is much more complex where RL agents often need to cooperate with other agents. For example, taxi scheduling (Nguyen et al., 2018) and network control (Chu et al., 2019) . In cooperative multi-agent tasks, each agent is treated as an independent decision-maker, but can be trained together to learn cooperation. The common goal is to maximize the global return in the perspective of a team of agents. To deal with such tasks, the architecture of centralized training and decentralized executions(CTDE) is proposed (Oliehoek & Vlassis, 2007 )(Jorge et al., 2016) . The basic idea of CTDE is to construct a centralized policy evaluator, which only works during training and is accessable to global information. At the same time, each agent is assigned with a local policy for decentralized execution. The role of the evaluator is to evaluate agents' local policies differentially from the global perspective. A challenge in construction of centralized evaluator is multi-agent credit assignment (Chang et al., 2004) : in cooperative settings, joint actions typically generate only global rewards, making it difficult for each agent to deduce its own contribution to the team's success. Credit assignment requires differentiate evaluation for agents' local policies, but designing individual reward function for each agent is often complicated and lacks of generalization(Grzes, 2017) (Mannion et al., 2018) . Current policy based MARL methods generally realize credit assignment by introducing differentiate value functions or advantage functions (Foerster et al., 2018 )(Lowe et al., 2017) . However, these value functions or advantage functions are estimated asynchronously but decentralized policies are updated synchronously, as shown in figure 1(b), which results in natural estimation bias. In this paper, we propose a novel policy based MARL method called multi-agent policy optimization with approximatively synchronous advantage estimation(ASAE). In our work, we first define the counter-factual scenes, in which MA advantage estimation can be converted to SA advantage estimation. For certain agent, each counter-factual scene is assigned with a SA advantage. Then the marginal advantage function is defined as the expectation of SA advantages on distribution of counter-factual scenes, and credit assignment is realized by constructing different scenes' distribution for different agents. Moreover, in order to achieve synchronous advantage estimation, an approximation of other agents' joint future policy is introduced. To ensure the approximation is reliable, a restriction is applied to the original multi-agent policy optimization(MAPO) problem. The approximate optimization problem is simplified and broken down into multiple sub-problems, which has a similar form to trust region policy optimization(TRPO) problem. And the sub-problems are finally solved by proximal policy optimization(PPO) method. We have two contributions in this work: (1) A novel advantage estimation method called marginal advantage estimation, which realizes credit assignment for MARL is proposed. More importantly, this method provides a channel for various SA advantage functions expanding to multi-agent system. (2) A simple yet effective method for approximatively synchronous advantage estimation is firstly proposed.

2. RELATED WORK

A common challenge in cooperative multi-agent tasks is credit assignment. RL algorithms designed for single-agent tasks, ignore credit assignment and take other agents as part of partial observable environment. Such algorithms perform poorly in complex cooperative tasks which require high coordination (Lowe et al., 2017) . To deal with the challenge, some value based MARL methods estimate a local Q value for each agent, and the shared global Q value is then constructed through these local Q values. Value decomposition network(VDN) constructs the global Q value by simply adding all local Q values together (Sunehag et al., 2018) . And in QMIX algorithm (Rashid et al., 2018) , the global Q value is obtained by mixing local Q values with a neural network. In mean field multi-agent methods, local Q values are defined on agent pairs. The mapping from local Q values to the global Q value is established by measuring the influence of each agent pair's joint action to the global return (Yang et al., 2018) . Similarly, for policy based MARL methods, credit assignment is generally realized through differentiated evaluation with CTED structure. Some naive policy based methods estimate local Q values for individual agents with a centralized critic (Lowe et al., 2017) , resulting in large variance. Some other methods try to introduce advantage function in MARL. Counter-factual multi-agent policy gradient(COMA) method (Foerster et al., 2018) is inspired by the idea of difference reward (Wolpert & Tumer, 2002) and provides a naive yet effective approach for differentiated advantage estimation in cooperative MARL. In COMA, a centralized critic is used to predict the joint Q value function Q π (s, u) of joint action u under state s. And the advantage for agent a is defined as A a (s, u) = Q(s, u)u a π a (u a |τ a )Q(s, (u -a , u a )) (1) where τ and π represent trajectory and policy respectively. a and -a denote current agent and the set of other agents respectively. COMA introduces a counter-factual baseline, which assumes that



Figure 1: Comparison among three different manners of advantage estimation & update. π a,t represents the policy of agent a at iteration t and there are n agents in total. Lines with arrow represent policy updates and ep denotes the update epoch. In single iteration, synchronous update takes only one epoch while asynchronous update takes n epochs. In advantage estimation, policies need to be estimated jointly, and the dashed boxes contain joint polices used for advantage estimation in corresponding update. Particularly, synchronous estimation requires other agents' future polices.

