MULTI-AGENT COLLABORATION VIA REWARD ATTRI-BUTION DECOMPOSITION

Abstract

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and may not generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play). In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that under certain conditions, each agent has a decentralized Q-function that is approximately optimal and can be decomposed into two terms: the self-term that only relies on the agent's own state, and the interactive term that is related to states of nearby agents, often observed by the current agent. The two terms are jointly trained using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss that ensures both terms retain their semantics. CollaQ is evaluated on various StarCraft maps, outperforming existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of environment steps. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

1. INTRODUCTION

In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interest from the research community. MARL algorithms have shown super-human level performance in various games like Dota 2 (Berner et al., 2019) , Quake 3 Arena (Jaderberg et al., 2019), and StarCraft (Samvelyan et al., 2019) . However, the algorithms (Schulman et al., 2017; Mnih et al., 2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019) , it takes agents 2.69 -8.62 million episodes to learn a simple strategy of door blocking, while it only takes human several rounds to learn this behavior. One of the key reasons for the slow learning is that the number of joint states grows exponentially with the number of agents. Moreover, many real-world situations require agents to adapt to new configurations of teams. This can be modeled as ad hoc multi-agent reinforcement learning (Stone et al., 2010) (Ad-hoc MARL) settings, in which agents must adapt to different team sizes and configurations at test time. In contrast to the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARL setting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or prior knowledge about teammate behaviors (Barrett and Stone, 2015) . As a result, they do not generalize to complex real-world scenarios. Most existing works either focus on improving generalization towards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc setting like varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020) . We consider a more general setting where test-time teammates may have different capabilities. The need to reason about different team configurations in the Ad-hoc MARL results in an additional exponential increase (Stone et al., 2010) in representational complexity comparing to the MARL setting. In the situation of collaboration, one way to address the complexity of the ad hoc team play setting is to explicitly model and address how agents collaborate. In this paper, one key observation is that when collaborating with different agents, an agent changes their behavior because she realizes that the team could function better if she focuses on some of the rewards while leaving other rewards to other teammates. Inspired by this principle, we formulate multi-agent collaboration as a joint

