MULTI-AGENT COLLABORATION VIA REWARD ATTRI-BUTION DECOMPOSITION

Abstract

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and may not generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play). In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that under certain conditions, each agent has a decentralized Q-function that is approximately optimal and can be decomposed into two terms: the self-term that only relies on the agent's own state, and the interactive term that is related to states of nearby agents, often observed by the current agent. The two terms are jointly trained using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss that ensures both terms retain their semantics. CollaQ is evaluated on various StarCraft maps, outperforming existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of environment steps. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

1. INTRODUCTION

In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interest from the research community. MARL algorithms have shown super-human level performance in various games like Dota 2 (Berner et al., 2019) , Quake 3 Arena (Jaderberg et al., 2019), and StarCraft (Samvelyan et al., 2019) . However, the algorithms (Schulman et al., 2017; Mnih et al., 2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019) , it takes agents 2.69 -8.62 million episodes to learn a simple strategy of door blocking, while it only takes human several rounds to learn this behavior. One of the key reasons for the slow learning is that the number of joint states grows exponentially with the number of agents. Moreover, many real-world situations require agents to adapt to new configurations of teams. This can be modeled as ad hoc multi-agent reinforcement learning (Stone et al., 2010) (Ad-hoc MARL) settings, in which agents must adapt to different team sizes and configurations at test time. In contrast to the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARL setting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or prior knowledge about teammate behaviors (Barrett and Stone, 2015) . As a result, they do not generalize to complex real-world scenarios. Most existing works either focus on improving generalization towards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc setting like varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020) . We consider a more general setting where test-time teammates may have different capabilities. The need to reason about different team configurations in the Ad-hoc MARL results in an additional exponential increase (Stone et al., 2010) in representational complexity comparing to the MARL setting. In the situation of collaboration, one way to address the complexity of the ad hoc team play setting is to explicitly model and address how agents collaborate. In this paper, one key observation is that when collaborating with different agents, an agent changes their behavior because she realizes that the team could function better if she focuses on some of the rewards while leaving other rewards to other teammates. Inspired by this principle, we formulate multi-agent collaboration as a joint optimization over an implicit reward assignment among agents. Because the rewards are assigned differently for different team configurations, the behavior of an agent changes and adaptation follows. While solving this optimization directly requires centralization at test time, we make an interesting theoretical finding that each agent has a decentralized policy that is (1) approximately optimal for the joint optimization, and (2) only depends on the local configuration of other agents. This enables us to learn a direct mapping from states of nearby agents (or "observation" of agent i) to its Q-function using deep neural network. Furthermore, this finding also suggests that the Q-function of agent i should be decomposed into two terms: Q alone i that only depends on agent i's own state s i , and Q collab i that depends on nearby agents but vanishes if no other agents nearby. To enforce this semantics, we regularize Q collab i (s i , •) = 0 in training via a novel Multi-Agent Reward Attribution (MARA) loss. The resulting algorithm, Collaborative Q-learning (CollaQ), achieves a 40% improvement in win rates over state-of-the-art techniques for the StarCraft multi-agent challenge. We show that (1) the MARA Loss is critical for strong performance and (2) both Q alone and Q collab are interpretable via visualization. Furthermore, CollaQ agents can achieve ad hoc team play without retraining or fine-tuning. We propose three tasks to evaluate ad hoc team play performance: at test time, (a) assign a new VIP unit whose survival matters, (b) swap different units in and out, and (c) add or remove units. Results show that CollaQ outperforms baselines by an average of 30% in all these settings. Related Works. The most straightforward way to train such a MARL task is to learn individual agent's value function Q i independently(IQL) (Tan, 1993) . However, the environment becomes non-stationary from the perspective of an individual agent thus this performs poorly in practice. Recent works, e.g., VDN (Sunehag et al., 2017 ), QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , adopt centralized training with decentralized execution to solve this problem. They propose to write the joint value function as Q π (s, a) = φ(s, Q 1 (o 1 , a 1 ), ..., Q K (o K , a K )) but the formulation of φ differs in each method. These methods successfully utilize the centralized training technique to alleviate the non-stationary issue. However, none of the above methods generalize well to ad-hoc team play since learned Q i functions highly depend on the existence of other agents.

2. COLLABORATIVE MULTI-AGENT REWARD ASSIGNMENT

Basic Setting. A multi-agent extension of Markov Decision Process called collaborative partially observable Markov Games (Littman, 1994) , is defined by a set of states S describing the possible configurations of all K agents, a set of possible actions A 1 , . . . , A K , and a set of possible observations O 1 , . . . , O K . At every step, each agent i chooses its action a i by a stochastic policy π i : O i × A i → [0, 1]. The joint action a produces the next state by a transition function P : S × A 1 × • • • × A K → S. All agents share the same reward r : S × A 1 × • • • × A K → R and with a joint value function Q π = E st+1:∞,at+1:∞ [R t |s t , a t ] where R t = ∞ j=0 γ j r t+j is the discounted return. In Sec. 2.1, we first model multi-agent collaboration as a joint optimization on reward assignment: instead of acting based on the joint state s, each agent i is acting independently on its own state s i , following its own optimal value V i , which is a function of the perceived reward assignment r i . While the optimal perceived reward assignment r * i (s) depends on the joint state of all agents and requires centralization, in Sec. 2.2, we prove that there exists an approximate optimal solution ri that only depends on the local observation s local i of agent i, and thus enabling decentralized execution. Lastly in Sec. 2.3, we distill the theoretical insights into a practical algorithm CollaQ, by directly learning the compositional mapping s local i → ri → V i in an end-to-end fashion, while keeping the decomposition structure of self state and local observations.

2.1. BASIC ASSUMPTION

A naive modeling of multi-agent collaboration is to estimate a joint value function V joint := V joint (s 1 , s 2 , . . . , s K ), and find the best action for agent i to maximize V joint according to the current joint state s = (s 1 , s 2 , . . . , s K ). However, it has three fundamental drawbacks: (1) V joint generally requires exponential number of samples to learn; (2) in order to evaluate this function, a full observation of the states of all agents is required, which disallows decentralized execution, one key preference of multi-agent RL; and (3) for any environment/team changes (e.g., teaming with different agents), V joint needs to be relearned for all agents and renders ad hoc team play impossible. Our CollaQ addresses the three issues with a novel theoretical framework that decouples the interactions between agents. Instead of using V joint that bundles all the agent interactions together, we consider the underlying mechanism how they interact: in a fully collaborative setting, the reason why

