MULTI-AGENT COLLABORATION VIA REWARD ATTRI-BUTION DECOMPOSITION

Abstract

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and may not generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play). In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that under certain conditions, each agent has a decentralized Q-function that is approximately optimal and can be decomposed into two terms: the self-term that only relies on the agent's own state, and the interactive term that is related to states of nearby agents, often observed by the current agent. The two terms are jointly trained using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss that ensures both terms retain their semantics. CollaQ is evaluated on various StarCraft maps, outperforming existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of environment steps. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

1. INTRODUCTION

In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interest from the research community. MARL algorithms have shown super-human level performance in various games like Dota 2 (Berner et al., 2019) , Quake 3 Arena (Jaderberg et al., 2019), and StarCraft (Samvelyan et al., 2019) . However, the algorithms (Schulman et al., 2017; Mnih et al., 2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019) , it takes agents 2.69 -8.62 million episodes to learn a simple strategy of door blocking, while it only takes human several rounds to learn this behavior. One of the key reasons for the slow learning is that the number of joint states grows exponentially with the number of agents. Moreover, many real-world situations require agents to adapt to new configurations of teams. This can be modeled as ad hoc multi-agent reinforcement learning (Stone et al., 2010) (Ad-hoc MARL) settings, in which agents must adapt to different team sizes and configurations at test time. In contrast to the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARL setting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or prior knowledge about teammate behaviors (Barrett and Stone, 2015) . As a result, they do not generalize to complex real-world scenarios. Most existing works either focus on improving generalization towards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc setting like varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020) . We consider a more general setting where test-time teammates may have different capabilities. The need to reason about different team configurations in the Ad-hoc MARL results in an additional exponential increase (Stone et al., 2010) in representational complexity comparing to the MARL setting. In the situation of collaboration, one way to address the complexity of the ad hoc team play setting is to explicitly model and address how agents collaborate. In this paper, one key observation is that when collaborating with different agents, an agent changes their behavior because she realizes that the team could function better if she focuses on some of the rewards while leaving other rewards to other teammates. Inspired by this principle, we formulate multi-agent collaboration as a joint optimization over an implicit reward assignment among agents. Because the rewards are assigned differently for different team configurations, the behavior of an agent changes and adaptation follows. While solving this optimization directly requires centralization at test time, we make an interesting theoretical finding that each agent has a decentralized policy that is (1) approximately optimal for the joint optimization, and (2) only depends on the local configuration of other agents. This enables us to learn a direct mapping from states of nearby agents (or "observation" of agent i) to its Q-function using deep neural network. Furthermore, this finding also suggests that the Q-function of agent i should be decomposed into two terms: Q alone i that only depends on agent i's own state s i , and Q collab i that depends on nearby agents but vanishes if no other agents nearby. To enforce this semantics, we regularize Q collab i (s i , •) = 0 in training via a novel Multi-Agent Reward Attribution (MARA) loss. The resulting algorithm, Collaborative Q-learning (CollaQ), achieves a 40% improvement in win rates over state-of-the-art techniques for the StarCraft multi-agent challenge. We show that (1) the MARA Loss is critical for strong performance and (2) both Q alone and Q collab are interpretable via visualization. Furthermore, CollaQ agents can achieve ad hoc team play without retraining or fine-tuning. We propose three tasks to evaluate ad hoc team play performance: at test time, (a) assign a new VIP unit whose survival matters, (b) swap different units in and out, and (c) add or remove units. Results show that CollaQ outperforms baselines by an average of 30% in all these settings. Related Works. The most straightforward way to train such a MARL task is to learn individual agent's value function Q i independently(IQL) (Tan, 1993) . However, the environment becomes non-stationary from the perspective of an individual agent thus this performs poorly in practice. Recent works, e.g., VDN (Sunehag et al., 2017) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , adopt centralized training with decentralized execution to solve this problem. They propose to write the joint value function as Q π (s, a) = φ(s, Q 1 (o 1 , a 1 ), ..., Q K (o K , a K )) but the formulation of φ differs in each method. These methods successfully utilize the centralized training technique to alleviate the non-stationary issue. However, none of the above methods generalize well to ad-hoc team play since learned Q i functions highly depend on the existence of other agents.

2. COLLABORATIVE MULTI-AGENT REWARD ASSIGNMENT

Basic Setting. A multi-agent extension of Markov Decision Process called collaborative partially observable Markov Games (Littman, 1994) , is defined by a set of states S describing the possible configurations of all K agents, a set of possible actions A 1 , . . . , A K , and a set of possible observations O 1 , . . . , O K . At every step, each agent i chooses its action a i by a stochastic policy π i : O i × A i → [0, 1]. The joint action a produces the next state by a transition function P : S × A 1 × • • • × A K → S. All agents share the same reward r : S × A 1 × • • • × A K → R and with a joint value function Q π = E st+1:∞,at+1:∞ [R t |s t , a t ] where R t = ∞ j=0 γ j r t+j is the discounted return. In Sec. 2.1, we first model multi-agent collaboration as a joint optimization on reward assignment: instead of acting based on the joint state s, each agent i is acting independently on its own state s i , following its own optimal value V i , which is a function of the perceived reward assignment r i . While the optimal perceived reward assignment r * i (s) depends on the joint state of all agents and requires centralization, in Sec. 2.2, we prove that there exists an approximate optimal solution ri that only depends on the local observation s local i of agent i, and thus enabling decentralized execution. Lastly in Sec. 2.3, we distill the theoretical insights into a practical algorithm CollaQ, by directly learning the compositional mapping s local i → ri → V i in an end-to-end fashion, while keeping the decomposition structure of self state and local observations.

2.1. BASIC ASSUMPTION

A naive modeling of multi-agent collaboration is to estimate a joint value function V joint := V joint (s 1 , s 2 , . . . , s K ), and find the best action for agent i to maximize V joint according to the current joint state s = (s 1 , s 2 , . . . , s K ). However, it has three fundamental drawbacks: (1) V joint generally requires exponential number of samples to learn; (2) in order to evaluate this function, a full observation of the states of all agents is required, which disallows decentralized execution, one key preference of multi-agent RL; and (3) for any environment/team changes (e.g., teaming with different agents), V joint needs to be relearned for all agents and renders ad hoc team play impossible. Our CollaQ addresses the three issues with a novel theoretical framework that decouples the interactions between agents. Instead of using V joint that bundles all the agent interactions together, we consider the underlying mechanism how they interact: in a fully collaborative setting, the reason why agent i takes actions towards a state, is not only because that state is rewarding to agent i, but also because it is more rewarding to agent i than other agents in the team, from agent i's point of view. This is the concept of perceived reward of agent i. Then each agent acts independently following its own value function V i , which is the optimal solution to the Bellman equation conditioned on the assigned perceived reward, and is a function of it. This naturally leads to collaboration. We build a mathematical framework to model such behaviors. Specifically, we make the following assumption on the behavior of each agent: Assumption 1. Each agent i has a perceived reward assignment r i ∈ R |Si||Ai| + that may depend on the joint state s = (s 1 , . . . , s K ). Agent i acts according to its own state s i and individual optimal value V i = V i (s i ; r i ) (and associated Q i (s i , a i ; r i )), which is a function of r i . Note that the perceived reward assignment r i ∈ R |Si||Ai| + is a non-negative vector containing the assignment of scalar reward at each state-action pair (hence its length is |S i ||A i |). We might also equivalently write it as a function: r i (x, a) : S i × A i → R, where x ∈ S i and a ∈ A i . Here x is a dummy variable that runs through all states of agent i, while s i refers to its current state. Given the perceived rewards assignment {r i }, the values and actions of agents become decoupled. Due to the fully collaborative nature, a natural choice of {r i } is the optimal solution of the following objective J(r 1 , r 2 , . . . , r K ). Here r e is the external rewards of the environment, w i ≥ 0 is the preference of agent i and is the Hadamard (element-wise) product: J(r 1 , . . . , r K ) := K i=1 V i (s i ; r i ) s.t. K i=1 w i r i ≤ r e (1) Note that the constraint ensures that the objective has bounded solution. Without this constraints, we could easily take each perceived reward r i to +∞, since each value function V i (s i ; r i ) monotonously increases with respect to r i . Intuitively, Eqn. 1 means that we "assign" the external rewards r e optimally to K agents as perceived rewards, so that their overall values are the highest. In the case of sparse reward, most of the state-action pair (x, a), r e (x, a) = 0. By Eqn. 1, for all agent i, their perceived reward r i (x, a) = 0. Then we only focus on nonzero entries for each r i . Define M to be the number of state-action pairs with positive reward: M = ai∈Ai 1{r i (x, a i ) > 0}. Discarding zero-entries, we could regard all r i as M -dimensional vector. Finally, we define the reward matrix R = [r 1 , . . . , r K ] ∈ R M ×K . Clarification on Rewards. There are two kinds of rewards here: external reward r e and perceived reward for each agent r i . r e is defined to be the environmental reward shared by all the agents: r e : S × A 1 × • • • × A k → R. Given this external reward, depending on a specific reward assignment, each agent can receive a perceived reward r i that drives its behavior. If the reward assignment is properly defined/optimized, then all the agents can act based on the perceived reward to jointly optimize (maximize) the shared external reward. 2.2 LEARN TO PREDICT THE OPTIMAL ASSIGNED REWARD r * i (s) The optimal reward assignments R * of Eq. 1, as well as its i-th assignment r * i , is a function of the joint states s = {s 1 , s 2 , . . . , s K }. Once the optimization is done, each agent can get the best action a * i = arg max ai Q i (s i , a i ; r * i (s)) independently from the reconstructed Q function. The formulation V i (s i ; r i ) avoids learning the value function of statistically infeasible joint states V i (s). Since an agent acts solely based on r i , ad hoc team play becomes possible if the correct r i is assigned. However, there are still issues. First, since each V i is a convex function regarding r i , maximizing Eqn. 1 is a summation of convex functions under linear constraints optimization, and is hard computationally. Furthermore, to obtain actions for each agent, we need to solve Eqn. 1 at every step, which still requires centralization at test time, preventing us from decentralized execution. To overcome optimization complexity and enable decentralized execution, we consider learning a direct mapping from the joint state s to optimally assigned reward r * i (s). However, since s is a joint state, learning such a mapping can be as hard as modeling V i (s). Fortunately, V i (s i ; r i (s)) is not an arbitrary function, but the optimal value function that satisfies Bellman equation. Due to the speciality of V i , we could find an approximate assignment ri for each agent i, so that ri only depends on a local observation s local i of the states of nearby other agents observed by agent i: ri (s) = ri (s local i ). At the same time, these approximate reward assignments {r i } achieve approximate optimal for the joint optimization (Eqn. 1) with bounded error: Theorem 1. For all i ∈ {1, . . . , K}, all s i ∈ S i , there exists a reward assignment ri that (1) only depends on s local i and (2) ri is the i-th column of a feasible global reward assignment R such that J( R) ≥ J(R * ) -(γ C + γ D )R max M K, where C and D are constants related to distances between agents/rewards (details in Appendix). Since ri only depends on the local observation of agent i (i.e., agent's own state s i as well as the states of nearby agents), it enables decentralized execution: for each agent i, the local observation is sufficient for an agent to act near optimally. Limitation. One limitation of Theorem 1 is that the optimality gap of ri heavily depends on the size of s local i . If the local observation of agent i covers more agents, then the gap is smaller but the cost to learn such a mapping is higher, since the mapping has more input states and becomes higher-dimensional. In practice, we found that the observation o i of agent i covers s local i works sufficiently well, as shown in the experiments (Sec. 4).

2.3. COLLABORATIVE Q-LEARNING (COLLAQ)

While Theorem. 1 shows the existence of perceived reward ri = ri (s local i ) with good properties, learning ri (s local i ) is not a trivial task. Learning it in a supervised manner requires (close to) optimal assignments as the labels, which in turn requires solving Eqn. 1. Instead, we resort to an end-to-end learning of Q i for each agent i with proper decomposition structure inspired by the theory above. To see this, we expand the Q-function for agent i: Q i = Q i (s i , a i ; ri ) with respect to its perceived reward. We use a Taylor expansion at the ground-zero reward r 0i = r i (s i ), which is the perceived reward when only agent i is present in the environment: Q i (s i , a i ; ri ) = Q i (s i , a i ; r 0i ) Q alone (si,ai) + ∇ r Q i (s i , a i ; r 0i ) • (r i -r 0i ) + O( ri -r 0i 2 ) Q collab (s local i ,ai) Here Q i (s i , a i ; r 0i ) is the alone policy of an agent i. We name it Q alone since it operates as if other agents do not exist. The second term is called Q collab , which models the interaction among agents via perceived reward ri . Both Q alone and Q collab are neural networks. Thanks to Theorem 1, we only need to feed local observation o i := s local i of agent i, which contains the observation of W < K local agents (Fig. 1 ), for an approximate optimal Q i . Then the overall Q i is computed by a simple addition (here o alone i := s i is the individual state of agent i): Q i (o i , a i ) = Q alone i (o alone i , a i ) + Q collab i (o i , a i ) Multi-Agent Reward Attribution (MARA) Loss. With a simple addition, the solution of Q alone i and Q collab i might not be unique: indeed, we might add any constant to Q alone and subtract that constant from Q collab to yield the same overall Q i . However, according to Eqn. 3, there is an additional constraint:  if o i = o alone i then ri = r 0i and Q collab (o alone i , a i ) ≡ 0, L = E si,ai∼ρ(•) [(y -Q i (o i , a i )) 2 DQN Objective + α(Q collab i (o alone i , a i )) 2 MARA Objective ] ( ) where the hyper-parameter α determines the relative importance of the MARA objective against the DQN objective. We observe that with MARA loss, training is much stabilized. We use a soft constraint version of MARA Loss. To train multiple agents together, we follow QMIX and feed the output of {Q i } into a top network and train in an end-to-end centralized fashion. CollaQ has advantages compared to normal Q-learning. Since Q alone i only takes o alone i whose dimension is independent of the number of agents, this term can be learned exponentially faster than Q collab i . Thus, CollaQ enjoys a much faster learning speed as shown in Fig. 5 , Fig. 6 and Fig. 7 . and other agents' states in the field of view of agent i. This is because the observation o i can be spatially large and cover agents whose states do not contribute much to agent i's action, and effective s local i is smaller than o i . Our architecture is similar to EPC (Long et al., 2020) except that we use a transformer architecture (stacking multiple layers of attention modules). As shown in the experiments, this helps improve the performance in various StarCraft settings. Intuition of CollaQ and Connection to the Theory. The intuitive explanation to CollaQ and MARA Loss is that when the agent cannot see others (i.e., other agent has no influence on the particular agent), the Q-value Q i should be equal to individual Q-value Q alone i . This can be interpreted as some equivalent statements: 1. The problem can be decomposed well into local sub-problems. 2. The existence of other agents does not influence the Q-value of the particular agent. The inspired MARA loss helps to eliminate the ambiguity. The semantic meaning of Q alone i and Q i are shown in Fig. 3 . The intuition actually connects to the theory. Theorem 1. shows that under some mild assumptions, the CollaQ objective can be viewed as a sub-optimal solution to an optimization problem on reward assignment. Thus each component of CollaQ and MARA loss can be well-justified. Although the problem defined in Eq. 1 is hard to optimize, the empirical success of CollaQ to some extent shows the effectiveness. The theory here serves more as an inspiration to the practical algorithm. We leave the analysis between exact optimization and CollaQ to future work.

3. EXPERIMENTS ON RESOURCE COLLECTION

In this section, we demonstrate the effectiveness of CollaQ in a toy gridworld environment where the states are fully observable. We also visualize the trained policy Q i and Q alone i .

Ad hoc Resource Collection.

We demonstrate CollaQ in a toy example where multiple agents collaboratively collect resources from a grid world to maximize the aggregated team reward. In this setup, the same type of resources can return different rewards depending on the type of agent that collects it. The reward setup is randomly initialized at the beginning of each episode and can be seen by all the agents. The game ends when all the resources are collected. An agent is expert for a certain resource if it gets the highest reward among the team collecting that. As a consequence, to maximize the shared team reward, the optimal strategy is to let the expert collect the corresponding resource. For testing, we devise the following reward setup: We have apple and lemon as our resources and N agents. For picking lemon, agent 1 receives the highest reward for the team, agent 2 gets the second highest, and so on. For apple, the reward assignment is reversed (agent N gets the highest reward, agent N -1 gets the second highest, ...). This specific reward setup is excluded from the environment setup for training. This is a very hard ad hoc team play at test time since the agents need to demonstrate completely different behaviors from training time to achieve a higher team reward. The left figure in Fig. 2 shows the training reward and the right one shows the ad hoc team play. We train on 5 agents in this setting. CollaQ outperforms IQL in both training and testing. In this example, random actions work reasonably well. Any improvement over it is substantial. Visualization of Q alone i and Q i . In Fig. 3 , we visualize the trained Q alone i and Q i (the overall policy for agent i) to show how Q collab i affects the behaviors of each agent. The policies Q alone i and Q i learned by CollaQ are both meaningful: Q alone i is the simple strategy of collecting the nearest resource (the optimal policy when the agent is the only one acting in the environment) and Q i is the optimal policy described formerly. The leftmost column in Fig. 3 shows the reward setup for different agents on collecting different resources (e.g. the red agent gets 4 points collecting lemon and gets 10 points collecting apple). The red agent specializes at collecting apple and the yellow specializes at collecting lemon. In a), Q alone i directs both agents to collect the nearest resource. However, neither agent is the expert on collecting its nearest resource. Therefore, Q collab i alters the decision of Q alone i , directing Q i towards resources with the highest return. This behavior is also observed in c) with a different resource placement. b) shows the scenario where both agents are the expert on collecting the nearest resource. Q collab i reinforces the decision of Q alone i , making Q i points to the same resource as Q alone i .

4. EXPERIMENTS ON STARCRAFT MULTI-AGENT CHALLENGE

StarCraft multi-agent challenge (Samvelyan et al., 2019) is a widely-used benchmark for MARL evaluation. The task in this environment is to manage a team of units (each unit is controlled by an agent) to defeat the team controlled by build-in AIs. While this task has been extensively studied in previous works, the performance of the agents trained by the SoTA methods (e.g., QMIX) deteriorates with a slight modification to the environment setup where the agent IDs are changed. The SoTA methods severely overfit to the precise environment and thus cannot generalize well to ad hoc team play. In contrast, CollaQ has shown better performance in the presence of random agent IDs, generalizes significantly better in more diverse test environments (e.g., adding/swapping/removing a unit at test time), and is more robust in ad hoc team play.

4.1. ISSUES IN THE CURRENT BENCHMARK

In the default StarCraft multi-agent environment, the ID of each agent never changes. Thus, a trained agent can memorize what to do based on its ID instead of figuring out the role of its units dynamically during the play. As illustrated in Fig. 4 , if we randomly shuffle the IDs of the agents at test time, the performance of QMIX gets much worse. In some cases (e.g., 8m_vs_9m), the win rate drops from 95% to 50%, deteriorating by more than 40%. The results show that QMIX relies on the extra information (the order of agents) for generalization. As a consequence, the resulting agents overfit to the exact setting, making it less robust in ad hoc team play. Introducing random shuffled agent IDs at training time addresses this issue for QMIX as illustrated in Fig. 4 . With attention model, the performance is even stronger. Trained CollaQ agents demonstrate interesting behaviors. On MMM2: (1) Medivac dropship only heals the unit under attack, (2) damaged units move backward to avoid focused fire from the opponent, while healthy units move forward to undertake fire. In comparison, QMIX only learns (1) and it is not obvious (2) was learned. On 2c_vs_64zg, CollaQ learns to focus fire on one side of the attack to clear one of the corridors. It also demonstrates the behavior to retreat along that corridor while attacking while agents trained by QMIX does not. See Appendix D for more video snapshots.

4.3. AD HOC TEAM WORK

Now we demonstrate that CollaQ is robust to change of agent configurations and/or priority during test time, i.e., ad hoc team play, in addition to handling random IDs. Different VIP agent. In this setting, the team would get an additional reward if the VIP agent is alive after winning the battle. The VIP agent is randomly selected from agent 1 to N -1 during training. At test time, agent N becomes the VIP, which is a new setup that is not seen in training. Fig. 6 shows the VIP agent survival rate at test time. We can see that CollaQ outperforms QMIX by 10%-32%. We also see that CollaQ learns the behavior of protecting VIP: when the team is about to win, the VIP agent is covered by other agents to avoid being attacked. Such behavior is not clearly shown in QMIX when the same objective is presented. Swap / Add / Remove different units. We also test the ad hoc team play in three harder settings: we swap the agent type, add and remove one agent at test time. From Fig. 7 , we can see that CollaQ can generalize better to the ad hoc test setting. Note that to deal with the changing number of agents at test time, all of the methods (QMIX, QTRAN, VDN, IQL, and CollaQ) are augmented with attention-based neural architectures for a fair comparison. We can also see that CollaQ outperforms QMIX, the second best, by 9.21% on swapping, 14.69% on removing, and 8.28% on adding agents.

4.4. ABLATION STUDY

We further verify CollaQ in the ablation study. First, we show that CollaQ outperforms a baseline (SumTwoNets) that simply sums over two networks which takes the agent's full observation as the input. SumToNets does not distinguish between Q alone (which only takes s i as the input) and Q collab (which respects the condition Q collab (s i , •) = 0). Second, we show that MARA loss is indeed critical for the performance of CollaQ. We compare our method with SumTwoNets trained with QMIX in each agent. The baseline has a similar parameter size compared to CollaQ. As shown in Fig. 8 , comparing to SumTwoNets trained with QMIX, CollaQ improves the win rates by 17%-47% on hard scenarios. We also study the importance of MARA Loss by removing it from CollaQ. Using MARA Loss boosts the performance by 14%-39% on hard scenarios, consistent with the decomposition proposed in Sec. 2.3.

5. RELATED WORK

Multi-agent reinforcement learning (MARL) has been studied since the 1990s (Tan, 1993; Littman, 1994; Bu et al., 2008) . Recent progresses of deep reinforcement learning give rise to an increasing effort of designing general-purpose deep MARL algorithms (including COMA (Foerster et al., 2018) , MADDPG (Lowe et al., 2017) , MAPPO (Berner et al., 2019) , PBT (Jaderberg et al., 2019) , MAAC (Iqbal and Sha, 2018) , etc) for complex multi-agent games. We utilize the Q-learning framework and consider the collaborative tasks in strategic games. Other works focus on different aspects of collaborative MARL setting, such as learning to communicate (Foerster et al., 2016; Sukhbaatar et al., 2016; Mordatch and Abbeel, 2018) , robotics manipulation (Chitnis et al., 2019) , traffic control (Vinitsky et al., 2018 ), social dilemmas (Leibo et al., 2017) , etc. The problem of ad hoc team play in multiagent cooperative games was raised in the early 2000s (Bowling and McCracken, 2005; Stone et al., 2010) and is mostly studied in the robotic soccer domain (Hausknecht et al., 2016) . Most works (Barrett and Stone, 2015; Barrett et al., 2012; Chakraborty and Stone, 2013; Woodward et al., 2019) either require sophisticated online learning at test time or require strong domain knowledge of possible teammates, which poses significant limitations when applied to complex real-world situations. In contrast, our framework achieves zero-shot generalization and requires little changes to the overall existing MARL training. There are also works considering a much simplified ad-hoc teamwork setting by tackling a varying number of test-time homogeneous agents (Schwab et al., 2018; Long et al., 2020) while our method can handle more general scenarios. Previous work on the generalization/robustness in MARL typically considers a competitive setting and aims to learn policies that can generalize to different test-time opponents. Popular techniques include meta-learning for adaptation (Al-Shedivat et al., 2017) , adversarial training (Li et al., 2019) , Bayesian inference (He et al., 2016; Shen and How, 2019; Serrino et al., 2019) , symmetry breaking (Hu et al., 2020) , learning Nash equilibrium strategies (Lanctot et al., 2017; Brown and Sandholm, 2019) and population-based training (Vinyals et al., 2019; Long et al., 2020; Canaan et al., 2020) . Populationbased algorithms use ad hoc team play as a training component and the overall objective is to improve opponent generalization. Whereas, we consider zero-shot generalization to different teammates at test time. Our work is also related to the hierarchical approaches for multi-agent collaborative tasks (Shu and Tian, 2019; Carion et al., 2019; Yang et al., 2020) . They train a centralized manager to assign subtasks to individual workers and it can generalize to new workers at test time. However, all these works assume known worker types or policies, which is infeasible for complex tasks. Our method does not make any of these assumptions and can be easily trained in an end-to-end fashion. There have also been effort on decomposing the observation space through individual networks. ASN (Wang et al., 2019) decomposes the observation space of each agent trying to capture semantic meaning of actions, DyAN (Wang et al., 2020) adopts similar architecture in a curriculum domain. EPC (Long et al., 2020 ) also proposes to use attention between individual agents to make the network structure invariant to the size of agents. While the network structure of CollaQ to some extent share some similarity with the works aforementioned, the semantic meaning of each component is different. CollaQ models the interaction between agents using an alone network and an attentionbased collaborative network, one used to model self-interest solutions and the other one models the influence of other agents on the particular agent. Several papers also discuss social dilemma in a multi-agent setting (Leibo et al., 2017; Rapoport, 1974; Van Lange et al., 2013) . Several works in reinforcement learning have been proposed to solve problems such as prisoner's delimma Sandholm and Crites (1996); de Cote et al. (2006) ; Wunder et al. (2010) . However, in our setting, all the agents share the same environmental reward. Thus, the optimal solution for all the agents is to jointly optimize the shared reward. SSD Jaques et al. (2019) gives the agent an extra intrinsic reward when its action has huge influence on others. CollaQ does not use any intrinsic reward. Lastly, our mathematical formulation is related to the credit assignment problem in RL (Sutton, 1985; Foerster et al., 2018; Nguyen et al., 2018) . Some reward shaping literature also fall into this category (Devlin et al., 2014; Devlin and Kudenko, 2012) . But our approach does not calculate any explicit reward assignment, we distill the theoretical insight and derive a simple yet effective learning objective. We show the exact win rates for all the maps and settings mentioned in StarCraft Multi-Agent Challenge. From Tab. 1, we can clearly see that CollaQ improves the previous SoTA by a large margin. We also check the margin of winning scenarios, measured as how many units survive after winning the battle. The experiments are repeated over 128 random seeds. CollaQ surpasses the QMIX by over 2 units on average (Tab. 2), which is a huge gain. 

E VIDEOS AND VISUALIZATIONS OF STARCRAFT MULTI-AGENT CHALLENGE

We extract several video frames from the replays of CollaQ's agents for better visualization. In addition to that, we provide the full replays of QMIX and CollaQ. CollaQ's agents demonstrate super interesting behaviors such as healing the agents under attack, dragging back the unhealthy agents, and protecting the VIP agent (under the setting of ad hoc team play with different VIP agent settings). The visualizations and videos are available at https://sites.google.com/view/ collaq-starcraft  F PROOF AND LEMMAS Lemma 1. If a 1 ≥ a 1 , then 0 ≤ max(a 1 , a 2 ) -max(a 1 , a 2 ) ≤ a 1 -a 1 . Proof. Note that max(a 1 , a 2 ) = a1+a2 2 + a1-a2 2 . So we have: max(a 1 , a 2 )-max(a 1 , a 2 ) = a 1 -a 1 2 + a 1 -a 2 2 - a 1 -a 2 2 ≤ a 1 -a 1 2 + a 1 -a 1 2 = a 1 -a 1 F.1 LEMMAS Lemma 2. For a Markov Decision Process with finite horizon H and discount factor γ < 1. For all i ∈ {1, . . . , K}, all r 1 , r 2 ∈ R M , all s i ∈ S i , we have: |V i (s i ; r 1 ) -V i (s i ; r 2 )| ≤ x,a γ |si-x| |r 1 (x, a) -r 2 (x, a)| where |s i -x| is the number of steps needed to move from s i to x. Proof. By definition of optimal value function V i for agent i, we know it satisfies the following Bellman equation: V i (x h ; r i ) = max ai r i (x i , a i ) + γE x h+1 |x h ,a h [V i (x h+1 )] Note that to avoid confusion between agents initial states s = {s 1 , . . . , s K } and reward at state-action pair (s, a), we use (x, a) instead. For terminal node x H , which exists due to finite-horizon MDP with horizon H, V i (x H ) = r i (x H ). The current state s i is at step 0 (i.e., x 0 = s i ). We first consider the case that r 1 and r 2 only differ at a single state-action pair (x 0 h , a 0 h ) for h ≤ H. Without loss of generality, we set r 1 (x 0 h , a 0 h ) > r 2 (x 0 h , a 0 h ). By definition of finite horizon MDP, V i (x h ; r 1 ) = V i (x h ; r 2 ) for h > h. By the property of max function (Lemma 1), we have: 0 ≤ V i (x 0 h ; r 1 ) -V i (x 0 h ; r 2 ) ≤ r 1 (x 0 h , a 0 h ) -r 2 (x 0 h , a 0 h ) (11) Since p(x 0 h |x h-1 , a h-1 ) ≤ 1, for any (x h-1 , a h-1 ) at step h -1, we have: 0 ≤ γ E x h |x h-1 ,a h-1 [V i (x h ; r 1 )] -E x h |x h-1 ,a h-1 [V i (x h ; r 2 )] (12) ≤ γ r 1 (x 0 h , a 0 h ) -r 2 (x 0 h , a 0 h ) Applying Lemma 1 and notice that all other rewards do not change, we have: 0 ≤ V i (x h-1 ; r 1 ) -V i (x h-1 ; r 2 ) ≤ γ r 1 (x 0 h , a 0 h ) -r 2 (x 0 h , a 0 h ) We do this iteratively, and finally we have: 0 ≤ V i (s i ; r 1 ) -V i (s i ; r 2 ) ≤ γ h r 1 (x 0 h , a 0 h ) -r 2 (x 0 h , a 0 h ) We could show similar case when r 1 (x 0 h , a 0 h ) < r 2 (x 0 h , a 0 h ), therefore, we have: |V i (s i ; r 1 ) -V i (s i ; r 2 )| ≤ γ h |r 1 (x 0 h , a 0 h ) -r 2 (x 0 h , a 0 h )| where h = |x 0 h -s i | is the distance between s i and x 0 h . Now we consider general r 1 = r 2 . We could design path {r t } from r 1 to r 2 so that each time we only change one distinct reward entry. Therefore each (s, a) pairs happens only at most once and we have: |V i (s i ; r 1 ) -V i (s i ; Then we have: |V i (s i ; r i ) -V i (s i ; ri )| ≤ γ C R max M 20) where M is the total number of sparse reward sites and R max is the maximal reward that could be assigned at each reward site x while satisfying the constraint φ(r 1 (x, a), r 2 (x, a), . . . , r K (s, a)) ≤ 0. Note that "sparse reward site" is important here, otherwise there could be exponential sites x / ∈ S local i and Eqn. 23 becomes vacant. Then we prove the theorem. Proof. Given a constant C, for each agent i, we define the vicinity reward site B i (C) := {x : |x -s i | ≤ C}. Given agent i and its local "buddies" s local Given this setting, we then construct a few reward assignments (see Fig. 11 ), given the current agent states s = {s 1 , s 2 , . . . , s K }. For brevity, we write R[M, s] to be the submatrix that relates to reward site M and agents set s. • The optimal solution R * for Eqn. 1. • The perturbed optimal solution R * by pushing the reward assignment of ] to be zero.



which eliminates such an ambiguity. For this, we add Multi-agent Reward Attribution (MARA) Loss. Overall Training Paradigm. For agent i, we use standard DQN training with MARA loss. Define y = E s ∼ε [r + γ max a Q i (o , a )|s, a] to be the target Q-value, the overall training objective is:

Figure 1: Architecture of the network. We use normal DRQN architecture for o alone i with attention-based model for Q collab . The attention layers take the encoded inputs from all agents and output an attention embedding.

Figure 2: Results in resource collection. CollaQ (green) produces much higher rewards in both training and ad hoc team play than IQL (orange).

Figure 3: Visualization of Q alone i and Qi in resource collection. The reward setup is shown in the leftmost column. Interesting behaviors emerge: in b), Q collab i

Figure 4: QMIX overfits to agent IDs. Introducing random agent IDs at test time greatly affect the performance.

Figure 5: Results in standard StarCraft benchmarks with random agent IDs. CollaQ (without Attn and with Attn) clearly surpasses the previous SoTAs. The attention-based model further improves the win rates for all maps except 2c_vs_64zg, which only has 2 agents and attention may not bring up enough benefits.

Figure 6: Results for StarCraft ad hoc team play using different VIP agent. At test time, the CollaQ has substantially higher VIP survival rate than QMIX. Attention-based model also boosts up the survival rate.

Figure 7: Ad hoc team play on: a) swapping, b) adding, and c) removing a unit at test time. CollaQ outperforms QMIX and other methods substantially on all these 3 settings.

Figure 9: Results for resource collection. Adding attention-based model to CollaQ introduces a larger variance so the performance is a little worse. QMIX does not show good performance in this setting.

r 2 )| ≤ t |V i (s i ; r t-1 ) -V i (s i ; r t )| (17) ≤ x,a γ |x-si| |r 1 (x, a) -r 2 (x, a)|(18)F.2 THM. 1First we prove the following lemma: Lemma 3. For any reward assignments r i for agent i for the optimization problem (Eqn. 1) and a local reward setM local i ⊇ {x : |x -s i | ≤ C}, if we construct ri as follows: ri (x, a) = r i (x, a) x ∈ M local

Figure 11: Different reward assignments.

(a subset of multiple agent indices), we construct the corresponding reward site set M local i : larger D is, the larger distance between relevant rewards sites from remote agents and the tighter the bound. There is a trade-off between C and D: the larger the vicinity, M local i expands and the smaller D is.

From R * , we get R * 0 by setting the region [M remote i , s local i ] to be zero. • The local optimal solution R * local that only depends on s local i . This solution is obtained by setting [:, s remote i ] to be zero and optimize Eqn. 1. • From R * local , we get R * local(0) by setting [M remote i , s local i

Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 165-172, 2014. Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433-440. IFAAMAS, 2012.

Win rates for StarCraft Multi-Agent Challenge. CollaQ show superior performance over all baselines.

Number of survived units on six StaCraft maps. We compute mean and standard deviation over 128 runs. CollaQ outperforms all baselines significantly by managing more units to survive. CollaQ with Attn 2.77 ± 0.17 4.73 ± 1.08 1.00 ± 0.49 5.22 ± 1.79 3.68 ± 0.63 4.73 ± 0.41 In a simple ad hoc team play setting, we assign a new VIP agent whose survival matters at test time. Results in Tab. 3 show that at test time, the VIP agent in CollaQ has substantial higher survival rate than QMIX.

VIP agents survival rates for StarCraft Multi-Agent Challenge. CollaQ with attention surpasses QMIX by a large margin.We also test CollaQ in a harder ad hoc team play setting: swapping/adding/removing agents at test time. Tab 4 summarizes the results for ad hoc team play, CollaQ outperforms QMIX by a lot.

Win rates for StarCraft Multi-Agent Challenge with swapping/adding/removing agents. CollaQ improves QMIX substantially.

6. CONCLUSION

In this work, we propose CollaQ that models Multi-Agent RL as a dynamic reward assignment problem. We show that under certain conditions, there exist decentralized policies for each agent and these policies are approximately optimal from the point of view of a team goal. CollaQ then learns these policies by resorting to an end-to-end training framework while using decomposition in Q-function suggested by the theoretical analysis. CollaQ is tested in a complex practical StarCraft MultiAgent Challenge and surpasses previous SoTA by 40% in terms of win rates on various maps and 30% in several ad hoc team play settings. We believe the idea of multi-agent reward assignment used in CollaQ can be an effective strategy for ad hoc MARL.

A COLLABORATIVE Q DETAILS

We derive the gradient and provide the training details for Eq. 5.Gradient for Training Objective. Taking derivative w.r.t θ a n and θ c n in Eq. 5, we arrive at the following gradient:Soft CollaQ. In the actual implementation, we use a soft-constraint version of CollaQ: we subtract Q collab (o alone i , a i ) from Eq. 4. The Q-value Decomposition now becomes:The optimization objective is kept the same as in Eq. 5. This helps reduce variances in all the settings in resource collection and Starcraft multi-agent challenge. We sometimes also replace Q collab (o alone i , a i ) in Eq. 7 by its target to further stabilize training.

B ENVIRONMENT SETUP AND TRAINING DETAILS

Resource Collection. We set the discount factor as 0.992 and use the RMSprop optimizer with a learning rate of 4e-5. -greedy is used for exploration with annealed linearly from 1.0 to 0.01 in 100k steps. We use a batch size of 128 and update the target every 10k steps. For temperature parameter α, we set it to 1. We run all the experiments for 3 times and plot the mean/std in all the figures.StarCraft Multi-Agent Challenge. We set the discount factor as 0.99 and use the RMSprop optimizer with a learning rate of 5e-4. -greedy is used for exploration with annealed linearly from 1.0 to 0.05 in 50k steps. We use a batch size of 32 and update the target every 200 episodes. For temperature parameter α, we set it to 0.1 for 27m_vs_30m and to 1 for all other maps.All experiments on StarCraft II use the default reward and observation settings of the SMAC benchmark. For ad hoc team play with different VIP, an additional 100 reward is added to the original 200 reward for winning the game if the VIP agent is alive after the episode.For swapping agent types, we design the maps 3s1z_vs_16zg, 1s3z_vs_16zg and 2s2z_vs_16zg (s stands for stalker, z stands for zealot and zg stands for zergling). We use the first two maps for training and the third one for testing. For adding units, we use 27m_vs_30m for training and 28m_vs_30m for testing (m stands for marine). For removing units, we use 29m_vs_30m for training and 28m_vs_30m for testing.We run all the experiments for 4 times and plot the mean/std in all the figures.

C DETAILED RESULTS FOR RESOURCE COLLECTION

We compare CollaQ with QMIX and CollaQ with attention-based model in resource collection setting. As shown in Fig. 9 , QMIX does not show great performance as it is even worse than random action. Adding attention-based model introduces a larger variance, so the performance degrades by 10.66 in training but boosts by 2.13 in ad ad hoc team play.

D DETAILED RESULTS FOR STARCRAFT MULTI-AGENT CHALLENGE

We provide the win rates for CollaQ and QMIX on the environments without random agent IDs on three maps. Fig. 10 shows the results for both method.It is easy to show all these rewards assignment are feasible solutions to Eqn. 1. This is because if the original solution is feasible, then setting some reward assignment to be zero also yields a feasible solution, due to the property of the constraint φ.For simplicity, we define J local to be the partial objective that sums over s j ∈ s local i and similarly for J remote .We could show the following relationship between these solutions:This is because each of this reward assignment move costs at most γ D R max by Lemma 2 and there are at most M K such movement.On the other hand, for each, from Lemma 3 we have:And similarly we have: ]. This is still a feasible solution since in both R * local(0) and R * 0 , their top-right and bottom-left sub-matrices are zero, and its objective is still good:Note that 1 is due to Eqn. 27, 2 is due to the optimality of R * local (and looser constraints for R * local ), 3 is due to the fact that R * is obtained by adding rewards released from s remote i to s local i . 4 is due to the fact that R * 0 and R * has the same remote components. 5 is due to Eqn. 26. 6 is by definition of J local and J remote .Therefore we obtain ri = [ R] i that only depends on s local i . On the other hand, the solution R is close to optimal R * , with gap (γ C + γ D )R max M K.

