A COACH-PLAYER FRAMEWORK FOR DYNAMIC TEAM COMPOSITION Anonymous

Abstract

In real-world multi-agent teams, agents with different capabilities may join or leave "on the fly" without altering the team's overarching goals. Coordinating teams with such dynamic composition remains a challenging problem: the optimal team strategy may vary with its composition. Inspired by real-world team sports, we propose a coach-player framework to tackle this problem. We assume that the players only have a partial view of the environment, while the coach has a complete view. The coach coordinates the players by distributing individual strategies. Specifically, we 1) propose an attention mechanism for both the players and the coach; 2) incorporate a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with different players. Our attention mechanism on the players and the coach allows for a varying number of heterogeneous agents, and can thus tackle the dynamic team composition. We validate our methods on resource collection tasks in multi-agent particle environment. We demonstrate zero-shot generalization to new team compositions with varying numbers of heterogeneous agents. The performance of our method is comparable or even better than the setting where all players have a full view of the environment, but no coach. Moreover, we see that the performance stays nearly the same even when the coach communicates as little as 13% of the time using our adaptive communication strategy. These results demonstrate the significance of a coach to coordinate players in dynamic teams. Recently, Iqbal et al. (2020) proposed a multi-head attention model for learning in environments with a variable number of agents under the CTDE framework. However, in many challenging tasks,

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is the problem of coordinating a team of agents to perform a shared task. It has broad applications in autonomous vehicle teams (Cao et al., 2012) , sensor networks (Choi et al., 2009) , finance (Lee et al., 2007) , and social science (Leibo et al., 2017) . Recent works in multi-agent reinforcement learning (MARL) have shed light on solving challenging multi-agent problems such as playing StarCraft with deep learning models (Rashid et al., 2018) . Among these methods, centralized training with decentralized execution (CTDE) has gained much attention since learning in a centralized way enables better cooperation while executing independently makes the system efficient and scalable (Lowe et al., 2017) . However, most deep CTDE approaches for cooperative MARL are limited to a fixed number of homogeneous agents. Real-world multi-agent tasks, on the other hand, often involve dynamic teams. For example, in a soccer game, a team receiving a red card has one fewer player. In this case, the team may switch to a more defensive strategy. As another example, consider an autonomous vehicle team for delivery. The control over the team depends on how many vehicles we have, how much load each vehicle permits, as well as the delivery destinations. In both examples, the optimal team strategy varies according to the team composition,foot_0 i.e., the size of the team and each agent's capability. In these settings, it is intractable to re-train the agents for each new team composition, and it is thus desirable to have zero-shot generalization to new team compositions that are not seen during training. the CTDE constraint is too restrictive as each agent only has access to its own decisions and partial environmental observations at test time -See Section 3.1 for an example where this requirement causes failure to learn. The CTDE constraint can be relaxed either by 1) allowing all agents to communicate with each other (Zhang et al., 2018) or 2) having a special "coach" agent who distributes strategic information based on the full view of the environment (Stone & Veloso, 1999) . The former case is typically too expensive for many CTDE scenarios (e.g., battery-powered drones or vehicles), while the latter case of having a coach may be feasible (e.g., satellites or watchtowers to monitor the field in which agents operate). In this work, we focus on the latter approach of having a coach to coordinate the agents. Specifically, we grant the coach with global observation while agents only have partial views of the environment. We assume that the coach can distribute information to various agents only in limited amounts. We model this communication through a continuous vector, termed as the strategy vector, and it is specific to each agent. We design each agent's decision module to incorporate the most recent strategy vector from the coach. We further propose a variational objective to regularize learning, inspired by (Rakelly et al., 2019; Wang et al., 2020a) . In order to save costs incurred in receiving information from the coach, we additionally design an adaptive policy where the coach communicates with different players only as needed. To train the coach and agents, we sample different teams from a set of team compositions. Recall that the training is centralized under the CTDE framework. At execution time, the learned policy generalizes across different team compositions in a zero-shot manner. Our framework also allows for dynamic teams whose composition varies over time (see Figure 1 (a-b )).

Summary of Results:

We (1) propose a coach-player framework for dynamic team composition of heterogeneous agents; (2) introduce a variational objective to regularize the learning, which leads to improved performance; (3) design an adaptive communication strategy to minimize communication from the coach to the agents. We apply our methods on resource-collection tasks in multi-agent particle environments. We evaluate zero-shot generalization for new team compositions at test time. Results show comparable or even better performance against methods where players have full observation but no coach. Moreover, there is almost no performance degradation even when the coach communicates as little as 13% of the time with the players. These results demonstrate the effectiveness of having a coach in dynamic teams.

2.1. PROBLEM FORMULATION

We model the cooperative multi-agent task under the Decentralized partially observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2016) . Specifically, we build on the setting of Dec-POMDP with entities (de Witt et al., 2019) , which considers entity-based knowledge representation. Here, entities include both controllable agents and other environment landmarks. In addition, we extend the representation to allow agents to have individual characteristics, i.e., skill-level, physical condition, etc. Therefore, a Dec-POMDP with characteristics-based entities can be described as a tuple (S, U , O, P, R, E, A, C, m, Ω, ρ, γ). E represents the space of entities. ∀e ∈ E, the entity e has its state representation s e ∈ R de . The global state is therefore the set s = {s e |e ∈ E} ∈ S. A subset of the entities are controllable agents a ∈ A ⊆ E. For both agents and non-agent entities, we differentiate them based on their characteristics c e ∈ C. 4 For example, c e can be a continuous vector that consists of two parts such that only one part can be non-zero vector. That is, if e is an agent, the first part can represent its skill-level or physical condition, and if e is an non-agent entity, the second part can represent its entity type. A scenario is a multiset of entities c = {c e |e ∈ E} ∈ Ω and possible scenarios are drawn from the distribution ρ(c). In other words, scenarios are unique up to the composition of the team and that of world entities. Fixing any particular scenario c, it maps to a normal Dec-POMDP with the fixed multiset of entities {e|c e ∈ c}. Given a scenario c, at each environment step, each agent a can observe a subset of entities specified by an observability function m : A × E → {0, 1}, where m(a, e) indicates whether agent a can observe entity e.foot_4 Therefore, an agent's observation is a set o a = {s e |m(a, e) = 1} ∈ O. All agents can perform the joint action u = {u a |a ∈ A} ∈ U , and the environment will step according to the transition dynamics P (s |s, u; c). After that, the entire team will receive a single scalar reward r ∼ R(s, u; c). Starting from an initial state s 0 , the MARL objective is to maximize the discounted cumulative team reward over time: G = E s0,u0,s1,u1,... [ ∞ t=0 γ t r t ], where γ is the discount factor. Our goal is to learn a team policy that can generalize across different scenarios c (different team compositions) and eventually dynamic scenarios (varying team compositions over time). For optimizing G, Q-learning is a specific method that learns an accurate action-value function and makes decision based on that. The optimal action-value function Q satisfies the Bellman equality: Q tot * (s, u; c) = r(s, u; c) + γE s ∼P (•|s,u;c) max u Q tot * (s , u ; c) , where Q tot * denote the team's optimal Q-value. A common strategy is to adopt function approximation and parameterize the optimal Q tot * with parameter θ. Moreover, due to partial observability, the history of observationaction pairs is often encoded to a compact vector representation, i.e., via a recurrent neural network (Medsker & Jain, 1999) , in place of the state: Q tot θ (τ t , u t ; c) ≈ E [Q tot * (s t , u t ; c)], where τ = {τ a |a ∈ A} and τ a = (o a 0 , u a 0 , . . . o a t ). In practice, at each time step t, the recurrent neural network takes in (u a t-1 , o a t ) as the new input, where u a -1 = 0 at t = 0 (Zhu et al., 2017) . Deep Qlearning (Mnih et al., 2015) uses deep neural networks to approximate the Q function, its objective in our case is: L(θ) = E (c,τt,ut,rt,τt+1)∼D r t + γ max u Q tot θ (τ t+1 , u ; c) -Q tot θ (τ t , u t ; c) 2 . (1) Here, D is a replay buffer that stores previously generated off-policy data. Q tot θ is the target network parameterized by a delayed copy of θ for stability.

2.2. VALUE FUNCTION FACTORIZATION AND ATTENTION QMIX

Factorizing the action-value function Q into per agent value function has become a popular approach in centralized training and decentralized execution. Specifically, Rashid et al. (2018) proposes QMIX that factorizes Q tot (τ t , u) into {Q a (τ a t , u a |a ∈ A} and combines them via a mixing network such that ∀a, ∂Qtot ∂Q a ≥ 0. The condition guarantees that individual optimal action u a is also the best action for the team. As a result, during execution, the mixing network can be removed and agents work independently according to their own Q a . Attention QMIX (A-QMIX) (Iqbal et al., 2020) augments the QMIX algorithm with attention mechanism to deal with an indefinite number of agents/entities. In particular, for each agent, the algorithm applies the multi-head attention (MHA) layer (Vaswani et al., 2017) to summarize the information of the other entities. This information is used for both encoding the agent's state and adjusting the mixing network. Specifically, the input o is represented by two matrices: the entity state matrix X E and the observability matrix M . Assume at the given scenario c, there exists n e entities, n a of which are the controllable agents, then X E ∈ R ne×de includes all entities encoding and the first n a rows belong to agents. M ∈ {0, 1} na×ne is a binary observability mask and M ij = m(a i , e j ) indicates whether agent i observes entity j. X E is first passed through an encoder, i.e., a single-layer feed-forward network, and becomes X. Denote the k-th row of X as h k , then for the i-th agent, the MHA layer then takes h i as the query and {h j |M ij = 1} as the keys to compute a latent representation of a i 's observation. For the mixing network, the same MHA layer will take X E and the full observation matrix M * , where M * ij = 1 if both e i and e j exist in the scenario c, and outputs the encoded global representation for each agent. These encoded representations are then used to generate the mixing network. We refer readers to Appendix B of Iqbal et al. (2020) for more details. While A-QMIX in principle applies to the dynamic team composition problem, it is restricted to fully decentralized execution with partial observation. We borrow the attention modules from A-QMIX but additionally investigate how to efficiently take advantage of the global information by introducing the coach. Iqbal et al. (2020) proposes an extended version of A-QMIX, called Attentive-Imaginative QMIX (AI-QMIX), which randomly breaks up the team into two disjoint parts for each agent's Q a to further decompose the Q value. While the authors demonstrate AI-QMIX outperforms A-QMIX on a gridworld resource allocation task and a modified StarCraft environment. As we will show in the experiment section, we find that AI-QMIX does not improve over A-QMIX by much while doubling the computation resource. For this reason, our method is mainly based on the A-QMIX framework, but extending it to AI-QMIX is straightforward.

3. METHOD

Here we present the coach-player architecture to incorporate global information for adapting the team-level strategy across different scenarios c. We first introduce the coach agent that coordinates base agents with global information via broadcasting strategies periodically. Then we present the learning objective and and an additional variational objective to regularize the training. We finish by introducing a method to reduce the broadcast rate and provide analysis to support it.

3.1. ON THE IMPORTANCE OF GLOBAL INFORMATION

As the optimal team strategy varies according to the scenario c, which includes the team composition, it is important for the team to be aware of the scenario change promptly. In an extreme example, assume in a multi-agent problem where every agent has its skill-level represented by a real number c a ∈ R and there is a task to complete. For each agent a, u a ∈ {0, 1} indicates whether a chooses to perform the task. The reward is defined as R(u; c) = max a c a • u a + 1a u a . In other words, the reward is proportional to the skill-level of the agent who performs it and the team got penalized if more than 1 agent choose to perform the task. If the underlying scenario c is fixed, even if all agents are unaware of others' capabilities, it is still possible for the team to gradually figure out the optimal strategy. By contrast, when c is subject to change, i.e. agents with different c can join or leave, even if we allow agents to communicate via a network, the information that a particular agent joins or leaves generally takes d time steps to propagate where d is the longest shortest path from that agent to any other agents. Therefore, we can see that knowing the global information is not only beneficial but sometimes also necessary for coordination. This motivates the introduction of the coach agent.

3.2. COACH AND PLAYERS

We introduce a coach agent and grant it with global observation. To preserve efficiency as in the decentralized setting, we limit the coach agent to only distribute information via a continuous vector z a ∈ R dz (d z is the dimension of strategy) to agent a, which we call the strategy, once every T time steps. T is the communication interval. The team strategy is therefore represented as z = {z a |a ∈ A}. Strategies are predicted via a function f parameterized by φ. Specifically, we assume z a ∼ N (µ a , Σ a ), where (µ = {µ a |a ∈ A}, Σ = {Σ a |a ∈ A}) = f φ (s; c). (2) Within the next T steps, agent a will act conditioned on the strategy z a . Specifically, within an episode, at time t k ∈ {v|v ≡ 0 (mod T )}, the coach observes the global state s t k and computes and distributes the strategies z t k for all agents. From time t ∈ [t k , t k + T -1], any agent a will act according to its individual action-value Q a (τ a t , • | z a t k ; c a ). Denote t = max{v|v ≡ 0 (mod T ) and v ≤ t}, the most recent time step when the coach distribute strategies. The mean square Bellman error objective in equation 1 becomes where z t ∼ f φ (s t; c), z t+1 ∼ f φ(s t+1 ; c), and φ is the parameter of the target network for the coach's strategy predictor f . We build our network on top of A-QMIX but use a separate multi-head attention (MHA) layer to encode the global states that the coach observes. For the mixing network, we also use the coach's output from the MHA layer for mixing the individual Q a to form the team Q tot . The entire architecture is described in Figure 3 .2. We provide more details in Appendix. L RL (θ, φ) = E (c,τt,ut,rt,s t,s t+1 )∼D r t + γ max u Q tot θ (τ t+1 , u |z t+1 ; c) -Q tot θ (τ t , u t | z t; c) 2 ,

3.3. REGULARIZING WITH VARIATIONAL OBJECTIVE

Inspired by recent work that applied variational inference to regularize the learning of a latent space in reinforcement learning (Rakelly et al., 2019; Wang et al., 2020a) , we also introduce a variational objective to stabilize the training. Intuitively, an agent's behavior should be consistent with its assigned strategy. In other words, the received strategy should be identifiable from the agent's future trajectory. Therefore, we propose to maximize the mutual information between the strategy and the agent's future observation-action pairs ζ a t = (o a t+1 , u a t+1 , o a t+2 , u a t+2 , . . . , o a t+T -1 , u a t+T -1 ). We maximize the following variational lower bound: I(z a t ; ζ a t , s t ) = E st,z a t ,ζ a t log q ξ (z a t |ζ a t , s t ) p(z a |s t ) + D KL p(z a t |ζ a t , s t ), q ξ (z a t |ζ a t , s t )) ≥ E st,z a t ,ζ a t log q ξ (z a t |ζ a t , s t ) p(z a |s t ) = E st,z a t ,ζ a t log q ξ (z a t |ζ a t , s t ) + H(z a t |s t ). Here H(•) denotes the entropy and q ξ is the variational distribution parameterized by ξ. We further adopt the Gaussian factorization for q ξ as in (Rakelly et al., 2019) , i.e. q ξ (z a t |ζ a t , s t ) ∝ q (t) ξ (z a t |s t , u a t ) t+T -1 k=t+1 q (k) ξ (z a t |o a k , u a k ), where each q (•) ξ is a Gaussian distribution. So q ξ predicts the μa t and Σa t of a multivariate normal distribution from which we calculate the log-probability of z a t . In practice, z a t is sampled from f φ using the re-parameterization trick (Kingma & Welling, 2013). The objective is L var (φ, ξ) = -λ 1 E st,z a t ,ζ a t [log q ξ (z a t |ζ a t , s t )] -λ 2 H(z a t |s t ) , where λ 1 and λ 2 are tunable coefficients.

3.4. REDUCING THE COMMUNICATION FREQUENCY

So far, we assume at every T steps the coach periodically broadcasts new strategies for all agents. In practice, broadcasting suffers communication cost or bandwidth limit. So it is desirable to only distribute strategies when "necessary". To reduce the communication frequency, we propose an intuitive method that decides whether to distribute new strategies based on the 2 distance of the old strategy to the new one. In particular, at time step t = kT, k ∈ Z, assuming the prior strategy for agent a is z a old , the new strategy for agent a is za t = z a t ∼ f φ (s, c) if ||z a t -z a old || 2 ≤ β z a old otherwise. (5) For a general time step t, the individual strategy for a is therefore za t . Here β is a manually specified threshold. Note that we can train a single model and apply this criterion for all agents. By adjusting β, one can easily achieve different communication frequencies. Intuitively, when the previous strategy is "close" to the current one, it should be more tolerant to keep using it. The intuition is concrete when the learned Q tot θ has relatively small Lipschitz constant. If we assume ∀τ t , u t , s t , s t, c, ||Q tot (τ t , u t , f (s t); c) -Q tot * (s t , u t ; c)|| 2 ≤ κ, where Q tot * is the optimal Q, and ∀z a 1 , z a 2 , |Q tot (τ t , u t |z a 1 , z -a ; c) -Q tot (τ t , u t |z a 2 , z -a ; c)| ≤ η||z a 1 -z a 2 || 2 , we have the following: Theorem 1. If the used team strategies zt satisfies ∀a, t, ||z a t -z a t || 2 ≤ β, denote the actionvalue and the value function of following the used strategies as Q and Ṽ , i.e. Ṽ (τ t | zt ; c) = max u Q(τ t, u| zt ; c), and define V tot * similarly, we have ||V tot * (s t ; c) -Ṽ (τ t | zt ; c)|| ∞ ≤ 2(n a ηβ + κ) 1 -γ , ( ) where n a is the number of agents and γ is the discount factor. We defer the proof to Appendix A. The method described in equation 5 satisfies the condition in Theorem 1 and therefore when β is small, distributing strategies according to equation 5 will not result in much performance drop.

4. EXPERIMENTS

We design the experiments to 1) verify the effectiveness of the coach agent; 2) investigate how performance varies with the interval T ; 3) test if the variational objective is useful; and to 4) understand how much the performance drops by adopting the method in equation 5. We test our idea on a resource collection task with different scenarios in customized multi-agent particle environments (Lowe et al., 2017) . In the following, we call our method COPA (COach-and-PlAyer).

4.1. RESOURCE COLLECTION

In Resource Collection, a team of agents coordinate to collect different resources spread out on a square map with width 1.8. There are 4 types of entites: the resources, the agents, the home and the invader. We assume there are 3 types of resources: (r)ed, (g)reen and (b)lue. In the world, always 6 resources appear with 2 of each type. Each agent has 4 characteristics (c a r , c a g , c a b , v a ), where c a x represents how efficient a collects the resource x and v is the agent's max moving speed. The agent's job is to collect the most amount of resources and bring them home, and catch the invader if it appears. If a collects x, the team receives a reward of 10 • c a x as reward. Holding any resource, agents cannot collect more and need to bring the resource home until going out again. Bringing a resource home has 1 reward. Occasionally the invader appears and goes directly to home. Any agent catch the invader will have 4 reward. If the invader reaches home, the team is penalized by -4 reward. Each agent has 5 actions: accelerate up / down / left / right and decelerate, and it observes anything within 0.2 distance. The maximum episode length is 145. In training, we allow scenarios to have 2 to 4 agents, and for each agent, c a r , c a g , c a b are chosen from {0.1, 0.5, 0.9} and the max speed v a from {0.3, 0.5, 0.7}. We design 3 testing tasks: 5-agent task, 6-agent task, and a varying-agent task. For each task, we generate 1000 different scenarios c. Each scenario includes n a agents, 6 resources and an invader. For agents, c a r , c a g , c a b are chosen uniformly from the interval [0.1, 0.9] and v a from [0.2, 0.8]. For a particular scenario in the varying agent task, starting from 4 agents, the environment randomly adds or drops an agent every ν steps as long as the number of agents remains in [2, 6] . ν is a random variable from the uniform distribution U(8, 12). See Figure 3 for an example run of the learned policy.

Effectiveness of Coach

We provide the training curve in Figure 4 .1 (a) where the communication interval is set to T = 4. The black solid line is a hard-coded greedy algorithm where agents always go for the resource they are mostly good at collecting, and whenever the invader appears, the closest agent goes for it. We see that without global information, A-QMIX and AI-QMIX are significantly below the hard-coded baseline. Without the coach, we let all agents have the global view every T steps in A-QMIX (periodic) but it barely improves over A-QMIX. A-QMIX (full) is fully centralized, i.e., all agents have global view. Without the variational objective, COPA is comparable against A-QMIX (full). With the variational objective, it becomes even better than A-QMIX (full). Note that all baseline methods are scaled to have more parameters than COPA. The results demonstrate the importance of global coordination and the coach-player hierarchy. Communication Interval To investigate how performance varies with T , we train with different T chosen from [2, 4, 8, 12, 16, 20, 24] in Figure 4 .1(b). Interestingly, the performance peaks at T = 4, contradicting the intuition that smaller T is better. This shows the coach is more useful when it can make the agents behavior smooth/consistent over time. 1 : Generalization performance on unseen environments with more agents and dynamic team composition. Results are computed from 5 models trained with 5 different seeds. Communication frequency is compared to communicating with all agents at every step. Figure 5 : The varying sensitivity to communication frequency.

Zero-shot Generalization

We apply the learned model with T = 4 to the 3 testing environments. Results are provided in Table 4 .1. The communication frequency is calculated according to the fully centralized setting. For instance, when T = 4 and β = 0, it results in an average 25% centralization frequency. As we increase β to suppress the distribution of strategies, we see that the performance shows no significant drop till 13% centralization frequency. Moreover, we apply the same model to 3 environments that are dynamic to different extents. In the more static environment, resources are always spawned at the same locations. In medium environment, resources are spawned randomly but there is no invader. The more dynamic environment is the 3rd environment in Table 4 .1 where the team is dynamic in composition and there exists the invader. Result is summarized in Figure 5 . Here, the x-axis is normalized according to the communication frequency when β = 0, and the y-axis is normalized by the corresponding performance. As expected, as the environment becomes more dynamic, low communication frequency more severely downgrades the performance.

4.2. RESCUE GAME

Search-and-rescue is a natural application of multi-agent systems. In this section we further apply COPA to a rescue game. In particular, we consider a 10 × 10 grid-world, where each grid contains a building. At any time step, each building is subject to catch a fire. When a building b is on fire, it has an emergency level c b ∼ U(0, 1). Within the world, at most 10 buildings will be on fire at the same time. Fortunately we have n (n is a random number from 2 to 8) robots who are the surveillance firefighters. Each robot a has a skill-level c a ∈ [0.2, 1.0]. A robot has 5 actions, moving up/down/left/right and put out the fire. If a is at a building on fire and chooses to put out the fire, the emergency level will be reduced to c b ← max(c b -c a , 0). At each time step t, the overallemergency is given by c B t = b (c b ) 2 since we want to penalize the existence of more emergent fire. The reward is defined as r t = c B t-1 -c B t , the amount of emergence level the team reduces. During training, we sample n from 3 -5 and these robots are spawned randomly across the world. Each agent's skill-level is sampled from [0.2, 0.5, 1.0]. Then a random number of 3 -6 buildings will catch a fire. During testing, we enlarge n to 2 -8 agents and sample up to 10 buildings on fire. We summarize the result in the failed to learn. We conjecture this is because the team has too much information to process during training and therefore it is hard to search for a good policy. COPA consistently outperforms all baselines even with a communication frequency as low as 0.15.

5. RELATED WORKS

In this section we briefly go over some related works in cooperative multi-agent reinforcement learning and hierarchical reinforcement learning. Centralized Training with Decentralized Execution Centralized training with decentralized execution (CTDE) assumes agents execute independently but uses the global information for training. A branch of methods investigates factorizable Q functions (Sunehag et al., 2017; Rashid et al., 2018; Mahajan et al., 2019; Son et al., 2019) where the team Q is decomposed into individual utility functions. Some other methods adopt actor-critic method where only the critic is centralized (Foerster et al., 2017; Lowe et al., 2017) . However, most deep CTDE methods by structure require fixed-size teams and are often applied on homogeneous teams. Methods for Dynamic Compositions Several recent works pay attention to transfer learning and curriculum learning in MARL problems where the learned policy is is a warm start for new tasks (Carion et al., 2019; Shu & Tian, 2018; Agarwal et al., 2019; Wang et al., 2020b; Long et al., 2020) . These works focus on curriculum learning and mostly consider homogeneous agents. Hence the team strategy is relatively consistent. Iqbal et al. (2020) first adopt the multi-head attention mechanism for dealing with a varying size heterogeneous team. But the heterogeneity comes from a small finite set of agent types (usually 2 to 3). Additionally, the method is fully decentralized and therefore less adaptive to frequent change in team composition. Ad Hoc Teamwork and Networked Agents Ad hoc teamwork studies the problem of quick adaptation to unknown teams (Genter et al., 2011; Barrett & Stone, 2012) . However, ad hoc teamwork focuses on the single ad hoc agent and often assumes no control over the teammates and therefore is essentially a single-agent problem. Decentralized networked agents assume information can propagate among agents and their neighbors (Kar et al., 2013; Macua et al., 2014; Suttle et al., 2019; Zhang et al., 2018) . However, research in networked agents still mainly focus on homogeneous fixed-size teams. Although it is possible to extend the idea for the dynamic team composition problem, we leave it as a future work. Hierarchical Reinforcement Learning The main focus of hierarchical RL/MARL is to decompose the task into hierarchies: a meta-controller selects either a temporal abstracted action (Bacon et al., 2017) , called an option, or a goal state (Vezhnevets et al., 2017) for the base agents. Then the base agents shift their purposes to finish the assigned option or reach the goal. Therefore usually the base agents have different learning objective from the meta-controller. Recent deep MARL methods also demonstrate role emergence (Wang et al., 2020a) or skill emergence (Yang et al., 2019) . But the inferred role/skill is only conditioned on the individual trajectory. The coach in our method uses global information to determine the strategies for the base agents. To our best knowledge, we are the first to consider applying such hierarchy for teams with varying number of heterogeneous agents.

6. CONCLUSION

We investigated a new setting of multi-agent reinforcement learning problems, where both the team size and members' capabilities are subject to change. To this end, we proposed a coach-player framework where the coach coordinates with global view but players execute independently with local views and the coach's strategy. We developed a variational objective to regularize the learning and introduces an intuitive method to suppress unnecessary distribution of strategies. The experiment results across multiple unseen scenarios on the Resource Collection task demonstrate the effectiveness of the coach agent. The zero-shot generalization ability of our method shows a promising direction to real-world ad hoc multi-agent coordination. Proof. From assumption 2, it is easy to check that if || za t -z a t || 2 ≤ β for all a, then |Q tot (τ t , u t |z t , c) -Q tot (τ t , u t |z t, c)| ≤ n a η 1 η 2 β. For notation convenience, we ignore the superscript of tot and the condition on c. For a state s, denote the action the learned policy take as u † , u † = arg max u Q(τ , u). Similarly we can define u * as the action the optimal Q * takes and ũ that Q takes. From assumption 1, we know that Q * (s, u † ) ≥ Q(τ , u † ) -κ ≥ Q(τ , u * ) -κ ≥ Q * (s, u * ) -2κ. ( ) Therefore taking u † will result in at most 2κ performance drop at this single step. Similarly, denote 0 = n a η 1 η 2 β, then Q(τ , ũ) ≥ Q(τ , ũ) -0 ≥ Q(τ , u † ) -0 ≥ Q(τ , u † ) -2 0 . Hence Q * (s, ũ) ≥ Q * (s, u * ) -2( 0 + κ). Note that this means taking the action ũ in the place of u * at state s will result in at most 2( 0 + κ) performance drop. This conclusion generalizes to any step t. Therefore, if at each single step the performance is bounded within 2( 0 + κ), then overall the performance is within 2( 0 + κ)/(1 -γ).

NETWORK ARCHITECTURE

For all experiments, we use the same network architecture where all intermediate hidden layer have 128 dimensions. Note that this is possible since the only difference is the number of entities, which does not influence our architecture when adopting an attention model. The architecture details follow exactly as in Appendix A of (Iqbal et al., 2020) .

TRAINING DETAILS

To train the model, we set the max total number of steps to 5 million. Then we use the exponentially decayed -greedy algorithm as our exploration policy, starting from 0 = 1.0 to n = 0.05. We parallel the environment with 8 threads for training. Details on hyper-parameters are available in Section A.

HYPER PARAMETERS

For all experiments, we use the same set of hyper-parameters. We provide them in the following 



Team composition is part of an environmental scenario(de Witt et al., 2019), which also includes other environment entities. The formal definition is in Section 2.1. Rigorously speaking the players in our method occasionally receive global information from the coach. But players still execute independently with local views while they benefit from the centralized learning. c e is part of s e , but we will explicitly write out c e in the following for emphasis. An agent can always observe itself, i.e., m(a, a) = 1, ∀a ∈ A. Note here we only assume Q tot is accurate around the predicted strategy by f , not for any strategy.



Figure 1: (a) In training, we sample teams from a set of compositions. The coach observes the entire world and coordinates different teams via broadcasting strategies periodically; (b) A team with dynamic composition can be viewed as a sequence of fixed composition team, thus the proposed training generalizes to dynamic composition; (c) Our method is at the star position within MARL. 3

Figure2: The coach-player network architecture. Here, GRU refers to gated recurrent unit(Chung et al., 2014); MLP refers to multi-layer perceptron; FC refers to fully connected layer. Both coach and players use multi-head attention to encode information. The coach has full view while players have partial views. h a t encodes agent a's history. h a t combines the most recent strategy z t = z t-t%T to predict the individual utility Q a . The mixing network combines all Q a s to predict Q tot .

Figure 3: An example episode up to t = 30 with communication interval T = 4. Here, c a is represented by rgb values, c a = (r, g, b, v). For illustration, we set agents rgb to be one-hot but it can vary in practice. (i) an agent starts at home; (ii) the invader (black) appears while the agent (red) goes to the red resource; (iii) another agent is spawned while the old agent brings resource home; (iv) one agent goes for the invader while the other for resource; (v-vi) a new agent (blue) is spawned and goes for the blue resource while other agents (red) are bringing resources home.

Figure 4: Training curves for Resource Collection. (a) comparison against A-QMIX, AI-QMIX and COPA without the variational objective. Here we choose T = 4; (b) ablations on the communication interval T . All results are averaged over 5 seeds.

Table4.2. Interestingly, we find that A-QMIX with full observation Random Greedy A-QMIX A-QMIX (full) COPA (w/o L var ) COPA (1) COPA (0.5) COPA (0.15) Average episodic reward over the same 500 Rescue games. Results are averaged over the same algorithm trained with 3 different seeds. For COPA (x), x denotes the communication frequency. Greedy algorithm matches the k-th skillful agent for the k-th emergent building.

Hyper-parameters in our experiments.

A APPENDIX PROOF OF THEOREM 1

Here we expand the assumptions from Theorem 1 and provide the proof for it. The two assumptions are: Assumption 1. Denote the learned team action-value function as Q tot , the learned coach strategy encoder as f and the true optimal action-value function as Q tot * . We assume for any τ t , u t , s t , s t, c,Assumption 2. Denote the learned individual action-value function as {Q ai } na i=1 , and the particular individual action-value at a state s with action u as {q ai = Q ai (s ai , u ai )} na i=1 . Then we assume unilaterally varying any q ai to q , i.e. all other q -ai remain the same, will not cause dramatic change of Q tot if q stays closely to q ai : Q tot (q -ai , q ai ) -Q tot (q -ai , q ) ≤ η 1 |q ai -q | (8)and for any agent a and ∀c a , τ a t , u a t , z a 1 , z a 2 with proper dimensions,In other words, assumption 1 assumes the learned Q tot approximates the true optimal Q tot * well combined with the learned coach strategy function f , 6 and assumption 2 assumes the learned team action-value Q tot has bounded Lipschitz constant. Next we provide the proof for Theorem 1.

