A COACH-PLAYER FRAMEWORK FOR DYNAMIC TEAM COMPOSITION Anonymous

Abstract

In real-world multi-agent teams, agents with different capabilities may join or leave "on the fly" without altering the team's overarching goals. Coordinating teams with such dynamic composition remains a challenging problem: the optimal team strategy may vary with its composition. Inspired by real-world team sports, we propose a coach-player framework to tackle this problem. We assume that the players only have a partial view of the environment, while the coach has a complete view. The coach coordinates the players by distributing individual strategies. Specifically, we 1) propose an attention mechanism for both the players and the coach; 2) incorporate a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with different players. Our attention mechanism on the players and the coach allows for a varying number of heterogeneous agents, and can thus tackle the dynamic team composition. We validate our methods on resource collection tasks in multi-agent particle environment. We demonstrate zero-shot generalization to new team compositions with varying numbers of heterogeneous agents. The performance of our method is comparable or even better than the setting where all players have a full view of the environment, but no coach. Moreover, we see that the performance stays nearly the same even when the coach communicates as little as 13% of the time using our adaptive communication strategy. These results demonstrate the significance of a coach to coordinate players in dynamic teams.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is the problem of coordinating a team of agents to perform a shared task. It has broad applications in autonomous vehicle teams (Cao et al., 2012) , sensor networks (Choi et al., 2009) , finance (Lee et al., 2007) , and social science (Leibo et al., 2017) . Recent works in multi-agent reinforcement learning (MARL) have shed light on solving challenging multi-agent problems such as playing StarCraft with deep learning models (Rashid et al., 2018) . Among these methods, centralized training with decentralized execution (CTDE) has gained much attention since learning in a centralized way enables better cooperation while executing independently makes the system efficient and scalable (Lowe et al., 2017) . However, most deep CTDE approaches for cooperative MARL are limited to a fixed number of homogeneous agents. Real-world multi-agent tasks, on the other hand, often involve dynamic teams. For example, in a soccer game, a team receiving a red card has one fewer player. In this case, the team may switch to a more defensive strategy. As another example, consider an autonomous vehicle team for delivery. The control over the team depends on how many vehicles we have, how much load each vehicle permits, as well as the delivery destinations. In both examples, the optimal team strategy varies according to the team composition,foot_0 i.e., the size of the team and each agent's capability. In these settings, it is intractable to re-train the agents for each new team composition, and it is thus desirable to have zero-shot generalization to new team compositions that are not seen during training. Recently, Iqbal et al. ( 2020) proposed a multi-head attention model for learning in environments with a variable number of agents under the CTDE framework. However, in many challenging tasks, the CTDE constraint is too restrictive as each agent only has access to its own decisions and partial environmental observations at test time -See Section 3.1 for an example where this requirement causes failure to learn. The CTDE constraint can be relaxed either by 1) allowing all agents to communicate with each other (Zhang et al., 2018) or 2) having a special "coach" agent who distributes strategic information based on the full view of the environment (Stone & Veloso, 1999). The former case is typically too expensive for many CTDE scenarios (e.g., battery-powered drones or vehicles), while the latter case of having a coach may be feasible (e.g., satellites or watchtowers to monitor the field in which agents operate). In this work, we focus on the latter approach of having a coach to coordinate the agents. Specifically, we grant the coach with global observation while agents only have partial views of the environment. We assume that the coach can distribute information to various agents only in limited amounts. We model this communication through a continuous vector, termed as the strategy vector, and it is specific to each agent. We design each agent's decision module to incorporate the most recent strategy vector from the coach. We further propose a variational objective to regularize learning, inspired by (Rakelly et al., 2019; Wang et al., 2020a) . In order to save costs incurred in receiving information from the coach, we additionally design an adaptive policy where the coach communicates with different players only as needed. To train the coach and agents, we sample different teams from a set of team compositions. Recall that the training is centralized under the CTDE framework. At execution time, the learned policy generalizes across different team compositions in a zero-shot manner. Our framework also allows for dynamic teams whose composition varies over time (see Figure 1 (a-b )).

Summary of Results:

We (1) propose a coach-player framework for dynamic team composition of heterogeneous agents; (2) introduce a variational objective to regularize the learning, which leads to improved performance; (3) design an adaptive communication strategy to minimize communication from the coach to the agents. We apply our methods on resource-collection tasks in multi-agent particle environments. We evaluate zero-shot generalization for new team compositions at test time. Results show comparable or even better performance against methods where players have full observation but no coach. Moreover, there is almost no performance degradation even when the coach communicates as little as 13% of the time with the players. These results demonstrate the effectiveness of having a coach in dynamic teams.

2. BACKGROUND 2.1 PROBLEM FORMULATION

We model the cooperative multi-agent task under the Decentralized partially observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2016) . Specifically, we build on the setting of Dec-POMDP with entities (de Witt et al., 2019) , which considers entity-based knowledge representation. Here, entities include both controllable agents and other environment landmarks. In addition, we extend the representation to allow agents to have individual characteristics, i.e., skill-level, physical condition, etc. Therefore, a Dec-POMDP with characteristics-based entities can be described as a tuple (S, U , O, P, R, E, A, C, m, Ω, ρ, γ). E represents the space of entities. ∀e ∈ E, the entity e has its state representation s e ∈ R de . The global state is therefore the set s = {s e |e ∈ E} ∈ S.



Team composition is part of an environmental scenario(de Witt et al., 2019), which also includes other environment entities. The formal definition is in Section 2.1. Rigorously speaking the players in our method occasionally receive global information from the coach. But players still execute independently with local views while they benefit from the centralized learning.



Figure 1: (a) In training, we sample teams from a set of compositions. The coach observes the entire world and coordinates different teams via broadcasting strategies periodically; (b) A team with dynamic composition can be viewed as a sequence of fixed composition team, thus the proposed training generalizes to dynamic composition; (c) Our method is at the star position within MARL. 3

