A COACH-PLAYER FRAMEWORK FOR DYNAMIC TEAM COMPOSITION Anonymous

Abstract

In real-world multi-agent teams, agents with different capabilities may join or leave "on the fly" without altering the team's overarching goals. Coordinating teams with such dynamic composition remains a challenging problem: the optimal team strategy may vary with its composition. Inspired by real-world team sports, we propose a coach-player framework to tackle this problem. We assume that the players only have a partial view of the environment, while the coach has a complete view. The coach coordinates the players by distributing individual strategies. Specifically, we 1) propose an attention mechanism for both the players and the coach; 2) incorporate a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with different players. Our attention mechanism on the players and the coach allows for a varying number of heterogeneous agents, and can thus tackle the dynamic team composition. We validate our methods on resource collection tasks in multi-agent particle environment. We demonstrate zero-shot generalization to new team compositions with varying numbers of heterogeneous agents. The performance of our method is comparable or even better than the setting where all players have a full view of the environment, but no coach. Moreover, we see that the performance stays nearly the same even when the coach communicates as little as 13% of the time using our adaptive communication strategy. These results demonstrate the significance of a coach to coordinate players in dynamic teams. Recently, Iqbal et al. (2020) proposed a multi-head attention model for learning in environments with a variable number of agents under the CTDE framework. However, in many challenging tasks,

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is the problem of coordinating a team of agents to perform a shared task. It has broad applications in autonomous vehicle teams (Cao et al., 2012) , sensor networks (Choi et al., 2009 ), finance (Lee et al., 2007) , and social science (Leibo et al., 2017) . Recent works in multi-agent reinforcement learning (MARL) have shed light on solving challenging multi-agent problems such as playing StarCraft with deep learning models (Rashid et al., 2018) . Among these methods, centralized training with decentralized execution (CTDE) has gained much attention since learning in a centralized way enables better cooperation while executing independently makes the system efficient and scalable (Lowe et al., 2017) . However, most deep CTDE approaches for cooperative MARL are limited to a fixed number of homogeneous agents. Real-world multi-agent tasks, on the other hand, often involve dynamic teams. For example, in a soccer game, a team receiving a red card has one fewer player. In this case, the team may switch to a more defensive strategy. As another example, consider an autonomous vehicle team for delivery. The control over the team depends on how many vehicles we have, how much load each vehicle permits, as well as the delivery destinations. In both examples, the optimal team strategy varies according to the team composition,foot_0 i.e., the size of the team and each agent's capability. In these settings, it is intractable to re-train the agents for each new team composition, and it is thus desirable to have zero-shot generalization to new team compositions that are not seen during training.



Team composition is part of an environmental scenario(de Witt et al., 2019), which also includes other environment entities. The formal definition is in Section 2.1.

