RANDOMIZED ENTITY-WISE FACTORIZATION FOR MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Real world multi-agent tasks often involve varying types and quantities of agents and non-agent entities; however, agents within these tasks rarely need to consider all others at all times in order to act effectively. Factored value function approaches have historically leveraged such independences to improve learning efficiency, but these approaches typically rely on domain knowledge to select fixed subsets of state features to include in each factor. We propose to utilize value function factoring with random subsets of entities in each factor as an auxiliary objective in order to disentangle value predictions from irrelevant entities. This factoring approach is instantiated through a simple attention mechanism masking procedure. We hypothesize that such an approach helps agents learn more effectively in multi-agent settings by discovering common trajectories across episodes within sub-groups of agents/entities. Our approach, Randomized Entity-wise Factorization for Imagined Learning (REFIL), outperforms all strong baselines by a significant margin in challenging StarCraft micromanagement tasks.

1. INTRODUCTION

Figure 1 : Breakaway sub-scenario in soccer. Agents in the yellow square can ignore the context outside of this region and still predict their success effectively. Many real-world multi-agent tasks contain scenarios in which an agent must deal with varying numbers and/or types of cooperative agents, antagonist enemies or other entities. Agents, however, can often select their optimal actions while ignoring a subset of agents/entities. For example, in the sport of soccer, a "breakaway" occurs when an attacker with the ball passes the defense and only needs to beat the goalkeeper in order to score (see Figure 1 ). In this situation, only the opposing goalkeeper is immediately relevant to the attacker's success, so the attacker can safely ignore players other than the goalkeeper for the time being. By ignoring irrelevant context, the attacker can generalize this experience better to its next breakaway. Furthermore, soccer takes many forms, from casual 5 vs. 5 to full scale 11 vs. 11 matches, and breakaways occur in all. If agents can identify independent patterns of behavior such as breakaways, they should be able to learn more efficiently as well as share their experiences across all forms of soccer. Value function factoring approaches attempt to leverage independences between agents, such as those in our soccer example, by learning value functions as a combination of independent factors that depend on disjunct subsets of the state and action spaces (Koller & Parr, 1999) . These subsets are typically fixed in advance using domain knowledge about the problem at hand, and thus are not scalable to complex domains where dependencies are unknown and may shift over time. Recent approaches in cooperative deep multi-agent reinforcement learning (MARL) factor value functions into separate components for each agent's action and observation space in order to enable decentralized execution (e.g., VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) ). These approaches learn a utility function for each agent that only depends on the agent's own action and its observations. The global Q-value is then predicted as some monotonic combination of these utilities in order to allow agents to greedily select their actions with local information while maximizing the global Q. These approaches are able to effectively leverage independence between agents' local actions and observations, however, we note that observable entities are provided by the environment and are not all necessarily relevant to an agent's value function. We build on these recent approaches by additionally factoring the observation space of each agent into factors for sub-groups of observed entities. Unlike classic works which factor the state or observation spaces, our work does not depend on fixed subsets of features designated through domain knowledge. Instead, we propose to randomly select sub-groups of observed entities and "imagine" the predicted utilities within these groups for each agent. These terms will not account for potential interactions outside of the groups, so we include additional factors that estimate the effect of the entities outside of each sub-group on each agent's utility. In order to estimate the true returns, we combine all factors using a mixing network (as in QMIX, Rashid et al., 2018) , which allows our model to weight factors based on the full state context. We hypothesize this approach is beneficial for two reasons: 1) randomly partitioning entities and predicting returns from disjunct factors allows our model to explore all possible independence relationships among agents and entities, teaching agents to ignore irrelevant context when possible and 2) by teaching our models when they can ignore irrelevant context, they will learn more efficiently across varied settings that share common patterns of behavior, such as breakaways in soccer. The loss for training randomized factorization is added to the QMIX loss (i.e., using full observations) as an auxiliary objective. Our reasoning is again twofold: 1) we must learn the true returns to use as a target prediction for a Q-learning loss. 2) we do not know a priori which entities are unnecessary and thus need to learn policies that act on full observations. Our entity-wise factoring procedure can be implemented easily in practice by using a simple masking procedure in attention-based models. Furthermore, by leveraging attention models, we can apply our approach to domains with varying entity quantities. Just as a soccer agent experiencing a breakaway can generalize their behavior across settings (5 vs. 5, 11 vs. 11, etc.) if they ignore irrelevant context, we hypothesize that our approach will improve performance across settings with variable agent and entity configurations. We propose Randomized Entity-wise Factorization for Imagined Learning (REFIL) and test on complex StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) tasks with varying agent types and quantities, finding it attains improved performance over state-of-the-art methods.

2. BACKGROUND AND PRELIMINARIES

In this work, we consider the decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek et al., 2016) , which describes fully cooperative multi-agent tasks. Specifically, we utilize the setting of Dec-POMDPs with entities (Schroeder de Witt et al., 2019) . Dec-POMDPs with Entities are described as tuples: (S, U, O, P , r , E, A, Φ, µ). E is the set of entities in the environment. Each entity e has a state representation s e , and the global state is the set s = {s e |e ∈ E} ∈ S. Some entities can be agents a ∈ A ⊆ E. Non-agent entities are parts of the environment that are not controlled by learning policies (e.g., landmarks, obstacles, agents with fixed behavior). The state features of each entity comprise of two parts: s e = [f e , φ e ] where f e represents the description of an entity's current state (e.g., position, orientation, velocity, etc.) while φ e ∈ Φ represents the entity's type (e.g., outfield player, goalkeeper, etc.), of which there are a discrete set. An entity's type affects the state dynamics as well as the reward function and, importantly, it remains fixed for the duration of the entity's existence. Not all entities may be visible to each agent, so we define a binary observability mask: µ(s a , s e ) ∈ {1, 0}, where agents can always observe themselves µ(s a , s a ) = 1, ∀a ∈ A. Thus, an agent's observation is defined as o a = {s e |µ(s a , s e ) = 1, e ∈ E} ∈ O. Each agent a can execute actions u a , and the joint action of all agents is denoted as u = {u a |a ∈ A} ∈ U. P is the state transition function which defines the probability P (s |s, u). r (s, u) is the reward function which maps the global state and joint actions to a single scalar reward. We do not consider entities being added during an episode, but they may become inactive (e.g., a unit dying in StarCraft) in which case they no longer affect transitions and rewards. Since s and u are sets, their ordering does not matter, and our modeling construct should account for this (e.g., by modeling with permutation invariance/equivariance (Lee et al., 2019) ). In many domains, the set of entity types present {φ e |e ∈ E} is fixed across episodes. We are particularly interested in cases where quantity and types of entities are varied between episodes, as identifying independence relationships between entities is crucial to generalizing experience effectively in these cases.

