THE EMERGENCE OF INDIVIDUALITY IN MULTI-AGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Individuality is essential in human society. It induces the division of labor and thus improves the efficiency and productivity. Similarly, it should also be a key to multi-agent cooperation. Inspired by that individuality is of being an individual separate from others, we propose a simple yet efficient method for the emergence of individuality (EOI) in multi-agent reinforcement learning (MARL). EOI learns a probabilistic classifier that predicts a probability distribution over agents given their observation and gives each agent an intrinsic reward of being correctly predicted by the classifier. The intrinsic reward encourages the agents to visit their own familiar observations, and learning the classifier by such observations makes the intrinsic reward signals stronger and in turn makes the agents more identifiable. To further enhance the intrinsic reward and promote the emergence of individuality, two regularizers are proposed to increase the discriminability of the classifier. We implement EOI on top of popular MARL algorithms. Empirically, we show that EOI outperforms existing methods in a variety of multi-agent cooperative scenarios.

1. INTRODUCTION

Humans develop into distinct individuals due to both genes and environments (Freund et al., 2013) . Individuality induces the division of labor (Gordon, 1996) , which improves the productivity and efficiency of human society. Analogically, the emergence of individuality should also be essential for multi-agent cooperation. Although multi-agent reinforcement learning (MARL) has been applied to multi-agent cooperation, it is widely observed that agents usually learn similar behaviors, especially when the agents are homogeneous with shared global reward and co-trained (McKee et al., 2020) . For example, in multi-camera multi-object tracking (Liu et al., 2017) , where camera agents learn to cooperatively track multiple objects, the camera agents all tend to track the easy object. However, such similar behaviors can easily make the learned policies fall into local optimum. If the agents can respectively track different objects, they are more likely to solve the task optimally. Many studies formulate such a problem as task allocation or role assignment (Sander et al., 2002; Dastani et al., 2003; Sims et al., 2008) . However, they require that the agent roles are rule-based and the tasks are pre-defined, and thus are not general methods. Some studies intentionally pursue difference in agent policies by diversity (Lee et al., 2020; Yang et al., 2020) or by emergent roles (Wang et al., 2020a) , however, the induced difference is not appropriately linked to the success of task. On the contrary, the emergence of individuality along with learning cooperation can automatically drive agents to behave differently and take a variety of roles, if needed, to successfully complete tasks. Biologically, the emergence of individuality is attributed to innate characteristics and experiences. However, as in practice RL agents are mostly homogeneous, we mainly focus on enabling agents to develop individuality through interactions with the environment during policy learning. Intuitively, in multi-agent environments where agents respectively explore and interact with the environment, individuality should emerge from what they experience. In this paper, we propose a novel method for the emergence of individuality (EOI) in MARL. EOI learns a probabilistic classifier that predicts a probability distribution over agents given their observation and gives each agent an intrinsic reward of being correctly predicted probability by the classifier. Encouraged by the intrinsic reward, agents tend to visit their own familiar observations. Learning the probabilistic classifier by such observations makes the intrinsic reward signals stronger and in turn makes the agents more identifiable. In this closed loop with positive feedback, agent individuality emerges gradually. However, at early learning stage, the observations visited by different agents cannot be easily distinguished by the classifier, meaning the intrinsic reward signals are not strong enough to induce agent characteristics. Therefore, we propose two regularizers for learning the classifier to increase the discriminability, enhance the feedback, and thus promote the emergence of individuality. EOI is compatible with centralized training and decentralized execution (CTDE). We realize EOI on top of two popular MARL methods, MAAC (Iqbal & Sha, 2019) and QMIX (Rashid et al., 2018) . For MAAC, as each agent has its own critic, it is convenient to shape the reward for each agent. For QMIX, we introduce an auxiliary gradient and update the individual value function by both minimizing the TD error of the joint action-value function and maximizing its cumulative intrinsic rewards. We evaluate EOI in three scenarios where agents are preferred to take different roles, i.e., Pac-Men, Windy Maze, and Firefighters, and we empirically demonstrate that EOI significantly outperforms existing methods. Additionally, in a micro-task of StarCraft II (Samvelyan et al., 2019) where the need for the division of labor is unknown, EOI also learns faster than existing methods. By ablation studies, we confirm that the proposed regularizers indeed improve the emergence of individuality even if agents have the same innate characteristics.

2. RELATED WORK

MARL. We consider the formulation of Decentralized Partially Observable Markov Decision Process (Dec-POMDP), where at each timestep t each agent i receives a local observation o t i , takes an action a t i , and gets a shared global reward r t . Agents together aim to maximize the expected return E T t=0 γ t r t , where γ is a discount factor and T is the time horizon. Many methods have been proposed for Dec-POMDP, most of which adopt CTDE. Some methods (Lowe et al., 2017; Foerster et al., 2018; Iqbal & Sha, 2019) extend policy gradient into multi-agent cases. Value function factorization methods (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019) decompose the joint value function into individual value functions. Communication methods (Das et al., 2019; Jiang et al., 2020) share information between agents for better cooperation. Behavior Diversification. Many cooperative multi-agent applications require agents to take different behaviors to complete the task successfully. Behavior diversification can be handcrafted or emerge through agents' learning. Handcrafted diversification is widely studied as task allocation or role assignment. Heuristics (Sander et al., 2002; Dastani et al., 2003; Sims et al., 2008; Macarthur et al., 2011) assign specific tasks or pre-defined roles to each agent based on goal, capability, visibility, or by search. M 3 RL (Shu & Tian, 2019) learns a manager to assign suitable sub-tasks to rule-based workers with different preferences and skills. These methods require that the sub-tasks and roles are pre-defined, and the worker agents are rule-based. However, in general, the task cannot be easily decomposed even with domain knowledge and workers are learning agents. The emergent diversification for single agent has been studied in DIAYN (Eysenbach et al., 2019) , which learns reusable diverse skills in complex and transferable tasks without any reward signal by maximizing the mutual information between states and skill embeddings as well as entropy. In multi-agent learning, SVO (McKee et al., 2020) introduces diversity into heterogeneous agents for more generalized and high-performing policies in social dilemmas. Some methods are proposed for behavior diversification in multi-agent cooperation. ROMA (Wang et al., 2020a) learns a role encoder to generate the role embedding, and learns a role decoder to generate the neural network parameters. However, there is no mechanism that guarantees the role decoder can generate different parameters, taking as input different role embeddings. Learning low-level skills for each agent using DIAYN is considered in Lee et al. (2020); Yang et al. (2020) , where agents' diverse low-level skills are coordinated by the high-level policy. However, the independently trained skills limit the cooperation, and diversity is not considered in the high-level policy.

3. METHOD

Individuality is of being an individual separate from others. Motivated by this, we propose EOI, where agents are intrinsically rewarded in terms of being correctly predicted by a probabilistic classifier that is learned based on agents' observations. If the classifier learns to accurately distinguish agents, agents should behave differently and thus individuality emerges. Two regularizers are introduced for

