A MUTUAL INFORMATION DUALITY ALGORITHM FOR MULTI-AGENT SPECIALIZATION Anonymous

Abstract

The social behavior change in a heterogeneous population is an essential study of multi-agent learning. The interactions between unique agents not only involves the optimization for single agent learning, agents' behavioral changes also depend on the mutual similarity and dissimilarity between the agents. Our study provides a theoretical derivation of the policies interactions under the formulation of joint policy optimization. We discover that joint policy optimization in a heterogeneous population can affect the population behaviors through mutual information (MI) maximization. In our study, we introduce a minimax formulation of MI (M&M) that optimizes the population behavior under MI minimization against the joint policy optimization. The main findings of our paper show that MI minimization can minimize the behavioral similarity between agents and enable agents to develop individualized policy specialization. Empirically M&M demonstrates a substantial gain in average population performance, diversity, and narrows the performance gap among the agents.

1. INTRODUCTION

From the success of multi-agent game plays [(OpenAI et al., 2019) , (Vinyals et al., 2019) ], the field of heterogeneous multi-agent learning is being actively studied in areas such as AI game design (Contributors, 2022 ), embedded IoTs (Toyama et al., 2021) , and the research for future Human-AI interaction. With the unique physique and specialized character attributes in a heterogeneous population, the ability to optimize policies for specific purposes is essential. The difficulty as well as the goal of the heterogeneous population learning research is to search and define a general learning algorithm that optimizes an agent population such that the uniqueness of each agent can be fully utilized within the population. To approach the problem, our research aims to understand the symmetric and asymmetric social behavior changes that occur during the population learning through mutual information formulation. To study the learning and behavioral change, multi-agent learning has formed two branches of studies. In the simplest form of multi-agent RL, individualized learning is performed to learn separate behavioral policies for each agent. These prior works include Independent Q-Learning (IQL) (Tan, 1993) , Policy Space Response Oracle (PSRO) (Lanctot et al., 2017), and AlphaStar (Vinyals et al., 2019) . However, the empirical success of these prior approaches have excluded knowledge sharing. Since the training is done independently, one agent's learned experiences do not transfer to another. The individualized behaviors result in high redundancy of re-exploration and low socialized behaviors among the agents. On the contrary, joint policy optimization is proposed as a solution to the listed problems. Joint policy optimization utilizes a single conditioned policy to learn the diverse skillset, and the character attributes of the population via distributed RL optimization. Through population experiences sharing and joint policy optimization, a single conditioned policy network can learn a set of generalized policy skills that are transferable across the different agents. Notable examples include OpenAIFive (OpenAI et al., 2019) , HAPPO (Kuba et al., 2021) and NeuPL (Liu et al., 2022) that optimize the multi-agent behaviors under the expected population accumulated rewards. Through knowledge sharing and joint policy optimization, population learning in interactive games has benefited from the increased learning efficiency, and learning generalization of agents' social behaviors. Our research focuses on the latter, where we analyze the cause of social behavior change.

